1. Introduction
The same word can be pronounced in different ways according to the intended message. We would shout Yes! when our favorite table tennis player wins a long rally, but we might whisper Yes… when we respond to our friend’s question while watching a movie. In Peirce’s (Reference Peirce1932) semiotics, these prosodic features of speech sound are indexical of our emotional state or attitude: There is a causal relationship between our inner state and the phonetic details of our pronunciation.
On the other hand, we might whisper the adjective quiet when we exaggerate the quietness of the car we want to sell, as in This car is {W quiet W}.Footnote 1 This whisper is iconic of the car’s quietness: The voice quality resembles the car’s sonic attribute. While most studies on the functions of suprasegmental features investigate its indexical use (Anderson et al., Reference Anderson, Klofstad, Mayew and Venkatachalam2014; Esling et al., Reference Esling, Moisik, Benner and Crevier-Buchman2019; Gussenhoven, Reference Gussenhoven2016; Hancil & Hirst, Reference Hancil and Hirst2013; Hübscher et al., Reference Hübscher, Borràs-Comes and Prieto2017; Laver, Reference Laver1994), its iconic use is also worth special attention in the current cognitive science inquiries, since both indexicality and iconicity are viewed as relevant to the origin of human language (Everett, Reference Everett2017; Imai & Kita, Reference Imai and Kita2014; Perniss et al., Reference Perniss, Thompson and Vigliocco2010; Vigliocco et al., Reference Vigliocco, Perniss and Vinson2014).
To this end, the current study explores iconic prosody, as it relates to the use of Japanese ideophones, which are imitative words that themselves iconically represent various sensory and emotional information, such as piyopiyo ‘cheeping’, sutasuta ‘walking briskly’, sarari ‘dry and smooth’, zukin ‘one’s head throbbing’ and ukiuki ‘buoyant’. Ideophones are often accompanied by prominent prosody, such as distinctly high or low pitch and marked voice quality (Childs, Reference Childs, Hinton, Nichols and Ohala1994; Dingemanse et al., Reference Dingemanse, Schuerman, Reinisch, Tufvesson and Mitterer2016). For example, according to Dingemanse and Akita’s (Reference Dingemanse and Akita2017) study using a multimodal corpus of Japanese, about 30% of adverbial ideophones marked by a quotative particle were ‘phonationally foregrounded’, i.e., pronounced with marked voice quality. For instance, in the following quote, the female speaker pronounces the ideophone hyoi ‘unexpected’ in a falsetto voice, which appears to emphasize the unexpectedness or suddenness of her son’s answering her phone call.
Soshite, nan-byak-kai-ka nan-zen-kai-ka wakar-anai-n-des-u-kedo, keitai shi-tara, {F hyoi F} -to koo yuugata, daibu kuraku nat-te-kara, guuzen tsunagat-ta-n-des-u.
‘And, I don’t know how many phone calls I made, but when I made a phone call on my cellphone, unexpectedly in the evening, when it was very dark, [my son] answered it by chance’.
https://www2.nhk.or.jp/archives/movies/?id=D0026020063_00000
Expanding on these previous observations, in particular the one by Dingemanse and Akita (Reference Dingemanse and Akita2017), the current study reports on three experiments, which together explore what the prosodic features of ideophones can represent iconically in Japanese.
The organization of this paper is as follows. Section 2 summarizes previous experimental studies on the iconic properties of speech sounds. Section 3 reports on an experiment in which native speakers of Japanese pronounced ideophones expressively, and Sections 4 and 5 report on perception experiments in which native Japanese speakers chose pitch–intensity–duration combinations and voice qualities that intuitively suited individual ideophones. Section 6 discusses the implications that the current findings may offer for the theory of language evolution. Section 7 is the conclusion.
2. Previous studies
There are several studies on the production and perception of the iconic functions of speech prosody, but they are so far limited in both number and scope. Shintel et al. (Reference Shintel, Nusbaum and Okrent2006), for example, asked English speakers to describe the direction of motion of an animated dot and found that the participants tend to use a higher fundamental frequency (f0) to describe upward motion than to describe downward motion and faster speech rate to describe faster motion than to describe slower motion.
Nygaard et al. (Reference Nygaard, Herold and Namy2009) asked three female speakers of English to pronounce sentences with novel adjectives in infant-directed speech (IDS): Can you get the {blicket/seebow/daxen/foppick/tillen/riffel} one? The participants pronounced the novel words as having positive (‘happy’, ‘hot’, ‘big’, ‘yummy’, ‘tall’, ‘strong’), negative (‘sad’, ‘cold’, ‘small’, ‘yucky’, ‘short’, ‘weak’) or neutral (i.e., unspecified) meanings. The acoustic analysis of the recorded materials revealed that, for instance, the novel words were pronounced with a higher f0, higher intensity and shorter duration, when the speakers intended to express ‘happiness’, and with higher intensity and longer duration, when they intended to express ‘largeness’ (see also Ferrara et al., Reference Ferrara, Lu and Goldin-Meadow2025; Herold, Reference Herold2006; Herold et al., Reference Herold, Nygaard, Chicos and Namy2011; Kunihira, Reference Kunihira1971; Michelini & Nygaard, Reference Michelini and Nygaard2025, for related experiments).
Similar experiments have also been conducted on iconic vocalizations (Ćwiek et al., Reference Ćwiek2021; Ćwiek & Fuchs, Reference Ćwiek, Fuchs, Goel, Seifert and Freksa2019; Perlman, Reference Perlman, Fischer, Akita and Perniss2026; Perlman et al., Reference Perlman, Dale and Lupyan2015; Perlman & Cain, Reference Perlman and Cain2014; Perlman & Lupyan, Reference Perlman and Lupyan2018). In this series of experiments, English speakers were instructed to use nonlinguistic vocalizations to express various meanings, such as ‘tiger’, ‘water’, ‘small’, ‘many’ and ‘that’. For example, ‘water’ was iconically represented by mimicking the sound of pouring water into a glass and ‘tiger’ by mimicking its roar. In forced-choice tasks, speakers of both English and several other languages showed accuracy that was greater than chance at selecting the intended meanings of the obtained vocalizations.
A few studies on iconic prosody can also be found in the recent literature on sound symbolism. According to Akita (Reference Akita2021, Reference Akita2025) and Motoki et al. (Reference Motoki, Pathak and Spence2022), Japanese speakers associate novel words pronounced with creaky voice with largeness, spikiness and bitterness; those pronounced in a falsetto with roundedness, brightness and sweetness; and those pronounced in a whisper with smallness, roundedness and darkness, and English speakers share most of these associations (see also Lacey et al., Reference Lacey, Jamal, List, McCormick, Sathian and Nygaard2020; Villegas et al., Reference Villegas, Akita and Kawahara2023).
While these studies are starting to unveil an important role of iconic prosody in natural languages, what is yet to be addressed is its role in real words, such as hyoi ‘unexpected’ discussed in Section 1. It remains an open question how prosodic features interact with the lexical meanings of conventional words other than the directional words (i.e., up and down) examined in Shintel et al. (Reference Shintel, Nusbaum and Okrent2006) (cf. Stolarski, Reference Stolarski2019). We would like to emphasize here that iconic prosody has not so far been tested with ideophones, iconic words that are characterized by their marked prosody. The current study fills this gap in the literature by experimentally examining the production and perception of Japanese ideophones.
3. Experiment 1: Production
To investigate whether – and if so, how – iconic prosody can contribute to ideophonic utterances, we first built upon Nygaard et al.’s (Reference Nygaard, Herold and Namy2009) nonword-based study, asking Japanese speakers to produce sentences with ideophones in IDS. We focused on IDS, as it generally employs exaggerated prosody, such as heightened pitch, a wide pitch range and lengthened vowels (Fernald et al., Reference Fernald, Taeschner, Dunn, Papousek, de Boysson-Bardies and Fukui1989; Garnica, Reference Garnica, Snow and Ferguson1977: Igarashi et al., Reference Igarashi, Nishikawa, Tanaka and Mazuka2013; Mazuka et al., Reference Mazuka, Igarashi, Martin and Utsugi2015).
3.1. Method
3.1.1. Participants
Thirty female monolingual speakers of Japanese (age: 23–63; M = 38.00; standard deviation [SD] = 9.43) were recruited on CrowdWorks, a Japanese crowdsourcing platform. Twenty-four of them had childcare experience, but this factor did not significantly improve the fit of a regression model reported below; hence, this factor was not considered in the subsequent analysis. They were paid 300 JPY for their participation.
3.1.2. Stimuli
We prepared a total of 20 simple sentences containing an ideophone, as listed in Table 1. The current experiment focused on those ideophones that represent manners of motion, such as walking, running and floating. They constitute a major semantic domain in the Japanese ideophone inventory, and their meanings have been extensively described in the literature (Akita, Reference Akita, Perniss, Fischer and Ljungberg2020a; Ibarretxe-Antuñano, Reference Ibarretxe-Antuñano, Akita and Pardeshi2019; Saji et al., Reference Saji, Akita, Kantartzis, Kita and Imai2019; Toratani, Reference Toratani2012). We selected motion ideophones from Akita (Reference Akita, Perniss, Fischer and Ljungberg2020a) that have a non-reduplicated form and end with a so-called sokuon /Q/ (phonetically realized as the first half of a geminate when followed by a consonant, as in the current experiment). We used this particular morphological shape because it is known that expressive, emphatic prosody appears most frequently with ideophones of this type (Akita, Reference Akita2020b). The 20 sentences were presented in a random order on Google Forms.
Table 1. Stimulus sentences for Experiment 1, with abbreviated semantic labels for cross-referencing in parentheses

In order to quantitatively explore the possible correlations between the semantic features of these ideophones and the use of particular prosodic patterns, a different group of 20 monolingual speakers of Japanese (female: 13 and male: 7; age: 22–55; M = 40.05; SD = 8.80) rated each of the 20 ideophones on six 7-point semantic scales, adapted from Ibarretxe-Antuñano (Reference Ibarretxe-Antuñano, Akita and Pardeshi2019) and Saji et al. (Reference Saji, Akita, Kantartzis, Kita and Imai2019): size (from 0 ‘small’ to 6 ‘large’), speed (from 0 ‘slow’ to 6 ‘fast’), weight (from 0 ‘light’ to 6 ‘heavy’), intensiveness (from 0 ‘moderate’ to 6 ‘intensive’), pleasantness (from 0 ‘unpleasant’ to 6 ‘pleasant’) and noise (from 0 ‘quiet’ to 6 ‘noisy’). Although no additional instructions were given as to how to interpret these scales, the ratings were no more variable for highly subjective scales (e.g., pleasantness: mean SD = 1.07) than for less subjective scales (e.g., speed: mean SD = 1.28). Using Google Forms, the ideophones were visually presented with the example sentences in Table 1. The order of the ideophones was randomized. The mean semantic ratings for the ideophones are shown in Table 2.
Table 2. Mean semantic ratings, with standard deviation in parentheses, for all the ideophones that were examined

Some of these scales were found to be strongly correlated with each other. For example, the Spearman correlation between size and weight was 0.78. Analyzing all these scales in a single regression analysis was not desirable, which would have resulted in a collinearity problem. Hence, a principal component (PC) analysis was run, using R version 4.4.0 (R Core Team, 2024). It was revealed that the first three components account for 89.52% of the variability in the data. As shown in Table 3, these dimensions are primarily characterized by speed and pleasantness, and the other four scales primarily contribute to PCs 4 to 6. Therefore, we only used the speed and pleasantness in the following statistical analyses.
Table 3. Principal components’ loadings

Note: Boldface > |0.50|.
3.1.3. Procedure
The 30 participants, recruited for the main experiment, were instructed to complete the experiment alone in a quiet room, read the sentences aloud so that even a 1-year-old infant could understand what they meant and record their pronunciation on their smartphone or other devices. They were allowed to pronounce each sentence as many times as they liked. In this experiment, as well as in Experiments 2 and 3 reported below, the participants read the consent form before they began their task.
3.1.4. Predictions
The previous literature on sound symbolism allows us to make some specific predictions about how the prosodic features of ideophones may be used to express the speed and pleasantness of motion. Notably, the frequency code hypothesis (Ohala, Reference Ohala1984) – one of the most influential hypotheses in the literature on sound symbolism – states that higher-frequency sounds signal a smaller vocalizer than lower-frequency sounds; for example, a small mouse makes a higher-frequency voice than a large elephant. From this hypothesis, Ohala and his colleagues proposed to derive several sound–meaning associations, as quoted below:
… high tones, vowels with high second formants (notably /i/), and high-frequency consonants are associated with high-frequency sounds, small size, sharpness, and rapid movement; low tones, vowels with low second formants (notably /u/), and low-frequency consonants are associated with low-frequency sounds, large size, softness, and heavy, slow movements.
(Hinton et al., Reference Hinton, Nichols, Ohala, Hinton, Nichols and Ohala1994, p. 10)
If Japanese ideophones behave according to the frequency code, it is predicted, for instance, that fast motion is expressed by high f0.
In addition, according to Nygaard et al.’s (Reference Nygaard, Herold and Namy2009) nonce word experiment, high f0 might also be used to express pleasant motion of some kind. It may also be reasonable to expect faster motion to be expressed by shorter duration, as in Shintel et al. (Reference Shintel, Nusbaum and Okrent2006) (see also Knoeferle et al., Reference Knoeferle, Li, Maggioni and Spence2017; Perlman & Cain, Reference Perlman and Cain2014).
3.1.5. Analysis
We analyzed the last repetition of each ideophone unless it was mispronounced. Ideophones whose form was changed from the intended target form, as in buraburaburaaQ or buran for buraQ ‘taking a walk’, were excluded from the data. A total of 557 recordings entered the following acoustic and statistical analysis (30 participants x 20 sentences – 43 exclusions; exclusion percentage = 7%). Although the participants were from different areas of Japan, we did not exclude any recordings, as the obtained pronunciation of the ideophones did not exhibit conspicuous dialectal variations.
Using Praat version 6.3.18 (Boersma & Weenink, Reference Boersma and Weenink2023), we obtained the mean f0 of the second vowel (V2) of each ideophone (e.g., /a/ of buraQ) and the mean intensity and duration of the entire ideophone (i.e., from the initial consonant to the beginning of the quotative particle -to) and standardized them within each participant. We focused on the f0 of V2 specifically, as it is the locus of the pitch accent.
Using the brms package (Bürkner, Reference Bürkner2017) with R version 4.4.0 (R Core Team, 2024), Bayesian mixed effects models were fit, with mean f0, mean intensity and duration as the dependent variables and the two mean ratings (speed and pleasantness), initial voicing of the ideophones and V2 as fixed effects, as well as a random intercept for participant and a random slope for participant associated with each of the three fixed factors. The consonantal and vocalic factors were included in the current models, as they may influence f0 in nontrivial ways. In particular, in the sound-symbolic system of Japanese ideophones, voiceless obstruents in the word-initial position represent small, light, fine objects, as in korokoro ‘a light object rolling’ versus gorogoro ‘a heavy object rolling’ (Hamano, Reference Hamano1998). The frequency code hypothesis would predict that ideophones with voiceless obstruents are pronounced with a higher f0. There are also acoustic bases for the inclusion of these segmental factors. Voiceless obstruents are known to raise the f0 of adjacent vowels, and voiced obstruents lower it (Kingston & Diehl, Reference Kingston and Diehl1994). High vowels, such as [i] and [u], tend to have higher pitch than low vowels, such as [ɑ] (Ohala, Reference Ohala1973).
We used default weakly informative priors for the intercept and group-level SDs. Four Markov Chain Monte Carlo (MCMC) chains were run with 2,000 iterations each, but the first 1,000 iterations were discarded as warm-up (burn-in) iterations. Convergence was assessed via R-hat statistics (all R-hat values = 1.00) and visual inspection of trace plots. To improve sampling efficiency and avoid divergent transitions, we set the target acceptance rate as adapt_delta = 0.97 and the maximum tree depth as max_treedepth = 15. All the analytical details can be checked in the R Markdown file at the project’s Open Science Framework (OSF) repository at https://osf.io/xrd96/.
3.2. Results
3.2.1. Prosodic tendencies of individual ideophones
Overall, the prosodic properties of the pronounced ideophones were consistent with some previously reported sound–meaning associations in Japanese and beyond. Figure 1 presents the mean f0 of the V2 of the 20 ideophones. Ideophones with an initial voiceless obstruent (/p, t, k, s, h/) tend to be pronounced with a higher f0 than those with voiced obstruent onset (/b, d, g, z/)—note that these are not due to phonetic f0 perturbation effects (Kingston & Diehl, Reference Kingston and Diehl1994), as f0 was measured at V2.

Figure 1. Mean standardized f0 of the V2 of ideophones, from the lowest to the highest.
In Japanese ideophones, smallness and lightness are often represented by word-initial voiceless obstruents (Hamano, Reference Hamano1998). The current results suggest that the same size and weight information is also expressible by f0: The higher the fundamental frequency, the smaller and lighter the referent. For example, kuruQ ‘a light object spinning quickly once’ was pronounced with a higher f0 (M = 0.33) than guruQ ‘going around, drawing a large circle’ (M = −0.37). Similarly, potoQ represents the falling motion of a small light object and was generally produced with high f0 (M = 0.97), whereas bataQ represents that of a heavy two-dimensional object and was produced with low f0 (M = −0.39). These results appear to accord well with the frequency code hypothesis.
Figure 2 shows the mean intensity of the 20 ideophones. It appears that ideophones that depict heavy objects’ forceful movements tend to be pronounced strongly. For example, goroQ represents a person’s lazy rolling movement on the floor and tends to be produced with high intensity (M = 0.73), whereas pukaQ represents a light object’s floating motion and is generally produced with low intensity (M = −1.04).

Figure 2. Mean standardized intensity of ideophones, from the lowest to the highest.
Figure 3 shows the mean duration of the 20 ideophones. /Q/-ending ideophones generally tend to represent quick movements. However, the relatively long duration of mowaQ ‘steam/smoke coming out’ (M = 1.23) reflects the slow movement it depicts.

Figure 3. Mean standardized duration of ideophones, from the lowest to the highest.
3.2.2. Ideophone semantics and prosody
The subjective ratings obtained for these ideophones (see Table 1) allow us to quantitatively assess how ideophones’ meanings and their prosodic features are correlated with each other. Figure 4 shows positive correlations between the two selected semantic dimensions of motion ideophones (i.e., speed and pleasantness) and the mean f0 of V2. Ideophones that represent faster and pleasant motion, such as pyokoQ ‘a little frog hopping once’, tend to be pronounced with a higher f0 than those that represent slower and unpleasant motion, such as goroQ ‘a heavy object rolling down, lying down’.

Figure 4. The speed and pleasantness of ideophones and the mean standardized f0 of their V2.
Table 4 shows the results of the Bayesian regression model. The last column shows the probability that the posterior samples are either positive or negative, depending on the skew of the posterior distribution. These probabilities represent the certainty of the effects being credible. The Bayesian mixed regression model revealed positive, very credible effects of speed and pleasantness on f0 at V2. Moreover, ideophones with V2 /o/ (e.g., pyokoQ and potoQ) tend to be pronounced with a higher f0 than those with V2 /a/. It might be that the small and ‘inconspicuous’ image associated with the vowel /o/ in Japanese ideophones (Hamano, Reference Hamano1998) was produced with a higher f0 via the frequency code.Footnote 2
Table 4. The results of the Bayesian mixed regression model for the mean f0 of the V2 of ideophones

Figure 5 shows the weak inverse relationship between the two semantic scales and mean intensity.

Figure 5. The speed and pleasantness of ideophones and their mean standardized intensity.
As shown in Table 5, a Bayesian regression model revealed that ideophones with an initial voiceless obstruent (e.g., koroQ) tend to be pronounced with lower intensity than those with an initial voiced obstruent (e.g., goroQ). Moreover, ideophones with V2 /o/ and those with V2 /u/ tend to be pronounced with higher intensity than those with V2 /a/. The effects of the two semantic scales, speed and pleasantness, were modest at best, however.
Table 5. The results of the Bayesian mixed regression model for the intensity of ideophones

Figure 6 shows the relationships between the two semantic scales and mean duration.

Figure 6. The speed and pleasantness of ideophones and their mean standardized duration.
As shown in Table 6, a Bayesian regression model revealed that ideophones with V2 /o/ tend to be pronounced with shorter duration than those with V2 /a/. No noticeable effects were found for the two semantic scales in the regression model.
Table 6. The results of the Bayesian mixed regression model for the duration of ideophones

3.3. Discussion
The production experiment demonstrated the iconic use of expressive prosody in Japanese ideophones. It showed that the association between pleasantness and f0 that Nygaard et al. (Reference Nygaard, Herold and Namy2009) found with native speakers of English also holds with Japanese speakers producing ideophones. We also found that f0 was associated with speed.
On the other hand, the effects of the semantic scales were modest – if not entirely absent – in terms of the intensity and duration of ideophones, which were more strongly affected by consonant and vowel types. It may be the case that the modest effects of intensity can partly be attributed to the uncontrolled recording environment; the distance between the participants’ mouths and the recording device may not have been fixed across the tokens. A perception experiment using controlled audio stimuli, such as Experiment 2, will address this potential confound.
In addition to these findings, it was observed that some participants used marked voice quality to express nuanced semantic differences between motion ideophones. For example, one participant pronounced dokaQ ‘a heavy object thudding’ with a harsh, pressed voice to emphasize the violent sound and motion. Falsetto was used for potoQ ‘a small light object dropping’ and pyokoQ ‘a little frog hopping once’, both of which represent a light motion of a small entity. One participant used a voiceless pronunciation for the quick, violent motion expressed by bataQ ‘a two-dimensional object slamming down’.
Ideophones generally have highly specific, holistic, multisensorial meanings, as suggested by their multiword translations used in this paper (Akita, Reference Akita2012; Iida & Akita, Reference Iida, Akita, Goldwater, Anggoro, Hayes and Ong2023; Nuckolls, Reference Nuckolls, Akita and Pardeshi2019). Therefore, it might be that the relationship between prosody and ideophones’ semantic properties can be captured more intuitively in terms of specific voice quality categories, such as falsetto and creaky voice, at least more so than general prosodic features, such as f0 and intensity. The relationship between ideophones and specific voice qualities will be further explored in Experiment 3, after testing the sound-symbolic significance of pitch, intensity and duration in perception in Experiment 2.
4. Experiment 2: Perceived meaning of f0, intensity and duration
Experiment 1 demonstrated that Japanese speakers can utilize iconic prosody to highlight the meaning of ideophones. Considering that production experiments generally have a high degree of freedom, the iconic effects of each prosodic feature may manifest themselves in a clearer fashion in perception experiments. To this end, Experiments 2 and 3 examined whether Japanese speakers use iconic prosody in understanding ideophones.
4.1. Method
4.1.1. Participants
A total of 50 monolingual speakers of Japanese (female: 26; male: 24; age: 22–62; M = 42.04; SD = 9.19) were recruited on CrowdWorks. None of them participated in Experiment 1. They were paid 300 JPY for their participation.
4.1.2. Stimuli
The same set of 20 simple ideophone sentences as Experiment 1 was used, but this time without IDS-like expressions, such as papa ‘dad’, Pengin-san ‘Mr. Penguin’ and the sentence-final particle -ne, which conveys ‘a soft tone of voice’. The first author, who is a male native speaker of Japanese, pronounced all ideophones in eight (2 x 2 x 2) distinct prosodic patterns: high (M = 145.10 Hz, SD = 5.16) versus low f0 (M = 118.86 Hz, SD = 2.99), high (M = 66.99 dB, SD = 1.83) versus low intensity (M = 63.04 dB, SD = 1.51) and long (M = 0.57 s, SD = 0.03) versus short duration (M = 0.42 s, SD = 0.02). Long pronunciation involved an extra-long V2, as in pukaaaQ, a strategy that is commonly found in the emphatic usage of ideophones. The same recording of the non-ideophonic part of each sentence, which was pronounced in normal tones, was used for all eight pronunciations of each ideophone. All the sound files are available at the project’s OSF repository mentioned above.
The sentences were presented in a random order, one sentence per page, on Google Forms, but the order of the eight audio files, labeled ‘A’ to ‘H’, was fixed due to the technical limitations of the platform: A: high f0–high intensity–long duration; B: high–high–short; C: high–low–long; D: high–low–short; E: low–high–long; F: low–high–short; G: low–low–long; and H: low–low–short.
4.1.3. Procedure
The participants were instructed to wear headphones or earphones and perform the task alone in a quiet environment. They were also instructed to listen to each recording as many times as they liked and choose the most suitable (or, if all sounded unnatural, most acceptable) pronunciation, from A to H, for the sentence.
4.1.4. Analysis
Separate Bayesian logistic regression models were fit for f0 (high [A, B, C, D] versus low [E, F, G, H]), intensity (high [A, B, E, F] versus low [C, D, G, H]) and duration (long [A, C, E, G] versus short [B, D, F, H]). The two mean semantic ratings (speed and pleasantness) for the ideophones, initial voicing and V2 were included as fixed effects, a random intercept of participant as well as random slopes for the two semantic ratings, initial voicing and V2. We used default weakly informative priors for the intercept and group-level SDs. In the f0 model, four MCMC chains were run with 5,000 iterations each, of which the first 4,000 were discarded as warm-ups. In the intensity and duration models, four MCMC chains were run with 3,000 iterations each, of which 2,000 iterations were warm-up iterations. Convergence was assessed via R-hat statistics (R-hat values = 1.00) and visual inspection of trace plots. The other details are identical to those in Experiment 1. See the R Markdown file for further details.
4.2. Results
In general, we observe those patterns that accord well with the overall results of Experiment 1. Figure 7 shows the proportions of high- and low-frequency choices for each ideophone. High f0 was preferred for ideophones with initial voiceless obstruents, which generally represent fast and pleasant motion, such as pyokoQ ‘a little frog hopping once’, and low f0 was preferred for those with initial voiced obstruents, which represent slow and unpleasant motion, such as goroQ ‘a heavy object rolling once, lying down’.

Figure 7. Proportions of high and low f0 sounds (A, B, C, D versus E, F, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.
A Bayesian regression model, summarized in Table 7, revealed credible effects of pleasantness and voicing. High f0 was preferred for those with initial voiceless obstruents and those who represent pleasant motion.
Table 7. The results of the Bayesian mixed regression model for the preferred f0 level

Figure 8 shows the proportions of high- and low-intensity sounds chosen for each ideophone, which showed results that are more straightforward than the corresponding results of Experiment 1. Two ideophones for violent movements (dokaQ ‘a hard object thudding down’ and bataQ ‘a two-dimensional object slamming down’) exhibited a strong preference for pronunciation with high intensity. In contrast, ideophones for light objects’ quiet motion, such as hiraQ ‘a light thin object fluttering down’, horoQ ‘a light teardrop falling’, potaQ ‘liquid dropping quietly’ and potoQ ‘a small light object dropping’, preferred pronunciation with low intensity.

Figure 8. Proportions of high- and low-intensity sounds (A, B, E, F versus C, D, G, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.
As shown in Table 8, a Bayesian regression model revealed credible effects of speed and pleasantness. Tokens with high intensity were preferred for ideophones for fast and unpleasant motion.Footnote 3
Table 8. The results of the Bayesian mixed regression model for the preferred intensity level

Figure 9 shows the preference for long versus short renditions for each ideophone. As was the case with Experiment 1, our /Q/-ending ideophones generally represent quick motion and prefer short pronunciation, but long pronunciation was also chosen frequently for ideophones that represent relatively slow movements, such as mowaQ ‘steam/smoke coming out’, buraQ ‘taking a walk’ and goroQ ‘a heavy object rolling once, lying down’.

Figure 9. Proportions of long and short sounds (A, C, E, G versus B, D, F, H) preferred for the 20 ideophones, ordered in the same way as the corresponding figure in Experiment 1.
As shown in Table 9, a Bayesian regression model revealed credible effects of speed, pleasantness, voicing, V2 /o/ and /u/. Long duration was preferred for those with initial voiceless obstruents, V2 /u/ and slow and unpleasant motion, whereas short duration was preferred for V2 /o/ and fast and pleasant motion.
Table 9. The results of the Bayesian mixed regression model for the preferred duration

4.3. Discussion
The current perception experiment provided additional, and arguably even clearer, evidence for the sound-symbolic relevance of prosody in Japanese ideophones. Higher f0 was, like Experiment 1, associated with more pleasant motion. Higher intensity was associated with faster and less pleasant motion, which may involve greater energy. Longer duration was associated with slower motion and, contrary to the results of Experiment 1, greater pleasantness, extending the scope of Shintel et al.’s (Reference Shintel, Nusbaum and Okrent2006) findings. This association between slow motion and pleasantness – rather than unpleasantness – may be due to specific motion contexts where someone or something moves in a relaxed, leisurely manner, illustrated by buraQ ‘taking a walk’, goroQ ‘lying down’ and pukaQ ‘a light object floating’.
5. Experiment 3: Perceived meaning of voice quality
In the final experiment, we went a step further into the nature of iconic prosody and investigated how different voice quality categories may interact with the meanings of Japanese ideophones. People do not actively use marked voice quality in noninteractive experimental settings such as Experiment 1. Therefore, in order to explore the question of whether certain voice qualities are favored for particular meanings, we recorded ideophones pronounced with marked voice quality and asked Japanese speakers to evaluate them in a perception experiment.
5.1. Method
5.1.1. Participants
Fifty-one monolingual speakers of Japanese who did not participate in either Experiment 1 or 2 (female: 21 and male: 29; prefer not to answer: 1; age: 22–66; M = 42.82; SD = 10.24) were recruited on CrowdWorks. They were paid 300 JPY for their participation.
5.1.2. Stimuli
The same set of 20 simple ideophone sentences as Experiment 2 was used. The first author, who is a male native speaker of Japanese, pronounced all ideophones with four types of marked phonation: creaky voice (a low-pitched voice with audible pulses, as in the end of an old man’s utterance), harsh voice (a rough, pressed voice produced with tensed vocal folds, similar to the sound that one often makes when lifting a heavy object), falsetto and whisper. The rest of the sentences were pronounced in a modal voice. These sounds are available at the OSF repository. The order of the sentences was randomized on Google Forms, but the order of the four voice qualities for each sentence was fixed: creaky voice, harsh voice, falsetto and whisper.
5.1.3. Procedure
The participants were instructed to wear headphones or earphones and complete the task alone in a quiet environment. They were also instructed to listen to each recording as many times as they liked and choose the most suitable (or, if all sounded unnatural, most acceptable) pronunciation for the sentence: the first, second, third or fourth.
5.1.4. Analysis
A Bayesian multinomial logistic regression model was fit, with the four voice quality categories as a dependent variable, the two mean semantic ratings (speed and pleasantness) for the ideophones, initial voicing and V2 as fixed effects; the random effect structure was identical to that of Experiments 1 and 2. The model consisted of four chains with 2,000 iterations, with a warm-up period of 1,000 iterations. We set the target acceptance rate as adapt_delta = 0.97 and the maximum tree depth as max_treedepth = 15. See the R Markdown file at the study’s OSF repository for details.
5.2. Results
As shown in Figure 10, different ideophones exhibited different preferences for the four voice qualities, and the iconic basis of these sound–meaning associations appears to be fairly straightforward to interpret. Harsh voice was preferred for ideophones that represent a violent sound and motion, such as zuboQ ‘one’s foot falling into a ditch’, dokaQ ‘a hard object thudding down’ and bataQ ‘a two-dimensional object slamming down’. Falsetto was preferred for ideophones that represent a light object’s fast motion, such as tsuruQ ‘slipping’, koroQ ‘a light object rolling once’, kuruQ ‘a light object spinning quickly once’, potoQ ‘a small light object dropping’ and pyokoQ ‘a little frog hopping once’. This result is again consistent with the frequency code hypothesis (Ohala, Reference Ohala1984).Footnote 4 Whisper was preferred for ideophones that represent quiet motion, such as mowaQ ‘steam/smoke coming out’, paraQ ‘small light drops falling’, hiraQ ‘a light thin object fluttering down’, horoQ ‘a light teardrop falling’, suruQ ‘a light object going off quietly’ and potaQ ‘liquid dropping quietly’.

Figure 10. Proportions of the four voice qualities preferred for the 20 ideophones.
No straightforward semantic generalization appears to hold for ideophones that preferred creaky voice: buraQ ‘taking a walk’, goroQ ‘a heavy object rolling once’, guruQ ‘going around’, taraQ ‘liquid dropping’, pukaQ ‘a light object floating’ and poroQ ‘a small light object dropping’. This is probably because a creaky voice (i.e., a very low-pitched voice) generally sounded most natural and least marked for the male speaker.
As shown in Table 10, a Bayesian regression model revealed that, consistent with the results of Experiments 1 and 2, falsetto (i.e., a distinctly high-pitched sound) was preferred to creaky voice for faster and more pleasant motion. This result may be related to the actual example of an ideophonic utterance cited in Section 1, in which hyoi pronounced in a falsetto appeared to express suddenness and unexpected happiness. On the other hand, a harsh voice was preferred to a creaky voice for ideophones depicting faster but less pleasant motion. Whisper was preferred to a creaky voice for ideophones for faster motion.
Table 10. The results of the Bayesian regression model for the preferred voice qualities, with creaky voice as a baseline

Note: The factors that are of particular interest are highlighted in bold.
5.3. Discussion
The current results show that the complex semantics of motion ideophones can be iconically associated with specific voice quality categories. Crucially, raw acoustic features, such as f0 and intensity, do not fully account for these associations. Specifically, while in Experiments 1 and 2, fast speed was generally associated with pleasantness, a harsh voice was associated with fast but unpleasant motion. Likewise, while low intensity evoked a slow image in Experiment 2, whisper, which is a low-intensity sound, was associated with fast motion. These results indicate that Japanese speakers can refer to the acoustic details of ideophones – demonstrably in terms of both raw acoustic features such as f0 as well as voice qualities – in their iconic interpretation of these words.
As an anonymous reviewer pointed out, it is worth mentioning that semantic scales and voice quality categories do not always have one-to-one correspondence. For example, in Experiment 1, one participant pronounced bataQ ‘a two-dimensional object slamming down’ without regular vibrations of the vocal folds (i.e., in a whisper voice). However, in the current experiment, this ideophone exhibited a strong preference for harsh voice, arguably due to the violent movement it represents. This oscillation indicates that iconic prosody can not only emphasize the semantic components inherent in individual ideophones but may also be able to add a new dimension to them.
6. General discussion
The three experiments demonstrated that the prosody of real words can be used and understood iconically. These findings confirm the importance of iconic prosody in human communication, which has been primarily investigated using novel words and nonlinguistic vocalizations in previous studies. Iconic prosody is attracting broad attention in cognitive science, as it is one of the major candidates for the original form of human language (Arbib et al., Reference Arbib, Liebal and Pika2008; Ćwiek et al., Reference Ćwiek2021; Haiman, Reference Haiman2018; Perlman, Reference Perlman, Fischer, Akita and Perniss2026). The iconic prosody of ideophones examined in the current study may be able to provide a missing link in this evolutionary theory.
Ideophones are imitative lexemes that are iconic and conventionalized at the same time. Unlike iconic vocalizations (Ćwiek et al., Reference Ćwiek2021), which are primarily prosodic and do not consist of consonants and vowels, ideophones are segmentally specified in the lexicon of an individual language. One can speculate that ideophones inherited iconic prosody from nonlinguistic vocalizations and gradually lost it to form a conventionalized system of nonimitative symbols (i.e., signs whose form–meaning relationship is arbitrary; Peirce, Reference Peirce1932).
This hypothesis gains additional support from the fact that the iconic prosody of ideophones can be adjusted in a graded manner. For example, pyokoQ-to ‘with a light hop’ can be pronounced in plain prosody [pʲokótːo], expressive prosody [pʲŏkőtːo] (extra-short V1, extra-high-pitched V2) and even more expressive prosody [{F pʲŏkőtːo F}] (extra-short V1, falsetto) (for a related observation, see Rhodes, Reference Rhodes, Hinton, Nichols and Ohala1994). Thus, as depicted in Figure 11, iconic prosody allows us to draw a fine-grained evolutionary path from nonlinguistic vocalizations to nonimitative, symbolic words via ideophones with varying degrees of expressiveness. Analyzing the iconic function of ideophone prosody might reveal how this evolution may have taken place – for example, what type of meaning was lexicalized first.

Figure 11. Possible evolutionary path from nonlinguistic vocalizations to non-ideophonic, symbolic words via ideophones.
7. Conclusion
This paper has examined the relevance of prosody in both the production and perception of ideophonic expressions in Japanese. We have demonstrated that Japanese speakers utilize iconic prosody for enhancing the depictive power of each ideophone and that they also have a clear preference regarding which acoustic properties should accompany what kinds of meanings. It is also worth noting that iconic prosody can be observed at different levels of abstraction. The association between high f0 and speed was confirmed in all three experiments, but the results of Experiment 3 also indicated that specific semantic aspects of ideophones can be iconically linked with a specific type of voice quality. The iconic prosody of ideophones can be considered a remnant of nonlinguistic vocalizations as an early form of spoken language. This remnant may connect iconic vocalizations and arbitrary words, which are otherwise far apart in the relevant evolutionary theory.
This study is the first step toward a comprehensive examination of iconic prosody in real words and opens up research opportunities in various respects. For example, future research should examine how iconic prosody works in semantic domains other than motion, including more abstract ones such as pain and emotion (McLean, Reference McLean2021). Another future direction would be a finer-grained psychoacoustic analysis of iconic prosody using continuous measures of voice quality instead of discrete categories (Lacey et al., Reference Lacey, Jamal, List, McCormick, Sathian and Nygaard2020; Villegas et al., Reference Villegas, Akita and Kawahara2023). Furthermore, a cross-linguistic comparison of prosody–meaning associations may enrich our evolutionary considerations. The hypothesis outlined in Section 6 would predict that iconic prosody should manifest itself across languages, which should be empirically tested in future studies (Akita, Reference Akita2025). Finally, a qualitative analysis of actual ideophone uses in specific discourse, such as the ideophone in a falsetto cited at the beginning of this paper, may help us to better understand the meaning of iconic prosody.
Data availability statement
All stimuli, experimental instructions, data and code are available at https://osf.io/xrd96/.
Acknowledgments
We are grateful to Laura Speed and the anonymous reviewer for their insightful comments on an earlier version of this paper.
Competing interests
The authors declare none.