1. Introduction
Mathematical theories such as information theory (Shannon Reference Shannon1948) and Bayesian inference (Bayes Reference Bayes1763; Laplace [1820] Reference Laplace1886) offer a means of quantifying the notion of communicative efficiency under the assumption that language is an effective system of message transfer. Recent linguistic research has provided increasing evidence that information-theoretic measures such as predictability, surprisal and informativity/entropy play a significant role in accounting for patterns in linguistic behaviour and the course of linguistic change. One of the underlying ideas of this research program is that linguistic signals differ in amount and quality, which causes a tendency for contextually predictable linguistic units to be reduced. Conversely, signals that increase the probability of successfully conveying the intended message may be enhanced in order to achieve robust information transmission. Prior research has investigated the role of information-theoretical and probabilistic measures in linguistic behaviour (see Jaeger & Buz Reference Jaeger, Buz, Fernandez and Cairns2017 for a comprehensive review). For example, findings establish links with morphological contraction (Frank & Jaeger Reference Frank, Jaeger, Love, McRae and Sloutsky2008; Bresnan & Spencer Reference Bresnan and Spencer2012), omission of morphemes (e.g., case marker omission; Fry Reference Fry2001; Lee Reference Lee2006; Kurumada & Jaeger Reference Kurumada and Jaeger2015; Norcliffe & Jaeger Reference Norcliffe and Jaeger2015), function words (e.g., that; Jaeger Reference Jaeger2010; Jaeger & Grimshaw Reference Jaeger and Grimshaw2013) or referring expressions (e.g., pronouns; Tily & Piantadosi Reference Tily, Piantadosi, Deemter, Gatt, Gompel and Krahmer2009).
This kind of approach is especially flourishing in phonology in the framework of Message-Oriented Phonology (Hall et al. Reference Hall, Hume, Jaeger and Wedel2016, Reference Hall, Hume, Jaeger and Wedel2018; henceforth MOP), which applies the basic concepts of information theory and Bayesian inference to phonological research. The idea that constitutes the foundation of MOP is that context-specific differences in the amount of information that phonological units (segments, syllables, etc.) carry about messages influence the extent to which general articulatory and perceptual biases affect their realisation (Hall et al. Reference Hall, Hume, Jaeger and Wedel2016). In this way, more informative pieces of the signal will be enhanced, while less informative pieces will be reduced. When Hall et al. (Reference Hall, Hume, Jaeger and Wedel2016) first introduced MOP, they demonstrated the concept based on final obstruent devoicing, a cross-linguistically common phenomenon, focussing on the patterns in English and German. They claimed that there is a direct correlation between the rate of devoicing in word-final obstruents and the number of minimal pairs (lexical competitors). Under the assumptions of MOP, final voiced obstruents are prone to reduction in languages such as German, where only a few minimal pairs exist, because word-final positions are ‘weak’ in terms of the amount of information they can convey, and maintaining voicing word-finally is physiologically costly. On the other hand, when the voicing distinction in word-final position allows distinguishing many lexical competitors, as in English, voicing will be preserved to maintain the contrast.
Probability effects at different levels provide supporting evidence for MOP: segments (Seyfarth Reference Seyfarth2014; Cohen Priva Reference Cohen Priva2015; Turnbull Reference Turnbull2018; with a focus on consonants in Kirov & Wilson Reference Kirov, Wilson, Miyake, Peebles and Cooper2012; Schertz Reference Schertz2013; Seyfarth et al. Reference Seyfarth, Buz and Jaeger2016; Nelson & Wedel Reference Nelson and Wedel2017; Chodroff & Wilson Reference Chodroff and Wilson2018; Sano Reference Sano2018a; Wedel et al. Reference Wedel, Nelson and Sharp2018; and on vowels in Aylett & Turk Reference Aylett and Turk2004; Hume & Bromberg Reference Hume and Bromberg2005; Shaw & Kawahara Reference Shaw and Kawahara2017; Wedel et al. Reference Wedel, Nelson and Sharp2018), phonological patterns and processes (Hall et al. Reference Hall, Hume, Jaeger and Wedel2018; Kawahara & Lee Reference Kawahara and Lee2018; Wedel et al. Reference Wedel, Ussishkin and King2019), variation and sound change (Wedel et al. Reference Wedel, Jackson and Kaplan2013a,Reference Wedel, Kaplan and Jacksonb; Bowern & Babinsky Reference Bowern and Babinsky2018) or morpheme/word duration (Bell et al. Reference Bell, Jurafsky, Fosler-Lussier, Girand, Gregory and Gildea2003, Reference Bell, Brenier, Gregory, Girand and Jurafsky2009; Hashimoto Reference Hashimoto2021), among others.
Within MOP-based research, close attention has been paid to the linguistic phenomenon referred to as contrastive hyperarticulation (Wedel et al. Reference Wedel, Nelson and Sharp2018). When a phonetic cue contributes to phonetically distinguishing a word from its lexical competitors, that cue tends to be hyperarticulated. That is, the distinctive feature of a sublexical unit that is relevant to a specific contrast is enhanced (e.g., longer duration, greater distance between vowels; Wedel et al. Reference Wedel, Nelson and Sharp2018) to make the distance in phonetic properties greater. This process is active not only within a single conversation, but also based on the existence of competitors in the lexicon as a whole (as also shown in Baese-Berk & Goldrick Reference Baese-Berk and Goldrick2009).
Contrastive hyperarticulation must be distinguished from slow/clear-speech hyperarticulation. Although the two types seem to have similar purposes (i.e., better comprehension/identification of lexical items; Payton et al. Reference Payton, Uchanski and Braida1994; Bradlow et al. Reference Bradlow, Torretta and Pisoni1996), they can have opposite effects on phonetic implementation. If we take the example of voice onset time (VOT; Lisker & Abramson Reference Lisker and Abramson1964), the interval of time between the release of the stop closure and the onset of voicing of the following vowel, the logical consequence of slow speech hyperarticulation is that, together with the enhancement of segment duration, VOT for voiceless stops should be longer. Contrastive hyperarticulation has a similar effect. For voiced stops, however, enhancement due to contrastive hyperarticulation should result in a shorter VOT, while we expect that in clear speech it would become longer in most cases (Smiljanić & Bradlow Reference Smiljanić and Bradlow2008). (In contrast, Kang & Guion Reference Kang and Guion2008 report a shortening of VOT in Korean lenis stops for clear speech hyperarticulation.)
Previous studies provide evidence that context-based contrastive hyperarticulation is active cross-linguistically, for example, duration of /p/ aspiration, fricative voicing contrasts, vowel length and quality contrasts and stop VOT duration in English (Baese-Berk & Goldrick Reference Baese-Berk and Goldrick2009; Kirov & Wilson Reference Kirov, Wilson, Miyake, Peebles and Cooper2012; Schertz Reference Schertz2013; Seyfarth et al. Reference Seyfarth, Buz and Jaeger2016; Nelson & Wedel Reference Nelson and Wedel2017; Wedel et al. Reference Wedel, Nelson and Sharp2018); vowels in Korean (Kang et al. Reference Kang, Ryu, Yun, Calhoun, Escudero, Tabain and Warren2019); and closure duration in singleton–geminate contrasts in Japanese (Sano Reference Sano2018a).
These studies suggest, in part, that hyperarticulation is cue-specific. That is, the only phonetic cues that participate in contrastive hyperarticulation are those that specifically contribute to the maintenance of the phonetic distance between the target and competitors for a given contrast. For example, hyperarticulation of VOT can be triggered by the presence of ‘voicing’ minimal pairs (e.g., /pit/ vs. /bit/), but not by other lexical neighbours of the target word (e.g., /kit/, /sit/, …).
The studies mentioned above offer a picture of the cue-specific nature of contrastive hyperarticulation in English. Experimental findings in Kirov & Wilson (Reference Kirov, Wilson, Miyake, Peebles and Cooper2012), for example, indicate that the VOT of word-initial voiceless stops correlates with the existence of a competitor for voicing and place of articulation, while no such correlation is observed in other positions. In Fricke (Reference Fricke2013), rather than the number of neighbours of a given word (neighbourhood density), the number of minimal pairs targeting the initial stop was the better predictor of VOT duration of voiceless stops in the Buckeye Corpus. In another experimental study exploring voiced and voiceless stops, Schertz (Reference Schertz2013) report that hyperarticulation of VOT was triggered by the existence of a voicing competitor, rather than a place/manner competitor. Seyfarth et al. (Reference Seyfarth, Buz and Jaeger2016) focus on word-final /s/–/z/ voicing contrast and experimentally demonstrate that speakers hyperarticulate when it is contextually relevant. Specifically, they observed that the signal was enhanced (i.e., shorter vowels before /s/ and longer voicing for /z/) when it resulted in increasing a relevant contrast, namely when there is a lexical competitor. Nelson & Wedel (Reference Nelson and Wedel2017), using a speech corpus, found that the presence of a VOT-specific minimal pair was a better predictor of contrastive hyperarticulation than the number of lexical neighbours differing in the initial segment. They also found a more robust effect for voiced stops (decreased duration when there is a competitor) than for voiceless stops. Lastly, Wedel et al. (Reference Wedel, Nelson and Sharp2018), using the same corpus, investigated both VOT of word-initial stops in voicing contrasts and F1−F2 Euclidean distance for vowel contrasts. For both types, they report that the existence of a cue-specific minimal pair competitor is a trigger for hyperarticulation: a greater phonetic distance with the competitor was observed. Conversely, neighbourhood density was shown to be irrelevant.
This article offers a case study of the cue-specificity of contrastive hyperarticulation, focussing on voicing and length contrasts in Japanese using a speech corpus. Based on the assumption that lexical competition induces synchronic, phonetically specific enhancement of phonemic contrasts (Aylett & Turk Reference Aylett and Turk2004, inter alia), this study analyses the patterns in sub-lexical hyperarticulation and examines the hypothesis that a specific cue (viz., VOT) that allows distinguishing a word from its minimal pair competitor is hyperarticulated to provide more information and maintain the contrast. Note that corpus studies appear to be especially relevant to the study of contrastive hyperarticulation: Nelson & Wedel (Reference Nelson and Wedel2017) point out that experimental studies, which often involve the use of speech paradigms, do not seem to provide enough motivation to trigger hyperarticulation in speakers. They suggest that speech in an experimental setting may induce so much clear-speech hyperarticulation that contrastive hyperarticulation is obscured or does not occur, as further hyperarticulation may not seem necessary to speakers.
Japanese differs from languages that are already well-studied in the MOP framework (English and other typologically related Indo-European languages) in rhythmic/prosodic properties, phonological characteristics and grammatical structure (such as word order or morphological system). Thus, investigating Japanese provides evidence to the question of whether the cue-specificity of contrastive hyperarticulation holds across language types.
In terms of rhythmic/prosodic properties, Japanese is a mora-timed language (Han Reference Han1962, Reference Han1994; Port et al. Reference Port, Dalby and O’Dell1987), in which moras tend to be of similar duration (see Port et al. Reference Port, Al-Ani and Maeda1980; Homma Reference Homma1981; Warner & Arai Reference Warner and Arai2001a,Reference Warner and Araib for a critical review of mora isochrony and mora-timing compensation), which may affect the realisation of hyperarticulation unlike in other languages. That is, we expect hyperarticulation targeting durational cues to be constrained to maintain mora-based durational contrasts: variations affecting duration should be restricted to a threshold of categorial perception in order to avoid the violation of language-specific timing properties that might hinder communication (e.g., by causing a short segment to be identified as long). Conversely, in a stress-timed language like English, no such restriction should be active, as there are no contrasts based solely on durational cues, and so the degree of hyperarticulation of durational cues should be greater in English than in Japanese. It is therefore meaningful to examine whether there is contrastive hyperarticulation despite the timing restrictions, or whether the phonetic implementation is rigid.
Furthermore, Japanese has two different phonological contrasts that target stops: voicing (see §1.1.1) and length (§1.1.2), which means that the same segment can have both a voicing and a geminate minimal pair (e.g., /kaki/ ‘persimmon’ vs. /kagi/ ‘key’ and /kakki/ ‘energy’). This allows us to test the cue-specificity of contrastive hyperarticulation by comparing the effects that contrasts have on VOT. If contrastive hyperarticulation is cue-specific, then we expect it to target only the specific cue that is relevant to a contrast. However, if it is not cue-specific, then we expect it to target the segment as a whole (as in clear speech hyperarticulation). For example, VOT duration should vary as well in the case of a geminate counterpart (and not only closure duration as reported in previous studies; Sano Reference Sano, Calhoun, Escudero, Tabain and Warren2019). Additionally, the literature reports an ongoing change in the VOT of Japanese voiced stops (from long lead to short lag; Riney et al. Reference Riney, Takagi, Ota and Uchida2007; Gao et al. Reference Gao, Yun, Arai, Calhoun, Escudero, Tabain and Warren2019; see §1.1.1), and it seems of interest to investigate how this sound change may affect patterns of hyperarticulation.
Lastly, we are not aware of any study examining multiple contrasts targeting the same segment in the same language. Because Japanese offers a variety of minimal pairs (voicing and length, both contrasts having a high functional load) for stop consonants, it provides a good opportunity to explore the phonetic specificity of contrastive hyperarticulation.
1.1. Phonetic properties of Japanese
1.1.1. Voicing contrast in Japanese
Japanese has a two-way contrast in voicing: voiceless stops /p/, /t/, /k/ contrast with their voiced counterparts /b/, /d/, /g/ (e.g., /kin/ ‘gold’ vs. /gin/ ‘silver’), in both word-initial and word-medial positions (Vance Reference Vance1987). Prior phonetic research investigating Japanese stops has identified the presence/absence of laryngeal activity and VOT as the main acoustic cues that contribute to the distinction between the two categories (Shimizu Reference Shimizu1977 et seq.). Thus, the traditional description of VOT opposes two types of plosives, ‘prevoiced’ (negative VOT; Shimizu Reference Shimizu1996, Reference Shimizu1999) and ‘voiceless unaspirated’ (positive VOT; Homma Reference Homma1980; Shimizu Reference Shimizu1996, Reference Shimizu1999), with variations depending on speech rate and position in the word.
However, the nature of VOT in Japanese is more complex than the dichotomy presented above. Recent studies indicate that the realisation of voicing is variable and undergoing a synchronic change: younger speakers tend to devoice stops in word-initial position (while older speakers tend to retain prevoicing), which suggests that VOT might be progressively losing its status as the primary cue for voicing (Takada et al. Reference Takada, Kong, Yoneyama and Beckman2015; Gao & Arai Reference Gao and Arai2018; Gao et al. Reference Gao, Yun, Arai, Calhoun, Escudero, Tabain and Warren2019). Takada et al. (Reference Takada, Kong, Yoneyama and Beckman2015) found that the degree of voicing appears to be shifting, as they observe extreme-lead, short-lead and short-lag VOT in speakers from oldest to youngest. Additionally, the VOT of voiceless consonants in Japanese does not match straightforwardly the traditional short lag vs. long lag dichotomy (Lisker & Abramson Reference Lisker and Abramson1967). Instead, voiceless stops are ‘moderately’ aspirated (Riney et al. Reference Riney, Takagi, Ota and Uchida2007), with a VOT that is shorter than long-lag VOT but longer than short-lag VOT.
The consequence of the above is that the VOT of voiced and voiceless consonants in initial position may overlap, suggesting that VOT alone might not be enough to maintain the contrast. However, the investigation of other possible secondary cues (f0, voice quality, following vowel) has produced limited results, and VOT still remains the primary and necessary cue for the voicing contrast (Riney et al. Reference Riney, Takagi, Ota and Uchida2007; Takada et al. Reference Takada, Kong, Yoneyama and Beckman2015; Byun Reference Byun2021). In the case of English, although both voiced and voiceless stops may have a positive VOT (Lisker & Abramson Reference Lisker and Abramson1964), there is no confusion between categories, because they fall into the short-lag and long-lag categories. In Japanese, however, the distinction may be difficult in initial position, because voiceless stop VOT values are between the short- and long-lag categories. In medial position, however, the contrast is retained.
1.1.2. Consonantal length contrast in Japanese
Segmental length plays an important role in Modern Japanese. Consonants in intervocalic position can be short (singleton) or long (geminate), and the distinction between them carries lexical contrasts in a variety of minimal pairs, such as /kako/ ‘past’ vs. /kakko/ ‘parenthesis’ or /hato/ ‘dove’ vs. /hatto/ ‘hat’. A large body of research has been conducted on the phonetics of singletons and geminates in Japanese. These studies have contributed to identifying the acoustic correlates involved in this distinction. The primary acoustic correlate of the singleton–geminate contrast in Japanese is closure duration: constriction for geminates is two to three times longer than for their singleton counterparts, with duration varying according to place and voicing (Han Reference Han1962, Reference Han1994; Homma Reference Homma1981; Port et al. Reference Port, Dalby and O’Dell1987; Kawahara Reference Kawahara and Kubozono2015). Previous studies have also identified factors such as the duration of preceding or following vowels and non-durational acoustic correlates like intensity, f0 and F1 (Port et al. Reference Port, Dalby and O’Dell1987; Han Reference Han1994; Hirata Reference Hirata2007; Kawahara Reference Kawahara and Kubozono2015). On the other hand, other phonetic cues such as VOT have been shown to be unrelated to the phonetic implementation of the contrast (Homma Reference Homma1981; Hirata & Whiton Reference Hirata and Whiton2005; Sano Reference Sano, Calhoun, Escudero, Tabain and Warren2019).
1.2. Goals and research agenda
In examining voicing and consonantal length contrasts in Japanese, we aim to provide additional evidence for the cue-specific nature of contrastive hyperarticulation, as MOP predicts that the same phonological unit can be affected in different ways by contrastive hyperarticulation when it contributes to identifying specific words relative to competitors. That is, depending on the competitors in the lexicon (e.g., the presence or absence of minimal pair), specific phonetic cues (e.g., VOT) are enhanced, while others remain unchanged.
Previous studies on VOT in the MOP framework suggest that, in English, the existence of a voiced stop competitor for a word-initial voiceless stop tends to induce contrastive hyperarticulation of VOT. That is, in an experimental setting, contrastive hyperarticulation of a voiceless stop leads to a longer VOT for target words with a minimal pair competitor when compared to those without (Baese-Berk & Goldrick Reference Baese-Berk and Goldrick2009; Peramunage et al. Reference Peramunage, Blumstein, Myers, Goldrick and Baese-Berk2011). Other studies have also found that hyperarticulation of the voicing contrast could be enhanced by visual stimuli (Kirov & Wilson Reference Kirov, Wilson, Miyake, Peebles and Cooper2012; Buz et al. Reference Buz, Tanenhaus and Jaeger2016; Seyfarth et al. Reference Seyfarth, Buz and Jaeger2016) and specifically targets VOT, the primary acoustic cue, instead of triggering a lengthening at the word level (Buz et al. Reference Buz, Jaeger, Tanenhaus, Bello, Guarini, McShane and Scassellati2014). In the case of Japanese, it appears reasonable to expect that the presence of lexical competitors should also trigger hyperarticulation of VOT duration: a shorter VOT for voiced consonants and a longer one for voiceless consonants. This is supported by Sano (Reference Sano2018a,Reference Sanob), who found in a corpus study that hyperarticulation can also be observed in Japanese consonantal length contrasts: when a lexical item has lexical competitors, the closure duration is longer for geminates and shorter for singletons.
In the current study, we specifically investigate the prediction that a distinctive phonetic cue (VOT) will undergo hyperarticulation when involved in a lexical contrast in which it maintains the phonetic distance between the target and the competitor. We show that in Japanese the existence of a voicing minimal pair competitor in the lexicon affects (i.e., enhances) the VOT duration of the target segment (shorter for voiced stops, longer for voiceless stops), while no such effect is induced by the existence of other types of contrasts, here consonantal length, in which VOT is not a relevant phonetic cue. This provides further evidence that the phonetic specificity of contrastive hyperarticulation is not limited to English and related languages, but also holds in typologically different languages like Japanese.
The remainder of this article is structured as follows: in §2, we introduce the corpus used for the analysis and the variables included in the statistical model. §3 summarises the results obtained in the statistical analysis. Based on the results, §4 discusses the nature of cue-specificity and other issues related to hyperarticulation with reference to the previous literature. §5 concludes this study.
2. Methods
2.1. The corpus and data collection
The analysis of the present study is based on the Corpus of Spontaneous Japanese Relational Database (NINJAL 2012; henceforth CSJ-RDB). The CSJ-RDB is a subset of the Corpus of Spontaneous Japanese (CSJ), one of the largest annotated corpora of spoken Japanese. The CSJ is abundantly annotated with linguistic and non-linguistic information that is suitable for detailed analysis. The target data were retrieved from the CSJ-RDB by focussing on a selection of 12 speech samples that is balanced in speech style (the CSJ-mini provided by Hanae Koiso of the National Institute for Japanese Language and Linguistics, which consists of the following speech samples: A01F0055, A02M0098, A05F0039, A11M0846, D01F0023, D01M0009, D04F0022, D04M0010, S00F0014, S01M0005, S02M0043, S03F0108). The breakdown of the CSJ-mini is as follows: monologue (eight speech samples) and dialogue (four speech samples), amounting to about 34,000 words produced by 11 speakers (five males and six females; age: 20s (3), 30s (5), 40s (1) and 60s (2)). Using the SQLite database language (https://jp.navicat.com/), this study employed the phonetic/phonological and morphological information annotated in the CSJ-RDB.
The target segments were stops (/p/, /t/, /k/, /b/, /d/, /g/). In addition to the target segments, we also retrieved information from the CSJ-mini about (a) segments immediately preceding/following the target segments, (b) syllables immediately preceding/following the syllables that contain the target segments and (c) words and phrases that contain the target segments. Tokens were excluded from the dataset, however, if the targeted segments occurred in filled pauses or word fragments.
After exclusion, the remaining target segments in the dataset were categorised as voiceless or voiced, based on the phonetic annotation provided in the CSJ-RDB. Other segmental properties, such as place, position in a word, height of the following vowel and existence of a minimal pair (see §2.2.2) were manually annotated for each target segment.
Segmental intervals in the CSJ-RDB (onset time and offset time) are annotated for each linguistic unit, such as segment, syllable, word and phrase. For stops, separate annotation is provided for the closure (with the label ‘<cl>’) and the burst (labelled with the segment) portions. The duration of VOT was obtained by subtracting the onset time of the target segments from their offset time based on the annotation in the CSJ-RDB. Note that the nature of the labels did not allow us to take into account the presence of laryngeal activity during the closure portion, thus VOT was treated as positive in our analysis, as in English (which has an aspiration contrast). However, as mentioned in §1.1.1., the VOT of voiced plosives in Japanese is undergoing a change, and prevoicing is often absent in younger generations. Given that speakers in the CSJ-mini mainly belong to younger generations, we expect to observe few occurrences of prevoicing, and so this treatment of VOT seems suitable. The duration of other units was calculated in the same manner. An exhaustive search and filtering of the data from the CSJ-mini resulted in a dataset of 4,448 tokens, of which 1,222 (27.5) were voiced stops and 3,226 (72.5) were voiceless stops.
2.2. Factors in statistical analysis
2.2.1. Response variable
For the purpose of examining the relationship between speech rate and VOT duration, we calculated speech rate by dividing the number of moras in a word containing the target segment by the duration of the word in seconds. As previously reported, VOT varied considerably depending on speech rate in such a way that they inversely correlated with each other (voiced:
$r = 0.189$
,
$t(1,220) = -6.724$
,
$p < 0.01$
; voiceless:
$r = 0.123$
,
$t(3224) = -7.015$
,
$p < 0.01$
). In other words, VOT tends to be shorter as the speech rate increases and longer as it decreases. Furthermore, phonetic enhancement resulting from contrastive hyperarticulation should be distinguished from enhancement due to slow/clear-speech hyperarticulation (see, e.g., Wedel et al. Reference Wedel, Nelson and Sharp2018). For these reasons, raw VOT values observed in the corpus were normalised by speech rate using a measure-internal method (see Wedel et al. Reference Wedel, Nelson and Sharp2018 and the references cited therein): namely, we multiplied raw VOT by speech rate. Following the previous literature, the speech rate-normalised VOT was then log-transformed (Bell et al. Reference Bell, Brenier, Gregory, Girand and Jurafsky2009; Seyfarth Reference Seyfarth2014).
2.2.2. Factors of interest
As the working hypothesis of this study is to examine if the minimal-pair-driven contrastive hyperarticulation in Japanese is cue-specific, the factors of our primary interest are the presence or absence of minimal pairs for (1) the voicing contrast and (2) the singleton–geminate contrast. Labels regarding the presence/absence of minimal pairs were coded item-by-item, using a lexicon based on three Japanese dictionaries (Kōjien, Shimmura Reference Shimmura2018; Sanseidō Kokugo Jiten, Kembo et al. Reference Kembo, Ichikawa, Hida, Yamazaki, Iima and Shioda2013; and Goo Kokugo Jisyo, Matsumura Reference Matsumura2024). For each word token, from the corresponding phonemic representation of the lemma annotated in the CSJ-RDB, we identified a potential minimal-pair competitor that contrasts with the lemma by substituting the distinctive feature (voicing or length) of the stop in question and checked if the competitor was present as a dictionary entry. If the potential competitor was present, the value of the minimal pair existence was coded as true; otherwise it was coded as false. Lexical accent was not taken into account. In this process, if a member of a pair was a personal name, jargon, or an archaic or dialectal form, the pair was not regarded as a minimal pair, as it is not likely to be shared by the majority of Japanese speakers. This process resulted in the distribution of minimal pairs summarised in Tables 1 (for the voicing contrast) and 2 (for the length contrast).
Table 1 Distribution of minimal pair existence by segment and position for the voicing contrast (
$x/y$
: for each cell, x is the number of types and y the number of tokens).

Table 2 Distribution of minimal pair existence by segment and position for the singleton–geminate contrast (
$x/y$
: for each cell, x is the number of types and y the number of tokens).

2.2.3. Control variables
Factors that may have an effect on VOT were also included in the model:
-
Place of articulation (labial, coronal, dorsal): VOT differs across places of articulation, as in labial
$<$ coronal
$<$ dorsal (Lisker & Abramson Reference Lisker and Abramson1967).
-
Position in word (initial/non-initial): Previous studies on VOT hyperarticulation in English focussed on word-initial stops (Wedel et al. Reference Wedel, Nelson and Sharp2018), but did not explore stops in non-initial position. However, following previous research on VOT in Japanese, this study explored both word-initial and non-initial positions, taking the effect of position on VOT into account.
-
Word frequency: We counted and log-transformed the number of occurrences of each word in the complete CSJ (
$N(\textit {word}_{x})$ ). Word frequency is known to affect word duration: frequent words are shorter than less frequent ones (Zipf Reference Zipf1935; Wright Reference Wright1979). Therefore, in the present study, word frequency may affect VOT duration.
-
Contextual predictability/local phonotactic probability (backward and forward): The average probability of each two-phoneme sequence was calculated by dividing the probability (number of occurrences in the corpus) of the target segment and the preceding/following vowel by the unconditional probability of the target segment (backward:
$p(\textit {phoneme}_{x} \mid \textit {phoneme}_{x-1})$ ; forward:
$p(\textit {phoneme}_{x} \mid \textit {phoneme}_{x+1})$ ). The values were log-transformed. Contextual predictability has been shown to be a predictor of duration in a corpus study of English (Seyfarth Reference Seyfarth2014).
-
Following vowel height (high/non-high): VOT is affected by the height of the following vowel (e.g., Klatt Reference Klatt1975). /i, u/ were coded as high and /a, e, o/ as non-high.
-
Following vowel duration: VOT of stops may be affected by the duration of the following vowel, based on the principle of mora-timing compensation (Port et al. Reference Port, Al-Ani and Maeda1980; Homma Reference Homma1981). We can expect that if vowel duration is longer, VOT is shorter and vice versa. The values were normalised by speech rate.
-
Word length: the number of moras in a word containing the target segment. The values were log-transformed. Turk & Shattuck-Hufnagel (Reference Turk and Shattuck-Hufnagel2000) propose that mean syllable duration may decrease with the number of syllables in a word, a phenomenon that they call ‘polysyllabic shortening.’ If we postulate that the same is true with moras, then the more moras a word contains, the shorter a mora will be. Segments should be affected accordingly at the level of the phonetic cue.
2.3. Model building
Following prior work (e.g., Wedel et al. Reference Wedel, Nelson and Sharp2018), we fit separate models for voiced and voiceless stops, since (a) VOT significantly differs depending on voicing (voiced:
$\textrm {mean} = 16.15\,\textrm {ms}$
,
$\textrm {SD} = 7.17$
; voiceless:
$\textrm {mean} = 21.84\,\textrm {ms}$
,
$\textrm {SD} = 10.05$
;
$t(3068.8) = -21.008$
,
$p < 0.01$
); (b) contrastive hyperarticulation is expected to affect VOT in opposite directions: voiced stops become shorter and voiceless stops become longer (if there are minimal-pair competitors); and (c) other control variables may not affect voiced and voiceless stops the same way. In particular, we fitted linear mixed-effects models to our data using lmer of the lmerTest package (Kuznetsova et al. Reference Kuznetsova, Brockhoff and Christensen2017) in R (R Core Team 2019). Variables included in the full model were the normalised VOT as the dependent variable; the fixed effects were presence/absence of minimal pairs (for voicing and singleton–geminate contrasts), place of articulation, position in word, word frequency, contextual predictability (backward and forward), following vowel height, following vowel duration and word length. Random intercepts for speaker and item (lemma) and by-speaker random slopes (consisting of position, voicing minimal pair, singleton-geminate minimal pair and an interaction term of position and voicing minimal pair) were also included in the model. The model was structured to include the greatest number of theoretically relevant factors, rather than focussing on the most complex random effects structure (see Tang & Bennett Reference Tang and Bennett2018, following Baayen et al. Reference Baayen, Vasishth, Kliegl and Bates2017; Matuschek et al. Reference Matuschek, Kliegl, Vasishth, Baayen and Bates2017). The model fit was assessed by referring to AIC, BIC and log-likelihood.
3. Results
We now turn to the description of the final models. Results are presented separately for voiced and voiceless stops. Table 3 presents the summary of fixed factors for the voiced-stop model. For the control predictors, place of articulation, position in word, word frequency, contextual predictability (backward and forward), following vowel height, following vowel duration and word length were all retained in the final model, which indicates that they make significant contributions to model fit. Although word frequency did not reach significance, other factors had highly significant positive or negative effects on VOT duration.
Table 3 Fixed effects summary for the voiced-stop model.

The factors of interest regarding the two minimal pair contrasts were retained in the final model. Figure 1 illustrates the distribution of speech rate normalised VOT for voiced and voiceless stops, by presence/absence of a minimal pair competitor.

Figure 1 Distribution of speech rate-normalised VOT values by presence/absence of a minimal-pair competitor for voiced and voiceless stops. Solid circles represent mean values and vertical lines interquartile ranges.
Figure 1 shows that the mean VOT value for voiced stops is lower when a lexical competitor (voiceless counterpart) exists than when there is no such competitor. This is also confirmed in Table 3. The presence of a minimal-pair competitor in a voicing contrast was shown to significantly predict a shorter VOT (
$p < 0.01$
) for voiced stops, as indicated by the negative coefficient for this variable. The average difference in raw VOT between voiced and voiceless stops when there is a minimal pair competitor was 8.45 ms, and it was 5.08 ms with no such competitor (see §4.1 for a comparison with English). On the other hand, the presence of a lexical competitor for the singleton–geminate contrast (i.e., a geminate counterpart) was not significantly predictive (
$p = 0.38304$
). This is consistent with the hypothesis that contrastive hyperarticulation of a given cue is triggered by the existence of a minimal pair distinguished by that cue. Additionally, the shorter VOT in voiced stops caused by the existence of voicing minimal pairs suggests that what is observed in this model is contrastive hyperarticulation, rather than slow/clear-speech hyperarticulation. Next, let us turn our attention to the summary of fixed factors for the voiceless stop model in Table 4.
Table 4 Fixed effects summary for the voiceless stop model.

For the control predictors, place of articulation, position in word, word frequency, contextual predictability (backward and forward), following vowel height, following vowel duration and word length were all retained in the final model, suggesting that these predictors significantly contribute to model fit. Unlike in the voiced model, place of articulation (labial), position in word and backward contextual predictability were not significant, while word frequency was highly significant. Forward contextual predictability was close to significance at the five percent level.
For the factors of interest, both retained in the final model, the presence of a minimal pair competitor in a voicing contrast (voiced counterpart) was found to significantly affect VOT in the expected direction: longer VOT (
$p < 0.05$
), as illustrated in Figure 1, while the presence of a minimal-pair competitor in the singleton–geminate contrast did not reach significance (
$p = 0.16767$
). The result in the voiceless stop model is again consistent with the hypothesis that contrastive hyperarticulation is triggered by the existence of cue-specific minimal pairs, but not by non-cue-specific minimal pairs. The direction of the effect of cue-specific lexical competition in the voiceless-stop model was reversed from what was found in the voiced-stop model, in support of the hypothesis that contrastive hyperarticulation is realised in such a way that the phonetic correlate of a distinctive feature moves away from the competitor.
However, care should be taken in interpreting these results, since the distribution of geminates is biased with respect to position and voicing. As Table 2 shows, there is no singleton–geminate contrast word-initially in Japanese; consonantal length is contrastive only word-medially. Additionally, there are very few voiced geminates. Considering the possible effects of these biases, we ran a third model consisting only of the word-medial voiceless stop fraction of the data for triangulation purposes. For consistency, position in word, as a control predictor and as a term in the by-speaker random slope, was taken out of this model. The results are presented in Table 5.
Table 5 Fixed effects summary for the word-medial voiceless stop model.

In the word-medial-voiceless-stop model, the pattern for the control predictors is mostly similar to the voiceless-stop model in Table 4, except for the following vowel duration, which did not reach significance. Most importantly, the pattern of the factors of interest in this model is consistent with the voiceless-stop model in that the presence of a minimal-pair competitor in a voicing contrast (i.e., a voiced counterpart) was found to significantly affect VOT, making it longer (
$p < 0.05$
), while the presence of a minimal-pair competitor in the singleton–geminate contrast did not reach significance (
$p = 0.12024$
). Thus, we can safely reject the possibility that the lack of significant contrastive hyperarticulation effect of singleton–geminate minimal pairs is due to an unbalanced distribution of geminates regarding position and voicing.
4. Discussion
We confirmed in the previous section that (a) the presence of a voicing minimal pair competitor induces hyperarticulation of VOT in such a way that voiceless stops become longer, while voiced stops become shorter; and (b) the presence of a singleton–geminate minimal pair competitor does not induce hyperarticulation of VOT. The VOT of stop consonants in Japanese thus provides additional support for the cue-specific nature of contrastive hyperarticulation. This is the first study that attests the existence of minimal pair-driven contrastive hyperarticulation of VOT in Japanese, which is typologically different from better-studied languages in its grammatical structure and rhythmic/prosodic properties. This provides additional evidence that the cue-specificity of contrastive hyperarticulation holds across language types.
As reviewed above, VOT functions as the main acoustic cue for the distinction between voiced and voiceless stops in Japanese (Shimizu Reference Shimizu1977 et seq.). For the singleton–geminate contrast, closure duration has been shown to be the primary acoustic correlate in natural speech (Sano Reference Sano, Calhoun, Escudero, Tabain and Warren2019). Furthermore, the perceptual distance between singletons and geminates is found to be enhanced by contrastive hyperarticulation (Sano Reference Sano2018a,Reference Sanob). Building upon prior research, the present study provides further evidence for the existence of contrastive hyperarticulation in VOT in Japanese, but its effect is induced only by the existence of voiced/voiceless minimal-pair competitors, and it is insensitive to lexical competition with singleton–geminate minimal pairs. This is consistent with previous findings that contrastive hyperarticulation occurs only at the level of the phonetic cue; hence, it is cue-specific.
4.1. Degree of contrastive hyperarticulation
The results of this study are consistent with recent seminal work on the effect of contrastive hyperarticulation on VOT in English (Wedel et al. Reference Wedel, Nelson and Sharp2018). We found that a greater difference was observed between voiceless and voiced stops’ VOT when a lexical competitor exists in the lexicon than when there is no such competitor. In this study, the average difference in raw VOT between voiced and voiceless stops with a minimal-pair competitor (8.45 ms) is about 66 greater than without such a competitor (5.08 ms).
This difference between VOT of voiced vs. voiceless stops in the presence/absence of a minimal pair is below the threshold of the Weber–Fechner law of the just-noticeable difference – that is, the minimal difference between two stimuli that leads to a change in experience (Treutwein Reference Treutwein1995). Under the assumptions of the just-noticeable difference, a 3 ms difference in duration seems indeed too small for listeners to make a conscious distinction between the two competitors (see Lehiste Reference Lehiste and Lass1976, inter alia; a 10 ms difference is necessary for hearers to distinguish two sounds). However, contrastive hyperarticulation is assumed to be an unconscious mechanism of spontaneous speech (in contrast with the conscious mechanism involved in clear/elicited speech): it is speakers’ implicit knowledge of the existence of a minimal pair in the lexicon that leads them to enhance specific cues in order to better keep lexical items apart. We can thus postulate that the same is true for perception: both speakers and listeners should be able to unconsciously use sub-lexical information at a higher level for a more efficient communication process. Findings from a study on VOT in English by McMurray et al. (Reference McMurray, Tanenhaus and Aslin2002) suggest that this inference is correct. Their results in a perceptual experiment where VOT varied on a continuum indicate that fine-grained acoustic differences at the level of the phonemic cue that have minimal effects on phoneme identification play, in fact, a crucial role in lexical access. Drawing a parallel with the results of the present study, we propose that a minimal change in VOT, which should not be relevant when examining the two segments in isolation, provides listeners with additional cues for lexical identification.
Let us now turn our attention to the reasons behind the small size of this difference. One possibility is that, since the just-noticeable difference is defined as being proportional to the size of the original unit, the increase should not be considered in terms of raw duration, but as a proportion. In this case, the small size of the difference in VOT between voiced and voiceless stops without a minimal-pair competitor (5.08 ms) might justify the small size of the difference between stops with competitors (8.45 ms), as a 66 increase between the two is observed. Note that contrastive hyperarticulation of the closure duration in the singleton–geminate contrast (Sano Reference Sano2018a) represents an increase of 59 of the difference between singleton and geminate closure (81.7 ms with a minimal-pair competitor vs. 51.2 ms without).
We can find further food for thought in the nature of VOT in Japanese. As previously mentioned in §1.1.1, recent studies report that VOT in Japanese is undergoing a change and tends to deviate from its original description, especially in younger speakers: devoicing is often observed in initial position, and VOT seems to be shifting from a long lead to a short lag (Riney et al. Reference Riney, Takagi, Ota and Uchida2007; Takada et al. Reference Takada, Kong, Yoneyama and Beckman2015; Gao & Arai Reference Gao and Arai2018; Gao et al. Reference Gao, Yun, Arai, Calhoun, Escudero, Tabain and Warren2019). What these studies suggest is that VOT seems to be losing its role as a primary cue for the voicing contrast in Japanese. Thus, one might postulate that cues other than VOT might also be hyperarticulated to enhance the contrast. In this regard, our models seem to indicate that the following vowel plays a role in the contrast, as we found a correlation with the following vowel duration in both the voiced (
$p < 0.001$
) and the voiceless (
$p < 0.05$
) models. On the other hand, as indicated in Gao et al. (Reference Gao, Yun, Arai, Calhoun, Escudero, Tabain and Warren2019), although the status of VOT as a phonetic cue for voicing appears to be shifting, there is a lag between production and perception, and they find that in terms of perception, listeners still heavily rely on VOT. This suggests that even a small difference in VOT duration should be meaningful.
If we put the present results into perspective with those of Wedel et al. (Reference Wedel, Nelson and Sharp2018), it is interesting to note that the degree of contrastive hyperarticulation was found to be greater in Japanese than in English. In Wedel et al. (Reference Wedel, Nelson and Sharp2018), the average difference in raw VOT between voiced and voiceless stops in words that have a voicing minimal-pair competitor (63 ms) is approximately 20 greater than those in words that do not have such a competitor (53 ms), while it was 66 in this study.
One clue to understanding this difference between English and Japanese may be the role of aspiration. Wedel et al. (Reference Wedel, Nelson and Sharp2018) and most previous studies focussed on word-initial stops, where voiceless segments are aspirated, producing longer VOT durations (Lisker & Abramson Reference Lisker and Abramson1964). For this reason, the effect of contrastive hyperarticulation on VOT in English may be inhibited; that is, it takes more effort to produce contrastive articulation at longer VOTs. The present study, on the other hand, targeted both word-initial and non-initial stops, because Japanese does not have positional aspiration, and VOT is thus not expected to differ greatly between these two positions. Compared to English, the shorter VOT in Japanese may allow extra room for the effect of contrastive hyperarticulation (i.e., lengthening). Furthermore, for a shorter VOT to be salient enough to convey information about lexical contrast, the duration should be enhanced to a greater extent. In other words, voiced and voiceless stops in Japanese are phonetically more similar, with closer VOT durations; the stops of interest therefore are subject to a stronger effect of contrastive hyperarticulation (even 66 of enhancement results in no more than 3.37 ms of difference). This is not the case for English, where the difference in VOT between voiced and voiceless stops is greater than in Japanese (20 of enhancement results in about 10 ms of difference).
4.2. Other issues
4.2.1. Position in word
It has been shown that language users pay more attention to word-initial positions (Bruner & O’Dowd Reference Bruner and O’Dowd1958), which contribute more information to lexical identification (van Son & Pols Reference van Son, Pols and Bourlard2003; Wedel et al. Reference Wedel, Ussishkin and King2019). Beckman (Reference Beckman1998) classifies initial syllables as ‘psycholinguistically prominent’, and therefore initial positions favour phonological processes that enhance the realisation of lexical contrasts. Other previous research has shown that cross-linguistically, strengthening phonological processes are more likely to target word beginnings, while neutralisation processes prefer word ends (Barnes Reference Barnes2006; Wedel et al. Reference Wedel, Ussishkin and King2019). In terms of prosody as well, domain-initial positions are stronger (Keating Reference Keating, Palethorpe and Tabain2003 and references cited therein). In our dataset, the difference in raw VOT between voiced and voiceless stops with and without a minimal-pair competitor was slightly greater in initial positions (3.82 ms) than in non-initial positions (3.58 ms), although the predictor, position in word, was significant in the voiced-stop model (
$p < 0.01$
) but not in the voiceless-stop model. From this, a possibility arises that the degree of contrastive hyperarticulation differs depending on the position of the segment in the word. If the positional difference is confirmed in future work, it can provide additional evidence for the cross-linguistic precedence of (word) initial positions over non-initial positions (e.g., Wedel et al. Reference Wedel, Ussishkin and King2019).
4.2.2. Slow/clear speech vs. casual speech
In §3, our results confirmed that contrastive hyperarticulation tends to be incompatible with slow/clear speech. This supports previous findings in Wedel et al. (Reference Wedel, Nelson and Sharp2018) that, assuming that the purpose of contrastive hyperarticulation is to avoid perceptually confusable productions near the category boundary, its effect may be less robust in slow/clear speech, where phonemic categories are less likely to overlap. Conversely, the context where contrastive hyperarticulation can be observed more robustly and is more likely to occur would be casual speech, which includes more perceptually confusable productions due, for example, to reduction processes (Wedel et al. Reference Wedel, Nelson and Sharp2018).
We examined the potential connection between speech style and the effect of contrastive hyperarticulation by running additional models with style in interaction with the existence of a voicing minimal pair. The distinction in the CSJ-RDB (monologue vs. dialogue) corresponds to the distinction between slow/clear speech and casual speech, as monologues represent slow/clear speech and dialogues represent casual speech (Maekawa Reference Maekawa2003). The gap in raw VOT duration due to the existence of a minimal-pair competitor was greater in dialogues than in monologues both in voiced and voiceless stops (voiced: monologues = 0.47 ms, dialogues = 3.05 ms; voiceless: monologues = 2.21 ms, dialogues = 3.11 ms), although the predictor was not significant in the voiceless-stop model, and the voiced-stop model only showed a tendency to significance (
$p = 0.09068$
). If it is confirmed in future work that the degree of contrastive hyperarticulation is greater in casual speech than in slow/clear speech, it will reinforce our understanding of the distinction between slow/clear-speech hyperarticulation and contrastive hyperarticulation, and suggest that contrastive hyperarticulation is induced only when necessary.
5. Conclusion
This study sheds light on hitherto unexplored aspects of contrastive hyperarticulation. Building upon prior work, this study provided an additional test case that supports cue-specificity. As mentioned above, in Japanese two kinds of minimal pairs are distinguished by duration-based cues coexisting within a single stop consonant: VOT for the voicing contrast and closure duration for the singleton–geminate contrast. By taking advantage of the variety of minimal pairs distinguished based solely on durational considerations, we examined how information about lexical competition is reflected at the level of the phonetic cue. The results showed that what matters in contrastive hyperarticulation of VOT is the existence of voicing minimal pairs, providing further support for the cue-specificity of contrastive hyperarticulation. Our results also offer additional support for MOP’s perspective that language is an effective tool for communication, and that speakers phonetically enhance cues for accurate message transmission: when there is a lexical competitor, VOT is hyperarticulated and the distance is increased, but speakers are less likely to hyperarticulate when little benefit is expected.
Acknowledgements
We would like to thank Andrew Wedel for his valuable feedback and support on earlier versions of this article. We are also deeply grateful to the editors and anonymous reviewers for their constructive suggestions. Any remaining errors are solely our responsibility.
Funding statement
This study is supported by the Japan Society for the Promotion of Science KAKENHI Grant No. 19K00558.
Competing interests
The authors declare no competing interests.