1. Introduction
Turn-taking in conversation is managed in versatile ways, and even more so in multimodal settings. While conversational participants organize their turns easily, measuring turn-taking cues by analyzing recordings has proven complex, leading many researchers to study single phenomena and mostly independently of other phenomena. For example, phonetic, and especially prosodic, features have been studied extensively in terms of their ability to predict points of speaker change. Gestures, such as head nods, have also been studied extensively in the vicinity of places of potential turn transitions.
In attempts to combine spoken with gestural features, the temporal characteristics of gestures, such as the apices of gesture strokes, have been compared with the temporal characteristics of speech, such as the stressed syllable of a lexical affiliate of the corresponding gesture (for example, a spoken lexical item with which a gesture shares lexical content). These studies have often found timing differences suggesting that gestures come slightly before their lexical affiliate (Bergmann et al., Reference Bergmann, Aksu and Kopp2011; Ferré, Reference Ferré2010; ter Bekke et al., Reference ter Bekke, Drijvers and Holler2020). Such findings on speech gesture synchrony often address implications, such as whether gestures facilitate lexical access, decrease processing time or the like. So far, however, it has not been investigated in depth what consequences the overall activity of gesturing or not gesturing has on the management of turn-taking or how the uptake or non-uptake of a turn depends on the timely dynamics of that gesturing activity.
In general, there seem to be two possibilities: on one hand, gesture strokes may mark an end-point of an utterance, thus yielding the turn. On the other hand, gesture activity may signal that an utterance is still in progress, hindering an uptake from the interlocutor. In the current study, we build on prior work calling for multimodal analyses of talk – for example, work by Mondada (Reference Mondada2019) positioning multimodality in the context of social interaction, embodiment and multisensority from a conversation analytic (CA) and therefore qualitative perspective – and extend it by taking a quantitative approach toward annotating real data with traditional features and exploring them in an innovative way. Regarding annotations, we take the well-known gesture phases, that is, preparation, hold, stroke and retraction following Kendon (Reference Kendon2004) and Kita et al. (Reference Kita, van Gijn, van der Hulst, Wachsmuth and Fröhlich1997), and basic turn transition types, such as keeping the turn, receiving a backchannel or yielding the turn. The simple but innovative analysis approach presented in this study takes a perspective focused around the offset of speech, that is, where a turn (potentially) comes to a syntactic and/or semantic end, and looks into the current speaker’s gestures in the vicinity of this point in time. Our study therefore contributes to the body of research that investigates the resources that interactional participants employ at potential turn transition places for managing their turns.
In our quantitative approach, we only refer to hand gestures and disregard other gesture types such as head gestures or eye gaze. Although our annotations included whether we interpreted the gesture as referential or not, we do not differentiate between these referentiality types in the current work, cf. Loehr (Reference Loehr2004), who does not distinguish between referential and non-referential gestures and Shattuck-Hufnagel and Ren (Reference Shattuck-Hufnagel and Ren2018) whose results span across referentiality.
1.1. Gesture analysis
A gesture is a movement of some part of the body that accompanies speech and has communicative value for listeners (Kendon, Reference Kendon1994); gesturing can be distinguished from movement for its own sake, or movement which involves object manipulation (Novack et al., Reference Novack, Wakefield and Goldin-Meadow2015). However, gestures are defined based on inferences about their communicative intent, not external criteria such as form (Bavelas, Reference Bavelas1994). It has been shown that naïve observers are able to interpret gestural movements reliably (Goldin-Meadow & Sandhofer, Reference Goldin-Meadow and Sandhofer1999). Thus, any analysis of gesture makes the fundamental assumption that the body movement in question is intended to be communicative.
Early analyses of manual gestures focused on gestures that formed specific shapes, or referred to specific locations, sizes, objects, metaphors or ideas (c.f. Kendon, Reference Kendon and Kay1980; McNeill, Reference McNeill1992). Since this time, however, a variety of classification systems have arisen, allowing the study of gestures across various parameters. The most popular of these systems were introduced by Kendon (Reference Kendon and Kay1980) with the categories sign language, pantomime, emblems and gesticulation. McNeill arranged these bodily movements on a continuum with gesticulation being defined as co-speech gestures. To further classify the co-speech gestures, McNeill (Reference McNeill, Brown and Anderson2006) introduced four categories roughly related with gesture semantics or function: iconic, metaphoric, deictic and beat gestures. More recent behavioral and cognitive evidence indicates that there is not a clear divide between, e.g., beat gestures and metaphoric gestures (Casasanto, Reference Casasanto2008, Reference Casasanto2009); rather, gestures may be classified in more than one way simultaneously. McNeill had already raised the observation that a strict classification is not really realistic and that a dimensional description would better characterize the ways in which gestures are implemented. Thus, even if a single or specific function is attributed to a gesture, this attribution should not be considered as a unique or exclusive function but rather simply one aspect of the gesture in question.
1.2. Coordination of speech and gesture
A growing body of evidence supports the argument that linguistic research should treat speech and gesture as a unified system (cf. e.g., Kendon, Reference Kendon2004; McNeill, Reference McNeill2005). Wagner et al. (Reference Wagner, Malisz and Kopp2014) provide an in-depth review of relationships between speech and gesture that have been reported in the literature. Their review raises the question of whether the auditory modality, that is, spoken language, and visual modality, that is, gesture, are used in parallel or as complements.
In speech production, a strong effect appears to arise in the context of prosodic features. Rhythmic or beat gestures have been demonstrated to appear in consistent temporal alignment with prosodic prominences in spoken language in adults as well as children (Ambrazaitis & House, Reference Ambrazaitis and House2017b; Esposito et al., Reference Esposito, Esposito, Refice, Savino, Shattuck-Hufnagel, Esposito, Bratanić, Keller and Marinaro2007; Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Florit-Pons et al., Reference Florit-Pons, Vilà-Giménez, Rohrer and Prieto2020; Knight, Reference Knight2009; Krahmer & Swerts, Reference Krahmer and Swerts2007; Leonard & Cummins, Reference Leonard and Cummins2011). Specifically, gesture apices tend to align with stressed syllables (e.g., Loehr, Reference Loehr2004; Rochet-Capellan et al., Reference Rochet-Capellan, Laboissière, Galván and Schwartz2008) or intonation peaks (e.g., Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Nobe, Reference Nobe1996; Pouw & Dixon, Reference Pouw, Dixon and Grimminger2019).
Visual and auditory information have been found to be automatically integrated in the course of speech perception, and the combination of these different input streams influences speech intelligibility (Kelly et al., Reference Kelly, Creigh and Bartolotti2010; McGurk & MacDonald, Reference McGurk and MacDonald1976). Viewing speech-accompanying gesture has been demonstrated to lead to increased activity in the auditory cortex (Hubbard et al., Reference Hubbard, Wilson, Callan and Dapretto2009), as well as in brain areas involved with semantic processing (Dick et al., Reference Dick, Goldin-Meadow, Hasson, Skipper and Small2009).
Specific constellations of different prosodic and gestural prominence cues may also have different communicative effects than the individual cues alone (Ambrazaitis & House, Reference Ambrazaitis and House2017a, Reference Ambrazaitis and House2017b; Prieto et al., Reference Prieto, Puglesi, Borràs-Comes, Arroyo and Blat2015). A manual McGurk effect has even been reported, where gestural beats were used to overwrite intonation cues for differentiating lexical stress, e.g., OBject versus obJECT (Bosker & Peeters, Reference Bosker and Peeters2021). Guellaï et al. (Reference Guellaï, Langus and Nespor2014) report that listeners can identify congruencies between even unintelligible speech and gesture and use gesture for disambiguation in cases when information in the speech signal is ambiguous or conflicting. It is thus clear that the temporal placement of speech-accompanying gesture can and does play a crucial role for speech understanding.
1.3. The management of turn-taking in conversation
Conversation tends to proceed with a minimum of problematic (that is, disruptive) overlaps or silent gaps (Sacks et al., Reference Sacks, Schegloff and Jefferson1974), and the amount of silent time between conversational turns appears to have a stable mean of around 200 ms across a variety of typologically different languages, including sign languages (Buanzur et al., Reference Buanzur, Zellers, Namyalo and Witzlack-Makarevich2018; de Vos et al., Reference de Vos, Torreira and Levinson2015; Heldner & Edlund, Reference Heldner and Edlund2010; Stivers et al., Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009). Many linguistic features, phonetic/prosodic and otherwise, play a role in signaling turn transition, including syntactic/semantic completion (e.g., Auer, Reference Auer, Couper-Kuhlen and Selting1996; de Ruiter et al., Reference de Ruiter, Mitterer and Enfield2006; Schaffer, Reference Schaffer1983), intonational features (e.g., Bögels & Torreira, Reference Bögels and Torreira2015; Caspers, Reference Caspers2003; Local et al., Reference Local, Kelly and Wells1986; Peters, Reference Peters2006; Selting, Reference Selting1996) and phonation quality/spectral characteristics (e.g., Kane et al., Reference Kane, Yanushevskaya, de Looze, Vaughan and Ní Chasaide2014; Ogden, Reference Ogden2001).
Studies using larger corpora (e.g., Gravano & Hirschberg, Reference Gravano, Hirschberg, Healey, Pieraccini, Byron, Young and Purver2009, Reference Gravano and Hirschberg2011; Hjalmarsson, Reference Hjalmarsson2011; Koiso et al., Reference Koiso, Horiuchi, Tutiya, Ichikawa and Den1998) tend to find a hierarchy of various features correlated with speaker transition or floor hold, including lexico-syntactic as well as phonetic features; however, syntactic/semantic completion is not always a definitive cue to finality. Different types of conversational actions or turns have different degrees of ‘projectability’, or predictability as to their future direction, such as when a speaker tells a story which requires multiple conversational turns to complete (Auer, Reference Auer2005).
Like linguistic cues, gestural cues have been shown to be relevant for the management of turn-taking in conversation. Schegloff (Reference Schegloff, Atkinson and Heritage1984) points out that it is mostly current speakers who gesture, although gestures may be used by a current hearer to indicate the desire to take the floor, as has been found for a variety of languages (Li, Reference Li2014; Mondada & Oloff, Reference Mondada, Oloff, Stam and Ishino2011; Streeck & Hartge, Reference Streeck, Hartge, Auer and di Luzio1992). Similarly, gestures may be used at turn ends to hold the floor during a pause or to invite a response from an interlocutor (Kendon, Reference Kendon1995; Mondada, Reference Mondada2007; Stivers & Rossano, Reference Stivers and Rossano2010). Sikveland and Ogden (Reference Sikveland and Ogden2012) demonstrate how hand gesturing across a turn end can help achieve the complex function of identifying and resolving a problem of understanding. Some gestural cues appear to parallel roles of prosodic structure; thus, Quek et al. (Reference Quek, McNeill, Bryll, Duncan, Ma, Kirbas, McCullough and Ansari2002) find that hand gestures are temporally correlated with prosodic phrase boundaries, possibly contributing to the segmentation of speech into phrases. In addition, Chui (Reference Chui2005) and Graziano and Gullberg (Reference Graziano and Gullberg2018) report that gesturing is linked with ongoing speech. From a turn-taking perspective, the end of a turn constitutes a break in continuity of speech, which might allow the absence of gesturing to be a turn-yielding cue, too. Similarly, Barkhuysen et al. (Reference Barkhuysen, Krahmer and Swerts2008) report that speakers tend to look away from their interlocutor phrase-medially and to look back at them phrase-finally. These studies serve as evidence that gestures may provide information about the completeness of a spoken turn.
1.4. Aims of the current study
It is clear from the literature discussed above that close temporal relationships exist between spoken language and gesture and that coordinated speech and gesture are relevant for the management of turn-taking in conversation. At the same time, the literature reported above suffers to some degree from a lack of methodological unity, with the results of qualitative and quantitative studies not always brought into harmony with one another. The question of whether and how possible variation in temporal relationships between speech and gesture contributes to the management of turn-taking remains open.
Thus, in the current study, we investigate the extent to which malleability in temporal relationships between speech and gesture is used for conversation management and how the use of such features may differ across languages.
Our specific research question is how and to what extent does the temporal relationship between speech and gesture contribute to the management of turn-taking in conversation? We operationalize the temporal relationship between speech and gesture as the temporal relationship between different phases of manual gestures produced by the speaker of the turn that is (potentially) coming to an end and the offset of speech at a location where turn transition may become relevant. We hypothesize that, at locations in conversation in which a current speaker reaches a point of possible completion but wishes to hold the floor, extra effort is needed, which will result in different temporal relationships between speech and gesture (cf. Kendrick et al., Reference Kendrick, Holler and Levinson2023; Schegloff, Reference Schegloff, Atkinson and Heritage1984).
We further address our research question in the context of two related languages with different prosodic structures, German and Swedish. While both are Germanic languages and thus have some substantial structural similarity, they differ in their intonational structure. German is an intonation language, where pitch movements are used exclusively for pragmatic purposes. In Swedish, however, pitch movements are part of the lexical specification of words, with words carrying one of two lexical pitch accents. The differences in the phonological systems have already been shown to be relevant for prosodic signaling of turn transition intentions (Rossi et al., Reference Rossi, Feindt and Zellers2022; Zellers et al., Reference Zellers, Gorisch, House and Peters2019a). Since gestural features are closely linked with prosodic features (cf. Section 1.2), it is thus possible that these prosodic differences could lead to differences in gesture use even in two relatively closely related languages.
2. Method
We adopt a quantitative, corpus-based approach that involves the annotation and analysis of video recordings. The data are spontaneous conversations from pre-existing corpora in German and Swedish that have already been transcribed orthographically. In this section, we give more details on the selected recordings, the annotations we added for the purposes of this study, and an outlook on the statistical analyses we employed.
2.1. Data
The data used in the current study are drawn from two corpora of conversational speech. The Swedish data come from the Spontal corpus (Edlund et al., Reference Edlund, Beskow, Elenius, Hellmer, Strömbergsson and House2010), a corpus of two-party conversations collected in Stockholm, Sweden. Spontal comprises audio, video and motion-capture data, although only the video and audio data are used in the current study. The German data are taken from FOLK (Forschungs- und Lehrkorpus Gesprochenes Deutsch, Research and Teaching Corpus of Spoken German) (Schmidt, Reference Schmidt2014), a collection of speech taken from a variety of natural settings, comprising audio and video data.
An important goal of the current research was to use existing data rather than to collect new data, since so much data are already available. To do this, it was necessary to make a selection of the data that was maximally similar, while taking into account the fundamental differences between these two speech databases. The materials in the Spontal corpus were more constrained in their form: all interactions involved two-party conversations, with participants sitting face to face and with no fixed topic of conversation in the portions of the data used. The topics of the conversations were quite varied although they generally fell into the categories of daily or common activities, such as hobbies, working out at the gym, buying a drill, moving into a new apartment, working as a translator, building a closet and travelling. Two of the speakers (09-22B and 09-35B) participated in more than one conversation, as indicated in Table 1.
Table 1. Metadata for the FOLK and Spontal files

The selection constraints of two-party conversations with participants sitting face to face were also adopted while searching FOLK for appropriate data for a comparison. Three relevant conversations in FOLK were identified. In two of the conversations, two speakers interact in the context of a mock job interview (a third party is present but does not contribute to the conversation once the mock interview has begun; the excerpts we analyzed began after this point). In the third conversation, an expert in birds of prey is interviewed in an informal setting. Although these conversational settings may be more formally structured than in the Spontal data, observation of the data indicates that turn-taking proceeds similarly to in the fully spontaneous conversations in Spontal. Furthermore, we do not anticipate that the differences in topic or formality would have a large impact on the temporal coordination of speech and gesture, since this is likely to rest on cognitive processes rather than on the specific content of a conversation.
For each language, we used a total of 55 minutes of data. In Spontal, the 55 minutes comprise 5 minutes each from 9 conversations, and 10 minutes, in two separate chunks, from a tenth conversation (09-35; see Table 1). In FOLK, the 55 minutes comprise 17–20 minutes each from the three conversations. Due to the constraints of the available data and annotations, it was not possible to use data from an equivalent amount of speakers while maintaining a similar amount of data as measured in minutes; we prioritized having a similar quantity of data per language so as to have a comparable number of completion points (see Section 2.2).
2.2. Annotations
Gesture and turn annotations were carried out using ELAN (Max Planck Institute, 2018), cf. Figure 1. Spoken features were annotated in Praat (Boersma & Weenink, Reference Boersma and Weenink2021).

Figure 1. Screenshot of the annotation environment in ELAN (data from FOLK).
Gesture annotation was carried out using the video signal only (that is, with the audio muted) and proceeded one conversational participant at a time. The first step of the annotation process was to identify gesture phrases, that is, stretches of time when one or both of a participant’s hands moved. In a second step, we segmented the gesture phases (preparation, stroke, hold, retraction, cf. Kendon (Reference Kendon2004); Koiso et al. (Reference Koiso, Horiuchi, Tutiya, Ichikawa and Den1998)). In the analysis below, where we relate these gesture phases with syntactic/semantic features, we also labeled areas with no hand gesture as none. Strictly speaking, this is not a gesture phase per se, but it is implemented so that measurement points with gesture can also be compared to those without gesture. The boundaries of the gesture phases were refined by moving frame-by-frame through the video in ELAN; if a boundary was ambiguous between two frames, the earlier frame was chosen as the boundary.
In a separate annotation phase in Praat, using the audio data only, we labeled locations where a speaker’s turn was potentially complete and the possibility of speaker change thus became relevant (c.f. TRPs, Sacks et al. (Reference Sacks, Schegloff and Jefferson1974); SYNCOMPS, Local and Walker (Reference Local and Walker2012); Potential Turn Boundaries (completion points), Zellers (Reference Zellers2017)). Since only locations in the conversation that were clearly syntactically or semantically complete in context were included in this classification, we adopt the term completion points for these locations, rather than, for example, TRPs, which are defined by constellations of features, not only syntactic/semantic completion. We excluded locations where the incoming speaker’s turn or backchannel began in overlap with the end of the current speaker’s turn, since these early incomings might represent an ‘incorrect’ prediction about the current speaker’s turn-taking intentions. The completion points were then classified based on the sequential structure of the possible transition as one of the following: holding the floor (with or without a verbal backchannel from the other speaker), releasing the floor (either with or without an explicit question form) or ambiguous cases.
-
• Floor hold without verbal backchannel (Keep) Following the boundary location, the current speaker takes the next full turn, thus keeping the conversational floor; the interlocutor does not produce any kind of verbalization.
-
• Floor hold with verbal backchannel (Backchannel) After the completion point, the interlocutor produced a verbal backchannel but no other speech. Evidence from Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011) and Ferré and Renaudier (Reference Ferré and Renaudier2017) indicates that verbal backchannels and gestural backchannels are positioned differently in conversation, with gestural backchannels tending to arise in overlap with ongoing speech, while verbal backchannels tend to be placed in silent gaps. Thus, verbal backchannels may also be produced in response to different speaker behavior than gestural backchannels. Furthermore, in our data, it was not always possible to identify whether the speaker in question was able to see a potential visual backchannel produced by a listener. To be as consistent as possible, we thus include only locations with a verbal backchannel in the current study.
-
• Change After the completion point, the interlocutor takes the next full turn.
-
• Question The current turn ended in a syntactically marked interrogative form (with, e.g., subject–verb inversion or a wh-word), and the next turn was taken up by the interlocutor. The role of the question label was to help distinguish cases with a clear invitation for a next speaker from speaker change cases where the lexical content does not specifically invite a contribution from the next speaker. Due to their rarity, questions are not included in the turn-taking analyses below.
-
• Ambiguous This label was used when no clear decision could be made, e.g., when the interlocutor laughed or produced unintelligible vocalisations, or when both speakers overlapped, e.g., talking collaboratively until the end of the turn. Ambiguous turns are also excluded from the turn-taking analyses.
2.3. Feature extraction and quantitative analysis
Using scripts, we extracted completion points and the ongoing gesture phase at the completion point, as well as over stretches of time beginning at 3 seconds before the completion point and ending at 3 seconds after the completion point.
Each analysis reported below has different requirements for the statistical analysis; thus the individual statistical analyses are reported in the Results. All statistical tests were calculated using R version 4.2.1 (R Core Team, 2022); α = .05. Figures were generated using ggplot2 (Wickham, Reference Wickham2016). The extracted data and R code are available at https://osf.io/efs4c/?view_only=16af14465e314724aa44ba709f051860.
3. Results
3.1. Temporal alignment of gestures with turn ends
Parts of this analysis follow a similar procedure to that used by Zellers et al. (Reference Zellers, Gorisch, House and Peters2019b); however, the current dataset is much larger than was used in that study, and some of the annotations were refined since the time of the previous analysis.
3.1.1. German
In the German data, we identified 451 completion points with the transition type Backchannel, Change or Keep. Of these, 223 (49.4%) had an ongoing gesture by the current speaker at the time of the offset of speech.
Figure 2a shows proportionally the gesture phase that was ongoing at the time of the offset of speech according to the type of turn transition in the German data; raw counts are given in Table 2. A χ 2 test shows that the difference in distribution of the gesture phases is different at different types of turn transition (χ 2(8) = 67.2, p < .05). Specifically, significant residual values indicate that preparations and holds are more likely in Keeps and less likely in Changes. Changes are more likely to have no gesture (that is, none) and Keeps are less likely to have none. Cramér’s V = 0.546, indicating a large effect size.

Figure 2. Ongoing gesture phase at time of speech offset, German data left, Swedish data right. The y-axis shows the proportion of gesture phases at each transition type, rather than raw counts.
Table 2. Gesture phases according to transition types observed at the time of speech offset in completion points in German and Swedish

3.1.2. Swedish
In the Swedish data, we identified 511 completion points with the transition type Backchannel, Change or Keep. Of these, 125 (24.5%) had an ongoing gesture by the current speaker at the time of the offset of speech.
Figure 2b shows proportionally the gesture phase that was ongoing at the time of the offset of speech according to the type of turn transition in the Swedish data; raw counts are given in Table 2. As in German, a χ 2 test shows that the difference in distribution of the gesture phases is different at different types of turn transition (χ 2(8) = 19.86, p < .05). Specifically, significant residual values show that strokes are more likely to arise in Keeps and less likely to arise in Changes. Cramér’s V = 0.282, indicating a medium effect size.
3.1.3. Cross-linguistic comparison
In both languages, it was more frequent in general for turns to end without gesture than with gesture. Specifically, the likelihood that no gesture will be ongoing at the time of speech offset is highest in Changes, followed by Backchannels and Keeps. The pattern of the retraction phase is similar. The inverse can be seen for the preparation and the hold phases, which are more frequent at Keeps and Backchannels than at Changes.
There are almost no preparations at all at Changes. In both German and Swedish, there are more stroke phases in Keeps than in Backchannels or Changes.
In sum, we can say about the distribution of gesture phases at the offset of speech of the current speaker that the gesture phases which move into or take place within the gesture space (that is, preparations, holds and strokes) tend to correlate with the same speaker keeping the floor, while gesture phases which move out of the gesture space (retraction, or none) tend to correlate with a change in speakership.
3.2. Timing of gestural activity around completion points
The analysis in Section 3.1 provides a snapshot of what happens at the offset of speech at a potential turn boundary, that is, at a specific point in time. However, it is clear that the distribution of gesture phases must evolve over time. Thus we are left with the open question of how the distribution develops toward – and also away from – the single point of time where speech stops.
We therefore take the distributions of gesture phases as shown in Figure 2 and treat them as if they were spectral slices. Taking a slice every tenth of a second, starting from 3 seconds before the offset of speech to 3 seconds after, and arranging them horizontally according to gesture phases, we obtain distributions of gesture phases over time, which can also be divided for each transition type separately. The result of this analysis/transformation is shown in Figure 3. We chose 3 seconds following Pöppel (Reference Pöppel2009), who suggests this duration as the window of cognitive ‘presence’.

Figure 3. Distribution of gesture phases over time according to transition types; German data left, Swedish data right. The zero point is at the speech offset at a potential turn boundary (PTB). On the y-axis we count which gesture phase the current speaker is currently in. All counts across all phases sum up to the number of completion points in the data with the specific transition label. The counts in the slice at time point zero, accords with the numbers in Table 2.
For each type of gesture phase, we conducted a mixed logistic regression with the criterion variable the frequency of the gesture phase and the fixed factors time, language, transition type and the interaction time:transitionType; the speaker was also included as a random factor (cf. function glmer in the R package lme4, Bartoń (Reference Bartoń2022)). The results for the model predicting the presence of gesture strokes are given in Figure 4; the model achieved an
$ {R}^2m=0.091 $
without the random factor of speaker and
$ {R}^2c=0.283 $
with the random factor. Expanded model results for all gesture phases are shown in Table 3, while the results of models for gesture phases other than strokes are summarized in Figure 5. We calculated the
$ {R}^2 $
values using the function r.squaredGLMM (Nakagawa & Schielzeth, Reference Nakagawa and Schielzeth2013) in the R package MuMIn (Bates et al., Reference Bates, Maechler, Bolker and Walker2022).

Figure 4. Model estimates for strokes. The x-axis shows the time offset from the completion point. The y-axis shows the estimated probability of strokes on a logarithmic scale. The random factor (speaker) is considered in the plot. We used the R package effects (Fox et al., Reference Fox, Weisberg, Price, Friendly and Hong2022) in plotting the estimates.
Table 3. Upper part: statistical evaluation for each logistic model: R 2m = explained variation without random factor (speaker) and R 2c corrected for the random factor. Lower part: p-values (log probabilities) for each factor and the interaction with time. The reference group (Intercept) is backchannels in German. Bold text shows predictors that achieve statistical significance.


Figure 5. Estimates for the models for gesture phases preparations, holds, retractions, and none.
As the results in Section 3.1 have already shown, for strokes, there was a significant main effect of transition type on the frequency of strokes: Overall, participants produce more gesture strokes around Keeps than around Backchannels and more strokes around Backchannels than around Changes. The analysis here further shows a significant interaction between transition type and time: In Changes, strokes become progressively less frequent as the speech offset approaches and passes, while in Keeps and Backchannels, gesture strokes are equally probable preceding and following the end of the current turn. No significant effects were found for language in this model.
Although this finding is valid, it could be considered trivial, since it is already well-known that manual gestures are mostly performed by current speakers. However, a logistic regression attempts to fit a linear model, while, as the distributions shown in Figure 3 suggest, there might be more details in the evolution of gesture phases over time, which the logistic regression is unable to model. Therefore, to look deeper into the gesture dynamics, we calculated binomial tests for each point in time (that is, every tenth of a second), from which we obtained the probability of stroke activity at each time point as well as 95% confidence intervals for these probabilities. The resulting plot for gesture strokes is shown in Figure 6. Since previously no effect was found for language, both languages are modeled together.

Figure 6. Probability of strokes over time (zero = offset of speech) according to transition types. Stretches where the confidence intervals do not overlap can be considered as significantly different.
While the results from the overall logistic regression did not show an effect of time for Keeps and Backchannels, Figure 6 shows that (i) stroke activity already differs as early as 3 seconds before the offset of speech between Backchannels (fewer strokes) and Keeps (more strokes). This difference however disappears between ca.
$ -2.2 $
seconds before the completion point up to ca.
$ 0.2 $
seconds after the completion point, where both in Backchannels and in Keeps, stroke activity first increases, reaching a peak at around 1 second preceding the speech offset, then decreases, reaching a valley at the speech offset and then starts to increase again. From ca. 0.4 seconds to 0.8 seconds following the speech offset, there is higher stroke activity at Keeps than at Backchannels.
This may relate to the preparation phases, shown in Figure 7. Interesting stretches of time here go from −0.5 seconds to +0.5 seconds and from ca. +1.8 to +2.2 seconds, where the probability of gesture preparations at Keeps is higher than for Backchannels. The peak at the completion point could mean that at Keeps, the speaker already prepares upcoming strokes even before the completion point. The preparations at Backchannels seem to come slightly later.

Figure 7. Probability of preparations over time (zero = offset of speech) according to transition types.
The distribution of hold phases over time, shown in Figure 8, shows that Backchannels and Keeps do not differ significantly in terms of the presence of holds, but more gesture holds appear about 0.7 seconds before the completion point in Keeps than at Changes, and this higher probability of holds is retained throughout the remaining time.

Figure 8. Probability of holds over time (0 = offset of speech) according to transition types.
The distributions of retractions have overlapping confidence intervals throughout the entire time period investigated and are therefore not shown and discussed further.
The distributions of the last condition, none, are shown in Figure 9. The pattern for Changes seems to mirror that of the strokes (cf. Figure 6) with a higher probability of none phase after the completion point and the increase in likelihood of there being no gesture beginning about 0.5 second before the offset of speech.

Figure 9. Probability of none over time (zero = offset of speech) according to transition types.
In terms of the frequency of turn transitions arising without gesture (none), all three transition types are significantly different throughout the 6 seconds surrounding the completion point. The none category is highest at Changes, lowest for Keeps, with Backchannels in the middle.
4. Discussion
With the current study, we took a first step to investigate precise temporal dynamics in the realm of turn-taking and gesture, where they have not yet been investigated quantitatively. We addressed the temporal relationships between manual gestures and the semantic and pragmatic content of conversational speech on two scales: the distribution of gestures in relation to turn-taking, as well as the overall distribution and alignment of hand gestures in the vicinity of potential turn boundaries.
We hypothesized that locations where a current speaker wishes to hold the floor (that is, Keeps and Backchannels) would demonstrate different temporal relationships between speech and gesture than cases in which the current speaker is ready to release the floor (that is, Changes).
We indeed found differences in gesturing behavior between Keeps and Changes, and to some extent between Keeps and Backchannels. The Backchannel locations may consist of a mixture of locations in which the current speaker wishes to keep the turn and locations where the current speaker would have been willing to allow an interlocutor to take up a turn but was ‘refused’ by the use of a backchannel (cf. Taboada, Reference Taboada2006; Yngve, Reference Yngve1970). Thus, the finding that gesture behavior in Keeps and Backchannels is similar but not identical is not unexpected. Gestural activity was more frequent and contained more strokes leading up to Keeps compared to Changes. Thus, the evidence from our study supports the interpretation suggested by Sacks et al. (Reference Sacks, Schegloff and Jefferson1974), that participants must do more active work to keep the floor than to release it.
Kita et al. (Reference Kita, van Gijn, van der Hulst, Wachsmuth and Fröhlich1997) termed preparation, stroke, partial retraction and retraction as ‘active phases’ and distinguished them that way from gesture holds (p.34). Our results, however, indicate that the pattern of the retraction phase behaves in a way similar to the pattern of no gesture. So if we think in terms of movement effort of a gesture, we could rather classify preparation, stroke and hold as gesture phases that contribute to gesture activity and classify retraction and none as not contributing to an active gestural movement, that is, gesture passivity. In this sense, our results could mean that a high degree of gesture activity is more likely to lead to a Keep, while a reduction of gesture activity, that is, passivity, is more likely to invite a Backchannel or lead to a Change in speakership. Overall, gesture passivity (demonstrated by a gesture retraction or no gesture) might thus be an indication for interlocutors that at the upcoming turn end, some kind of contribution is expected.
From a multimodal perspective, gestures may be equally informative for participants of face-to-face conversations and complement other turn-taking cues, such as pitch, or even overwrite them. As Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011) have shown, compared to speech activity and mutual gaze, pitch was not a relevant factor in explaining the presence or absence of a backchannel signal (vocal, visual or bimodal). Our own previous work has also suggested that when gestural cues to turn-taking are available, pitch variation may be employed to a lesser extent (Zellers et al., Reference Zellers, Gorisch, House and Peters2019a).
Our investigation also included a cross-linguistic comparison. Findings on one language may not count for another language, but as we were analysing basic gestural properties (gesture phases) and not their emblematic use (Kendon, Reference Kendon and Kay1980), the influence of the language may be rather low, indicating a more general pattern of gesture dynamics and turn taking.
Although we did not find systematic or significant language differences, there appear to be differences in size of effects between German and Swedish (see Figure 3 and results for Cramér’s V in Sections 3.1.1 and 3.1.2). These differences may arise from other aspects of the data. First, our Swedish sample contained overall less gesturing than the German sample. Second, the conversational activities in the corpora differed, with the German conversations being in general more task-oriented than the Swedish ones. Thus, the current results are probably better understood as supporting the argument that gesture implementation in the vicinity of completion points is a universal communicative strategy, rather than one strongly mediated by linguistic structure.
4.1. Temporal features of hand gestures at turn ends
The analyses reported in Sections 3.1 and 3.2 provide evidence that hand gesturing overall is structured in a way that supports the structuring and management of turn-taking in conversation. This complements and expands upon findings by, for example, Kendrick et al. (Reference Kendrick, Holler and Levinson2023), who report that preparations and strokes at TCU ends are associated with floor-holding by the current speaker, as well as by Kendon (Reference Kendon1995) and Mondada (Reference Mondada2007), who report functions of specific gesture shapes in terms of their contribution to the activity of holding or releasing the floor. By looking at all hand gestures, regardless of their form, we find that the stroke phase in general, that is, the obligatory and ‘meaningful’ portion of the gesture, also becomes rarer as a completion point approaches. At the time of speech offset, strokes are extremely rare in both German and Swedish, and this is further modulated by the type of turn transition that is taking place: ongoing strokes at the time of speech offset are more frequent in Keeps, where the same speaker intends to continue speaking, than at Changes, where a new speaker will take over. Changes are also the type of turn transition least likely to have any kind of ongoing gesture at the time of speech offset. Thus, for example, Schegloff’s (Reference Schegloff, Atkinson and Heritage1984) claim that hand gesturing is a current-speaker activity is supported by our quantitative analysis, and we can expand upon this by arguing that hand gesturing is also an activity that can indicate intentions about future speakership.
Expanding our view outward from the single time point of the offset of speech to look at the larger picture approaching and following a completion point, we see differences in gesturing behavior even at a substantial distance from this time point. Even in the few seconds preceding a completion point, gesturing is less frequent preceding Changes than preceding Keeps or Backchannels. In Backchannels and Keeps, holds remain similarly frequent, or may even increase, approaching the speech offset, and dip in frequency shortly afterward. In all types of transitions, the frequency of strokes is also at its peak about 1 second before the completion point. This suggests that stroking could be a useful early visual cue to an upcoming boundary, alerting an interlocutor to search for other cues about the current speaker’s turn-taking intentions.
4.2. Limitations
The current study adopts the offset of speech as a reference point, thus taking the perspective of the current speaker. Different results might have arisen if our reference point was the onset of speech following our completion points, taking a more recipient-oriented perspective, as did Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011), who time-locked at the start of (verbal, visual or bimodal) backchannels and investigated the prior interlocutor’s presence or absence of speech and mutual gaze. Our study also restricts gesture annotation to the four gesture phases following Kendon (Reference Kendon2004). An addition to the stroke phase could be the annotation of the gesture apex, that is, the place of maximum effort (peak velocity, peak acceleration, or peak deceleration), cf. Pouw and Dixon (Reference Pouw, Dixon and Grimminger2019). A focus on the most prominent part of a gesture stroke might result in different (potentially sharper) distribution contours.
Another limitation of our study is our focus on the recipient’s verbal behavior while not taking into account the visual resources of feedback such as head nods, facial expressions, gaze and so forth Including such annotations in future studies would better reflect the multimodal richness of face-to-face interaction. Similarly, it is beyond the scope of the current study to account for additional signaling on the part of the first speaker regarding his or her turn-taking intentions, either by lexical means or by variation in, e.g., pitch or duration; annotations of the turn-final pitch contours exist, and their relationship to gestural behavior will be explored in future research.
We were also limited by the available data. While many corpora of conversational speech exist, it is challenging to identify corpora that are sufficiently similar in terms of their structure. The interactional settings in particular are different for most corpora. In addition, our study used data from a different number of speakers in each language, meaning that for the
$ {\chi}^2 $
tests, which could not incorporate random factors, the differing amount of individual variability in the two languages could have influenced the statistical results. Once comparable data are available across languages, e.g., the parallel corpus of the PECII project (Kornfeld et al., Reference Kornfeld, Küttner, Zinken, Deppermann, Fandrych, Kupietz and Schmidt2023) with constant interactional settings, our study could be repeated, also taking the other shortcomings into account.
5. Conclusions
Annotating and carrying out quantitative analyses of conversational data could be interpreted as ignoring or overgeneralizing important complexity arising in conversational interaction; however, our larger-scale quantitative analysis has identified larger-scale temporal patterns arising across languages and across conversational settings. While interacting participants can still make sense of very context-specific cue organizations, we find evidence supporting the hypothesis that conversational participants systematically vary their gestural behavior in the approach to and at turn boundaries and that the temporal placement of different gesture phases (strokes versus holds versus other phases) shows a tendency to pattern similarly depending on the sequential structure of the conversational turn. We found differences only in the degree to which these patterns arose between German and Swedish, suggesting that these temporal patterns are either universal or that linguistic or cultural differences must be much larger to identify differences in timing behavior.
Future research will bring another prosodic parameter, the pitch contour at turn ends, into the equation. It will also expand the scope of the investigation beyond Germanic languages and beyond European cultures. These parameters will help us to refine our assessment of the universality of gesturing behavior at turn boundaries as well as its interaction with the linguistic system.
Data availability statement
Legal restrictions prohibit sharing of the raw video and audio dataset due to GDPR privacy restrictions and demands for anonymity of participants. The extracted data and R code for these analyses are publicly available at https://osf.io/efs4c/?view_only=16af14465e314724aa44ba709f051860.
Acknowledgments
We are grateful to Simon Alexanderson, Jonas Beskow and Jens Edlund for assistance with Spontal and to Caroline Kleen for supplementary annotation work. We also want to thank Sandra Hansen and Sascha Wolfer for valuable feedback on the statistical analyses and the anonymous reviewers for their comments and suggestions for improvement on previous versions of this paper. We gratefully acknowledge support from CLARIN SPEECH, the CLARIN Knowledge Centre for Speech Analysis.
Funding statement
This work was supported by the German Research Foundation (DFG; Aufbau Internationaler Kooperationen GO 3063/1–1, PE 2879/1–1, ZE 1178/1–1), the Swedish Research Council (VR-2017-02140) and the Riksbankens Jubileumsfond (P12–0634:1).
Competing interests
The authors declare none.