Temporal relationships between speech and hand gestures in the vicinity of potential turn boundaries in German and Swedish conversation

Margaret Zellers; Jan Gorisch; David House

doi:10.1017/langcog.2025.10014

Temporal relationships between speech and hand gestures in the vicinity of potential turn boundaries in German and Swedish conversation

Published online by Cambridge University Press: 21 July 2025

Margaret Zellers

Jan Gorisch and

David House

Show author details

Margaret Zellers*: Affiliation:
Institut für Skandinavistik, Frisistik, & Allgemeine Sprachwissenschaft, Kiel University, Kiel, Germany
Jan Gorisch: Affiliation:
Department of Pragmatics, Leibniz-Institute for the German Language, Mannheim, Germany
David House: Affiliation:
Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
*: Corresponding author: Margaret Zellers; Email: mzellers@isfas.uni-kiel.de

Article contents

Abstract
Introduction
Method
Results
Discussion
Conclusions
Data availability statement
Funding statement
Competing interests
References

Rights & Permissions

Abstract

Both gesture and talk are basic building blocks of face-to-face conversation. In this study, we address the temporal dynamics of hand gesture phases relative to places and types of turn transition. We annotated gesture features and measured temporal aspects of gesture related to speech in two languages, German and Swedish. We found variation in the temporal relationships of gesture types and alignment of gesture phases that relate to the management of turn-taking in conversation. Specifically, the frequency of different gesture phases accompanying the offset of speech differed depending on whether the same speaker held the floor or whether a new speaker took up a turn. In addition, we found that differences in temporal alignment of gesture phases can distinguish between the type of turn transition that is upcoming up to a second before the place of transition is reached. Our results emphasize the importance of the interaction of the verbal and the gestural modality to maintain the smooth flow of conversation.

Keywords

conversation co-speech gesture German gesture phases hand gestures potential turn boundary Swedish temporal gesture alignment turn transitions

Information

Type: Article
Information: Language and Cognition , Volume 17 , 2025 , e57

DOI: https://doi.org/10.1017/langcog.2025.10014 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press

1. Introduction

Turn-taking in conversation is managed in versatile ways, and even more so in multimodal settings. While conversational participants organize their turns easily, measuring turn-taking cues by analyzing recordings has proven complex, leading many researchers to study single phenomena and mostly independently of other phenomena. For example, phonetic, and especially prosodic, features have been studied extensively in terms of their ability to predict points of speaker change. Gestures, such as head nods, have also been studied extensively in the vicinity of places of potential turn transitions.

In attempts to combine spoken with gestural features, the temporal characteristics of gestures, such as the apices of gesture strokes, have been compared with the temporal characteristics of speech, such as the stressed syllable of a lexical affiliate of the corresponding gesture (for example, a spoken lexical item with which a gesture shares lexical content). These studies have often found timing differences suggesting that gestures come slightly before their lexical affiliate (Bergmann et al., Reference Bergmann, Aksu and Kopp2011; Ferré, Reference Ferré2010; ter Bekke et al., Reference ter Bekke, Drijvers and Holler2020). Such findings on speech gesture synchrony often address implications, such as whether gestures facilitate lexical access, decrease processing time or the like. So far, however, it has not been investigated in depth what consequences the overall activity of gesturing or not gesturing has on the management of turn-taking or how the uptake or non-uptake of a turn depends on the timely dynamics of that gesturing activity.

In general, there seem to be two possibilities: on one hand, gesture strokes may mark an end-point of an utterance, thus yielding the turn. On the other hand, gesture activity may signal that an utterance is still in progress, hindering an uptake from the interlocutor. In the current study, we build on prior work calling for multimodal analyses of talk – for example, work by Mondada (Reference Mondada2019) positioning multimodality in the context of social interaction, embodiment and multisensority from a conversation analytic (CA) and therefore qualitative perspective – and extend it by taking a quantitative approach toward annotating real data with traditional features and exploring them in an innovative way. Regarding annotations, we take the well-known gesture phases, that is, preparation, hold, stroke and retraction following Kendon (Reference Kendon2004) and Kita et al. (Reference Kita, van Gijn, van der Hulst, Wachsmuth and Fröhlich1997), and basic turn transition types, such as keeping the turn, receiving a backchannel or yielding the turn. The simple but innovative analysis approach presented in this study takes a perspective focused around the offset of speech, that is, where a turn (potentially) comes to a syntactic and/or semantic end, and looks into the current speaker’s gestures in the vicinity of this point in time. Our study therefore contributes to the body of research that investigates the resources that interactional participants employ at potential turn transition places for managing their turns.

In our quantitative approach, we only refer to hand gestures and disregard other gesture types such as head gestures or eye gaze. Although our annotations included whether we interpreted the gesture as referential or not, we do not differentiate between these referentiality types in the current work, cf. Loehr (Reference Loehr2004), who does not distinguish between referential and non-referential gestures and Shattuck-Hufnagel and Ren (Reference Shattuck-Hufnagel and Ren2018) whose results span across referentiality.

1.1. Gesture analysis

A gesture is a movement of some part of the body that accompanies speech and has communicative value for listeners (Kendon, Reference Kendon1994); gesturing can be distinguished from movement for its own sake, or movement which involves object manipulation (Novack et al., Reference Novack, Wakefield and Goldin-Meadow2015). However, gestures are defined based on inferences about their communicative intent, not external criteria such as form (Bavelas, Reference Bavelas1994). It has been shown that naïve observers are able to interpret gestural movements reliably (Goldin-Meadow & Sandhofer, Reference Goldin-Meadow and Sandhofer1999). Thus, any analysis of gesture makes the fundamental assumption that the body movement in question is intended to be communicative.

Early analyses of manual gestures focused on gestures that formed specific shapes, or referred to specific locations, sizes, objects, metaphors or ideas (c.f. Kendon, Reference Kendon and Kay1980; McNeill, Reference McNeill1992). Since this time, however, a variety of classification systems have arisen, allowing the study of gestures across various parameters. The most popular of these systems were introduced by Kendon (Reference Kendon and Kay1980) with the categories sign language, pantomime, emblems and gesticulation. McNeill arranged these bodily movements on a continuum with gesticulation being defined as co-speech gestures. To further classify the co-speech gestures, McNeill (Reference McNeill, Brown and Anderson2006) introduced four categories roughly related with gesture semantics or function: iconic, metaphoric, deictic and beat gestures. More recent behavioral and cognitive evidence indicates that there is not a clear divide between, e.g., beat gestures and metaphoric gestures (Casasanto, Reference Casasanto2008, Reference Casasanto2009); rather, gestures may be classified in more than one way simultaneously. McNeill had already raised the observation that a strict classification is not really realistic and that a dimensional description would better characterize the ways in which gestures are implemented. Thus, even if a single or specific function is attributed to a gesture, this attribution should not be considered as a unique or exclusive function but rather simply one aspect of the gesture in question.

1.2. Coordination of speech and gesture

A growing body of evidence supports the argument that linguistic research should treat speech and gesture as a unified system (cf. e.g., Kendon, Reference Kendon2004; McNeill, Reference McNeill2005). Wagner et al. (Reference Wagner, Malisz and Kopp2014) provide an in-depth review of relationships between speech and gesture that have been reported in the literature. Their review raises the question of whether the auditory modality, that is, spoken language, and visual modality, that is, gesture, are used in parallel or as complements.

In speech production, a strong effect appears to arise in the context of prosodic features. Rhythmic or beat gestures have been demonstrated to appear in consistent temporal alignment with prosodic prominences in spoken language in adults as well as children (Ambrazaitis & House, Reference Ambrazaitis and House2017b; Esposito et al., Reference Esposito, Esposito, Refice, Savino, Shattuck-Hufnagel, Esposito, Bratanić, Keller and Marinaro2007; Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Florit-Pons et al., Reference Florit-Pons, Vilà-Giménez, Rohrer and Prieto2020; Knight, Reference Knight2009; Krahmer & Swerts, Reference Krahmer and Swerts2007; Leonard & Cummins, Reference Leonard and Cummins2011). Specifically, gesture apices tend to align with stressed syllables (e.g., Loehr, Reference Loehr2004; Rochet-Capellan et al., Reference Rochet-Capellan, Laboissière, Galván and Schwartz2008) or intonation peaks (e.g., Esteve-Gibert & Prieto, Reference Esteve-Gibert and Prieto2013; Nobe, Reference Nobe1996; Pouw & Dixon, Reference Pouw, Dixon and Grimminger2019).

Visual and auditory information have been found to be automatically integrated in the course of speech perception, and the combination of these different input streams influences speech intelligibility (Kelly et al., Reference Kelly, Creigh and Bartolotti2010; McGurk & MacDonald, Reference McGurk and MacDonald1976). Viewing speech-accompanying gesture has been demonstrated to lead to increased activity in the auditory cortex (Hubbard et al., Reference Hubbard, Wilson, Callan and Dapretto2009), as well as in brain areas involved with semantic processing (Dick et al., Reference Dick, Goldin-Meadow, Hasson, Skipper and Small2009).

Specific constellations of different prosodic and gestural prominence cues may also have different communicative effects than the individual cues alone (Ambrazaitis & House, Reference Ambrazaitis and House2017a, Reference Ambrazaitis and House2017b; Prieto et al., Reference Prieto, Puglesi, Borràs-Comes, Arroyo and Blat2015). A manual McGurk effect has even been reported, where gestural beats were used to overwrite intonation cues for differentiating lexical stress, e.g., OBject versus obJECT (Bosker & Peeters, Reference Bosker and Peeters2021). Guellaï et al. (Reference Guellaï, Langus and Nespor2014) report that listeners can identify congruencies between even unintelligible speech and gesture and use gesture for disambiguation in cases when information in the speech signal is ambiguous or conflicting. It is thus clear that the temporal placement of speech-accompanying gesture can and does play a crucial role for speech understanding.

1.3. The management of turn-taking in conversation

Conversation tends to proceed with a minimum of problematic (that is, disruptive) overlaps or silent gaps (Sacks et al., Reference Sacks, Schegloff and Jefferson1974), and the amount of silent time between conversational turns appears to have a stable mean of around 200 ms across a variety of typologically different languages, including sign languages (Buanzur et al., Reference Buanzur, Zellers, Namyalo and Witzlack-Makarevich2018; de Vos et al., Reference de Vos, Torreira and Levinson2015; Heldner & Edlund, Reference Heldner and Edlund2010; Stivers et al., Reference Stivers, Enfield, Brown, Englert, Hayashi, Heinemann, Hoymann, Rossano, de Ruiter, Yoon and Levinson2009). Many linguistic features, phonetic/prosodic and otherwise, play a role in signaling turn transition, including syntactic/semantic completion (e.g., Auer, Reference Auer, Couper-Kuhlen and Selting1996; de Ruiter et al., Reference de Ruiter, Mitterer and Enfield2006; Schaffer, Reference Schaffer1983), intonational features (e.g., Bögels & Torreira, Reference Bögels and Torreira2015; Caspers, Reference Caspers2003; Local et al., Reference Local, Kelly and Wells1986; Peters, Reference Peters2006; Selting, Reference Selting1996) and phonation quality/spectral characteristics (e.g., Kane et al., Reference Kane, Yanushevskaya, de Looze, Vaughan and Ní Chasaide2014; Ogden, Reference Ogden2001).

Studies using larger corpora (e.g., Gravano & Hirschberg, Reference Gravano, Hirschberg, Healey, Pieraccini, Byron, Young and Purver2009, Reference Gravano and Hirschberg2011; Hjalmarsson, Reference Hjalmarsson2011; Koiso et al., Reference Koiso, Horiuchi, Tutiya, Ichikawa and Den1998) tend to find a hierarchy of various features correlated with speaker transition or floor hold, including lexico-syntactic as well as phonetic features; however, syntactic/semantic completion is not always a definitive cue to finality. Different types of conversational actions or turns have different degrees of ‘projectability’, or predictability as to their future direction, such as when a speaker tells a story which requires multiple conversational turns to complete (Auer, Reference Auer2005).

Like linguistic cues, gestural cues have been shown to be relevant for the management of turn-taking in conversation. Schegloff (Reference Schegloff, Atkinson and Heritage1984) points out that it is mostly current speakers who gesture, although gestures may be used by a current hearer to indicate the desire to take the floor, as has been found for a variety of languages (Li, Reference Li2014; Mondada & Oloff, Reference Mondada, Oloff, Stam and Ishino2011; Streeck & Hartge, Reference Streeck, Hartge, Auer and di Luzio1992). Similarly, gestures may be used at turn ends to hold the floor during a pause or to invite a response from an interlocutor (Kendon, Reference Kendon1995; Mondada, Reference Mondada2007; Stivers & Rossano, Reference Stivers and Rossano2010). Sikveland and Ogden (Reference Sikveland and Ogden2012) demonstrate how hand gesturing across a turn end can help achieve the complex function of identifying and resolving a problem of understanding. Some gestural cues appear to parallel roles of prosodic structure; thus, Quek et al. (Reference Quek, McNeill, Bryll, Duncan, Ma, Kirbas, McCullough and Ansari2002) find that hand gestures are temporally correlated with prosodic phrase boundaries, possibly contributing to the segmentation of speech into phrases. In addition, Chui (Reference Chui2005) and Graziano and Gullberg (Reference Graziano and Gullberg2018) report that gesturing is linked with ongoing speech. From a turn-taking perspective, the end of a turn constitutes a break in continuity of speech, which might allow the absence of gesturing to be a turn-yielding cue, too. Similarly, Barkhuysen et al. (Reference Barkhuysen, Krahmer and Swerts2008) report that speakers tend to look away from their interlocutor phrase-medially and to look back at them phrase-finally. These studies serve as evidence that gestures may provide information about the completeness of a spoken turn.

1.4. Aims of the current study

It is clear from the literature discussed above that close temporal relationships exist between spoken language and gesture and that coordinated speech and gesture are relevant for the management of turn-taking in conversation. At the same time, the literature reported above suffers to some degree from a lack of methodological unity, with the results of qualitative and quantitative studies not always brought into harmony with one another. The question of whether and how possible variation in temporal relationships between speech and gesture contributes to the management of turn-taking remains open.

Thus, in the current study, we investigate the extent to which malleability in temporal relationships between speech and gesture is used for conversation management and how the use of such features may differ across languages.

Our specific research question is how and to what extent does the temporal relationship between speech and gesture contribute to the management of turn-taking in conversation? We operationalize the temporal relationship between speech and gesture as the temporal relationship between different phases of manual gestures produced by the speaker of the turn that is (potentially) coming to an end and the offset of speech at a location where turn transition may become relevant. We hypothesize that, at locations in conversation in which a current speaker reaches a point of possible completion but wishes to hold the floor, extra effort is needed, which will result in different temporal relationships between speech and gesture (cf. Kendrick et al., Reference Kendrick, Holler and Levinson2023; Schegloff, Reference Schegloff, Atkinson and Heritage1984).

We further address our research question in the context of two related languages with different prosodic structures, German and Swedish. While both are Germanic languages and thus have some substantial structural similarity, they differ in their intonational structure. German is an intonation language, where pitch movements are used exclusively for pragmatic purposes. In Swedish, however, pitch movements are part of the lexical specification of words, with words carrying one of two lexical pitch accents. The differences in the phonological systems have already been shown to be relevant for prosodic signaling of turn transition intentions (Rossi et al., Reference Rossi, Feindt and Zellers2022; Zellers et al., Reference Zellers, Gorisch, House and Peters2019a). Since gestural features are closely linked with prosodic features (cf. Section 1.2), it is thus possible that these prosodic differences could lead to differences in gesture use even in two relatively closely related languages.

2. Method

We adopt a quantitative, corpus-based approach that involves the annotation and analysis of video recordings. The data are spontaneous conversations from pre-existing corpora in German and Swedish that have already been transcribed orthographically. In this section, we give more details on the selected recordings, the annotations we added for the purposes of this study, and an outlook on the statistical analyses we employed.

2.1. Data

The data used in the current study are drawn from two corpora of conversational speech. The Swedish data come from the Spontal corpus (Edlund et al., Reference Edlund, Beskow, Elenius, Hellmer, Strömbergsson and House2010), a corpus of two-party conversations collected in Stockholm, Sweden. Spontal comprises audio, video and motion-capture data, although only the video and audio data are used in the current study. The German data are taken from FOLK (Forschungs- und Lehrkorpus Gesprochenes Deutsch, Research and Teaching Corpus of Spoken German) (Schmidt, Reference Schmidt2014), a collection of speech taken from a variety of natural settings, comprising audio and video data.

An important goal of the current research was to use existing data rather than to collect new data, since so much data are already available. To do this, it was necessary to make a selection of the data that was maximally similar, while taking into account the fundamental differences between these two speech databases. The materials in the Spontal corpus were more constrained in their form: all interactions involved two-party conversations, with participants sitting face to face and with no fixed topic of conversation in the portions of the data used. The topics of the conversations were quite varied although they generally fell into the categories of daily or common activities, such as hobbies, working out at the gym, buying a drill, moving into a new apartment, working as a translator, building a closet and travelling. Two of the speakers (09-22B and 09-35B) participated in more than one conversation, as indicated in Table 1.

Table 1. Metadata for the FOLK and Spontal files

The selection constraints of two-party conversations with participants sitting face to face were also adopted while searching FOLK for appropriate data for a comparison. Three relevant conversations in FOLK were identified. In two of the conversations, two speakers interact in the context of a mock job interview (a third party is present but does not contribute to the conversation once the mock interview has begun; the excerpts we analyzed began after this point). In the third conversation, an expert in birds of prey is interviewed in an informal setting. Although these conversational settings may be more formally structured than in the Spontal data, observation of the data indicates that turn-taking proceeds similarly to in the fully spontaneous conversations in Spontal. Furthermore, we do not anticipate that the differences in topic or formality would have a large impact on the temporal coordination of speech and gesture, since this is likely to rest on cognitive processes rather than on the specific content of a conversation.

For each language, we used a total of 55 minutes of data. In Spontal, the 55 minutes comprise 5 minutes each from 9 conversations, and 10 minutes, in two separate chunks, from a tenth conversation (09-35; see Table 1). In FOLK, the 55 minutes comprise 17–20 minutes each from the three conversations. Due to the constraints of the available data and annotations, it was not possible to use data from an equivalent amount of speakers while maintaining a similar amount of data as measured in minutes; we prioritized having a similar quantity of data per language so as to have a comparable number of completion points (see Section 2.2).

2.2. Annotations

Gesture and turn annotations were carried out using ELAN (Max Planck Institute, 2018), cf. Figure 1. Spoken features were annotated in Praat (Boersma & Weenink, Reference Boersma and Weenink2021).

Figure 1. Screenshot of the annotation environment in ELAN (data from FOLK).

Gesture annotation was carried out using the video signal only (that is, with the audio muted) and proceeded one conversational participant at a time. The first step of the annotation process was to identify gesture phrases, that is, stretches of time when one or both of a participant’s hands moved. In a second step, we segmented the gesture phases (preparation, stroke, hold, retraction, cf. Kendon (Reference Kendon2004); Koiso et al. (Reference Koiso, Horiuchi, Tutiya, Ichikawa and Den1998)). In the analysis below, where we relate these gesture phases with syntactic/semantic features, we also labeled areas with no hand gesture as none. Strictly speaking, this is not a gesture phase per se, but it is implemented so that measurement points with gesture can also be compared to those without gesture. The boundaries of the gesture phases were refined by moving frame-by-frame through the video in ELAN; if a boundary was ambiguous between two frames, the earlier frame was chosen as the boundary.

In a separate annotation phase in Praat, using the audio data only, we labeled locations where a speaker’s turn was potentially complete and the possibility of speaker change thus became relevant (c.f. TRPs, Sacks et al. (Reference Sacks, Schegloff and Jefferson1974); SYNCOMPS, Local and Walker (Reference Local and Walker2012); Potential Turn Boundaries (completion points), Zellers (Reference Zellers2017)). Since only locations in the conversation that were clearly syntactically or semantically complete in context were included in this classification, we adopt the term completion points for these locations, rather than, for example, TRPs, which are defined by constellations of features, not only syntactic/semantic completion. We excluded locations where the incoming speaker’s turn or backchannel began in overlap with the end of the current speaker’s turn, since these early incomings might represent an ‘incorrect’ prediction about the current speaker’s turn-taking intentions. The completion points were then classified based on the sequential structure of the possible transition as one of the following: holding the floor (with or without a verbal backchannel from the other speaker), releasing the floor (either with or without an explicit question form) or ambiguous cases.

• Floor hold without verbal backchannel (Keep) Following the boundary location, the current speaker takes the next full turn, thus keeping the conversational floor; the interlocutor does not produce any kind of verbalization.
• Floor hold with verbal backchannel (Backchannel) After the completion point, the interlocutor produced a verbal backchannel but no other speech. Evidence from Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011) and Ferré and Renaudier (Reference Ferré and Renaudier2017) indicates that verbal backchannels and gestural backchannels are positioned differently in conversation, with gestural backchannels tending to arise in overlap with ongoing speech, while verbal backchannels tend to be placed in silent gaps. Thus, verbal backchannels may also be produced in response to different speaker behavior than gestural backchannels. Furthermore, in our data, it was not always possible to identify whether the speaker in question was able to see a potential visual backchannel produced by a listener. To be as consistent as possible, we thus include only locations with a verbal backchannel in the current study.
• Change After the completion point, the interlocutor takes the next full turn.
• Question The current turn ended in a syntactically marked interrogative form (with, e.g., subject–verb inversion or a wh-word), and the next turn was taken up by the interlocutor. The role of the question label was to help distinguish cases with a clear invitation for a next speaker from speaker change cases where the lexical content does not specifically invite a contribution from the next speaker. Due to their rarity, questions are not included in the turn-taking analyses below.
• Ambiguous This label was used when no clear decision could be made, e.g., when the interlocutor laughed or produced unintelligible vocalisations, or when both speakers overlapped, e.g., talking collaboratively until the end of the turn. Ambiguous turns are also excluded from the turn-taking analyses.

2.3. Feature extraction and quantitative analysis

Using scripts, we extracted completion points and the ongoing gesture phase at the completion point, as well as over stretches of time beginning at 3 seconds before the completion point and ending at 3 seconds after the completion point.

Each analysis reported below has different requirements for the statistical analysis; thus the individual statistical analyses are reported in the Results. All statistical tests were calculated using R version 4.2.1 (R Core Team, 2022); α = .05. Figures were generated using ggplot2 (Wickham, Reference Wickham2016). The extracted data and R code are available at https://osf.io/efs4c/?view_only=16af14465e314724aa44ba709f051860.

3. Results

3.1. Temporal alignment of gestures with turn ends

Parts of this analysis follow a similar procedure to that used by Zellers et al. (Reference Zellers, Gorisch, House and Peters2019b); however, the current dataset is much larger than was used in that study, and some of the annotations were refined since the time of the previous analysis.

3.1.1. German

In the German data, we identified 451 completion points with the transition type Backchannel, Change or Keep. Of these, 223 (49.4%) had an ongoing gesture by the current speaker at the time of the offset of speech.

Figure 2a shows proportionally the gesture phase that was ongoing at the time of the offset of speech according to the type of turn transition in the German data; raw counts are given in Table 2. A χ ² test shows that the difference in distribution of the gesture phases is different at different types of turn transition (χ ²(8) = 67.2, p < .05). Specifically, significant residual values indicate that preparations and holds are more likely in Keeps and less likely in Changes. Changes are more likely to have no gesture (that is, none) and Keeps are less likely to have none. Cramér’s V = 0.546, indicating a large effect size.

Figure 2. Ongoing gesture phase at time of speech offset, German data left, Swedish data right. The y-axis shows the proportion of gesture phases at each transition type, rather than raw counts.

Table 2. Gesture phases according to transition types observed at the time of speech offset in completion points in German and Swedish

3.1.2. Swedish

In the Swedish data, we identified 511 completion points with the transition type Backchannel, Change or Keep. Of these, 125 (24.5%) had an ongoing gesture by the current speaker at the time of the offset of speech.

Figure 2b shows proportionally the gesture phase that was ongoing at the time of the offset of speech according to the type of turn transition in the Swedish data; raw counts are given in Table 2. As in German, a χ ² test shows that the difference in distribution of the gesture phases is different at different types of turn transition (χ ²(8) = 19.86, p < .05). Specifically, significant residual values show that strokes are more likely to arise in Keeps and less likely to arise in Changes. Cramér’s V = 0.282, indicating a medium effect size.

3.1.3. Cross-linguistic comparison

In both languages, it was more frequent in general for turns to end without gesture than with gesture. Specifically, the likelihood that no gesture will be ongoing at the time of speech offset is highest in Changes, followed by Backchannels and Keeps. The pattern of the retraction phase is similar. The inverse can be seen for the preparation and the hold phases, which are more frequent at Keeps and Backchannels than at Changes.

There are almost no preparations at all at Changes. In both German and Swedish, there are more stroke phases in Keeps than in Backchannels or Changes.

In sum, we can say about the distribution of gesture phases at the offset of speech of the current speaker that the gesture phases which move into or take place within the gesture space (that is, preparations, holds and strokes) tend to correlate with the same speaker keeping the floor, while gesture phases which move out of the gesture space (retraction, or none) tend to correlate with a change in speakership.

3.2. Timing of gestural activity around completion points

The analysis in Section 3.1 provides a snapshot of what happens at the offset of speech at a potential turn boundary, that is, at a specific point in time. However, it is clear that the distribution of gesture phases must evolve over time. Thus we are left with the open question of how the distribution develops toward – and also away from – the single point of time where speech stops.

We therefore take the distributions of gesture phases as shown in Figure 2 and treat them as if they were spectral slices. Taking a slice every tenth of a second, starting from 3 seconds before the offset of speech to 3 seconds after, and arranging them horizontally according to gesture phases, we obtain distributions of gesture phases over time, which can also be divided for each transition type separately. The result of this analysis/transformation is shown in Figure 3. We chose 3 seconds following Pöppel (Reference Pöppel2009), who suggests this duration as the window of cognitive ‘presence’.

Figure 3. Distribution of gesture phases over time according to transition types; German data left, Swedish data right. The zero point is at the speech offset at a potential turn boundary (PTB). On the y-axis we count which gesture phase the current speaker is currently in. All counts across all phases sum up to the number of completion points in the data with the specific transition label. The counts in the slice at time point zero, accords with the numbers in Table 2.

For each type of gesture phase, we conducted a mixed logistic regression with the criterion variable the frequency of the gesture phase and the fixed factors time, language, transition type and the interaction time:transitionType; the speaker was also included as a random factor (cf. function glmer in the R package lme4, Bartoń (Reference Bartoń2022)). The results for the model predicting the presence of gesture strokes are given in Figure 4; the model achieved an $ {R}^2m=0.091 $ without the random factor of speaker and $ {R}^2c=0.283 $ with the random factor. Expanded model results for all gesture phases are shown in Table 3, while the results of models for gesture phases other than strokes are summarized in Figure 5. We calculated the $ {R}^2 $ values using the function r.squaredGLMM (Nakagawa & Schielzeth, Reference Nakagawa and Schielzeth2013) in the R package MuMIn (Bates et al., Reference Bates, Maechler, Bolker and Walker2022).

Figure 4. Model estimates for strokes. The x-axis shows the time offset from the completion point. The y-axis shows the estimated probability of strokes on a logarithmic scale. The random factor (speaker) is considered in the plot. We used the R package effects (Fox et al., Reference Fox, Weisberg, Price, Friendly and Hong2022) in plotting the estimates.

Table 3. Upper part: statistical evaluation for each logistic model: R ²m = explained variation without random factor (speaker) and R ²c corrected for the random factor. Lower part: p-values (log probabilities) for each factor and the interaction with time. The reference group (Intercept) is backchannels in German. Bold text shows predictors that achieve statistical significance.

Figure 5. Estimates for the models for gesture phases preparations, holds, retractions, and none.

As the results in Section 3.1 have already shown, for strokes, there was a significant main effect of transition type on the frequency of strokes: Overall, participants produce more gesture strokes around Keeps than around Backchannels and more strokes around Backchannels than around Changes. The analysis here further shows a significant interaction between transition type and time: In Changes, strokes become progressively less frequent as the speech offset approaches and passes, while in Keeps and Backchannels, gesture strokes are equally probable preceding and following the end of the current turn. No significant effects were found for language in this model.

Although this finding is valid, it could be considered trivial, since it is already well-known that manual gestures are mostly performed by current speakers. However, a logistic regression attempts to fit a linear model, while, as the distributions shown in Figure 3 suggest, there might be more details in the evolution of gesture phases over time, which the logistic regression is unable to model. Therefore, to look deeper into the gesture dynamics, we calculated binomial tests for each point in time (that is, every tenth of a second), from which we obtained the probability of stroke activity at each time point as well as 95% confidence intervals for these probabilities. The resulting plot for gesture strokes is shown in Figure 6. Since previously no effect was found for language, both languages are modeled together.

Figure 6. Probability of strokes over time (zero = offset of speech) according to transition types. Stretches where the confidence intervals do not overlap can be considered as significantly different.

While the results from the overall logistic regression did not show an effect of time for Keeps and Backchannels, Figure 6 shows that (i) stroke activity already differs as early as 3 seconds before the offset of speech between Backchannels (fewer strokes) and Keeps (more strokes). This difference however disappears between ca. $ -2.2 $ seconds before the completion point up to ca. $ 0.2 $ seconds after the completion point, where both in Backchannels and in Keeps, stroke activity first increases, reaching a peak at around 1 second preceding the speech offset, then decreases, reaching a valley at the speech offset and then starts to increase again. From ca. 0.4 seconds to 0.8 seconds following the speech offset, there is higher stroke activity at Keeps than at Backchannels.

This may relate to the preparation phases, shown in Figure 7. Interesting stretches of time here go from −0.5 seconds to +0.5 seconds and from ca. +1.8 to +2.2 seconds, where the probability of gesture preparations at Keeps is higher than for Backchannels. The peak at the completion point could mean that at Keeps, the speaker already prepares upcoming strokes even before the completion point. The preparations at Backchannels seem to come slightly later.

Figure 7. Probability of preparations over time (zero = offset of speech) according to transition types.

The distribution of hold phases over time, shown in Figure 8, shows that Backchannels and Keeps do not differ significantly in terms of the presence of holds, but more gesture holds appear about 0.7 seconds before the completion point in Keeps than at Changes, and this higher probability of holds is retained throughout the remaining time.

Figure 8. Probability of holds over time (0 = offset of speech) according to transition types.

The distributions of retractions have overlapping confidence intervals throughout the entire time period investigated and are therefore not shown and discussed further.

The distributions of the last condition, none, are shown in Figure 9. The pattern for Changes seems to mirror that of the strokes (cf. Figure 6) with a higher probability of none phase after the completion point and the increase in likelihood of there being no gesture beginning about 0.5 second before the offset of speech.

Figure 9. Probability of none over time (zero = offset of speech) according to transition types.

In terms of the frequency of turn transitions arising without gesture (none), all three transition types are significantly different throughout the 6 seconds surrounding the completion point. The none category is highest at Changes, lowest for Keeps, with Backchannels in the middle.

4. Discussion

With the current study, we took a first step to investigate precise temporal dynamics in the realm of turn-taking and gesture, where they have not yet been investigated quantitatively. We addressed the temporal relationships between manual gestures and the semantic and pragmatic content of conversational speech on two scales: the distribution of gestures in relation to turn-taking, as well as the overall distribution and alignment of hand gestures in the vicinity of potential turn boundaries.

We hypothesized that locations where a current speaker wishes to hold the floor (that is, Keeps and Backchannels) would demonstrate different temporal relationships between speech and gesture than cases in which the current speaker is ready to release the floor (that is, Changes).

We indeed found differences in gesturing behavior between Keeps and Changes, and to some extent between Keeps and Backchannels. The Backchannel locations may consist of a mixture of locations in which the current speaker wishes to keep the turn and locations where the current speaker would have been willing to allow an interlocutor to take up a turn but was ‘refused’ by the use of a backchannel (cf. Taboada, Reference Taboada2006; Yngve, Reference Yngve1970). Thus, the finding that gesture behavior in Keeps and Backchannels is similar but not identical is not unexpected. Gestural activity was more frequent and contained more strokes leading up to Keeps compared to Changes. Thus, the evidence from our study supports the interpretation suggested by Sacks et al. (Reference Sacks, Schegloff and Jefferson1974), that participants must do more active work to keep the floor than to release it.

Kita et al. (Reference Kita, van Gijn, van der Hulst, Wachsmuth and Fröhlich1997) termed preparation, stroke, partial retraction and retraction as ‘active phases’ and distinguished them that way from gesture holds (p.34). Our results, however, indicate that the pattern of the retraction phase behaves in a way similar to the pattern of no gesture. So if we think in terms of movement effort of a gesture, we could rather classify preparation, stroke and hold as gesture phases that contribute to gesture activity and classify retraction and none as not contributing to an active gestural movement, that is, gesture passivity. In this sense, our results could mean that a high degree of gesture activity is more likely to lead to a Keep, while a reduction of gesture activity, that is, passivity, is more likely to invite a Backchannel or lead to a Change in speakership. Overall, gesture passivity (demonstrated by a gesture retraction or no gesture) might thus be an indication for interlocutors that at the upcoming turn end, some kind of contribution is expected.

From a multimodal perspective, gestures may be equally informative for participants of face-to-face conversations and complement other turn-taking cues, such as pitch, or even overwrite them. As Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011) have shown, compared to speech activity and mutual gaze, pitch was not a relevant factor in explaining the presence or absence of a backchannel signal (vocal, visual or bimodal). Our own previous work has also suggested that when gestural cues to turn-taking are available, pitch variation may be employed to a lesser extent (Zellers et al., Reference Zellers, Gorisch, House and Peters2019a).

Our investigation also included a cross-linguistic comparison. Findings on one language may not count for another language, but as we were analysing basic gestural properties (gesture phases) and not their emblematic use (Kendon, Reference Kendon and Kay1980), the influence of the language may be rather low, indicating a more general pattern of gesture dynamics and turn taking.

Although we did not find systematic or significant language differences, there appear to be differences in size of effects between German and Swedish (see Figure 3 and results for Cramér’s V in Sections 3.1.1 and 3.1.2). These differences may arise from other aspects of the data. First, our Swedish sample contained overall less gesturing than the German sample. Second, the conversational activities in the corpora differed, with the German conversations being in general more task-oriented than the Swedish ones. Thus, the current results are probably better understood as supporting the argument that gesture implementation in the vicinity of completion points is a universal communicative strategy, rather than one strongly mediated by linguistic structure.

4.1. Temporal features of hand gestures at turn ends

The analyses reported in Sections 3.1 and 3.2 provide evidence that hand gesturing overall is structured in a way that supports the structuring and management of turn-taking in conversation. This complements and expands upon findings by, for example, Kendrick et al. (Reference Kendrick, Holler and Levinson2023), who report that preparations and strokes at TCU ends are associated with floor-holding by the current speaker, as well as by Kendon (Reference Kendon1995) and Mondada (Reference Mondada2007), who report functions of specific gesture shapes in terms of their contribution to the activity of holding or releasing the floor. By looking at all hand gestures, regardless of their form, we find that the stroke phase in general, that is, the obligatory and ‘meaningful’ portion of the gesture, also becomes rarer as a completion point approaches. At the time of speech offset, strokes are extremely rare in both German and Swedish, and this is further modulated by the type of turn transition that is taking place: ongoing strokes at the time of speech offset are more frequent in Keeps, where the same speaker intends to continue speaking, than at Changes, where a new speaker will take over. Changes are also the type of turn transition least likely to have any kind of ongoing gesture at the time of speech offset. Thus, for example, Schegloff’s (Reference Schegloff, Atkinson and Heritage1984) claim that hand gesturing is a current-speaker activity is supported by our quantitative analysis, and we can expand upon this by arguing that hand gesturing is also an activity that can indicate intentions about future speakership.

Expanding our view outward from the single time point of the offset of speech to look at the larger picture approaching and following a completion point, we see differences in gesturing behavior even at a substantial distance from this time point. Even in the few seconds preceding a completion point, gesturing is less frequent preceding Changes than preceding Keeps or Backchannels. In Backchannels and Keeps, holds remain similarly frequent, or may even increase, approaching the speech offset, and dip in frequency shortly afterward. In all types of transitions, the frequency of strokes is also at its peak about 1 second before the completion point. This suggests that stroking could be a useful early visual cue to an upcoming boundary, alerting an interlocutor to search for other cues about the current speaker’s turn-taking intentions.

4.2. Limitations

The current study adopts the offset of speech as a reference point, thus taking the perspective of the current speaker. Different results might have arisen if our reference point was the onset of speech following our completion points, taking a more recipient-oriented perspective, as did Truong et al. (Reference Truong, Poppe, de Kok and Heylen2011), who time-locked at the start of (verbal, visual or bimodal) backchannels and investigated the prior interlocutor’s presence or absence of speech and mutual gaze. Our study also restricts gesture annotation to the four gesture phases following Kendon (Reference Kendon2004). An addition to the stroke phase could be the annotation of the gesture apex, that is, the place of maximum effort (peak velocity, peak acceleration, or peak deceleration), cf. Pouw and Dixon (Reference Pouw, Dixon and Grimminger2019). A focus on the most prominent part of a gesture stroke might result in different (potentially sharper) distribution contours.

Another limitation of our study is our focus on the recipient’s verbal behavior while not taking into account the visual resources of feedback such as head nods, facial expressions, gaze and so forth Including such annotations in future studies would better reflect the multimodal richness of face-to-face interaction. Similarly, it is beyond the scope of the current study to account for additional signaling on the part of the first speaker regarding his or her turn-taking intentions, either by lexical means or by variation in, e.g., pitch or duration; annotations of the turn-final pitch contours exist, and their relationship to gestural behavior will be explored in future research.

We were also limited by the available data. While many corpora of conversational speech exist, it is challenging to identify corpora that are sufficiently similar in terms of their structure. The interactional settings in particular are different for most corpora. In addition, our study used data from a different number of speakers in each language, meaning that for the $ {\chi}^2 $ tests, which could not incorporate random factors, the differing amount of individual variability in the two languages could have influenced the statistical results. Once comparable data are available across languages, e.g., the parallel corpus of the PECII project (Kornfeld et al., Reference Kornfeld, Küttner, Zinken, Deppermann, Fandrych, Kupietz and Schmidt2023) with constant interactional settings, our study could be repeated, also taking the other shortcomings into account.

5. Conclusions

Annotating and carrying out quantitative analyses of conversational data could be interpreted as ignoring or overgeneralizing important complexity arising in conversational interaction; however, our larger-scale quantitative analysis has identified larger-scale temporal patterns arising across languages and across conversational settings. While interacting participants can still make sense of very context-specific cue organizations, we find evidence supporting the hypothesis that conversational participants systematically vary their gestural behavior in the approach to and at turn boundaries and that the temporal placement of different gesture phases (strokes versus holds versus other phases) shows a tendency to pattern similarly depending on the sequential structure of the conversational turn. We found differences only in the degree to which these patterns arose between German and Swedish, suggesting that these temporal patterns are either universal or that linguistic or cultural differences must be much larger to identify differences in timing behavior.

Future research will bring another prosodic parameter, the pitch contour at turn ends, into the equation. It will also expand the scope of the investigation beyond Germanic languages and beyond European cultures. These parameters will help us to refine our assessment of the universality of gesturing behavior at turn boundaries as well as its interaction with the linguistic system.

Data availability statement

Legal restrictions prohibit sharing of the raw video and audio dataset due to GDPR privacy restrictions and demands for anonymity of participants. The extracted data and R code for these analyses are publicly available at https://osf.io/efs4c/?view_only=16af14465e314724aa44ba709f051860.

Acknowledgments

We are grateful to Simon Alexanderson, Jonas Beskow and Jens Edlund for assistance with Spontal and to Caroline Kleen for supplementary annotation work. We also want to thank Sandra Hansen and Sascha Wolfer for valuable feedback on the statistical analyses and the anonymous reviewers for their comments and suggestions for improvement on previous versions of this paper. We gratefully acknowledge support from CLARIN SPEECH, the CLARIN Knowledge Centre for Speech Analysis.

Funding statement

This work was supported by the German Research Foundation (DFG; Aufbau Internationaler Kooperationen GO 3063/1–1, PE 2879/1–1, ZE 1178/1–1), the Swedish Research Council (VR-2017-02140) and the Riksbankens Jubileumsfond (P12–0634:1).

Competing interests

The authors declare none.

References

Ambrazaitis, G., & House, D. (2017a). Acoustic features of multimodal prominences: Do visual beat gestures affect verbal pitch accent realization? In Proceedings of 14th International Conference on Auditory-Visual Speech Processing. Stockholm, Sweden.Google Scholar

Ambrazaitis, G., & House, D. (2017b). Multimodal prominences: Exploring the patterning and usage of focal pitch accents, head beats and eyebrow beats in Swedish television news readings. Speech Communication, 95, 110–113. https://doi.org/10.1016/j.specom.2017.08.008.CrossRef Google Scholar

Auer, P. (1996). On the prosody and syntax of turn-continuations. In Couper-Kuhlen, E. & Selting, M. (Eds.), Prosody in conversation: Interactional studies (pp. 57–100). Cambridge: Cambridge University Press.10.1017/CBO9780511597862.004CrossRef Google Scholar

Auer, P. (2005). Projection in interaction and projection in grammar. Text-Interdisciplinary Journal for the Study of Discourse, 25(1), 7–36.10.1515/text.2005.25.1.7CrossRef Google Scholar

Barkhuysen, P., Krahmer, E., & Swerts, M. (2008). The interplay between the auditory and visual modality for end-of-utterance detection. Journal of the Acoustical Society of America, 123(1), 354–365.10.1121/1.2816561CrossRef Google Scholar

Bartoń, K. (2022). lme4: Linear mixed-effects models using ‘Eigen’ and S4. R package version 1.47.1. https://CRAN.R-project.org/package=lme4 Google Scholar

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2022). MuMIn: Multi-model inference. R package version 1.1–31. https://CRAN.R-project.org/package=MuMIn Google Scholar

Bavelas, J. B. (1994). Gestures as part of speech: Methodological implications. Research on Language and Social Interaction, 27, 201–221.10.1207/s15327973rlsi2703_3CrossRef Google Scholar

Bergmann, K., Aksu, V., & Kopp, S. (2011). The relation of speech and gestures: Temporal synchrony follows semantic synchrony. In Proceedings of the 2nd Workshop on Gesture and Speech in Interaction (GeSpIn 2011).Google Scholar

Boersma, P., & Weenink, D. (2021). Praat: Doing phonetics by computer. http://www.praat.org/.Google Scholar

Bögels, S., & Torreira, F. (2015). Listeners use intonational phrase boundaries to project turn ends in spoken interaction. Journal of Phonetics, 52, 46–57.10.1016/j.wocn.2015.04.004CrossRef Google Scholar

Bosker, H. R., & Peeters, D. (2021). Beat gestures influence which speech sounds you hear. Proceedings of the Royal Society B, 288(1943), 20202419. https://doi.org/10.1098/rspb.2020.2419.CrossRef Google Scholar

Buanzur, T., Zellers, M., Namyalo, S., & Witzlack-Makarevich, A. (2018). A first investigation of the timing of turn-taking in Ruuli. In Proceedings of Interspeech 2018, Hyderabad, India (pp. 621–625).10.21437/Interspeech.2018-1254CrossRef Google Scholar

Casasanto, D. (2008). Conceptual affiliates of metaphorical gestures. In International Conference on Language, Communication, & Cognition. Brighton, UK.Google Scholar

Casasanto, D. (2009). When is a linguistic metaphor a conceptual metaphor? New Directions in Cognitive Linguistics, 24, 127–146.10.1075/hcp.24.11casCrossRef Google Scholar

Caspers, J. (2003). Local speech melody as a limiting factor in the turn-taking system in Dutch. Journal of Phonetics, 31, 251–276.10.1016/S0095-4470(03)00007-XCrossRef Google Scholar

Chui, K. (2005). Temporal patterning of speech and iconic gestures in conversational discourse. Journal of Pragmatics, 37(6), 871–887. https://doi.org/10.1016/j.pragma.2004.10.010.CrossRef Google Scholar

de Ruiter, J., Mitterer, H., & Enfield, N. (2006). Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation. Language, 82(3), 515–535.10.1353/lan.2006.0130CrossRef Google Scholar

de Vos, C., Torreira, F., & Levinson, S. C. (2015). Turn-timing in signed conversations: Coordinating stroke-to-stroke turn boundaries. Frontiers in Psychology, 6, 268.10.3389/fpsyg.2015.00268CrossRef Google Scholar

Dick, A. S., Goldin-Meadow, S., Hasson, U., Skipper, J. I., & Small, S. L. (2009). Co-speech gestures influence neural activity in brain regions associated with processing semantic information. Human Brain Mapping, 30(11), 3509–3526.10.1002/hbm.20774CrossRef Google Scholar

Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: A Swedish spontaneous dialogue corpus of audio, video and motion capture. In Proceedings of LREC 2010, Valetta, Malta.Google Scholar

Esposito, A., Esposito, D., Refice, M., Savino, M., & Shattuck-Hufnagel, S. (2007). A preliminary investigation of the relationship between gestures and prosody in Italian. In Esposito, A., Bratanić, M., Keller, E., & Marinaro, M. (Eds.), Fundamentals of verbal and nonverbal communication and the biometric issue (pp. 65–74). Amsterdam: IOS Press.10.1007/978-3-540-76442-7CrossRef Google Scholar

Esteve-Gibert, N., & Prieto, P. (2013). Prosodic structure shapes the temporal realization of intonation and manual gesture movements. Journal of Speech, Language, and Hearing Research, 56(3), 850–864.10.1044/1092-4388(2012/12-0049)CrossRef Google Scholar

Ferré, G. (2010). Timing relationships between speech and co-verbal gestures in spontaneous French. In Proceedings of LREC 2010 (pp. 86–91). Valetta, Malta.Google Scholar

Ferré, G., & Renaudier, S. (2017). Unimodal and bimodal backchannels in conversational English. Proceedings of SEMDIAL, 2017, 27–37.Google Scholar

Florit-Pons, J., Vilà-Giménez, I., Rohrer, P., & Prieto, P. (2020). The development and temporal integration of co-speech gesture in narrative speech: A longitudinal study. In: Proceedings of GESPIN 2020. Stockholm, Sweden.Google Scholar

Fox, J., Weisberg, S., Price, B., Friendly, M., & Hong, J. (2022). effects: Effect Displays for Linear, Generalized Linear, and Other Models. R package version 4.2–2. https://CRAN.R-project.org/package=effects Google Scholar

Goldin-Meadow, S., & Sandhofer, C. M. (1999). Gestures convey substantive information about a child’s thoughts to ordinary listeners. Developmental Science, 2, 67–74.10.1111/1467-7687.00056CrossRef Google Scholar

Gravano, A., & Hirschberg, J. (2009). Turn-yielding cues in task-oriented dialogue. In Healey, P., Pieraccini, R., Byron, D., Young, S., & Purver, M. (Eds.), Proceedings of the SIGDIAL 2009 Conference (pp. 253–261). London: Association for Computational Linguistics.Google Scholar

Gravano, A., & Hirschberg, J. (2011). Turn-taking cues in task-oriented dialogue. Computer Speech and Language, 25, 601–634.10.1016/j.csl.2010.10.003CrossRef Google Scholar

Graziano, M., & Gullberg, M. (2018). When speech stops, gesture stops: Evidence from developmental and crosslinguistic comparisons. Frontiers in Psychology, 9, 879.10.3389/fpsyg.2018.00879CrossRef Google Scholar

Guellaï, B., Langus, A., & Nespor, M. (2014). Prosody in the hands of the speaker. Frontiers in Psychology, 5, 700.Google Scholar

Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps in conversation. Journal of Phonetics, 38, 555–568.10.1016/j.wocn.2010.08.002CrossRef Google Scholar

Hjalmarsson, A. (2011). The additive effect of turn-taking cues in human and synthetic voice. Speech Communication, 53, 23–35.10.1016/j.specom.2010.08.003CrossRef Google Scholar

Hubbard, A. L., Wilson, S. M., Callan, D. E., & Dapretto, M. (2009). Giving speech a hand: Gesture modulates activity in auditory cortex during speech perception. Human Brain Mapping, 30(3), 1028–1037.10.1002/hbm.20565CrossRef Google Scholar

Kane, J., Yanushevskaya, I., de Looze, C., Vaughan, B., & Ní Chasaide, A. (2014). Analysing the prosodic characteristics of speech-chunks preceding silences in task-based interactions. In Proceedings of 15th Interspeech (pp. 333–337). Singapore.Google Scholar

Kelly, S. D., Creigh, P., & Bartolotti, J. (2010). Integrating speech and iconic gestures in a Stroop-like task: Evidence for automatic processing. Journal of Cognitive Neuroscience, 22(4), 683–694.10.1162/jocn.2009.21254CrossRef Google Scholar

Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In Kay, M. R. (Ed.), The role of nonverbal communication (pp. 207–227). Berlin/The Hague: De Gruyter Mouton.Google Scholar

Kendon, A. (1994). Do gestures communicate? A review. Research on Language and Social Interaction, 27, 175–200.10.1207/s15327973rlsi2703_2CrossRef Google Scholar

Kendon, A. (1995). Gestures as illocutionary and discourse structure markers in southern Italian conversation. Journal of Pragmatics, 23(3), 247–279. https://doi.org/10.1016/0378-2166(94)00069-V.CrossRef Google Scholar

Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge: Cambridge University Press.10.1017/CBO9780511807572CrossRef Google Scholar

Kendrick, K. H., Holler, J., & Levinson, S. C. (2023). Turn-taking in human face-to-face interaction is multimodal: Gaze direction and manual gestures aid the coordination of turn transitions. Philosophical Transactions of the Royal Society B, 378(1875), 20210473. https://doi.org/10.1098/rstb.2021.0473.CrossRef Google Scholar

Kita, S., van Gijn, I., & van der Hulst, H. (1997). Movement phases in signs and co-speech gestures, and their transcription by human coders. In Wachsmuth, I. & Fröhlich, M. (Eds.), International Gesture Workshop, Bielefeld, Germany, September 1997. Lecture Notes in Artificial Intelligence 1371 (pp. 23–35). Berlin: Springer.Google Scholar

Knight, D. (2009). A multi-modal corpus approach to the analysis of backchanneling behaviour. PhD Thesis, University of Nottingham.Google Scholar

Koiso, H., Horiuchi, Y., Tutiya, S., Ichikawa, A., & Den, Y. (1998). An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese map task dialogs. Language and Speech, 41, 295–321.10.1177/002383099804100404CrossRef Google Scholar

Kornfeld, L., Küttner, U.-A., & Zinken, J. (2023). Ein Korpus für die vergleichende Interaktionsforschung. Das Parallel European Corpus of Informal Interaction (PECII). In Deppermann, A., Fandrych, C., Kupietz, M., & Schmidt, T. (Eds.), Korpora in der germanistischen Sprachwissenschaft. Mündlich, schriftlich, multimedial (pp. 103–127). Berlin/Boston: de Gruyter.10.1515/9783111085708-006CrossRef Google Scholar

Krahmer, E., & Swerts, M. (2007). The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. Journal of Memory and Language, 57(3), 396–414.10.1016/j.jml.2007.06.005CrossRef Google Scholar

Leonard, T., & Cummins, F. (2011). The temporal relation between beat gestures and speech. Language and Cognitive Processes, 26(10), 1457–1471.10.1080/01690965.2010.500218CrossRef Google Scholar

Li, X. (2014). Multimodality, interaction and turn-taking in Mandarin conversation (Vol. 3). Amsterdam: John Benjamins Publishing Company.10.1075/scld.3CrossRef Google Scholar

Local, J., & Walker, G. (2012). How phonetic features project more talk. Journal of the International Phonetic Association, 42, 255–280.10.1017/S0025100312000187CrossRef Google Scholar

Local, J. K., Kelly, J., & Wells, W. H. G. (1986). Towards a phonology for conversation: Turn-taking in Tyneside English. Journal of Linguistics, 22, 411–437.10.1017/S0022226700010859CrossRef Google Scholar

Loehr, D. P. (2004), Gesture and intonation. PhD thesis, Georgetown University.Google Scholar

Max Planck Institute for Psycholinguistics (2018). ELAN (version 5.2). Nijmegen, Netherlands. https://tla.mpi.nl/tools/tla-tools/elan/.Google Scholar

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748.10.1038/264746a0CrossRef Google Scholar

McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago, IL: The University of Chicago Press.Google Scholar

McNeill, D. (2005). Gesture and thought. Chicago, IL: University of Chicago Press.10.7208/chicago/9780226514642.001.0001CrossRef Google Scholar

McNeill, D. (2006). Gesture and communication. In Brown, K. & Anderson, A. (Eds.), The Encyclopedia of language and linguistics (2nd ed., pp. 58–66). Psycholinguistics Series. Amsterdam and Boston: Elsevier.10.1016/B0-08-044854-2/00798-7CrossRef Google Scholar

Mondada, L. (2007). Multimodal resources for turn-taking: Pointing and the emergence of possible next speakers. Discourse Studies, 9(2), 194–225.10.1177/1461445607075346CrossRef Google Scholar

Mondada, L. (2019). Contemporary issues in conversation analysis: Embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47–62.10.1016/j.pragma.2019.01.016CrossRef Google Scholar

Mondada, L., & Oloff, F. (2011). Gestures in overlap. The situated establishment of speakership. In Stam, G. & Ishino, M. (Eds.) Integrating gestures. The interdisciplinary nature of gesture (pp. 321–338). Amsterdam: John Benjamins.10.1075/gs.4.29monCrossRef Google Scholar

Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining

$ {R}^2 $ from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133–142.10.1111/j.2041-210x.2012.00261.xCrossRef Google Scholar

Nobe, S. (1996), Representational gestures, cognitive rhythms, and acoustic aspects of speech: A network/threshold model of gesture production. PhD thesis, University of Chicago.Google Scholar

Novack, M. A., Wakefield, E. M., & Goldin-Meadow, S. (2015). What makes a movement a gesture? Cognition, 146, 339–348.10.1016/j.cognition.2015.10.014CrossRef Google Scholar

Ogden, R. (2001). Turn transition, creak and glottal stop in Finnish talk-in-interaction. Journal of the International Phonetic Association, 31(1), 139–152. https://doi.org/10.1017/S0025100301001116.CrossRef Google Scholar

Peters, B. (2006). Form und Funktion prosodischer Grenzen im Gespräch – Ein phonetischer Beitrag zur Gesprächsforschung. Saarbrücken: Südwestdeutscher Verlag für Hochschulschriften.Google Scholar

Pöppel, E. (2009). Pre-semantically defined temporal windows for cognitive processing. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1525), 1887–1896.10.1098/rstb.2009.0015CrossRef Google Scholar

Pouw, W., & Dixon, J. A. (2019). Quantifying gesture-speech synchrony. In Grimminger, A. (Ed.), 6th gesture and speech in interaction conference – GESPIN 6 (pp. 75–80). Universitaetsbibliothek Paderborn.Google Scholar

Prieto, P., Puglesi, C., Borràs-Comes, J., Arroyo, E., & Blat, J. (2015). Exploring the contribution of prosody and gesture to the perception of focus using an animated agent. Journal of Phonetics, 49(1), 41–54.10.1016/j.wocn.2014.10.005CrossRef Google Scholar

Quek, F., McNeill, D., Bryll, R., Duncan, S., Ma, X.-F., Kirbas, C., McCullough, K. E., & Ansari, R. (2002). Multimodal human discourse: Gesture and speech. ACM Transactions on Computer-Human Interaction, 9(3), 171–193. https://doi.org/10.1145/568513.568514.CrossRef Google Scholar

R Core Team (2022). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/Google Scholar

Rochet-Capellan, A., Laboissière, R., Galván, A., & Schwartz, J.-L. (2008). The speech focus position effect on jaw–finger coordination in a pointing task. Journal of Speech, Language, and Hearing Research, 51, 1507–1521.10.1044/1092-4388(2008/07-0173)CrossRef Google Scholar

Rossi, M., Feindt, K., & Zellers, M. (2022). Individual variation in F0 marking of turn-taking in natural conversation in German and Swedish. In Proceedings of Speech Prosody 2022 (pp. 185–189). Lisbon, Portugal.10.21437/SpeechProsody.2022-38CrossRef Google Scholar

Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest systematics for the organisation of turn-taking for conversation. Language, 50(4), 696–735.10.1353/lan.1974.0010CrossRef Google Scholar

Schaffer, D. (1983). The role of intonation as a cue to turn taking in conversation. Journal of Phonetics, 11, 243–257.10.1016/S0095-4470(19)30825-3CrossRef Google Scholar

Schegloff, E. A. (1984). On some gesture’s relation to talk. In Atkinson, J. M. & Heritage, J. (Eds.), Structures of social action: Studies in conversation analysis (pp. 266–296). Cambridge: Cambridge University Press.Google Scholar

Schmidt, T. (2014). The research and teaching corpus of spoken German – FOLK. In Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC’14) (pp. 383–387). Reykjavik, Iceland: European Language Resources Association (ELRA).Google Scholar

Selting, M. (1996). On the interplay of syntax and prosody in the constitution of turn-constructional units and turns in conversation. Pragmatics, 6, 357–388.Google Scholar

Shattuck-Hufnagel, S., & Ren, A. (2018). The prosodic characteristics of non-referential co-speech gestures in a sample of academic-lecture-style speech. Frontiers in Psychology, 9, 1514.10.3389/fpsyg.2018.01514CrossRef Google Scholar

Sikveland, R., & Ogden, R. (2012). Holding gestures across turns: Moments to generate shared understanding. Gesture, 12(2), 166–199.10.1075/gest.12.2.03sikCrossRef Google Scholar

Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., Rossano, F., de Ruiter, J. P., Yoon, K.-E., & Levinson, S. C. (2009). Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences of the United States of America, 106(26), 10587–10592.10.1073/pnas.0903616106CrossRef Google Scholar

Stivers, T., & Rossano, F. (2010). Mobilizing response. Research on Language and Social Interaction, 43(1), 3–31. https://doi.org/10.1080/08351810903471258.CrossRef Google Scholar

Streeck, J., & Hartge, U. (1992). Previews: Gestures at the transition place. In Auer, P. & di Luzio, A. (Eds.), The contextualization of language (pp. 135–158). Amsterdam: Benjamins B.V.10.1075/pbns.22.10strCrossRef Google Scholar

Taboada, M. (2006). Spontaneous and non-spontaneous turn-taking. Pragmatics, 16(2–3), 329–360.Google Scholar

ter Bekke, M., Drijvers, L., & Holler, J. (2020). The predictive potential of hand gestures during conversation: An investigation of the timing of gestures in relation to speech. PsyArXiv. https://doi.org/10.31234/osf.io/b5zq7CrossRef Google Scholar

Truong, K. P., Poppe, R., de Kok, I., & Heylen, D. (2011). A multimodal analysis of vocal and visual backchannels in spontaneous dialogues. In Proceedings of 12th Interspeech (pp. 2973–2976). Florence, Italy.Google Scholar

Wagner, P., Malisz, Z., & Kopp, S. (2014). Gesture and speech in interaction: An overview. Speech Communication, 57, 209–232.10.1016/j.specom.2013.09.008CrossRef Google Scholar

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Cham, Switzerland: Springer International Publishing.10.1007/978-3-319-24277-4CrossRef Google Scholar

Yngve, V. H. (1970). On getting a word in edgewise. In Papers from the Sixth Regional Meeting Chicago Linguistic Society, April 16–18, 1970 (pp. 567–578). Chicago: Chicago Linguistic Society.Google Scholar

Zellers, M. (2017). Prosodic variation and segmental reduction and their roles in cuing turn transition in Swedish. Language and Speech, 60(3), 454–478.10.1177/0023830916658680CrossRef Google Scholar

Zellers, M., Gorisch, J., House, D., & Peters, B. (2019a). Hand gestures and pitch contours and their distribution at possible speaker change locations: A first investigation. In Proceedings of GeSpIn 2019 (pp. 93–98). Paderborn, Germany.Google Scholar

Zellers, M., Gorisch, J., House, D., & Peters, B. (2019b). Timing properties of hand gestures and their lexical counterparts at turn transition places. Proceedings of FONETIK, 2019, 119–124.Google Scholar

Table 1. Metadata for the FOLK and Spontal files

Figure 1. Screenshot of the annotation environment in ELAN (data from FOLK).

Figure 2. Ongoing gesture phase at time of speech offset, German data left, Swedish data right. The y-axis shows the proportion of gesture phases at each transition type, rather than raw counts.

Table 2. Gesture phases according to transition types observed at the time of speech offset in completion points in German and Swedish

Table 3. Upper part: statistical evaluation for each logistic model: R2m = explained variation without random factor (speaker) and R2c corrected for the random factor. Lower part: p-values (log probabilities) for each factor and the interaction with time. The reference group (Intercept) is backchannels in German. Bold text shows predictors that achieve statistical significance.

Figure 5. Estimates for the models for gesture phases preparations, holds, retractions, and none.

Figure 6. Probability of strokes over time (zero = offset of speech) according to transition types. Stretches where the confidence intervals do not overlap can be considered as significantly different.

Figure 7. Probability of preparations over time (zero = offset of speech) according to transition types.

Figure 8. Probability of holds over time (0 = offset of speech) according to transition types.

Figure 9. Probability of none over time (zero = offset of speech) according to transition types.

Article contents

Temporal relationships between speech and hand gestures in the vicinity of potential turn boundaries in German and Swedish conversation

Abstract

Keywords

Information

1. Introduction

1.1. Gesture analysis

1.2. Coordination of speech and gesture

1.3. The management of turn-taking in conversation

1.4. Aims of the current study

2. Method

2.1. Data

2.2. Annotations

2.3. Feature extraction and quantitative analysis

3. Results

3.1. Temporal alignment of gestures with turn ends

3.1.1. German

3.1.2. Swedish

3.1.3. Cross-linguistic comparison

3.2. Timing of gestural activity around completion points

4. Discussion

4.1. Temporal features of hand gestures at turn ends

4.2. Limitations

5. Conclusions

Data availability statement

Acknowledgments

Funding statement

Competing interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests