1. Introduction
While Hockett (Reference Hockett1960) identified the use of the vocal-auditory channel as one of the main features of language, and one leaving the rest of the body free to perform other activities, it is now well established that communication is better characterised as involving multimodality. Effects due to the interaction between speech and the visual modality go well beyond the use of visual input in the form of lip movement to decode the speech signal (McGurk & MacDonald, Reference McGurk and MacDonald1976). For instance, Holler et al. (Reference Holler, Kendrick and Levinson2018) investigated how fast participants in a conversation are able to answer questions and found faster responses for answers to questions accompanied by gestures. Drijvers and Holler (Reference Drijvers and Holler2022) asked participants to shadow (i.e., repeat back as fast as possible) recordings from natural conversations and found that seeing the speaker helps even when mouth movements are invisible, suggesting that the presence of gesture supports comprehension. The importance of gestural behaviour in communication is also evident in a recent study that found less creative thinking in online meetings than in physical multimodal settings (Brucks & Levav, Reference Brucks and Levav2022). Such findings raise the question of what kinds of multimodal information support the perception of what types of linguistic information.
In the current paper, we investigate the extent to which prominence perception may benefit from gestural cues that are produced during face-to-face interaction. Prominence is understood here as a relational property whereby a linguistic element stands out from other elements in an utterance (Grice & Kügler, Reference Grice and Kügler2021), and we speak of multimodal prominence as a property combining visual and prosodic characteristics.
It is well established that gesture is tightly connected with speech in human communication (Kendon, Reference Kendon2004; McNeill, Reference McNeill2008) and cognition (de Ruiter, Reference de Ruiter2000; Gentilucci & Dalla Volta, Reference Gentilucci and Dalla Volta2007; Iverson & Thelen, Reference Iverson and Thelen1999; Willems & Hagoort, Reference Willems and Hagoort2007), and that one of the ways in which this tight relationship manifests itself is in the temporal coordination between gestures and units of speech, in particular units of prosodic structure (see Wagner et al., Reference Wagner, Malisz and Kopp2014, for an overview).
Alignment between speech and the gestural modality, including not only hand gestures but also facial expressions and head movements, has been studied from the perspective of language production (Cavé et al., Reference Cavé, Guaïtella, Bertrand, Santi, Harlay and Espesser1996; Loehr, Reference Loehr2007; McClave, Reference McClave1998), language development (Esteve-Gibert et al., Reference Esteve-Gibert, Lœvenbruck, Dohen and d’Imperio2022) and in several languages (Chui, Reference Chui2005; Ferré, Reference Ferré2010). These findings converge on a view of gesture and speech as temporally aligned via their prominence cues and provide support for an understanding of communication as involving a common cognitive origin of the two modalities.
It has been debated whether gestures play a role for the speaker or the listener. For instance, Krauss and Hadar (Reference Krauss and Hadar1999) claim that their main function is to help the speaker in managing their communicative flow, for example, for lexical retrieval. Other studies, however, suggest that the presence of gestural behaviour has an effect on the way interlocutors perceive speaker utterances (Dohen & Loevenbruck, Reference Dohen and Loevenbruck2009; Glave & Rietveld, Reference Glave and Rietveld1979; House et al., Reference House, Beskow and Granström2001; Prieto et al., Reference Prieto, Puglesi, Borràs-Comes, Arroyo and Blat2015), and that beats in particular contribute to the perception of prominence of target words (Bosker & Peeters, Reference Bosker and Peeters2021; Krahmer & Swerts, Reference Krahmer and Swerts2004, Reference Krahmer and Swerts2007). The latter studies are based on highly controlled Dutch stimuli. The goal of the present one, by contrast, is to investigate how gestures contribute to the perception of overall (multimodal) prominence in naturally occurring stimuli extracted from conversations. This extension to conversational data requires explicitly addressing challenges such as the large variability in both speech and gesture, which in turn makes it harder to disentangle the factors that affect the perception of different elements in multimodal signals. Furthermore, we are the first to look at multimodal prominence perception in Maltese.
Assuming that speakers use gestures if a word is particularly important to them, we expect that words accompanied by gestures will be produced with stronger prominence in the speech signal than words not accompanied by a gesture. Therefore, we hypothesise that the perceived prominence of a target word in an utterance will be increased if the word is accompanied by a manual gesture. We further hypothesise that the perceived prominence of the co-occurring word will be higher if listeners see the speaker produce the manual gesture compared with if they do not. In other words, the first hypothesis concerns the listener’s perception of utterances that are produced with vs. without a co-speech gesture. The second concerns the listener’s perception of the same two types of utterance while viewing vs. not viewing the gestures and, therefore, investigates whether seeing the gesture actually affects prominence perception. The choice of positing two separate hypotheses is based on the fact that several studies (Ambrazaitis & House, Reference Ambrazaitis and House2023; Berger & Zellers, Reference Berger and Zellers2022; Esteve-Gibert et al., Reference Esteve-Gibert, Lœvenbruck, Dohen and d’Imperio2022; Krahmer & Swerts, Reference Krahmer and Swerts2007; Krivokapić et al., Reference Krivokapić, Tiede and Tyrone2017; Pouw et al., Reference Pouw, de Jonge-Hoekstra, Harrison, Paxton and Dixon2021) have provided evidence for stronger articulated verbal prominence cues in connection with accompanying gestures. Therefore, if gestures affect prominence perception, this may also happen if the gestures are not seen (first hypothesis). However, if the visual cues provided by the gestures add to the perception of prominence, beyond their effects on the speech signal, then the effect on prominence perception should be stronger when gestures are visible than otherwise (second hypothesis).
In this study, we compare the acoustic cues to prominence in words produced with vs. without co-occurring gestures in utterances extracted from Maltese dialogues. We then test whether participants listening to the utterances perceive words accompanied by gestures as more prominent. Finally, we test whether seeing the accompanying gestures also increases the perceived prominence.
We begin with a brief review of research investigating the relationship between gestures and prosodic features in Section 2.1, followed by an account of how prosodic prominence is expressed in Maltese in Section 2.2. In Section 3, we explain the methods used in the experiment, including gestural and acoustic properties of the stimuli used, particularly the effect of gesture co-presence on intensity and pitch values of the stressed vowels in the target words. The results of the perception experiment are reported in Section 4 and discussed in Section 5. Section 6 draws the conclusions.
2. Background
2.1. Related work
Systematic relationships between gestures (not necessarily of the hand) and prosodic features have been observed in several studies. In general, there is broad agreement in the literature that hand gestures are coordinated with prosodic events, such as pitch accents and prosodic phrase boundaries (Bolinger, Reference Bolinger1986; Kendon, Reference Kendon and Key1980; Leonard & Cummins, Reference Leonard and Cummins2010; Loehr, Reference Loehr2007, Reference Loehr2004), and several studies claim that hand gesture strokes – a stroke being defined as the most dynamic part of a gesture – are temporally aligned with (or slightly precede) the main sentence accent (Alahverdzhieva & Lascarides, Reference Alahverdzhieva, Lascarides and Müller2010; McNeill, Reference McNeill1992). In their empirical study of German data (276 examples), Ebert et al. (Reference Ebert, Evert and Wilmes2011) confirmed this generalisation by observing that gesture strokes tend to precede sentence accent by 0.36 s on average. Esteve-Gibert and Prieto (Reference Esteve-Gibert and Prieto2013) investigated the alignment of speech and pointing gestures in Catalan and found that the positions of intonation peaks and gesture apexes were correlated and influenced by prosodic structure. In a further study on Catalan data, Esteve-Gibert et al. (Reference Esteve-Gibert, Borràs-Comes, Asor, Swerts and Prieto2017) looked at coordination between prosodic elements and head movements and demonstrated that the position of prosodic heads (accented syllables) and prosodic edges (prosodic word and intonational phrase boundaries) has an impact on the timing of head movements. Ambrazaitis et al. (Reference Ambrazaitis, Zellers and House2020b) found a tendency for stressed syllables in Swedish compounds to be accompanied by gestural movements.
It has been suggested that the choice of which words will be accompanied by gestures could be explained in terms of information structure (IS), which in turn affects the degree of prominence of linguistic expressions in a sentence. Thus, Ebert et al. (Reference Ebert, Evert and Wilmes2011) observed that the onsets of hand gesture phrases in their German data align with new-information foci but not contrastive ones. Krahmer and Swerts (Reference Krahmer and Swerts2004) studied the way eyebrows contribute to the perception of focus in Dutch and Italian. Their conclusion, however, is that eyebrow movements seem to play little to no role in the perception of focus in spite of the fact that speakers of both languages prefer an eyebrow movement to coincide with the most prominent word. Paggio et al. (Reference Paggio, Galea, Vella, Gatt and Paggio2018) investigated the prosodic characteristics and occurrence of gestures in Maltese fronted complements. Complement fronting in Maltese is a marked construction in IS terms, with the fronted complement taking the role of contrastive topic or focus depending on the context. The study showed that in Maltese, fronted complements with a falling pitch accent are often accompanied by hand gestures. Debreslioska and Gullberg (Reference Debreslioska and Gullberg2020) found, on the basis of German narrative spoken data, that the information status of a referent, whether new or inferable, has an effect on the type of gesture it is accompanied by. Esteve-Gibert et al. (Reference Esteve-Gibert, Lœvenbruck, Dohen and d’Imperio2022) showed that French-speaking children use head movements to mark the informational status of discourse referents.
The role of co-speech gesture – again, not just hand gestures – has also been studied by testing its effect on listeners. Several studies provide experimental evidence suggesting that beats contribute to both the realisation and perception of prominence. Prieto et al. (Reference Prieto, Puglesi, Borràs-Comes, Arroyo and Blat2015) explored the role of visual cues in the perception of focus in Catalan using a 3D animated character and found that both head and eyebrow movements contributed to the perception of contrastive focus, whereas head nods were more informative than eyebrow movements for focus identification. In House et al. (Reference House, Beskow and Granström2001), listeners were tasked with identifying the most prominent of two words in test sentences uttered by a talking face in Swedish. The study found that eyebrow and head movements contributed to prominence perception when synchronised with the stressed vowel of the potentially prominent word. Krahmer and Swerts (Reference Krahmer and Swerts2007) used highly controlled Dutch stimuli to investigate the effect of gesture on prominence perception. The results of the study suggest that visual beats (hand, face, and eyebrow) have a significant effect on the perceived prominence of the target word (in this study, either the word Amanda or the word Malta; in other words, only two possible words, each containing the same target vowel). Seeing a speaker realise a visual beat on a word increases its perceived prominence compared with not seeing the beat. In addition, the presence of an accompanying beat gesture has an effect on some of the acoustic characteristics of the target word, that is, duration and the F
$ {}_2 $
formant. The combined effect of acoustic and gestural cues on multimodal perception of prominence is discussed in Jiménez-Bravo and Marrero-Aguiar (Reference Jiménez-Bravo and Marrero-Aguiar2020) based on audiovisual stimuli created from recordings of conversations in an online Spanish talent show. Stimuli for this study were generated by manipulating both F
$ {}_0 $
and intensity to determine their contribution to perceived prominence in the presence of gestural cues. It was found that duration had a stronger effect than F
$ {}_0 $
.
Bosker and Peeters (Reference Bosker and Peeters2021) showed that beat gestures can influence stress perception. The authors used Dutch minimal stress pairs (e.g., KAnon vs. kaNON ‘canon’ vs. ‘cannon’ in English), removed the pitch and intensity cues to stress and had a hand gesture produced on the first or second syllable. Participants perceived the word to be stressed on the syllable that was aligned with the hand gesture.
Ambrazaitis et al. (Reference Ambrazaitis, Frid and House2020a, Reference Ambrazaitis, Frid and House2022) studied prominence perception with data from Swedish TV news. Similarly to Krahmer and Swerts (Reference Krahmer and Swerts2007), they found that words realised with a pitch accent and head movement tended to receive higher prominence ratings than words with only a pitch accent in the audiovisual condition, whereas words with low prominence tended to be rated slightly higher in the audio-only condition. These results were further analysed and discussed in Ambrazaitis and House (Reference Ambrazaitis and House2022), where cumulative effects were observed between F
$ {}_0 $
rises, the presence of accompanying head movements and the occurrence of eyebrow movements.
In summary, empirical studies of multimodal language data in several languages have shown systematic relationships between prosodic features of speech and the timing as well as the function of accompanying gestures. One aspect of this relationship is the way gestures, both of the hands and the head, affect the perception of prominence by listeners. Several studies have demonstrated the effect of gestures using highly controlled data, or data from reading TV news. The only study we know of which uses naturally occurring conversational data is Jiménez-Bravo and Marrero-Aguiar (Reference Jiménez-Bravo and Marrero-Aguiar2020). The authors themselves, however, consider the study a pilot project due to the limited number of experiment participants (12). More research is necessary to understand how listeners perceive multimodal prominence in naturally occurring unscripted data. The current study is a contribution in this direction.
2.2. Prosodic prominence in Maltese
Determining the location of prosodically prominent elements is an important prerequisite for a study on the perception of prominence of gestures that may co-occur with such elements. Research on prosodic prominence in Maltese, though not extensive, suggests that there is a strong tendency for the main sentence accent that marks prominence to gravitate towards the right edge of the relevant prosodic domain (Vella, Reference Vella and Fabri2009a). The lexically stressed syllable that is assigned the main sentence accent also serves as the anchor for a pitch accent. More specifically, Vella (Reference Vella1995) claims that prosodic structure in Maltese is organised in such a way that it is the phonological phrase which is the domain of focus, defined in Gussenhoven (Reference Gussenhoven1983, p. 18) as ‘mark[ing] the speaker’s declared contribution to the conversation, while [−focus] constitutes his cognitive starting point’. The focused element within the (final) phonological phrase (in a sequence) is designated as nuclear within the higher-level intonational phrase domain and assigned one of a number of pitch accents identified for Maltese. Additional material can follow this focused element outside the phonological phrase but within the same intonational phrase. This gets assigned a phrase accent rather than a pitch accent.
In Maltese, two phrase accents, defined following Grice et al. (Reference Grice, Ladd and Arvaniti2000, p. 180) as ‘edge tones with a secondary association to an ordinary tone-bearing unit’, have also been identified. These phrase accents (Vella, Reference Vella1995, Reference Vella2003, Reference Vella and Fabri2009a, Reference Vella, Comrie, Fabri, Hume, Mifsud, Stolz and Vanhove2009b) have a secondary association to a lexically stressed syllable in the stretch of speech which follows nuclear material, by definition post-nuclear and post-focal [-focus] following Gussenhoven’s definition (see above). Vella (Reference Vella1995)’s claim is that these phrase accents in Maltese are confined to [-focus] stretches of speech which are extrametrical to the phonological phrase while nevertheless occurring within the intonational phrase, and by virtue of this fact are followed by a tone associated with the intonational phrase boundary. A recent discussion of these accents is provided in Vella and Grice (Reference Vella and Grice2024). What is relevant in the context of this paper is that the phrase accents that occur on [-focus] stretches of speech are not considered to be prominence-lending in the same way that pitch accents are.
The placement of pitch accents in Maltese is influenced by a number of factors, including both syntactic structure and IS. As a result, when formulating the criteria for selecting target utterances, it was necessary to take these factors into account to ensure that they were held constant across the sets of experimental stimuli (see also Section 3.2.2). For example, in utterances in which a pitch accent is followed by a phrase accent, prosodic prominence is signalled by the pitch accent occurring relatively early in the utterance, even in instances where this is followed by a phrase accent signalling a secondary rather than a primary prominence.
The relative constituent order flexibility that characterises Maltese, as well as other structural conditions such as the use of pronominal clitics at the end of verbs, also often lead to the [+focus] element in an utterance, which gets assigned the main sentence accent, coming relatively early in the utterance (Vella, Reference Vella2003). As mentioned above, the [+focus] element, irrespective of whether it comes late or early, is the one which carries the nuclear pitch accent and is considered prominent.
Thus, for example, SEna, ‘years’ is the [+focus] element in Example A.2.12 in Appendix A.2 from the list of examples used in our study,Footnote 1 Jien għandi dsatax il-SEna, ‘I have nineteen years’ (more idiomatically, ‘I am nineteen years old’). In this case, there is no material following the prominent element that comes at the end of the sentence. By contrast, the [+focus] and therefore prominent element, in Example A.1.13 in Appendix A.1, Sal-aħħar tax-XAHAR għandhom ‘Till the end of the month they have’, is XAHAR ‘month’. This is the last word in the phonological phrase sal-aħħar tax-XAHAR, but it is followed post-nuclearly by għandhom, which, following Vella (Reference Vella1995), is deemed to fall outside the phonological phrase, though within the intonational phrase, thus, it is rendered ineligible to be designated as the carrier of the prominent main sentence accent. An (unmarked) SVC(omplement) version of this, Għandhom sal-aħħar tax-xahar, ‘They have till the end of the month’, without a following post-nuclear stretch, would of course also be possible. But the version of this sentence in our dataset starts with the complement rather than the verb. This is the most ‘important’ information, the information that the speaker wants to ensure is conveyed to their interlocutor. The pro-dropped verb għandhom ‘they have’, which follows, in other words, the complement in this sentence, is given information, which is backgrounded in terms of IS. It is [-focus] and hence gets assigned a phrase accent rather than a pitch accent that marks a secondary rather than a primary prosodic prominence (see Vella, Reference Vella1995; Reference Vella2003).
To conclude, while prosodic prominence in Maltese has a tendency to occur late, for example, in Example A.2.12 in Appendix A.2 discussed above, it is also possible for prominence to occur earlier than the final word in the phonological phrase. As mentioned above, Paggio et al. (Reference Paggio, Galea, Vella, Gatt and Paggio2018) have shown that fronted complements in Maltese often involve both a pitch accent, which by definition implies prosodic prominence, and a hand gesture. Example A.1.13 in Appendix A.1, also discussed above, is a good illustration of a sentence in which prominence occurs relatively early.
In view of the complexity in prominence relations arising from the interplay between prosodic structure and prominence in Maltese, in selecting our stimuli, we tried to keep examples in which constituent order is not SVO or SVC to an absolute minimum. This means that sentence accent in our examples generally occurs on the final element within the sentence, although this final element could have the lexical stress on the final, penultimate or antepenultimate syllable. There are a few exceptions, such as Example A.1.13 in Appendix A.1.
3. Methods
An experiment was carried out to test our hypotheses, which, to repeat, are (i) if a target word has been produced with an accompanying manual gesture, a listener will hear it as more prominent than a word that was produced without a gesture and (ii) if the listener can also see the accompanying gesture, the perceived prominence of the target word will be even stronger. In this section, we explain the methods used in the design, conduct and analysis of the experiment.
3.1. Participants
A convenience sampleFootnote 2 from the student (and staff) population at the University of Malta participated in the study. A total of 95 participants (51 male and 44 female) took part in the study for a monetary compensation. Participants were asked to provide their age bracket, with 49 being 18–26 years old, 30 being 26–35 years old and a further 11, 3 and 2 participants in the age brackets 36–56, 46–59 and 60+, respectively.
Participants self-rated their proficiency in Maltese and English on a scale from 0 to 4 for the four language skills (listening, speaking, reading and writing). As Table 1 shows, participants identified themselves as highly proficient in Maltese and English, although slightly less so in Maltese writing. This reflects the fact that students mostly write in English, while Maltese tends to be their preference in other modalities (Vella, Reference Vella2013).
Table 1. Mean (SD) of participants’ self-rated proficiency (scale 0–4) in Maltese and English

3.2. Materials
We used the MAMCO corpus of video-recorded Maltese conversations (Paggio et al., Reference Paggio, Galea, Vella, Gatt and Paggio2018; Paggio & Vella, Reference Paggio and Vella2014) to find examples containing words bearing prominence that were or were not accompanied by a hand gesture. The MAMCO corpus is the first multimodal resource involving Maltese conversational data. It consists of 12 video-recorded conversations between 12 speakers (6 females and 6 males), each taking part in two different short (∼5 min) dialogues in a recording studio. All participants are Maltese-dominant speakers. They had not met prior to the experiment. They were asked to speak freely and try to get to know each other. They were recorded by three different cameras while standing facing each other. For this study, we employed the recordings made using two cameras recording each participant from a semi-frontal angle. For each dialogue, the recordings obtained with these two cameras were later edited into the same video. An example is shown in Figure 1.

Figure 1. Semi-frontal speaker view from one of the MAMCO conversations.
Lapel microphones were used to record the audio. After the recordings, participants were asked to fill in a survey and sign a consent form. All gave their permission for the resulting data collection to be used for research purposes. The setting and general organisation used to collect the corpus replicate those used in the Nordic NOMCO corpus (Paggio et al., Reference Paggio, Allwood, Ahlsén, Jokinen and Navarretta2010), a choice that was motivated by the wish to allow future comparative analyses of the way gestures are used in different languages.
All twelve conversations were orthographically transcribed with word boundaries. An acoustic analysis of the data was also carried out as required (see more in Section 3.2.5).
3.2.1. Gesture annotation in MAMCO
All head movements and a random selection of hand gestures were annotated prior to this study following the MUMIN coding scheme for gestural annotation (Allwood et al., Reference Allwood, Cerrato, Jokinen, Navarretta, Paggio, Martin, Paggio, Kuehnlein, Stiefelhagen and Pianesi2007). The coding scheme defines formal and functional attributes for various types of movements. For instance, a hand movement can be one-handed or two-handed, and it can be annotated as symbolic, iconic or indexical following Peirce’s semiotic categories (Peirce, Reference Peirce2009). Within the indexical category, a distinction is made between deictic and non-deictic, the latter corresponding to gestures that only display a beat-like dimension, that is, a rapid biphasic movement excursion. Gestures are annotated as a single time interval, that is, including the preparatory phase, without marking internal components.
3.2.2. Extracting and further annotating the stimuli
We extracted 60 short sentence stimuli from the MAMCO corpus, all involving a target word carrying sentence accent; the target word mostly occurs in the final part of a sentence. In half of the stimuli, the target word is accompanied by a hand gesture, the ‘withGesture’ condition. This means that the stroke of the gesture co-occurs with the word. Since the stroke was not indicated in the original annotations, it had to be identified by the research assistant choosing the examples. Hand gestures that were not already coded were also annotated following the scheme described earlier. Semantic aspects depicted in the gesture did not play a role in deciding the alignment.
In the remaining half of the stimuli, the ‘noGesture’ condition, no hand gestures are present.
The stimuli (together with 30 fillers) were taken from speech produced by ten of the twelve speakers in the corpus. Two research assistants were tasked with choosing and extracting the example sentences, and two senior researchers validated them, both in terms of the choice of examples and clip editing. The complete sentence list is provided in the Appendix.
An example where the accented word is accompanied by a hand gesture is shown in Figure 2 (Example A.1.7 in Appendix A.1): the speaker on the right says kelli subSIdiary area ‘I had a subSIdiary area’ and makes a hand gesture that is aligned with the accented syllable SI. In this and other illustrative examples in the text, the stressed syllable of the target word is shown in capitals. Note that in the stimuli shown to the experiment participants, faces were not blurred. They are blurred here for reasons of privacy protection.

Figure 2. Video frame from an example.
It was not easy to find stimuli where the alignment between target word and gesture was easily identifiable, and which also met a number of syntactic constraints aimed at reducing syntactic variation (see Section 3.2.3). Therefore, we did not limit the form and function of the gestures. The final selection consisted of 18 one-handed and 12 two-handed gestures. As for the semiotic type, 17 are indexical and non-deictic, 6 are deictic, and the remaining 7 are iconic.
3.2.3. Reducing variability in the stimuli
Since our knowledge of the effects of sentence modality and IS in Maltese is, to date, quite limited, and given the small size of the dataset, we made the decision to restrict our stimuli with the aim of reducing syntactic diversity and phonological confounds. The following conditions served as a guide to the choice of stimuli:
-
• The stimulus is a relatively short independent clause.
-
• The main sentence accent is on a content word.
-
• The clause is not a negative statement.
-
• The clause is not a question.
-
• The clause generally has an SVO or SVC order, with few exceptions.
-
• There are no contrastive accents.
-
• The speaker does not move their hands, with the exception of the gesture accompanying the target word.
-
• There is no overlap between the speakers.
As mentioned above, these conditions were purposely meant to limit syntactic diversity in our dataset. The sentences, nevertheless, show wide lexical variation. In particular, the target words differ in almost all examples. This is intended since our goal was to verify, for naturally occurring speech, results that other studies obtained using controlled experimental stimuli that draw conclusions only based on a comparison of the same target vowel in different conditions (Bosker & Peeters, Reference Bosker and Peeters2021; Krahmer & Swerts, Reference Krahmer and Swerts2007).
Care was taken, as mentioned earlier, not to include stimuli (both with and without accompanying gestures) where contrastive accents were present. We did not, however, otherwise control for differences in IS. Both sets of stimuli include examples displaying a broad focus on a final constituent or on the entire sentence, and others with a narrow focus on the phrase containing the target word.Footnote 3 The distribution of the two IS patterns, however, is very similar in the two types of stimulus, with 22 examples of broad focus and 8 of narrow focus in the ‘withGesture’ condition, whereas the counts are 24 and 6, respectively, in the ‘noGesture’ one. In addition, the stimuli were presented to the experiment participants (see below) without any preceding context, and the IS of the individual stimulus is therefore difficult, if not impossible, for them to determine out of context. We therefore did not include IS as a factor in our analyses.
A set of 30 fillers was also identified and used in the experiment. These are, in most respects, similar to the experimental stimuli. In these sentences, however, sentence accent falls on a word anywhere in the sentence with or without one or more hand gestures.
3.2.4. Preparing the stimuli
Since noise levels differed between clips, we used the noise-reduction functionality in Audacity (version 2.4.2) to equalise noise levels across files. To do this, we separated the video from the audio stream using VirtualDub2 (www.virtualdub.com). In order to generate a noise profile for each speaker, we identified and used a stretch of silence of at least 1 s from the channel of that speaker from the relevant MAMCO corpus recording. This noise profile was then used in the noise reduction algorithm in Audacity. In order to avoid sudden visual onsets, we used MATLAB to generate a linear 400-ms fade-in and fade-out for the video files, and zero-padded the noise-reduced audio files with the same amount of silence. The processed files were then recombined using VirtualDub2 and then encoded as MP4 files.
3.2.5. Acoustic properties of target vowels
Here, we give an account of a number of relevant acoustic properties of the target words with a view to getting a better understanding of whether some of these properties are influenced by the presence of a co-occurring gesture. This information will be important for interpreting the results of the perceptual experiment described below.
We start by showing counts of the vowels in the accented syllables of the target words in Table 2.
Table 2. Count of vowels in the accented syllables of the examples analysed, grouped by degree of openness

We have grouped these vowels according to their degree of openness because openness is known to influence F
$ {}_0 $
and loudness, parameters that are also relevant for prosody. We know, in fact, that listeners take into account vowel-intrinsic pitch (higher vowels have a higher intrinsic pitch) when judging the prosodic prominence of a syllable (Fowler & Brown, Reference Fowler and Brown1997). Note that the vowel ‘æ’, not a Maltese vowel, has also been included in the list, since a vowel having this quality is found in these data, mostly in loanwords from English, such as clan and jazz.
Previous studies have observed effects of co-speech gesture on some of the acoustic properties of the associated stressed syllable, particularly duration (Krahmer & Swerts, Reference Krahmer and Swerts2007), one of the higher formants (F
$ {}_2 $
) (Bernardis & Gentilucci, Reference Bernardis and Gentilucci2006; Krahmer & Swerts, Reference Krahmer and Swerts2007) and fundamental frequency (F
$ {}_0 $
) (Ambrazaitis & House, Reference Ambrazaitis and House2022). A biomechanical explanation for such effects, linking hand movements to movements of the respiratory system, has been suggested by Pouw et al. (Reference Pouw, Harrison and Dixon2020a) and Pouw et al. (Reference Pouw, Trujillo and Dixon2020b).
Contrary to the studies that found an effect of F
$ {}_2 $
, our target words contained a wide variety of vowels (see again Table 2). Therefore, measuring F
$ {}_1 $
, F
$ {}_2 $
and F
$ {}_3 $
would not have been useful because the dataset is too small to differentiate the effects of vowel quality and prominence on the formant measures. As a consequence, in our acoustic analysis, we focus on duration, pitch and intensity.
In order to conduct the analysis, the orthographic transcription had to be word-aligned. Therefore, the Munich Automatic Segmentation System (Kisler et al., Reference Kisler, Reichel and Schiel2017) was run on all the sound files, and the output was manually corrected. An example of the resulting word-by-word segmentation can be seen in the tier below the oscillogram in Figure 3. The stressed vowel of the target word was segmented by hand in a separate tier below the tier containing the word-by-word segmentation of the text.

Figure 3. Example stimulus (‘withGesture’) with, starting in the topmost part of the figure, intensity (in blue), pitch (in red) and the oscillogram.
Praat (Boersma & Weenink, Reference Boersma and Weenink2022) was then used to manually extract average pitch and intensity values of the stressed vowel in the target word of each example sentence. Due to the fact that slightly divergent values may be computed by the tool in different runs, values were averaged across three different runs. Average sentence pitch and intensity values were then obtained automatically by means of a Praat script that extracts pitch and intensity values at steps of 0.01 s. The script used a pitch floor of 75 Hz and a pitch ceiling of 400 Hz, values that were deemed adequate for the Maltese speakers in the dataset used. Values below or above this range were ignored.
Since a number of different speakers were involved, a comparison of the absolute values would not have been meaningful. The measures were normalised by subtracting from them the average pitch and intensity values of the entire sentence in which the word occurs.
For instance, in the example quoted earlier, where the speaker says kelli subSIdiary area ‘I had a subSIdiary area’ and makes a hand gesture that is aligned with the accented syllable SI, the pitch and intensity of the vowel are 142.93 Hz and 73.86 dB, whereas the mean pitch and intensity for the entire sentence are 133.23 Hz and 68.01 dB, respectively. In the case of both pitch and intensity, therefore, average values in this example are higher for the stressed vowel compared with those for the sentence as a whole. The acoustic properties of the example are visualised in Figure 3.
Table 3 displays the normalised mean (SD) pitch and intensity values of the stressed vowel in the target words in the 30 ‘noGesture’ examples and the 30 ‘withGesture’ examples. Note that a positive value means that the value of the accented vowel is higher than the average value of the relevant sentence, and vice versa for negative values. The difference between the pitch of the target vowel and the average sentence pitch is expressed in semitones (ST). The table also reports the average duration of the accented vowels in both conditions.
Table 3. Mean (SD) values for pitch, intensity and duration of the stressed vowels in the target words

Note: Pitch and intensity values were normalised by subtracting the mean sentence value for each example. The pitch value is expressed in semitones.
As can be seen, there is no notable difference in vowel duration in the two conditions. This is not very surprising since our dataset included target words containing phonemically short and long vowels as well as diphthongs, so differences in duration due to these sources of variation may well have hidden any possible effect of co-speech gesture. The differences in pitch and intensity between the target vowel and the sentence, on the contrary, are larger if a gesture is present. The distribution of the differences for both types of measures is shown in the boxplots in Figure 4, with the distribution for the ‘noGesture’ examples displayed on the left in both plots.

Figure 4. Acoustic differences between the target vowel and the sentence.
Two mixed-effects models were tested to predict (i) the difference between the target vowel pitch and the average pitch of the corresponding sentence and (ii) the difference between the target vowel intensity and the average intensity of the corresponding sentence. In both cases, the difference in acoustic property, either in pitch or in intensity, is the dependent variable, and the condition (‘noGesture’ vs. ‘withGesture’) is used as a fixed factor. We attempted to use both speaker and vowel as factors in both cases. However, the effect structure was too complex to be supported by the data. Therefore, vowel openness rather than the full range of vowels was used as an additional fixed effect, while speaker was added as a random effect.
The coefficients of these models can be inspected in Table 4 for pitch and Table 5 for intensity. We see that the effect of gesture in the case of intensity only approaches significance, whereas it is significant in the prediction of pitch. If the gesture is present, the difference in pitch between the target vowel and the sentence average increases. This preliminary exploration of some of the acoustic properties of the target words in the presence of a co-occurring gesture suggests some interesting avenues for more in-depth study. The variability in the naturally occurring stimuli does not, however, lend itself well to further analysis at this point.
Table 4. Predicting vowel pitch from gesture and vowel openness: model coefficients; significant p values are shown in boldface

Table 5. Predicting vowel intensity from gesture and vowel openness: model coefficients; significant p values are shown in boldface

But in general, the finding that the target words that were accompanied by gestures had stronger cues to prosodic prominence than target words without accompanying gestures makes it reasonable to ask whether these stronger cues have an influence on prominence perception.
3.3. Procedure
The stimuli described in Section 3.2 were used to build the experiment used in this study. This was run online using JSPsych (de Leeuw, Reference de Leeuw2015). All materials used, including the explanations, were presented to participants in Maltese. The experiment began with a short questionnaire asking participants about their age, gender, language background, self-rated proficiency and patterns of use. This was followed by a training phase, in which they were exposed to six examples based on two sentences.
First, they heard three versions of the sentence Jien qed nistudja l-mediċina ‘I am studying medicine’: one with the pronoun ‘I’ carrying the sentence accent; another with the sentence accent on ‘medicine’, however with broad focus signalling a ‘neutral’ rendering; and a third with a sentence accent again on ‘medicine’, but this time with narrow focus. In these utterances, l-mediċina became successively longer, louder and produced with a higher maximum pitch as we moved from the version with an accent on Jien, to the unmarked utterance, and finally to the version with an accent on l-mediċina. Moreover, in the utterances with a narrow focus on either Jien or l-mediċina, there was a clear pitch accent on that element. Participants were told that in the first case, the word l-mediċina might be rated as one or two on a scale of importance/prominence from one to seven, as either three or four in the second case, and five or six in the third. Note that we are using a wider scale than is usually employed in linguistic coding of prosodic prominence, for example, in Krahmer and Swerts (Reference Krahmer and Swerts2007), since a scale with more levels may make it easier to reach a decision (as argued for segmental perception by Apfelbaum et al., Reference Apfelbaum, Kutlu, McMurray and Kapnoula2022).
Next, participants were asked to rate three versions of the sentence Iz-ziju għandu kelb tal-kaċċa ‘My uncle has a hunting dog’ (literally ‘The uncle has a dog for hunting’). They were asked to rate the prominence of kelb tal-kaċċa ‘hunting dog’ when it was in narrow focus, when the sentence was pronounced signalling broad focus, that is, with a ‘neutral’ rendering, and when iz-ziju ‘the uncle’ was in narrow focus. Participants saw the whole sentence written on the screen, with the element they had to judge in terms of prominence highlighted in colour. After listening to the sentence, they could rate the prominence of the highlighted element on a scale from 1 to 7, or press the ‘R’ key to listen to the whole sentence again.
After this practice phase, the main experiment started. Participants were asked to listen to a number of Maltese sentences and rate the prominence of a target word. All participants were exposed to both audio-only and audiovisual conditions. Each trial started with a visual display of the whole sentence in written form, together with a prompt requiring them to indicate how important they considered the target element in that sentence. An example is shown below:
Il-frażi hija: ‘għandi xi ħbieb minn Architecture’.
‘The sentence is: “I have some friends who study Architecture”’.
Kemm hi importanti: ‘Architecture’?
‘How important is: “Architecture”?’
The display also mentioned that the audio or video clip could be replayed by pressing ‘R’ if participants were unsure of their rating. We gave participants the option to repeatedly hear the stimuli, because speech stimuli from a spontaneous speech corpus taken out of context are often difficult to process (Brouwer et al., Reference Brouwer, Mitterer and Huettig2013). This differs from what has been done in other studies (e.g., Krahmer & Swerts, Reference Krahmer and Swerts2007) that use a fixed utterance, usually recorded carefully as read speech. Participants saw the sentence for 2.5 s before the audio or video clip started playing. The experiment only proceeded to the next trial when participants had rated the importance of the highlighted word, with an inter-trial interval of 50 ms.
The experiment was blocked by modality condition, so that participants first rated 45 sentences (15 stimuli ‘withGesture’, 15 ‘noGesture’ stimuli and 15 fillers) presented either as audio-only or audiovisual stimuli, and then 45 stimuli (again 15
$ \times $
3) in the other condition. Participants were given the opportunity to take a short break after one block of 45 trials. Most participants finished the study in 20–30 min. Block order was counterbalanced across participants, and stimuli used in each condition were counterbalanced across participants, so that each stimulus appeared as often in both the audiovisual and visual conditions in the complete dataset.
3.4. Statistical analysis
The data were analysed using a linear mixed-effects model to accommodate the fact that we have a sample of participants and a sample of stimuli (Westfall et al., Reference Westfall, Kenny and Judd2014). We used the lmerTest package (v3.1-3) in R (v4.1.3). The data and analysis code are available on the Open Science Framework.
The dependent variable is prominence rating. The analysis used stimuli and participants as random factors and started with a maximal random-effect structure (but note that the factor Gesture is between stimuli, rendering a random slope of Gesture over stimuli nonsensical). Due to convergence issues, we removed the correlations between random effects and the random slope of the interaction over participants. The fixed factors were Modality (audio-only vs. audiovisual) and Gesture (‘withGesture’ vs. ‘noGesture’) and their interaction. The fixed factors were contrast coded (Modality: audio-only = −0.5, audiovisual = 0.5; Gesture: ‘noGesture’ = −0.5, ‘withGesture’ = 0.5). With this contrast coding, the regression weights show the mean differences between the conditions, and interactions and main effects are linearly independent (i.e., the interaction does not influence the main effects).
4. Results
We first investigated whether the task, asking participants to rate the importance of a given word in a sentence, managed to get them to rate the prosodic prominence of the utterances, by analysing the responses obtained in the training phase. As indicated in the description of the procedure, participants rated three versions of Iz-ziju għandu kelb tal-kaċċa. ‘My uncle has a hunting dog’ (literally ‘The uncle has a dog for hunting’), one with a broad focus and two with a narrow focus on either Iz-ziju or kelb tal-kaċċa. The participants rated the ‘importance’ of kelb tal-kaċċa as 1.16 if there was a narrow focus on Iz-ziju, as 3.57 if there was a broad focus and as 4.98 if there was a narrow focus on kelb tal-kaċċa. With a standard deviation of around 1.3 across these items, this represents a large effect. Unsurprisingly, this effect was significant in a repeated-measure ANOVA, p < 0.001. This indicates that the task was sensitive to eliciting responses that reflect the cueing of prosodic prominence.
Figure 5 shows the mean prominence ratings for all six conditions (including the fillers). Target words in ‘noGesture’ stimuli were generally rated as less prominent than those in the ‘withGesture’ stimuli. There is no overall effect of Modality, as, in the ‘noGesture’ condition, audio-only stimuli receive higher ratings than audiovisual stimuli, while the reverse is observed in the ‘withGesture’ condition. This suggests an interaction between the factors based on the descriptive data.

Figure 5. Mean prominence ratings for all six conditions (including the fillers) with error bars based on the method of Morey (Reference Morey2008).
The linear mixed-effects model (see Table 6) only fully confirms the first two observations from the descriptive data. These observations were that there is a clear effect of the Gesture condition and no effect of the Modality condition. The third observation was that the effect of having a gesture is larger in the audio-visual modality, but the corresponding interaction is only marginally significant.
Table 6. Results from the linear mixed-effects model of the prominence ratings; significant p values are shown in boldface

Our first hypothesis was verified, in that the model found a clear effect of the Gesture condition: the target words accompanied by a gesture tend to be rated as more prominent by listeners. However, the interaction of Modality by Gesture, which would indicate that seeing gestures enhances the perceived prominence, only approaches significance; thus, contrary to what our second hypothesis predicted, the small interaction effect provided by the audiovisual modality does not reach significance.
Based on the request of a reviewer, we performed an exploratory analysis of whether the order of the blocks (audiovisual then audio-only vs. audio-only then audiovisual) influenced the ratings. To that end, we added to the model a contrast-coded predictor for block order interacting with the other two factors. Comparing this model with the initial model that did not take into account block order indicated that block order did not affect the pattern of results
$ \left(\chi 2(4)=5.504,p=0.239\right) $
. Moreover, none of the individual beta weights for the order factor were close to significance (min(p) = 0.12).
5. Discussion
As we have seen, in our experiment, the presence of an accompanying hand gesture can increase the strength of some acoustic cues to the prominence of target words, and consequently their perceived prominence for a listener. In spite of this, no effect of modality (audio-only vs. audiovisual) nor any interaction between gesture condition and modality was found. While words accompanied by a gesture were rated by listeners as more prominent than those without a co-occurring gesture, the prominence difference between these two sets of stimuli was not enhanced in the audiovisual condition. This suggests that words that are accompanied by gestures are also more prominent in the auditory domain and that the visibility of the gestures does not further increase the perceived prominence, at least not in a significant way. The failure to find a significant effect of the audiovisual modality contrasts with the results by Bosker and Peeters (Reference Bosker and Peeters2021), who found clear evidence of an effect due to beat gestures with a smaller sample of participants and items (though items were used repeatedly in their experiments). However, our results do not necessarily contradict Bosker and Peeters’s. The stimuli in the Bosker and Peeters study were acoustically ambiguous, and even with a matching gesture, they were never perceived as unambiguously prominent on the first or second syllable. Our stimuli, in contrast, were selected because they are produced in a prominent fashion, and for such stimuli, there might be diminishing effects of the visibility of gestures, given that the auditory modality is relatively strong in and by itself. Therefore, our results are probably more in line with a small additional effect of gestures on prominence perception (given acoustic prominence to begin with) than with a null effect.
The acoustic analysis of the stimuli shows that the presence of gestures has some effect on the realisation of prominence, as also observed by others (Bernardis & Gentilucci, Reference Bernardis and Gentilucci2006; Krahmer & Swerts, Reference Krahmer and Swerts2007). The difference in intensity between target words and the rest of the sentence is larger, on average, when words are accompanied by gestures than when they are not. However, the effect of gesture presence on intensity only approaches significance. In contrast, we found a small effect of gesture co-occurrence on pitch: on average, the difference between the pitch of the vowel and the mean pitch of the sentence is larger when a gesture is present than when it is not. However, the normalisation method we applied is a simple difference of pitch values between the target vowel and the sentence average. A more fine-grained measure that captures F
$ {}_0 $
movement in the vicinity of the stressed vowel might yield a different result. Finally, no effect was found for duration.
Thus, our results diverge somewhat from the findings reported in Krahmer and Swerts (Reference Krahmer and Swerts2007), who found significant effects on target words due to duration and F
$ {}_2 $
values. Their analysis, however, contrasts two target words in the same sentence rather than one target word and the associated sentence as we do. Furthermore, only one vowel is represented in the materials on which that study is based. Since, in contrast, a range of different vowels occur in our examples, it made little sense for us to even look at formants.
One could discuss whether the small effect we do see on the pitch produced by speakers when they also produce a gesture on the target word is actually caused by the gesture. A causal relationship has been suggested in connection with arm and wrist movements, for example, by Pouw et al. (Reference Pouw, de Jonge-Hoekstra, Harrison, Paxton and Dixon2021), who found that participants asked to retell a cartoon in three different gestural conditions (no gesture, wrist movement and larger arm movement) showed increased F
$ {}_0 $
values in the two gestural conditions compared with the passive position. This difference, however, was only significant for arm movements. Generally, however, a correlation between the presence of gestures and acoustic properties could have other sources, such as a common motor system, as suggested, for example, by Parrell et al. (Reference Parrell, Goldstein, Lee and Byrd2014).
While this study has managed to show, in naturally occurring multimodal data in Maltese, that the perceived auditory prominence of a target word is greater if a hand gesture is produced by the speaker while uttering the word, we must also note that extracting stimuli from naturally occurring data may also have created a number of confounds, making the audiovisual condition not totally ‘clean’, and thus diminishing the likelihood of observing a potential effect.
A potential source of complexity, for example, could be due to different types of pitch accents. Ambrazaitis and House (Reference Ambrazaitis and House2022) discussed how different types of lexical accents in Swedish are affected differently by the presence of visual beats, with larger effects seen on accentual rises than on falls. We cannot exclude that such differences may also hold for our data, although care was taken to ensure that target words in the stimuli used in the experiment were restricted to the part of the utterance carrying the nuclear pitch accent rather than the phrase accents typical on post-nuclear stretches of speech in Maltese (see Vella, Reference Vella1995, Reference Vella2003).
Another source of variation we did not control for relates to different properties of the hand gestures. All the gestures in the examples have a beat-like quality, but about 40% of them also have a deictic component or iconic properties, and both one-handed and two-handed gestures are present. Although no detailed analysis of the kinematic characteristics of the gestures was conducted to investigate the possible impact of gesture size and velocity, there is probably variation in all these dimensions. The overall complexity may have diluted the general effect on prominence perception.
6. Conclusion and future work
Our study has shown that, on the one hand, the perceived auditory prominence of a target word is greater if a hand gesture is produced by the speaker while uttering the word. Importantly, our study uses examples from natural conversational data to confirm effects established previously on the basis of highly controlled multimodal stimuli (Bolinger, Reference Bolinger1986; Krahmer & Swerts, Reference Krahmer and Swerts2004) or read text (Ambrazaitis et al., Reference Ambrazaitis, Frid and House2020a; Ambrazaitis & House, Reference Ambrazaitis and House2022). On the other hand, however, whether the participants see the gesture or not does not make a significant difference in our study. An interpretation of these two results leads us to suggest that gestures appear to provide an additional cue to prominence, but are not in themselves a necessary prominence cue for the listener.
A number of follow-up studies could be carried out to further corroborate and understand the results presented here. One possibility is certainly that of looking in more detail at the individual stimuli used in the experiment, to see whether other factors may be at play in the perception of prominence in the different conditions. One such factor that certainly merits some probing is constituent order, in particular the fact that the use of certain syntactic constructions in some of the examples may have an additive effect on perception prominence.
Another direction would be to study in more depth in what way prominence perception is affected by different types of gestures, both in terms of their different functions and their different kinematic characteristics. Yet another would be to investigate the effect of gestures on the perception of prominence where sentence accent is not placed towards the end of a sentence, or where it is combined with a phrase accent following the principles explained in Section 2.2. However, it is likely to be a challenge to find enough relevant examples of these more complex phenomena in naturally occurring data, so it may be necessary to revert to the use of elicited experimental data.
It would also be interesting to extend our investigation to multimodal stimuli involving facial beats. It must be noted, however, that in the design phase of our study, we searched for relevant examples of head movements co-occurring with sentence accents, but it proved impossible to isolate a sufficient number of short sentences in which only one head movement was present. To solve this problem, one possible avenue could be to have artificial agents perform our original stimuli with the addition of head movements following a methodology employed in several other studies to test the perception of multimodal signals (Heylen et al., Reference Heylen, Bevacqua, Pelachaud, Poggi, Gratch and Schröder2011; House et al., Reference House, Beskow and Granström2001; Prieto et al., Reference Prieto, Puglesi, Borràs-Comes, Arroyo and Blat2015).
Acknowledgements
We acknowledge the work of our research assistant, Amanda Muscat. We also thank the reviewers for their useful comments.
Funding statement
This work was supported by the University of Malta’s Research Seed Fund.
A. Appendix
In all the examples below, the syllable containing the main sentence accent is capitalised. In the examples with gestures, the hand gesture stroke is aligned with the accented syllable. The order of constituents in the original Maltese examples is retained in the translations provided below, although this may make them sound unidiomatic in English. In the examples involving pro-drop, the pronoun is placed in square brackets in the translation; the same holds for the verb in sentences that are verbless in Maltese. In a few of the translations, elements that may help increase the idiomaticity are also included in square brackets.
A.1. Examples with gestures
-
1. ee jiena min-naħa tas-SOUTH
‘yeah I’m from the south side’
-
2. aħna nibdew mis-SEcond year
‘we start in the second year’
-
3. sixth year insiru avuKAti
‘[in the] sixth year [we] become lawyers’
-
4. t’id tkun avuKAT
‘[you] have to be a lawyer’
-
5. jien kelli em IngLIŻ
‘[I] had um English’
-
6. kelli interMEdiate
‘[I] had intermediate’
-
7. kelli subSIdiary area
‘[I] had a subsidiary area’
-
8. għandi naq’a probLEma
‘[I] have a bit of a problem’
-
9. nħobb ħafna nSAjar ukoll
‘[I] love a lot cooking as well’
-
10. għamilna naq’a fiGOLli
‘[we] made some figolli’
-
11. l’aqwa li m’hemmx xagħar ABjad u
‘the most important thing is that there is no white hair and’
-
12. il-mużika BAXxa
‘the music [is] low’
-
13. sal-aħħar tax-XAHAR għandhom
‘till the end of the month [they] have’
-
14. imma kelli IngLIŻ
‘but [I] had English’
-
15. di’ waħda minnhom is-celeBRAtions allura jew?
‘this [was] one of them the celebrations so or?’
-
16. ħafna nies it-teQUIla jdejjaqhom ħafna
‘many people the tequila [it] bothers them a lot’
-
17. taraha bilQIEGĦda
‘[you] see her sitting’
-
18. le quantu tal-kwalità tal-MUżika
‘no depending on the quality of the music’
-
19. diffiċli social LIFE għalina
‘[it is] difficult social life for us’
-
20. inħobb immur NIEkol
‘[I] love going to eat’
-
21. inħobb immur nara FILM
‘[I] love going to watch a film’
-
22. inħobb noħroġ JIEna ‘iġifieri
‘[I] love to go out me so’
-
23. infatti jkollna CLAN
‘In fact [we] have a clan’
-
24. minn Ħal Tarxien, xagħarha NOKkli
‘from Ħal Tarxien, her hair [is] curly’
-
25. jiena iktar volontarJAT
‘I [do] more voluntary work’
-
26. jien KAren
‘I [am] Karen’
-
27. qiegħda l-universiTA’
‘[I] am at the university’
-
28. il-ballu li GĦADda kien
‘the ball that passed it was’
-
29. kien ikun hemm il-BASketball
‘[There] used to be the basketball’
-
30. tista’ ma’ tagħmel XEJN
‘[You] can do nothing’
A.2. Examples without gestures
-
1. fis-sajf ikun l-aħjar Wied il-GĦAJN
‘in the summer [it] is best [at] Wied il-Għajn’
-
2. emozzjoniJIET qiegħda nagħmel
‘emotions [I] am working on’
-
3. minn Ħ’AtTARD jien
‘from Ħ’Attard I [am]’
-
4. għandi xi ħbieb minn ArchiTECture
‘[I] have some friends from Architecture’
-
5. fih ukoll ee ĦAmes snin
‘[it] has as well eh five years’
-
6. jien qed nagħmel ArchiTECture
‘I am doing Architecture’
-
7. mort xi tlett SNIN sentej’
‘[I] went about three years two years’
-
8. qed issa t-tiel’t SEna u
‘[I] am now [in] the third year and’
-
9. ijja kelli BŻONN jien
‘yes [I] needed I’
-
10. jiena gradwajt sa third year LIġi
‘I graduated until third year law’
-
11. qed nagħmel kriminoloĠIja
‘[I] am doing criminology’
-
12. jien għandi dsatax-il SEna
‘I have nineteen years’
-
13. jiena qalli sieħbi NIġi imma
‘I [he] told me my friend to come but’
-
14. jie’ ħa nagħlaq twenty three dis-SEna
‘I will be twenty three this year’
-
15. haw’ sabiħ IMma haw’
‘here [is] nice but here’
-
16. qed niftehmu fil-PARty
‘[we] are agreeing in the party’
-
17. vera tajba kienu FL-AĦħar
‘really good [they] were in the end’
-
18. tagħmilha l-LUmi jiġ’ieri
‘make it the lemon so’
-
19. llum il-ġurnata ma tantx ikonna kunċerti tal-JAZZ
‘today the day [we] do not have many jazz concerts’
-
20. rari ssib xi ħadd BĦAlek
‘[it is] rare to find someone like you’
-
21. kont għamilt ħin nisTENna bilwiefqa
‘[I] had done [some] time waiting standing’
-
22. le hekk sabiħ uKOLL
‘no like that [it is] nice too’
-
23. qed nistudja accounts u BANking
‘[I] am studying accounts and banking’
-
24. qiegħda second YEAR
‘[I] am [in] second year’
-
25. ħabibti tmu’ masqueRADE
‘my friend goes [to] Masquerade’
-
26. tri’ tkun deĊIża
‘[you] must be definite’
-
27. fil-verita’ fil-BIdu jidħlu ħafna
‘in truth in the beginning [they] enter a lot’
-
28. em għandi TWENty
‘em [I] have twenty’
-
29. intom kontu tagħmlu l-ĠLIED
‘you used to make fights’
-
30. konna għamilna żmien ħbi’b SEW
‘[we] had made [some] time good friends’
A.3. Fillers
-
1. ee minn Ħaż-ŻABbar
‘eh from Ħaż-Żabbar’
-
2. forsi għandna xi ħbieb in KOmuni mela
‘maybe [we] have some friends in common then’
-
3. noqGĦOD Ħaż-Żabbar
‘[I] live [in] Ħaż-Żabbar’
-
4. hemm l-istatWA eħe
‘there [is] the statue yes’
-
5. toqgħod targuMENta qisek avukat
‘[you] stay arguing like [a] lawyer’
-
6. ĊEMpluli mill-fakultà
‘[they] called me from the faculty’
-
7. jkun haw’ wisq SĦANA
‘[it] is too much heat’
-
8. meta ikolli ċans għax iFHIMni
‘when [I] have chance because [you] understand me’
-
9. użgur għax, m’hemmx GĦALfejn BEd
‘of course because, there is no need [for a] bed’
-
10. qed jBATtu n-nies
‘[they] are decreasing the people’
-
11. JIEna Karen
‘I [am] Karen’
-
12. aħjar milli għaLAQT
‘better than [it] closing’
-
13. la lesta TAJtha
‘since [it is] ready [I] gave it’
-
14. għaMILT tlett ijiem
‘[I] did three days’
-
15. disserTAtion ukoll
‘dissertation as well’
-
16. le PARty bil-
‘no party with’
-
17. morna naraw OPra
‘[we] went to see [an] opera’
-
18. meta tpoġġi għandek LUSsu
‘when [you] sit [you] have luxury’
-
19. għax bħalissa BROKE
‘because right now [I am] broke’
-
20. daw’ id-diski tal-PARties jew hekk
‘these songs of [the] parties or that’
-
21. twenty TWO ħa nagħlaq. F’June propjament.
‘twenty two [I] will be. In June actually’
-
22. jiddependi ĦAFna mi… miċ-ċirkustanzi
‘[it] depends a lot o… on the circumstances’
-
23. eżatt allura tiġini awtoMAtika il-ħaġa
‘exactly so [it] comes automatic[ally] the thing’
-
24. kull m’għandi ERbgħa
‘all [I] have is four’
-
25. t’id tifFOkka
‘[you] have to focus’
-
26. kont immur MasqueRADE
‘[I] used to go to Masquerade’
-
27. F’GĦAxar snin mhux tilħaq
‘in ten years not [you] don’t reach’
-
28. għamilna reUNion milux
‘[we] did a reunion not long ago’
-
29. għax propj’ment għandu BAby
‘because actually [he] has a baby’
-
30. m’għadx hemm dik il-ĦEĠġa ta’ futbol
‘[there] isn’t there still that enthusiasm of football’