1. Introduction
Dialogue systems are intelligent computer programs that can engage humans in natural conversational interactions. Usually empowered by automatic speech recognition (ASR) and natural language processing technologies, along with machine learning techniques, these systems provide learners with interaction opportunities varying from the most constrained tasks, such as read-aloud and elicited imitation, to more spontaneous and contextualized interactions, such as role-plays and paired discussions (e.g. Gokturk & Chukharev-Hudilainen, Reference Gokturk and Chukharev-Hudilainen2023). The advent of generative artificial intelligence (GenAI) starts a new era of dialogue systems that facilitate autonomous, multi-turn, and coherent conversations. Debates about GenAI applications in language learning, teaching, assessment, and policymaking have been raised (e.g. Voss et al., Reference Voss, Cushing, Ockey and Yan2023). We conducted the present study to offer a holistic portrayal of dialogue system applicationsFootnote 1 for advancing L2 speaking skills, using a multilevel quantitative meta-analysis. Our study builds upon the existing body of knowledge that can help the field prepare for the incoming substantial challenges in GenAI applications in educational contexts, especially for L2 speaking, an area that remains under-researched.
Using these intelligent systems for speaking practice offers noticeable advantages in both affective and cognitive domains. The low-stakes and non-judgmental practice environment with a virtual interlocutor helps alleviate L2 learners’ speaking anxiety and enhances their willingness to communicate (Jeon, Reference Jeon2024; Kohnke, Reference Kohnke2023; Shafiee Rad, Reference Shafiee Rad2024; Tai & Chen, Reference Tai and Chen2023). Beyond the observed affective advantages, dialogue systems can also function as effective self-learning tools, providing real-time multimodal corrective feedback (Petersen, Reference Petersen2010; Tai, Reference Tai2022), an authentic speaking environment (Hwang et al., Reference Hwang, Guo, Hoang and Chang2022), and adequate interactive exercises (Hsu, Chen & Yu, Reference Hsu, Chen and Yu2023). A well-trained dialogue system can also enable learners with increased exposure to high-quality oral input (see Gokturk & Chukharev-Hudilainen, Reference Gokturk and Chukharev-Hudilainen2023, for a typical dialogue system architecture).
Dialogue systems can impact the development of L2 speaking abilities. However, their effectiveness varies across different areas like system delivery mode, learning environments, and corrective feedback provision, indicating inconclusive research findings. For instance, while the use of a single modal voice-based or text-based chatbot for speaking showed advantages (e.g. Hsu, Chen & Yu, Reference Hsu, Chen and Todd2023; Kim, Kim & Cha, Reference Kim, Kim and Cha2021; Tai, Reference Tai2022), a superior multimodal presentation of the feedback to promote speaking proficiency was also found (e.g. Hwang et al., Reference Hwang, Guo, Hoang and Chang2022; Liu, Hwang & Su, Reference Liu, Hwang and Su2024). Another example could be the findings on the use of dialogue systems under informal and formal learning contexts. While many studies suggested the effectiveness of using dialogue systems in an in-class instructional context (Dizon, Reference Dizon2020; Kim, Kim & Cha, Reference Kim, Kim and Cha2021), research also found the special advantages of using dialogue systems as a valuable extension of in-class teaching for L2 speaking development (Liu, Hwang & Su, Reference Liu, Hwang and Su2024; Tai, Reference Tai2022). Additionally, as dialogue systems leverage different technologies that influence user speech production, the interaction tasks for speaking practice vary. It remains unclear which types of dialogue systems or interaction tasks are more conducive to achieving effective L2 speaking proficiency. These findings underscore the importance of conducting a comprehensive review to synthesize the existing literature and identify gaps in our understanding, enabling us to develop more effective strategies for using dialogue systems to enhance speaking proficiency.
Many narrative reviews have explored the application of dialogue systems in language learning. Ji, Han and Ko (Reference Ji, Han and Ko2023) examined the collaboration between conversational AIs and teachers. Huang, Hew and Fryer (Reference Huang, Hew and Fryer2022) synthesized chatbot affordances from technological, pedagogical, and social perspectives. Bibauw, François and Desmet (Reference Bibauw, François and Desmet2019) discussed definitions and research trends of dialogue-based computer-assisted language learning (CALL) systems, while Litman, Strik and Lim (Reference Litman, Strik and Lim2018) reviewed speech technologies used for language assessment. Quantitative meta-analyses investigating the effectiveness of chatbots (Lee & Hwang, Reference Lee and Hwang2022; Zhang et al., Reference Zhang, Shan, Lee, Che and Kim2023), social robots (Lee & Lee, Reference Lee and Lee2022), and dialogue systems (Bibauw, Van den Noortgate et al., Reference Bibauw, Van den Noortgate, François and Desmet2022) for general language learning and specifically ASR for L2 pronunciation (Ngo, Chen & Lai, Reference Ngo, Chen and Lai2024) can also be found. To the best of our knowledge, no meta-analysis focused on L2 speaking development. This paucity in the L2 speaking domain can be partially explained by the complexity of coordinating experiments using dialogue systems in learning contexts (Bibauw, François & Desmet, Reference Bibauw, François and Desmet2019). The diverse terminology and technology used for dialogue systems also increase the data collection and coding difficulty from primary studies. Regarding this issue, we adopted the umbrella term of dialogue-based CALLFootnote 2 as proposed by Bibauw, François and Desmet (Reference Bibauw, François and Desmet2019), which refers to the learners’ activities of using any system or application to engage in a dialogue with an automated interlocutor in an L2. This adoption encompasses all forms of dialogue systems that emphasize the interactive process with virtual agents, as the specific typology of dialogue systems is not the focal point of the present review.
A meta-analysis focused on L2 speaking in dialogue-based CALL is therefore essential. Unlike other language skills, speaking requires interactive engagement and immediate feedback, both of which are central to dialogue-based learning environments. By concentrating on L2 speaking within these systems, this study offers a comprehensive review to evaluate their effectiveness, identify trends, and highlight gaps, all of which can guide future research and pedagogical practices. The present study purports to depict a general research picture of dialogue-based CALL on L2 speaking development by synthesizing the overall effect size and related moderators affecting their effectiveness. To this end, we conducted a three-level meta-analysis addressing the following research questions (RQs):
-
RQ1. To what extent do dialogue-based CALL systems promote L2 speaking development?
-
RQ2. What are the significant moderator variables in using dialogue-based CALL for L2 speaking development?
-
RQ3. To what extent do these moderators affect the L2 speaking development?
2. Methodology
2.1 Literature search
To have a broad inclusion of eligible research, we conducted an extensive literature search for studies indexed in well-known online databases, including Scopus, Taylor & Francis Online, and Web of Science. Additionally, dissertations were searched in ProQuest and CNKI China. Due to our language repertoire, only research published in English and Chinese was included. Two search term sets regarding technology and L2 speaking were combined and applied. Informed by Bibauw, François and Desmet (Reference Bibauw, François and Desmet2019) and Huang, Hew and Fryer’s (Reference Huang, Hew and Fryer2022) review on dialogue-based CALL, technology-related keywords were entered in the database as “chatbot*,” “intelligent personal assistant*,” “conversation* agent*,” “spoken dialog* system*,” “spoken dialog* technolog.*” Search terms for L2 speaking included “speech*, oral, conversation*, interaction*, talk*, speak*, and language*”. A secondary search strategy was also applied by checking the research reference lists to include additional eligible research.
For all the search procedures, we filtered the search results in terms of research fields to computer and science, linguistics, education research, arts and humanities, and social science. The search period ended in May 2023.
2.2 Inclusion and exclusion criteria
We defined the inclusion and exclusion criteria from three aspects: technology used, target language proficiency, and research design. Table 1 provides detailed descriptions of the inclusion and exclusion criteria. For L2 speaking proficiency, a broad definition is adopted, including but not limited to conventional speaking proficiency in a psycholinguistic-individualist perspective (e.g. fluency, pronunciation, grammatical accuracy, lexical diversity) and sociolinguistic-interactional perspective (e.g. interactional competence; IC) (Roever & Kasper, Reference Roever and Kasper2018). Ultimately, we obtained 16 studies for coding and analysis. Figure 1 illustrates the research search and inclusion in a PRISMA flowchart.
Table 1. Inclusion and exclusion criteria


Figure 1. PRISMA flowchart of article search and selection (adapted from Page et al., Reference Page, McKenzie, Bossuyt, Boutron, Hoffmann, Mulrow, Shamseer, Tetzlaff, Akl, Brennan, Chou, Glanville, Grimshaw, Hróbjartsson, Lalu, Li, Loder, Mayo-Wilson, McDonald and Moher2021).
2.3 Coding scheme
We adopted a coding scheme from Plonsky and Oswald (Reference Plonsky, Oswald, Mackey and Gass2012) to explore the included research from three general perspectives: study context, design and treatment, and measure. Overall, 13 codes were finally investigated within these three categories. Specifically, for the dialogue systems coding, we adopted Bibauw, François and Desmet’s (Reference Bibauw, François and Desmet2019) typologies of systems, interactions, and degree of constraints on meaning. The detailed coding scheme can be found in supplementary material S1.
2.4 Inter-coder reliability
The overall coding process involved several coding cycles by two independent researchers. Initially, the primary inclusion of the effect sizes in the present analysis showed an ideal 94.16% agreement (Mackey & Gass, Reference Mackey and Gass2016). Subsequently, the effect size calculation was also checked with an acceptable 86.52% agreement. Any discrepancies arising from effect size inclusion and calculation were resolved through careful discussions between the coders, resulting in a final inclusion of 89 effect sizes for the subsequent coding and analysis. Furthermore, in coding the 13 moderator variables, the average Cohen’s kappa coefficient was 0.93. This value falls within the “excellent” range of coding reliability (i.e. from 0.8 to 1), indicating a robust and dependable coding process (Mackey & Gass, Reference Mackey and Gass2016: 141).
2.5 Effect size calculation and interpretation
We calculated effect sizes based on the standardized mean difference. To achieve an unbiased estimate of the standardized mean difference in small sample analysis, we used Hedges’s g (Hedges & Olkin, Reference Hedges and Olkin2014) to remove the overestimation that tends to be observed in the estimate of Cohen’s d (Cohen, Reference Cohen2013). To combine the effect sizes across study designs, we adopted Morris and DeShon’s (Reference Morris and DeShon2002) formulas to transform effect sizes into a comparable metric (see supplementary material S2 for details). We followed the field-specific guideline from Plonsky and Oswald (Reference Plonsky and Oswald2014), with Cohen’s d close to 0.40 as a small effect, 0.70 as a medium effect, and 1.0 as a large effect. This benchmark on Cohen’s d is directly applied to Hedges’s g since Hedges’s g is simply an unbiased estimate that corrects the overestimation in small samples.
2.6 Data analysis using a three-level meta-analytical model
We utilized a multileveled (three-level) meta-analytical model, a robust but not widely applied method in L2 studies, to account for the non-independent effect size inclusion. Research in dialogue-based CALL on L2 speaking development tends to contribute to more than one effect size within the study, leading to a correlated relation among included effect sizes. This correlation (i.e. effect size dependence) threatens the validity of meta-analyses (Matt & Cook, Reference Matt, Cook, Cooper, Hedges and Valentine2009). Using a three-level meta-analytical model would allow effect size to vary among participants (level 1, sample variance), outcomes (level 2, within-study variance), and studies (level 3, between-study variance), suggesting more precise estimates (Assink & Wibbelink, Reference Assink and Wibbelink2016). This approach also allows for more effect size inclusion, which can increase the statistical power of the analysis. Additionally, as effect sizes are extracted from different outcome variables, more effect size inclusion provides opportunities to test more study characteristics (see Cheung, Reference Cheung2019, for further discussion).
We used R (R Core Team, 2018) for data analysis. The package metafor (Viechtbauer, Reference Viechtbauer2010) was fitted to our data using its rma.mv function. This function can fit suitable meta-analytic multivariate/multilevel models to account for non-independence in the effects/outcomes. We fit a three-level random-effects model using this function, in which a restricted maximum likelihood estimation method (REML) was used for estimating the parameters in the model. This model accounts for the heterogeneity in effect sizes both within and between studies, accommodating the nested structure of multiple effect sizes per study. Detailed information on the codes, formulas, and a step-by-step guide on performing a three-level meta-analysis in R can be found in Assink and Wibbelink (Reference Assink and Wibbelink2016) and Harrer et al. (Reference Harrer, Cuijpers, Furukawa and Ebert2021). Four outliers in the model were detected and replaced by winsorizing along with a cutoff value at 2 standard deviations from the mean of all effect sizes (g low = −1.44, g up = 2.98) (Lipsey & Wilson, Reference Lipsey and Wilson2001: 108).
3. Results
3.1 Overall effect of the dialogue-based CALL systems on L2 speaking development
Table 2 provides the overall effect size and results of heterogeneity tests across levels. The estimated average effect was g = .61 (95% CI [0.34, 0.89]), indicating a significant medium effect of dialogue systems on L2 speaking development.
Table 2. Overall effect size and results of heterogeneity tests at different levels

Note. CI = confidence interval.
3.2 Heterogeneity and publication bias
As shown in Table 2, the result of the Q test was significant (p < .0001), suggesting significant variations in the outcomes of the primary studies and the need for moderator analyses. The estimates of variance components were τ 2 level2 = 0.18 and τ 2 level3 = 0.20, indicating that I 2 level2 = 38.58% of the total variation can be attributed to within-study heterogeneity, whereas I 2 level3 = 42.45% can be attributed to the between-study heterogeneity. In other words, more variations were observed at the between-study level of the model. Additionally, the three-level model provided a significantly better fit than the two-level model in which level 3 heterogeneity was constrained to zero (χ 1 2 = 14.54; p = .0001).
For the multilevel meta-analytic model employed, quantifying the relationship between study size and effect size (publication bias) lacks appropriate tests (Assink & Wibbelink, Reference Assink and Wibbelink2016). Consequently, a contoured funnel plot was used to visually assess this association without statistical symmetry evaluation (Figure 2). The asymmetrical plot indicates potential publication bias, with missing studies in the lower left side of the funnel. However, sample sizes did not significantly moderate the effect (b = 0.00, 95% CI [−0.01, 0.02], p = 0.6), suggesting that larger studies would not produce more negative effect sizes than smaller ones. Therefore, the funnel plot’s lack of strong negative effects likely reflects the true effect size distribution rather than publication bias. Tests using a traditional two-level meta-analytic model also indicate publication bias but with minimal impact on the effect (see supplementary material S3 for details).

Figure 2. Contoured funnel plot of the standard error of Hedges’s g.
3.3 Moderator analyses
Significant variations across the three levels necessitated moderator analyses for study context, study design and treatment, and measurement variables. A series of omnibus tests were used to address the difference of each variable across different subgroups. For categorical variables, we reported a Q test to suggest the possible significant effect of the moderators with their estimated Hedges’s g value in each level. For continuous variables, specifically the three codes as duration in weeks and sessions and time per session, we reported the regression coefficient b to suggest the effect increased by an additional unit. Specifications of the included studies’ features and system designs are provided in supplementary material S4.
3.3.1 Study context
Table 3 presents the moderator analyses results for study context variables. The dominant research focuses on targeted higher education (k = 11, 69%), with limited investigations in K-12 settings (k = 5, 31%). However, the dialogue systems’ effect on L2 speaking development does not significantly differ between K-12 and higher education contexts, exhibiting a similar medium effect. Concerning L2 proficiency, dialogue systems demonstrate effects for beginner and intermediate learners, although proficiency level does not moderate the overall effect significantly. Notably, beginner learners appear to benefit more, with a large effect size, compared to a medium effect for intermediate learners. For advanced proficiency learners, the results failed to reach significance, potentially due to the small sample size from a single study (n = 5, k = 1).
Table 3. Moderator analyses in data from study context

Note. CI = confidence interval; n = number of effect sizes; k = number of studies; g = Hedges’s g. *p < .05. **p < .01. ***p < .001.
3.3.2 Study design and treatment data
Table 4 reports the moderator analyses results for design and treatment data encompassing seven subgroups. Learning location does not significantly impact the results. Nevertheless, dialogue systems show effectiveness in both out-of-classroom and in-classroom learning contexts, with a noticeably large effect size during informal out-of-classroom learning scenarios. Similarly, no significant difference is observed in oral corrective feedback (CF) presence. The mean effectiveness of the system intervention reaches a medium to large level when CF is absent, while those incorporating CF demonstrate a medium effect on learners’ L2 speaking development.
Table 4. Moderator analyses in data from the design and treatment

Note. CI = confidence interval; n = number of effect sizes; k = number of studies; g = Hedges’s g. *p < .05. **p < .01. ***p < .001.
Looking at treatment duration coded as the overall duration in the number of weeks or sessions and task time per session in hours, neither reaches statistical significance. However, they appear to influence the speaking outcome, as their positive regression coefficients suggest. Additionally, time per session in hours seems to present a higher outcome than the other two. The type of interaction is not a differential moderator as well. Task-oriented interaction was predominantly implemented (k = 9, 56%), followed by open-ended interaction (k = 6, 38%). Both task-oriented and open-ended interactions exhibit a significant difference from the null effect, unlike system-guided interaction, probably due to the small sample size (n = 1, k = 12).
Significant differences emerge across system design features. Narrative systems were omitted due to a lack of applications. Most research utilized reactive (k = 9, 56%) and goal-oriented systems (k = 6, 38%). Goal-oriented systems exhibit a large effect, followed by a medium effect for reactive systems. Form-focused systems provide a non-significant small effect, likely due to the limited number of studies analyzed. Regarding system meaning constraints, results mirrored the system type moderator unsurprisingly, as system coding partially relied on constraints on learners’ production meaning and form (see supplementary material S1). To avoid redundancy, the meaning constraint moderator was omitted from the table.
Lastly, system modality significantly impacts overall effectiveness. Most systems employed a mixed multimodal interface (k = 9, 56%), followed by voice-based systems (k = 7, 44%). Learners benefited most from the mixed mode, exhibiting a large effect size. Voice-based systems, relying solely on sound recognition and production, provide a small effect. Text-based systems demonstrate the lowest effect among the three, although the difference was not statistically significant.
3.3.3 Measures
Table 5 shows the mean effect sizes of two outcome variables, including measure and speaking rating criteria. A dominant report in holistic proficiency is observed (n = 44, k = 15). No significant difference is found between the effect sizes observed in terms of measure. While speaking performance graded using the analytical scale shows a medium effect, holistic grading presents a large effect, both showing differences from the null effect. Regarding rating criteria, the result shows no significant difference across the five components. Although a large effect size was obtained in task completion, vocabulary, and fluency, dialogue systems show a medium effect on pronunciation (g = .58). Their effect on speaking grammatical accuracy remains underdetermined, as this domain failed to reach significance. Since we found only one study measuring IC (Kim, Reference Kim2017), conducting a moderator analysis would be biased. Therefore, we omitted this variable.
Table 5. Moderator analyses in data from measures

Note. CI = confidence interval; n = number of effect sizes; k = number of studies; g = Hedges’s g. *p < .05. **p < .01. ***p < .001.
4. Discussion
The present meta-analysis synthesized the results of 16 studies to assess the effectiveness of dialogue-based CALL systems in enhancing L2 speaking skills. Meanwhile, the analysis incorporated a few moderator variables and examined their effect on L2 speaking development. The following section answers the three research questions by discussing the overall effectiveness of dialogue systems for L2 speaking development and the moderators influencing their effects.
4.1 The effectiveness of dialogue-based CALL (RQ1)
In general, dialogue-based CALL exhibited a significantly positive and medium effect on L2 learners’ speaking development. This finding is consistent with the previous meta-analyses but with a slightly larger effect (e.g. Bibauw, Van den Noortgate, et al., Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022; Zhang et al., Reference Zhang, Shan, Lee, Che and Kim2023). The speaking gains observed are similar to those reported by Zhang et al. (Reference Zhang, Shan, Lee, Che and Kim2023) and Lee and Hwang (Reference Lee and Hwang2022).
This effectiveness could be explained by several advantages of dialogue systems for speaking practice, including but not limited to their ability to (1) create continuing and meaningful interactive opportunities (Han, Reference Han2020; Hsu, Chen & Todd, Reference Hsu, Chen and Todd2023); (2) construct authentic speaking contexts (Hwang et al., Reference Hwang, Guo, Hoang and Chang2022); (3) provide multimodal feedback (Tai & Chen, Reference Tai and Chen2022); and (4) engage L2 learners with a stress-free, interactive environment (Hsu, Chen & Yu, Reference Hsu, Chen and Todd2023; Tai, Reference Tai2022). As the realm of AI continues to evolve, future research can further explore the ubiquity, interactivity, and authenticity afforded by dialogue systems for enhancing speaking practice.
However, the effectiveness of dialogue systems for L2 speaking development is tempered by several methodological limitations, particularly small sample sizes. Nearly all the studies (k = 15, 94%) involved fewer than 60 learners, with the largest sample being 73 (e.g. Kim, Kim, Cha, Reference Kim, Kim and Cha2021). Moreover, checking participants’ homogeneity before trials was often overlooked. While some studies employed analysis of covariance (ANCOVA) with speaking pre-test score as covariates to address group differences (e.g. Hsu, Chen & Todd, Reference Hsu, Chen and Todd2023; Hwang et al., Reference Hwang, Guo, Hoang and Chang2022; Yang, Lai & Chen, Reference Yang, Lai and Chen2022), they did not examine the assumption of homogeneity of regression slope. Assumptions for parametric analysis were rarely reported, with normality being the only assumption addressed in just one study (e.g. Yang, Lai & Chen, Reference Yang, Lai and Chen2022). This lack of methodological rigor may lead to inaccurate results. Furthermore, studies using a mixed-design method frequently neglected to incorporate pre-test and post-test within-group repeated measures (k = 9, 70%), potentially leading to biased conclusions regarding intervention effectiveness. This also poses challenges for meta-analysis, as calculating sampling variance for effect sizes often requires t-values from repeated measures. Future research should overcome these methodological limitations by using larger sample sizes, ensuring participant homogeneity, including pre-test and post-test repeated measures in mixed-design studies, and reporting parametric analysis assumptions.
4.2 Toward a comprehensive understanding of effective dialogue-based CALL for L2 speaking development (RQ2 and RQ3)
Compared to the established effectiveness of dialogue-based CALL for L2 speaking, more crucial and useful questions concern when, where, how, and for whom this effectiveness could be realized. The current study addresses these questions through the comprehensive exploration of multiple moderators in dialogue systems interventions. It is noteworthy that dialogue-based CALL remains a burgeoning area of research, as evidenced by the restricted sample sizes in the present and related review of research (e.g. Bibauw, François & Desmet, Reference Bibauw, François and Desmet2019; Zhang et al., Reference Zhang, Shan, Lee, Che and Kim2023). Therefore, instead of drawing definitive conclusions, we hope the following findings stimulate greater research attention and provoke further testing and analysis.
4.2.1 Study context
Regarding educational stages, all stages showed a shared medium effect, albeit no significant difference was observed in this moderator. This finding aligns with previous findings (e.g. Bibauw, Van den Noortgate, et al., Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022; Zhang et al., Reference Zhang, Shan, Lee, Che and Kim2023), suggesting that dialogue systems benefit students across levels of education. While prior studies found an advantage for younger learners (e.g. Zhang et al., Reference Zhang, Shan, Lee, Che and Kim2023), the present analysis had a limited representation of K-12 learners (n = 5, 31%), indicating insufficient empirical evidence. Therefore, it is still early to conclude that dialogue-based CALL favors L2 learners at specific education stages. Further investigations are warranted to explore the applications of dialogue systems in relevant teaching contexts and potentially uncover differential effects across educational levels.
L2 proficiency is also a non-significant moderator, while lower (A1/A2) and intermediate (B1/B2) proficiency learners demonstrated certain learning gains. This finding may be ascribed to the social and psychological support within the dialogue-based CALL environment for less proficient L2 learners. Dialogue systems can offer multimodal feedback, enhancing comprehension and consequently production for less proficient learners (Tai, Reference Tai2022). This advantage for less proficient L2 learners also coincides with Bibauw, Van den Noortgate, et al. (Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022), wherein a special effect of dialogue-based CALL in the consolidation stages of learning was hypothesized. For advanced L2 proficiency learners, dialogue systems appeared less effective (e.g. Kim, Reference Kim2016; Tai, Reference Tai2022), although the limited sample size precluded a significant effect. Tai (Reference Tai2022) posited that ASR technology sacrifices sentence length and complexity to maintain high recognition rates, suggesting less challenging interaction tasks for proficient L2 speakers. In contrast, Hsu, Chen and Todd (Reference Hsu, Chen and Todd2023) observed better interaction experiences for advanced learners due to fluent conversation flow and fewer communication breakdowns resulting from adequate L2 proficiency. While supporting Bibauw, Van den Noortgate, et al.’s (Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022) hypothesis regarding dialogue-based CALL’s effect in the early consolidation stage for less proficient learners, our findings indicate varying experiences when interacting with virtual interlocutors across proficiency levels. We call for research investigating dialogue-based CALL targeting different participant populations, particularly considering potential communication breakdowns across proficiency groups.
4.2.2 Study design and treatment
Dialogue systems have shown effectiveness in both in-classroom and out-of-classroom settings for speaking practice. Integrating them with mobile devices allows dialogue-based CALL to enjoy the mobility, ubiquity, and flexibility of mobile-assisted language learning (MALL), characterized as anytime, anywhere learning (Kukulska-Hulme & Shield, Reference Kukulska-Hulme and Shield2008). This facilitates authentically contextual conversation by connecting knowledge with learners’ surroundings, using language in meaningful contexts, and stimulating learners’ interests (Hsu, Chen & Todd, Reference Hsu, Chen and Todd2023; Hwang et al., Reference Hwang, Guo, Hoang and Chang2022; Tai, Reference Tai2022). For instance, Tai (Reference Tai2022) encouraged out-of-classroom dialogue-based CALL activities as meaningful extensions of classroom learning. This integration raises the important issue of the teacher’s role in dialogue-based CALL. We agree with Ji, Han and Ko (Reference Ji, Han and Ko2023) that collaboration between teachers and machines (i.e. dialogue systems) is pivotal for successful AI-integrated language learning. Dialogue systems could help teachers to better allocate teaching resources in combination with classroom-based interactive practices (Tai, Reference Tai2022). Teachers can also help to maintain learners’ learning interests, especially when the novelty effect wears off (El Shazly, Reference El Shazly2021). Compared with the rich explorations of dialogue systems’ role in helping learners with L2 learning (e.g. Kohnke, Reference Kohnke2023; Tai & Chen, Reference Tai and Chen2023), limited research investigates how language teachers can guide students during dialogue-based CALL. Stronger orchestration between teachers and technology would be required in future classrooms (Roschelle, Lester & Fusco, Reference Roschelle, Lester and Fusco2020). More studies, especially empirical research, are needed to explore the teachers’ participation in dialogue-based CALL, from course design and practical teaching to class management and language assessment.
Second, the presence or absence of CF did not make a significant difference in L2 speaking practice. Notably, the effect sizes in our study indicate much more proximity between the two conditions than what was reported in Bibauw, Van den Noortgate, et al. (Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022), with more nuanced classifications of CF. This suggests that while we did not find a significant difference, there may still be potential benefits of CF that are not fully captured by our results. Therefore, this finding might not contradict the literature on CF’s benefits in L2 learning, particularly in traditional classrooms (e.g. Lyster, Saito & Sato, Reference Lyster, Saito and Sato2013; Nassaji & Kartchava, Reference Nassaji and Kartchava2021) or in technology-enhanced learning environments for pronunciation and speech fluency (Gu et al., Reference Gu, Davis, Tao and Zechner2021; Ngo, Chen & Lai, Reference Ngo, Chen and Lai2024). The non-significant difference may stem from CF’s heterogeneous nature, characterized by disparate techniques, objectives, and instructional contexts across the studies. In dialogue-based CALL, providing CF faces challenges, as it risks disrupting learner interaction and willingness to communicate (Hwang et al., Reference Hwang, Guo, Hoang and Chang2022). System feedback in the form of silence or erroneous responses can prompt immediate self-correction among learners, particularly in pronunciation. In evaluating feedback of this nature, discerning its corrective intent is challenging, as these potential implicit instances of CF may be present but remain unreported. Under this circumstance, we applied a dichotomous coding to suggest the distinction between studies with and without a clear report of corrective notifications (e.g. incorrect pronunciation notifications in Hsu, Chen & Yu, Reference Hsu, Chen and Todd2023; Tai, Reference Tai2022) or CF moves during interaction (e.g. recast in Petersen, Reference Petersen2010). This yes or no coding might have affected the finding. Given that dialogue-based CALL for speaking is still evolving, the specific use of CF across different intelligent dialogue systems warrants further empirical investigations. Discussions should be made upon the context-specific CF to explore its unique contributions to L2 speaking development under dialogue-based CALL, especially employing GenAI-based systems for CF delivery.
Third, although not achieving significance, the intervention duration seems to affect the overall effectiveness of dialogue-based CALL for L2 speaking development, particularly with longer individual sessions. Studies in this domain often omit precise time-on-task data in favor of reporting overall session duration, leading to ambiguity and inconsistency in defining intervention length and frequency. This tendency may potentially contribute to the non-significant findings. Interestingly, while increasing the number of weeks or sessions shows relatively small effects, longer individual sessions seem to have a more pronounced impact. Bibauw, Van den Noortgate, et al. (Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022) reported higher learning outcomes for dialogue-based CALL studies using a packed practice. Together with our findings, these might imply that both frequency and depth of use matter for effective learning. When learners use the system frequently and each session is long enough for meaningful engagement, the combined effect may produce the best outcomes. Additionally, our findings could also indicate a novelty effect, where learners initially engage more deeply with the system but experience diminishing returns as they grow familiar with it over time. Given that the impact of intervention duration remains unclear, this potential novelty effect, observed in other technology-enhanced learning environments like MALL (e.g. Tseng et al., Reference Tseng, Chen, Wang, Cheng, Yang and Gao2022), warrants further investigations. While researchers should strive for greater control and clarity in reporting intervention duration, future studies should examine how both duration and frequency affect dialogue systems for L2 speaking development.
Fourth, our analysis reveals a moderating effect from diverse system designs and meaning constraints of learners’ production, while interaction type does not differentially impact L2 speaking development. This established impact suggests that systems possess distinctive instructional and interactional values for L2 speaking development. Goal-oriented systems show advantages for L2 speaking development, backing Bibauw, François and Desmet’s (Reference Bibauw, François and Desmet2022) claim that form-focused and goal-oriented systems offer the most promising affordances for language learning, while the impact of form-focused systems remains uncertain due to limited studies involved. Goal-oriented systems emphasize implicit meaning constraints in learner production, prompting learners to engage in dialogic interaction to achieve a specific goal. Unlike open-ended free dialogue, their interactional value emphasizes collaborative activity to accomplish a task, known as task-oriented interactions. Tasks for speaking development vary widely from everyday transactions like travel (Park, Reference Park2022) and daily life (Hwang et al., Reference Hwang, Guo, Hoang and Chang2022) to exam-oriented tasks (Hsu, Chen & Yu, Reference Hsu, Chen and Todd2023). Learners with these tasks and systems have full interactivity and high user initiative toward predetermined learning goals. Compared to open-ended interactions facilitated by reactive systems, it appears that dialogue tasks for L2 speaking development are better set within a specific context or domain rather than being left entirely to user discretion in unrestricted communications.
The effectiveness of contextualized interaction in goal-oriented systems can also explain the non-significant impact established for interaction type. Some studies in the analysis employed reactive systems for task-oriented interaction activities, predominantly oriented around speech topics (e.g. Dizon, Reference Dizon2020; Tai & Chen, Reference Tai and Chen2022; Yang, Lai & Chen, Reference Yang, Lai and Chen2022). These tasks demonstrated more open-ended interactions in nature, especially conducted with intelligent personal assistants (i.e. reactive systems such as Google Assistant). Reactive systems operate solely in response to prompts or questions, providing limited contextual interaction. In contrast, goal-oriented systems incorporate tasks with diverse technological affordances such as CF (Hsu, Chen & Yu, Reference Hsu, Chen and Todd2023), virtual reality (VR) learning environments (Park, Reference Park2022), and a blend of controlled and free-speaking practices (Hwang et al., Reference Hwang, Guo, Hoang and Chang2022). To effectively enhance L2 speaking, it is advisable to use goal-oriented systems for task-oriented interactions that guide the student through the steps required to accomplish tasks. Additionally, our findings diverge from Bibauw, Van den Noortgate, et al.’s (Reference Bibauw, François, Desmet, Ziegler and González-Lloret2022) view that learners benefit the most from system-guided interactions and form-focused systems, where systems guide learners through predetermined activities. This contrast also underscores the unique interactional and instructional demands of dialogue systems for L2 speaking practice, which is technologically more challenging to develop. With advanced complex dialogue management techniques, future research can further explore the design or application of different systems varying in user control and interactivity levels.
Lastly, the pivotal role of system modality emerged as a noteworthy moderator. Dialogue systems with mixed system modality integrating a diverse spectrum, encompassing voice, text, and additional facets like VR, yielded a large and significant effect. This finding fits the found modality impact of dialogue systems for L2 speaking in Tai and Chen (Reference Tai and Chen2022). While learners prefer voice chatting over text chatting (Kim, Kim & Cha, Reference Kim, Kim and Cha2021), a mixed written and spoken interface can increase the intelligibility of the interaction and thus facilitate optimal communication. For less proficient L2 learners, relying solely on auditory feedback from the system may lead to processing and retrieval difficulties, particularly in cases of miscommunication (Tai & Chen, Reference Tai and Chen2022). Visual support in feedback, provided through screen displays or VR equipment, enables learners to pinpoint sources of miscommunication, thereby promoting self-directed learning and correction. Furthermore, the multimodal feedback presentation can motivate learners to explore unknown information and enhance processing and comprehension (Tai & Chen, Reference Tai and Chen2022). Similar significant effects have been reported for mixed interaction modes in studies by Zhang et al. (Reference Zhang, Shan, Lee, Che and Kim2023) and Lee and Hwang (Reference Lee and Hwang2022). With more advanced dialogue systems, future research for L2 speaking practice should consider having a multimodal interface plugging various visual and auditory modes.
4.2.3 Measures
Overall, no significant difference is observed in speaking gains measured using holistic or analytical scales in dialogue-based CALL for L2 speaking development, although both indicate some effect. The prevalent use of holistic proficiency scales provides limited insights into specific areas like pronunciation, grammar, and vocabulary. Future studies should explore dialogue systems’ effect on specific aspects of speaking, utilizing more informative inquiries, including linguistic features (e.g. speech fluency, lexical diversity, or syntactic complexity) and interactional patterns (e.g. discourse markers, conversational repair strategies, and IC). Concerning the limited report on specific aspects of speaking, dialogue systems seem to improve fluency, pronunciation, task completion, and vocabulary, but not grammatical accuracy. Tai (Reference Tai2022) attributed the limited effectiveness of IPA-mediated interaction in grammar to technological constraints, particularly ASR’s struggle to accurately recognize longer sentences. Consequently, participants often use simpler grammatical structures to sustain conversational fluency, in which learners focus on meaning and fluency during free practice with a native-like virtual interlocutor. However, this finding contradicts Hwang et al. (Reference Hwang, Guo, Hoang and Chang2022), where free talk with a chatbot improved grammatical accuracy and no established relation was found in controlled talks practicing predetermined sentence structures. Given the insufficient empirical evidence, it is premature to draw firm conclusions about dialogue-based CALL’s effect on specific speaking areas. Nevertheless, dialogue systems seem to effectively enhance L2 learners’ vocabulary, speech fluency, pronunciation, and task completion. Its impact on grammatical accuracy remains to be established.
It is important to address language proficiency coding, particularly with the psycholinguistic-individualist and IC domains. Notably, only Kim (Reference Kim2017) assessing speaking proficiency in the negotiation of meaning is related to IC. However, considering the non-equivalence of L2 speaking proficiency between conventional psycholinguistic-individualist and IC domains (Roever & Ikeda, Reference Roever and Ikeda2022), it is worth exploring the impact of dialogue-based CALL on learners’ IC. Moreover, as highlighted earlier, there is a noticeable research gap concerning the scarcity of studies targeting different proficiency levels. Specifically for studies of advanced learners, utilizing intelligent dialogue systems to develop and assess their IC emerges as a promising avenue.
5. Conclusion
This meta-analysis aims to present a general picture of the effect of dialogue-based CALL on L2 speaking development. After a stringent research search and inclusion process, we identified 16 eligible studies. Results showed a moderate effect (g = .61) of dialogue systems for L2 speaking development. Three significant moderators were found: types of systems, the meaning constraint of learner production, and system output modalities that can moderate the effect of dialogue systems for speaking. Learners benefit more when they use goal-oriented systems, stressing the implicit meaning constraints of learner productions. Regarding system modalities, mixed modalities are the most effective, highlighting the need to integrate visual and audio modes.
The present analysis also brings several implications that highlight potential directions for future research. First, given that providing immediate, continuous feedback is one of the key features of present intelligent dialogue systems, it is important to discern the appropriate typology of CF within dialogue systems to ascertain its effectiveness. Second, future research can target the unique affordances of dialogue systems upon learners across proficiency levels. Third, although it did not reach statistical significance, the contrast between in-classroom and out-of-classroom prompts further investigations of dialogue-based CALL within the context of MALL to uncover its adaptivity and mobility. Additionally, understanding the role of teachers in both formal and informal learning contexts is paramount. Collaborative efforts between dialogue systems and language teachers warrant exploration, encompassing aspects like course design, practical teaching, classroom management, and language assessment. Fourth, given the limitation of time control in the field, there is a clear need for further investigation into the effects of intervention duration and frequency of dialogue system applications on L2 speaking development. Lastly, the established effectiveness of goal-oriented systems suggests a future research agenda to develop task-based speaking activities simulating real-life situations. Cross-disciplinary collaborations are encouraged to leverage advanced dialogue manager modules empowered by GenAI for highly contextualized interactions. To conclude, while the current analysis offers insights, the limited number of studies underscores that the use of dialogue systems for L2 speaking development remains a nascent field. With advancements in GenAI-powered dialogue systems, it is imperative to advocate for further research into the potential of dialogue-based CALL for enhancing speaking proficiency.
This study is not without limitations. The analysis falls short of representing a global spectrum of dialogue-based CALL systems for speaking proficiency. The scope of this study was confined by the language proficiency of the researchers, encompassing solely empirical studies published in English and Chinese. Furthermore, the limited number of studies included in the current meta-analysis also underscores the lack of conclusive evidence and signifies the preliminary phase of this research domain, thus limiting the strength of the effects observed. Due to this limited number of included studies, the potential publication bias that might be indirectly observed in the moderator analysis should also be noted. Additionally, this study only investigated publications from journals, indicating potential incomplete representation of the field. Future research could also consider other publication sources, such as conference proceedings and book chapters. Lastly, the missed search term “robot” also implies incomplete coverage of the field, given that contemporary educational robots often integrate dialogue systems to facilitate human-like interactions.
Supplementary material
To view supplementary material referred to in this article, please visit https://doi.org/10.1017/S0958344025100268
Data availability statement
Data available on request from the authors.
Acknowledgements
We would like to thank the anonymous reviewers for their insightful feedback on earlier drafts. Special thanks goes to Professor Serge Bibauw for sharing the R codes that informed our analysis, and to Jili Shen and Hui Wang for their support with the coding. Any remaining limitations are our own.
Authorship contribution statement
Zhuohan Hou: Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing. Shanghchao Min: Conceptualization, Funding acquisition, Methodology, Supervision, Writing – original draft, Writing – review & editing.
Funding disclosure statement
This research was supported by the Zhejiang Provincial Planning Office of Philosophy and Social Science [22ZJQN16YB] and Zhejiang Provincial Graduates’ Science and Technology Innovation Program [2023R401174].
Competing interests statement
The authors declare no competing interests.
Ethical statement
Ethical approval was not required.
GenAI use disclosure statement
ChatGPT (Version 4) was used to revise some of the sentences for clarity.
About the authors
Zhuohan Hou is currently a PhD candidate in Applied Linguistics at the School of International Studies, Zhejiang University, Hangzhou, China. Her research interests include speaking and listening assessment, computer-assisted language learning, and quantitative research methods. Her work has been published in peer-reviewed journals such as System.
Shangchao Min is a professor of Applied Linguistics at the School of International Studies, Zhejiang University, Hangzhou, China. Her research interests include language testing and assessment, educational measurement, and second language acquisition. She serves on the editorial boards of Language Testing and Language Assessment Quarterly. She has undertaken several research projects funded by the Ministry of Education and the National Social Science Foundation of China as a principal investigator.