Introduction
From a usage-based perspective, language exposure is essential for learning a second/foreign language (L2) (Ellis & Wulff, Reference Ellis, Wulff, VanPatten and Williams2008). While formal classroom instruction continues to be an important source of input, L2 learners of English today are increasingly exposed to the target language outside the classroom through self-initiated, so-called Extramural English (EE) activities such as gaming, watching films/TV, listening to music, and social media engagement (Sundqvist, Reference Sundqvist2009). Recent studies show that such activities are not only becoming more and more common, but they also serve as an effective complement to classroom learning by creating conditions for incidental and implicit language learning (e.g., Pfenninger, Reference Pfenninger2022; Zhang et al., Reference Zhang, Zou, Cheng, Xie, Wang and Au2021).
This effectiveness can be understood through the lens of self-regulated engagement, which encompasses behavioral, emotional, and cognitive dimensions (Fredricks et al., Reference Fredricks, Filsecker and Lawson2016), as well as through the core principles of learning beyond the classroom, including agency, motivation, and interaction (Richards, Reference Richards2015). At the heart of both perspectives is the idea that learners take initiative and invest in language use in pursuit of personal goals or interests, during which language is acquired implicitly. The behavioral dimension of engagement is reflected in the frequency at which learners choose to participate in EE activities, demonstrating their agency to pursue language learning on their own terms. Emotional engagement, closely tied to motivation, is seen in the enjoyment learners derive from these activities, which in turn helps sustain their involvement over time (Sundqvist, Reference Sundqvist2019). In addition, EE activities often involve authentic input and collaborative interaction, requiring learners to process language in real time. This relates to the cognitive dimension of engagement, in which learners are challenged to interpret meaning, respond to feedback, and negotiate understanding—processes known to support L2 development (Gass & Mackey, Reference Gass and Mackey2006; Pfenninger, Reference Pfenninger2022).
Indeed, research on EE has provided empirical evidence of its generally positive effects on L2 speaking (e.g., De Wilde et al., Reference De Wilde, Brysbaert and Eyckmans2020, Reference De Wilde, Brysbaert and Eyckmans2021; Hannibal Jensen, Reference Hannibal Jensen2019; Sundqvist, Reference Sundqvist2009), as well as on reading and listening comprehension (De Wilde et al., Reference De Wilde, Brysbaert and Eyckmans2020, Reference De Wilde, Brysbaert and Eyckmans2021; Pfenninger & Singleton, Reference Pfenninger and Singleton2017; Sylvén & Sundqvist, Reference Sylvén and Sundqvist2012; Verspoor et al., Reference Verspoor, de Bot, van Rein, de Houwer and Wilton2011). However, in some specific areas of language development, such as vocabulary, findings present a more mixed picture. While many studies have found a positive relationship between EE and L2 vocabulary development (e.g., De Wilde et al., Reference De Wilde, Brysbaert and Eyckmans2020, Reference De Wilde, Brysbaert and Eyckmans2021; Hannibal Jensen, 2017; Sundqvist, Reference Sundqvist2009, Reference Sundqvist2019; Sylvén & Sundqvist, Reference Sylvén and Sundqvist2012), others have reported less consistent results (e.g., Bollansée et al., Reference Bollansée, Puimège, Peters, Valentin and Friederike2020; Peters, Reference Peters2018; Schwarz, Reference Schwarz2020). These inconsistencies may be related to differences in learner groups (e.g., in terms of age, gender, proficiency), the specific aspects of vocabulary knowledge assessed (often through vocabulary tests), and the type of EE activity involved (see Kaatari et al., Reference Kaatari, Larsson, Wang, Eickhoff and Sundqvist2023, for a review), highlighting the need for further research into the conditions under which EE supports different dimensions of L2 vocabulary development.
The relationship between EE and grammar or writing skills has been less extensively explored, with some notable exceptions. These include Muñoz et al. (Reference Muñoz, Cadierno and Casas2018) on learners’ receptive English grammar skills, as well as Olsson (Reference Olsson2012), Sundqvist (Reference Sundqvist2019), Olin-Scheller and Wikström (Reference Olin-Scheller and Wikström2010), Pfenninger (Reference Pfenninger2022), and Pfenninger and Wirtz (Reference Pfenninger and Wirtz2024) on various aspects of writing proficiency. While based on relatively small datasets (in terms of the number of learners involved), these studies suggest that L2 written production is a promising area for further research into the effects of EE on L2 development. As Olsson (Reference Olsson2012) rightly noted, advancing this line of inquiry will require large learner corpora that include not only texts produced by L2 learners, but also rich metadata on their EE exposure. The Swedish Learner English Corpus (SLEC, Kaatari et al., Reference Kaatari, Wang and Larsson2024) was created out of this need, enabling studies that have yielded insights into the relationship between EE activities and lexical complexity (Kaatari et al., Reference Kaatari, Larsson, Wang, Eickhoff and Sundqvist2023) and also the use of multiword units in L2 writing (Kim et al., Reference Kim, Larsson, Kaatari, Wang and Sundqvist2025; Wang et al., Reference Wang, Kaatari, Larsson, Kim and Sundqvist2025). However, while a valuable resource, SLEC only includes data from one regional context: Sweden.
In this data report, we present the Chinese Learner English Corpus (CLEC), designed as the L1 Chinese counterpart to SLEC, to help further enrich our understanding of the relationship between EE and L2 writing by incorporating a different educational and cultural context. Beyond its specific focus on EE, the shared design of SLEC and CLEC also facilitates contrastive interlanguage analysis (CIA) (Granger, Reference Granger2015; Reference Granger2024), thereby enhancing our understanding of the impact of various variables (such as L1 background, proficiency, and gender) on the learner output. In addition, like SLEC, CLEC responds to calls for expanding learner demographics in learner corpus research, particularly by including intermediate-level learners (Paquot & Plonsky, Reference Paquot and Plonsky2017, p.87). Most existing learner corpora tend to focus on advanced learners of English; examples of well-known corpora of this kind include the International Corpus of Learner English (ICLE) (Granger et al., Reference Granger, Dupont, Meunier, Naets and Paquot2020), as well as more recent collections such as the Varieties of English for Specific Purposes dAtabase (VESPA) (Paquot et al., Reference Paquot, Larsson, Hasselgård, Ebeling, De Meyere, Valentin, Laso, Naets, Verdaguer and van Vuuren2022), and the International Corpus Network of Asian Learners of English (ICNALE) (Ishikawa, Reference Ishikawa2023). Freely available corpora that target intermediate-level or younger learners remain scarce, with the International Corpus of Crosslinguistic Interlanguage (ICCI) (Hong, Reference Hong, Tono, Kawaguchi and Minegishi2012) being one of the few available. In this context, corpora like CLEC thus constitute valuable additions to the field of learner corpus research.
In the remainder of the paper, we provide a detailed overview of the corpus design and compilation of CLEC, as well as its current composition. To demonstrate potential uses of the corpus, we also present two case studies. The first is a keyword analysis, which illustrates how the corpus can be used to examine how different learning contexts may influence content selection and topic development in L2 writing. The second case study involves a lexical bundle analysis, comparing gamers and nongamers within CLEC to showcase how the corpus can be used to explore whether engagement in this particular EE activity is reflected in the use of multiword combinations in L2 writing.
Design and compilation
Explicit design criteria are essential in learner corpus research to ensure interstudy comparability (Granger, Reference Granger2024). As mentioned, the design and compilation process of CLEC closely follows that of SLEC, which in turn was developed with the intention of aligning, to a large extent, with existing well-designed learner corpora such as ICLE. This applies both to the procedures for collecting learner texts and to the use of a background questionnaire for gathering metadata.
However, an important distinguishing feature of the CLEC/SLEC design, compared to most existing corpora, including ICLE, is the use of a fixed writing task with a single topic. This design choice reflects growing attention in learner corpus and SLA research to the influence of task and topic on learner language, since “opportunity of use” for certain linguistic features can vary across different task-topic types (Caines & Buttery, Reference Caines, Buttery, Brezina and Flowerdew2019; see also Wang, Reference Wang2016; Yoon, Reference Yoon2017). In the compilation of both CLEC and SLEC, students were asked to write an argumentative text on the same topic (“how to lead a good life”), based on an identical prompt. The prompt, provided in the Appendix (see also Kaatari et al., Reference Kaatari, Wang and Larsson2024), includes a mind map designed to help students consider factors relevant to the topic, along with instructions intended to elicit a prototypical argumentative text structure. To ensure comprehension, especially for younger learners in China, the prompt was translated into Chinese by three of the authors, all native speakers of Chinese.
To compile metadata for CLEC (i.e., information about the students), a questionnaire was used. It was adapted from the one used to compile SLEC and then translated into Chinese. The questionnaire captures learner and task variables such as L1, gender, school year, educational program, time allocated for the task, and whether a preparatory lesson was given. Regarding learners’ EE engagement, the SLEC questionnaire includes five activity types. Specifically, students were asked to estimate the number of hours they spend per week on the following: reading (books, newspapers, magazines), watching (TV shows or films), engaging in face-to-face conversation, using social media, and playing computer games involving communication. These activities were selected from the seven most extensively researched types of EE activities, as identified in a meta-analysis of the EE literature (see Zhang et al., Reference Zhang, Zou, Cheng, Xie, Wang and Au2021). CLEC expands on this by including all seven activity types, adding listening to music and personal or recreational writing to those covered in SLEC. These two categories were slightly adapted from Zhang et al. (Reference Zhang, Zou, Cheng, Xie, Wang and Au2021) to better suit the learner context: listening to music replaces “listening to audio” to reduce ambiguity, and personal or recreational writing replaces “writing compositions,” which could be interpreted as school-assigned tasks. To gather more information about students’ EE engagement, we also included questions on what they watch, listen to, and read in English outside of school.
As some students take after-school English tutoring classes, which provide additional opportunities to practice English skills (including speaking, reading, and writing), we also included questions about their participation in such activities. Table 1 provides an overview of all the metadata recorded in CLEC as well as information on the circumstances for the writing task.
Table 1. Overview of the metadata included in CLEC

The corpus was compiled in collaboration with lower and upper secondary school teachers in China. Prior to the official roll-out, we conducted a pilot compilation to ensure clarity in both the prompt and the questionnaire. While no changes to the final documents were deemed necessary, the pilot helped identify some practical issues worth communicating to participating teachers. For instance, we learned that it is important to remind students to complete both the questionnaire and the writing task, and to ensure that both documents are collected and stapled together for each student to avoid any mix-ups during later processing.
All the texts were handwritten on paper in the classroom, with students having no access to digital tools or other language support resources. Teachers were encouraged to allocate a minimum of 50 minutes for data collection, as was done in SLEC. While most teachers spent between 50 and 100 minutes, this proved difficult for Year 7 students, who typically completed both the writing task and the questionnaire within one lesson (40–45 minutes).
After the data collection, we also interviewed four of the teachers involved to gain further insights into the educational and cultural background. We asked questions about the typical requirements for English writing at different levels, their approaches to writing instruction in the classroom, and their understanding of students’ EE engagement—for example, how much time students spend in school and the amount of homework typically assigned (as an estimate of their available free time); whether teachers recommend any specific EE activities or materials. This information was used to contextualize the learner data in the case studies, and this approach to gathering supplementary insights may also inform future corpus compilation projects targeting similar learner populations.
To date, data have been collected primarily from four provinces: Hubei, Shannxi, Shanxi, and Fujian. The handwritten texts were digitalized by manually converting them to .txt files, and the accompanying metadata were entered into an Excel spreadsheet for further processing. Texts containing only a few words or sentences, or lacking certain background information, were excluded from the corpus (N = 190). The remaining texts were screened to anonymize any identifiable information.
Corpus description
CLEC is a monitor corpus, which means data will be continuously added to the corpus. In the first release, CLEC contains 828 texts with a total of 170,079 words. In this section, we present the current composition of CLEC, focusing on the distribution of texts by gender, school year, and students’ EE engagement.
Table 2 presents the distribution of students across gender and the word count for each subset. While there are more female than male students (55.4% vs. 44.6%), male students produced slightly longer texts on average (209.0 vs. 202.5 words).
Table 2. Distribution across genderFootnote 1

Table 3 presents the distribution of texts across school years and the word count for each subset. Year 11 students contributed the most to the corpus (28.6%), followed by Year 9 (20.5%) and Year 8 (19.4%) students. The average text length for Year 7 students is notably lower than that of the other groups, at around 130 words, whereas Year 10 students produced the longest texts, averaging approximately 240 words.
Table 3. Distribution across school years

According to the interviews with the teachers who assisted in our data collection, writing instruction in Year 7 focuses on various communication genres such as announcements, emails, and letters, with a typical length requirement of 60–80 words. In Years 8–9, students begin writing narratives (usually with a few guiding questions), and occasionally argumentative essays, typically ranging from 80 to 100 words. In upper secondary school (Years 10–12), these genres remain central, with slightly higher length requirements: 80 words for shorter communicative texts (e.g., letters) and 150 words for longer pieces (mostly narrative assignments, such as continuing a story). Given this context, the word counts in our corpus are higher than expected.
Figure 1 illustrates the distribution of students grouped by the time spent on each of the seven activities. Time categories from the questionnaire range from 0 to 9 hours, with additional categories for 10–15 hours, 16–20 hours, and over 20 hours. For clarity in the graph, the first ten categories are aggregated as 0 hours, 1–3 hours, 4–6 hours, and 7–9 hours. As shown in Figure 1, the majority of students spent limited time on most activities, with a strong concentration in the 0–hour and 1–3 hour categories. More than half of the students reported spending 0 hours on gaming (55%), conversation (53%), and writing in English (51%). In the 1–3 hour range, reading (64%), social media use (63%), and watching (61%) are the most common activities. These three activities also follow a similar pattern across the remaining time ranges, with approximately 10% of students reporting 4–6 hours per week and a very small proportion spending more than that. Listening stands out from the other activities in that students are more likely to engage in it for an extended time.

Figure 1. Time spent on EE activities.
The generally limited engagement with EE reported by the students aligns with the teachers’ reflections, who emphasized that students spent most of their time in school, leaving them with very limited free time. In particular, they noted the scarce opportunities for students to speak English outside of school. However, since speaking skills will be included in the national exam at the end of Year 9 starting in 2026, some students have begun dedicating more effort to these skills. Some teachers also mentioned encouraging students to subscribe to certain AI apps for practicing spoken English.
Table 4 lists the most popular films (or TV series), songs, and types of reading among the students. The films and TV series show a strong preference for iconic Hollywood blockbusters and popular TV shows like Harry Potter, The Avengers, and Friends. The songs listed feature a mix of global hits, predominantly Western, with a moderate presence of K-pop hits, reflecting the growing influence of K-pop music in recent years.
Table 4. Popular films/TV series, songs, and reading genres

The answers students provided regarding what they read outside of school are less straightforward to summarize, as some mentioned broad genres like “novel,” while others provided specific titles. Notably, the common types of reading include English-language newspapers in China such as China Daily and educational publications such as 21st Century and English Weekly. These sources target learners of English, presenting cultural and global topics in accessible English, often with incorporated language exercises. Graded reading series such as Bookworm and children’s biographies such as Who was/is…? are also popular. These books are designed specifically to improve learners’ reading skills, with the former featuring classic literature adapted for various proficiency levels, and the latter introducing young readers to famous historical and contemporary figures in a straightforward and engaging style. Together, these reading materials suggest a strong motivation for extramural reading aimed at skill development, rather than solely for leisure or pleasure. Our interviews with the teachers confirm that these commonly read titles are often recommended to students as complementary reading in preparation for exams. Beyond this, most students simply have little time for additional reading.
Case studies
In this section, we present two case studies to demonstrate potential uses of the corpus. The first study involves a keyword analysis to compare CLEC and its Swedish counterpart, SLEC. This analysis serves as an initial exploration of thematic, stylistic, and lexical differences between the two corpora, which could be attributed to different learning contexts (Collentine, Reference Collentine2004), including learners’ EE engagement.
Indeed, as illustrated in Figure 2, which compares CLEC and SLEC in terms of time spent on five of the EE activities, overall, Swedish students reported spending more time on all activities except reading. For instance, a smaller proportion of Swedish students reported spending 0 hours on gaming (36% vs. 55% in CLEC) and conversation (28% vs. 53%), indicating generally higher engagement in these activities. However, the opposite trend is observed for reading, where a greater proportion of Swedish students reported spending 0 hours (47%), compared to Chinese students (23%). In addition, a larger share of Chinese students reported spending 1-3 hours per week reading (64% vs. 32% in SLEC). For the other activities, Swedish students also tend to spend more extended time, particularly on social media use and watching English-language films or TV series. Given these discrepancies, the learning environments represented in CLEC and SLEC may exemplify two distinct contexts for L2 development well recognized in SLA research: the more traditional foreign language instruction in parts of China, and what may be seen as a domestic immersion context in Sweden (see, e.g., Köylü and Tracy-Ventura, Reference Köylü and Tracy-Ventura2022, for more on learning contexts for L2 development).

Figure 2. Time spent on EE activities: comparing CLEC and SLEC.
The second study examines the impact of gaming, a type of EE activity shown to contribute positively to L2 phraseology (e.g., Wang et al., Reference Wang, Kaatari, Larsson, Kim and Sundqvist2025), focusing on lexical bundles.
Comparing CLEC and SLEC: a keyword analysis
Keywords are words that appear with an unusually high frequency in a given corpus compared to a reference corpus (Scott, Reference Scott1997, p. 236). These words provide insights into the “aboutness and style” of the texts being analyzed (Baker, Reference Baker2004, p.347). Traditional keyness measures, such as Chi-square and Log-likelihood, rely solely on word frequencies and often generate extensive keyword lists, including many high-frequency function words or words concentrated in only a few texts, which may not meaningfully represent the corpus or its discourse domain (Baker Reference Baker2004; Egbert and Biber Reference Egbert and Biber2019). To refine keyword analysis, researchers have increasingly incorporated dispersion as a key dimension, using measures such as DP (deviation of proportions) (Gries, Reference Gries2008; see Biber et al., Reference Biber, Reppen, Schnur and Ghanem2016; Gries, Reference Gries, Paquot and Gries2020, for an overview of dispersion measures). This study applies text dispersion keyness, a method developed by Egbert and Biber (Reference Egbert and Biber2019) and implemented in AntConc (Version 4.2.0) (Anthony, Reference Anthony2022) to extract keywords from CLEC and SLEC. The analysis is based on a balanced sample of 500 texts drawn from each corpus, with 106,101 words from CLEC and 190,405 words from SLEC.
Keyword analysis requires a reference corpus, which may be a large, balanced general corpus or a directly comparable corpus, depending on the research questions (Baker, Reference Baker2004). In the present study, the two learner corpora serve as each other’s reference corpus. The approach was chosen because our aim is to explore differences in thematic focus and lexical choices between two closely matched learner corpora. While using an external reference corpus for both learner corpora could also help identify differences, it would likely introduce variables not controlled for in the comparison with SLEC and CLEC.
A total of 53 keywords were identified in CLEC, in comparison to 227 in SLEC. Table 5 presents the top 30 keywordsFootnote 2 from each corpus, ranked in descending order of keyness. It also includes the raw frequenciesFootnote 3 of these keywords and the number of texts where they occur, both in the target corpus (freq_t, range_t) and the reference corpus (freq_r, range_r).
Table 5. Top 30 keywords from CLEC and SLEC

One advantage of the text dispersion keyness method, as noted by Egbert and Biber (Reference Egbert and Biber2019), is that it gives preference to content words over grammatical or function words. This is clearly reflected in the CLEC keyword list in Table 5. However, the SLEC list is still notably dominated by function words, highlighting the distinctiveness of certain grammatical elements in the Swedish learners’ writing. These include prepositions and particles (with, about, out, on, around), some of which (out, on, and around) are likely linked to phrasal verbs occurring more frequently in the Swedish learner corpus (e.g., stress out, figure out, work out, burn out, go on, keep on, stick around, fool around). The frequent occurrence of that and when may further suggest a preference among Swedish students for clause-heavy constructions. While such patterns fall outside the scope of this case study, they are potentially meaningful in connection with EE exposure and merit closer examination in future research.
Although the texts from both corpora were written on the same topic, the keywords revealed distinct thematic priorities between the two groups of students. The keywords in SLEC suggest an emphasis on training and sports (e.g., sport, gym, football, hockey), emotional states (e.g., fun, depression, stress, sad, happier, hate), and family-related topics (e.g., loved, food, house, siblings). In contrast, the keywords in CLEC tend to be associated with personal and spiritual development (e.g., development, study, personal, spirit, spiritual, enrich, knowledge), health (e.g., healthy, strong, sports, outdoors), and family and friends (e.g., warm, warmth, difficulties).
Example (1) illustrates that spiritual development in CLEC is often linked to education and knowledge. Alongside education, health and family emerged as key themes, as reflected in Example (2). While family (and friends) is also an important theme in the Swedish counterpart, Examples (2)–(4) suggest that for the Chinese students, they were particularly valued for the warmth and support they provide in times of difficulties, whereas the Swedish students emphasized the importance of feeling loved. A similar contrast can be found in the discussion of sports: for the Chinese students, sports were primarily associated with health, as shown in Example (5), whereas for the Swedish students, they were mostly seen as fun, as illustrated in Example (6).
-
(1) Besides health education is also very significant. Our spiritual world need knowledge to fill it. (CLEC: G_2_F_23_G01_20)
-
(2) For me, a good life is having a warm family, a stable job, and a healthy body. (CLEC:G_2_F_23_G03_7(1))
-
(3) Friends can help you when you encounter difficulties. (CLEC: H_2_M_23_X1755)
-
(4) You need friends, family and you need to feel that you are loved. (SLEC: G_1_S_F_21_156)
-
(5) You can do sports or outdoors activities to improve your health. (CLEC: H_3_M_23_GS18)
-
(6) I play football because I think it’s fun and I like to win. (SLEC: H_8_M_24_60)
In addition to these thematic differences, the keywords also indicate a clear reliance on the prompt in the CLEC texts, with some keywords directly drawn from the prompt (e.g., leading, aspects, personal development, recreation, visibility, outdoors, sports, appearance, owning). This tendency is likely in line with a core aspect of Chinese essay writing, which places particular emphasis on topic relevance; that is, the composition must maintain a clear theme that closely aligns with the essay prompt requirements. Accordingly, the first and foremost steps in essay writing, commonly encouraged in Chinese writing instruction, are analyzing the prompt and identifying the main idea (Chen, Reference Chen2013), and one strategy often used to reinforce this alignment is repeating the title within the composition (Wang, Reference Wang1994). This approach also appears to influence English writing instruction. Indeed, all the teachers interviewed noted that their students are often instructed to carefully analyze the prompt and stay closely aligned with it in their texts. The unusually frequent occurrence of these prompt words may reflect a learner’s strategy to ensure adherence to the topic.
Some keywords suggest different styles of writing between the two corpora. In SLEC, words and phrases like going (to), just, stuff, lot, and big indicate a more colloquial style. This is evident in Example (7), where the vague expression a lot of stuff and the use of the second-person pronoun contribute to an informal tone. In contrast, keywords in CLEC such as role, sense, view, opinion(s), and vital reflect a more formal style, characterized by expressions such as play an important role, from my point of view, and of vital importance in (8), (9), and (10). These are likely some formulaic expressions that students are taught and tend to retain, which are used to emphasize significance and express personal opinions in this type of writing.
-
(7) If you are healthy, it is easier to do a lot of stuff that you think is fun. (SLEC: G_1_Y_M_21_74)
-
(8) Many people think that appearance plays an important role in life, which I don’t agree. (CLEC: G_1_F_23_WYS4(3))
-
(9) From my point of view, the most significant aspects for leading a good life are personal development and fitness. (CLEC: G_2_M_23_G02_7)
-
(10) As far as I’m concerned, the security from family is of vital importance. (CLEC: H_3_F_23_W31)
A further notable difference between the two corpora is the use of personal pronouns to engage with the reader. The Swedish students tended to use second-person pronouns (your, yourself, you), whether to address the reader directly or more generically, as seen in (7) and (11), while the Chinese students were more likely to adopt a collective voice with first-person pronouns (us, our, we), as in (12). The collective perspective was also observed in Wang (Reference Wang2016, p.124), through the use of verb-noun collocations such as make contribution/sacrifice/ progress in argumentative essays written by university-level Chinese EFL learners, reflecting the influence of “cultural-specific patterns of writing” (Ädel, Reference Ädel2006, p.148).
-
(11) If you have friends you can do things with them that you and your friend find interesting and funny. (SLEC: G_1_S_F_22_15)
-
(12) When we get a healthy body, we can do anything we want. (CLEC: H_2_F_23_GS(2)2)
These differences in thematic focus and writing style, revealed through the keywords in CLEC and SLEC, together with variations in learners’ EE engagement, appear to reflect the distinct learning contexts in the two countries. One context still seems to be dominated by traditional foreign language instruction, which prioritizes structured learning (including, for example, formal writing, topic analysis, disciplined pursuits), while the other resembles a type of domestic immersion context where language use is both functional and integrated into learners’ daily communication. However, we acknowledge that it is highly unlikely that a single factor, or even a small set of independent variables, can account for a complex phenomenon such as L2 learning in a linear fashion (Pfenninger & Wirtz, Reference Pfenninger and Wirtz2024). While our study design controlled for several key variables (e.g., task, topic, learner age range, L1), other factors, including individual difference such as age of onset of English learning, motivation, and anxiety, as well as task repetition (see Baba & Nitta, Reference Baba and Nitta2014), were not controlled for and may also have played a role in the outcomes. Moreover, potential interactions between variables were not examined. Therefore, the findings should not be taken as evidence of a direct causal relationship, but rather as indicative of certain patterns and tendencies that differ between the two learner corpora and may be worth exploring further.
Exploring the effect of gaming on 3-word bundles
Lexical bundles refer to uninterrupted sequences of three or more words that occur frequently in natural language use, regardless of its idiomaticity or structural status (e.g., as can be seen, in other words, although it is) (Biber et al., Reference Biber, Conrad and Cortes2004), and they have been found to be an important component of fluent linguistic production (Hyland, Reference Hyland2008). L1 speakers, through extensive exposure to language in various communicative contexts, have access to a vast repertoire of multi-word units. In contrast, L2 learners, with more limited exposure to the target language, often struggle with this aspect of language use (Pawley & Syder, Reference Pawley, Syder, Richards and Schmidt1983; Wray, Reference Wray2012). As L2 learners of English are increasingly exposed to authentic English through EE activities, research has begun to examine the role of such exposure in L2 learning and use (e.g., Sundqvist, Reference Sundqvist2009; Kaatari et al., Reference Kaatari, Larsson, Wang, Eickhoff and Sundqvist2023).
Among these activities, gaming is believed to contribute to language learning by providing enjoyable experiences that motivate sustained engagement necessary to succeed in the game. This reflects the behavioral and emotional dimensions of self-regulated engagement in language learning (Fredricks et al., Reference Fredricks, Filsecker and Lawson2016). At the same time, gaming often involves a “trial-and-error” approach to language acquisition in naturalistic settings (Sundqvist, Reference Sundqvist2015), requiring players to process authentic input in real time, particularly when interacting with other players, navigating game instructions, or solving in-game tasks in English, thereby engaging the cognitive dimension as well. In this way, language is acquired implicitly as a by-product of pursuing individual interests.
However, previous research has yielded mixed results regarding the relationship between gaming and L2 vocabulary development (e.g., Sundqvist, Reference Sundqvist2019; Muñoz, Reference Muñoz2020), likely due to variations in game types, learners’ ages and proficiency levels, or methodological approaches across studies. Similarly, Wang et al. (Reference Wang, Kaatari, Larsson, Kim and Sundqvist2025) reported inconclusive findings concerning the effect of gaming on multiword combinations in the written production of L2 Swedish learners: while no significant differences were observed between gamers and nongamers in their use of adjective-noun and verb-noun combinations, the amount of time spent gaming appeared to exert some influence. These findings point to the need for further research into the potential impact of gaming on different aspects of L2 lexical development. This case study aims to illustrate the potential of CLEC in supporting this line of research by examining the use of lexical bundles—another type of multi-word sequences often associated with fluency and genre-specific discourse (e.g., Biber et al., Reference Biber, Conrad and Cortes2004; Wray Reference Wray2012)—in the writing of Chinese learners of English, to explore whether exposure through gaming may have an impact on language output.
To investigate this, we extracted two subsets from CLEC: a gaming subset, comprising 112 texts produced by students who reported spending more than 3 hours per week playing games that involve communication in English, and a nongaming subset, consisting of the same number of texts produced by students who do not game at all. The two subsets of texts are distributed in the same way across school years. In total, the gamer subset encompasses 19,969 words, while the non-gamer subset encompasses 26,073 words.
We decided to focus on three-word bundles, primarily because the small datasets make it unlikely that setting the word length to more than three would yield enough bundles for analysis. In this study, the cut-off points were set to a minimum frequency of five occurrences across five different texts. AntConc (Version 4.2.0) (Anthony, Reference Anthony2022) was used to retrieve three-word bundles that meet these criteria. The subsequent analysis is based on the top 100 bundles from both subsets. Table 6 presents the top 30 from each.
Table 6. Top 30 3-word bundles in gaming and nongaming subsets

Out of the top 100 bundles from each subset, 73 are shared, suggesting that gaming has a limited effect on this aspect of language use overall. As shown in Table 6, the top eight bundles all come from the prompt: “what are the most important aspects for leading a good life”. Among the remaining shared bundles, some reflect the predominant themes discussed earlier (e.g., a healthy body, life and health, friends and family, well paid job). Other shared bundles include common formulaic sequences such as all in all, a lot of, and so on, and what’s more. According to the teacher interviews, multi-word transitional signals such as what’s more, all in all are explicitly taught in writing instruction. The rest of the shared bundles are mainly those used to express opinions (e.g., in my opinion, so I think), disagreement (e.g., don’t think, but I think), and evaluation (e.g., is very important, can help us). Example (13) illustrates a common pattern used to introduce a rebuttal, with the phrase (but I) don’t think (so) following a counterargument introduced by some people think. The frequent use of such expressions suggests that this may be a strategy students tend to rely on. The fact that these expressions are shared by both subsets indicates that they are likely a product of classroom instruction, which may not be easily influenced by students’ EE exposure.
-
(13) Some people think money is the most important thing. But I don’t think so. (CLEC: G_1_F_23_W8)
Among the bundles that occur exclusively in each subset, some suggest different thematic priorities. In the gaming subset, money-related bundles (money is the, money is not) stand out, while the non-gaming subset features bundles related to health and education (a good education, education and health, healthy body is). The gaming subset also includes fragments of longer formulaic sequences (e.g., is the key to, one of the most), as well as a range of bundles involving personal pronouns I and you (e.g., I have a, you want to, for me I, I think have, if you want, can give you, do you think, can make me), suggesting a more informal and interactive style. Examples (14) and (15) illustrate this style well. The use of do you think and other questions invites engagement characteristic of casual conversation.
-
(14) What do you think of your life? Incredible? Dull? Or anything else? (CLEC: G_1_F_23_W1)
-
(15) That’s my point of view, what about you? What do you think can make your life better? (CLEC: H_2_F_23_X1619)
The bundles exclusive to the nongaming list feature those involving the impersonal it structure (e.g., it’s important, think it is, it is not) and the inclusive pronoun we (we don’t, we have a, we want to, if we don), suggesting a preference for a more collective and impersonal approach. In Example (16), the use of anticipatory it adds a sense of formality and objectivity to the statement.
-
(16) It is not enough to focus solely on one aspect while neglecting the other. (CLEC: G_2_F_23_G03_3(1))
In conclusion, while the differences between the gaming and nongaming subsets appear minimal in this respect, there are some signs that gamers tend to adopt a more informal and interactive style, in contrast to the more formal and impersonal style observed among nongamers. This may suggest a degree of register transfer, whereby features of the language input encountered during gaming (often informal and conversational in nature) are reflected in learners’ written output. However, given the exploratory nature of this case study, these observations should be seen as tentative yet promising, warranting further investigation in future research with more fine-grained learner profiling and robust statistical analysis.
Conclusion
In this data report, we introduced the Chinese Learner English Corpus (CLEC), with the aim of providing an additional freely available corpus resource, thus facilitating the growing research interest in the role of extramural exposure in L2 development. CLEC adds value both as a standalone dataset to explore the effect of EE engagement and in direct comparison with SLEC to facilitate contrastive interlanguage research.
The two case studies presented serve to illustrate some of the analytical possibilities afforded by the corpus. In the keyword analysis, the tightly controlled task and topic across both learner corpora, along with comparable learner profiles, made it more plausible to attribute observed differences to contextual factors such as educational traditions and exposure to English, rather than to variation in prompt, task type, or mode of production. In the lexical bundle analysis, metadata on learners’ engagement in specific EE activities allowed us to construct subcorpora for exploratory comparisons, highlighting the potential of the corpus for targeted investigations into how extramural exposure might be reflected in different aspects of learner output. More broadly, the shared design of CLEC and SLEC, and potentially other corpora to be added to the “LEC” family, opens up possibilities for contrastive interlanguage analysis (CIA; Granger, Reference Granger2015, Reference Granger2024), similar to what has been enabled by ICLE, but with greater precision due to the control of the topic variable.
While the findings were preliminary, they highlighted patterns (e.g., prompt reliance, register transfer from input to output) that merit further investigation using more refined analytical approaches. Methodologically, while the keyword analysis relied on a direct comparison between two well-matched learner corpora, future research could benefit from combining this with comparisons to a large, external reference corpus, as demonstrated by Baker (Reference Baker2004), to uncover both differences and potential similarities between learner groups. With regard to gaming, more fine-grained learner profiling within the subcorpora and robust statistical analysis will be necessary in order to disentangle the effects of multiple confounding variables—an essential step toward understanding a phenomenon as complex as L2 development (Pfenninger & Wirtz, Reference Pfenninger and Wirtz2024).
By making CLEC freely available to researchers (https://sites.google.com/view/ee-corpora/), we hope to encourage further research in these areas. The corpus is also a valuable resource for teachers and teacher trainees looking to deepen their understanding of L2 development and acquisition (e.g., developmental sequences, L1 influence, individual differences), as well as to explore topics such as assessment and feedback.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/S0272263125101332.
Competing interests
The authors declare none.
Appendix
