Exploring the Role of n-Grams in L1 Identifi cation

Scott Jarvis, Magali Paquot

Research output: Chapter in Book/Report/Conference proceedingChapter

12 Scopus citations

Abstract

Chapter 2 showed that relatively high levels of L1 classification accuracy can be achieved under the following conditions: (1) Five L1 groups, some of which are closely related to each other. (2) The learners within each group vary widely in terms of L2 proficiency. (3) The texts are all written narrative descriptions of a silent film. (4) The features (i.e. variables) used by the classifier include a few dozen highly frequent words. Conditions (1) and (2) were intended to make the detection task challenging for the classifier in order to investigate how sensitive the classifier is to even subtle between-group differences in learners’ language-use patterns, and simultaneously to gather possible evidence of L1 effects that may tend to evade conscious awareness but are nevertheless reliable enough – even across proficiency levels – to be detected by a computer-based classifier. Condition (3) represented a control variable whose purpose was to limit the range of variation in the data to that which could be attributed to proficiency differences (within-group) and L1 differences (between-group). This was done to enhance the clarity of interpretations that could be made on the basis of the results – to show whether certain L1-related tendencies are reliable enough such that they are detectable even when proficiency differences within L1 groups are greater than differences between groups. Finally, condition (4) represented the wealth of resources that were made available to the classifier. In order to test the reliability of L1 lexical effects as well as the strength, sensitivity, and practicality of the classifier, the pool of features made available to the classifier was intentionally restricted to just 53 of the most frequent words in the data. The stepwise feature-selection parameters were further set in such a way as to allow the classifier to build its L1 prediction model using no more than 40 of the 53 features that were made available to it. This was done for purposes of adhering to the convention of restricting the number of variables to no more than 10% of the number of cases.

Original languageEnglish (US)
Title of host publicationApproaching Language Transfer through Text Classification
Subtitle of host publicationExplorations in the Detection-Based Approach
PublisherChannel View Publications
Pages71-105
Number of pages35
ISBN (Electronic)9781847696991
ISBN (Print)9781847696977
StatePublished - Mar 14 2012
Externally publishedYes

ASJC Scopus subject areas

  • General Arts and Humanities
  • General Social Sciences

Fingerprint

Dive into the research topics of 'Exploring the Role of n-Grams in L1 Identifi cation'. Together they form a unique fingerprint.

Cite this