Learner corpora and native language identification

Scott Jarvis, Magali Paquot

Research output: Chapter in Book/Report/Conference proceedingChapter

8 Scopus citations

Abstract

Native language identification (NLI) is the task of automatically identifying the first language (L1) of a language user on the basis of the person's production of the target language. This research pursuit is guided by the assumption that a person's L1 background can be inferred from how frequently he or she makes use of certain features of the target language (e.g. words, word sequences, sequences of characters). The task is typically modelled as a text categorisation problem where the set of L1s is predefined and each text is assigned an L1 on account of its specific language features. NLI offers potential practical applications in a wide variety of domains that rely on language corpora. Among other benefits, NLI appears to enhance the performance of a number of natural language processing (NLP) tasks, such as speech recognition, parsing and information extraction (Mayfield Tomokiyo and Jones 2001). NLP tools and techniques are typically trained on native-speaker data and are consequently often less robust when applied to non-native language (L2) (Díaz-Negrillo et al. 2010; Chapter 24, this volume). A second benefit of NLI is that its results may contribute to the success of machine-learning approaches to author identification and profiling. These techniques are today of crucial interest for a number of web-related fields such as internet security and cybercrime investigation (Argamon et al. 2009). The results of an NLI task may also contribute to second language acquisition (SLA) theory building. The ability to detect the L1 of individuals on the basis of their use of certain specific features of the target language indeed offers unprecedented opportunities for the study of transfer, i.e. ‘the influence resulting from similarities and differences between the target language and any other language that has been previously (and perhaps imperfectly) acquired’ (Odlin 1989: 27; see also Chapter 15, this volume). The rapprochement between NLI techniques and transfer research was first made by Tsur and Rappoport (2007) and has recently been fully articulated in the detection-based approach to transfer (Jarvis 2010, 2012). In this exploratory approach, the results of an NLI task are used as primary data to investigate the nature and extent of L1 influence in non-native language use.

Original languageEnglish (US)
Title of host publicationThe Cambridge Handbook of Learner Corpus Research
PublisherCambridge University Press
Pages605-628
Number of pages24
ISBN (Electronic)9781139649414
ISBN (Print)9781107041196
DOIs
StatePublished - Jan 1 2015
Externally publishedYes

ASJC Scopus subject areas

  • General Arts and Humanities
  • General Social Sciences

Fingerprint

Dive into the research topics of 'Learner corpora and native language identification'. Together they form a unique fingerprint.

Cite this