Methodological issues regarding corpus-based analyses of linguistic variation

Douglas Biber

Research output: Contribution to journalArticlepeer-review

145 Scopus citations


Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text 'genres' as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues It focuses on four particular methodological issues. (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliablity represent the linguistic characteristics of that category, and related questions concerning the validity of 'genre'categories, (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identify and analyze the salient parameters of variation among texts These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lond corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation, In conclusion, the paper welcomes the future availablity of larger and more representative corpora, but it also urges researches to fully exploit existing corpora for ongoing investigations of linguistic variation.

Original languageEnglish (US)
Pages (from-to)257-269
Number of pages13
JournalLiterary and Linguistic Computing
Issue number4
StatePublished - 1990

ASJC Scopus subject areas

  • Information Systems
  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Methodological issues regarding corpus-based analyses of linguistic variation'. Together they form a unique fingerprint.

Cite this