Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition

Don Miller, Douglas Biber

Research output: Contribution to journalArticlepeer-review

36 Scopus citations


Recent methodological advances have been used to create word lists based on large corpora. The present paper explores whether these corpora - and the associated lists - are unequivocally more representative. Corpus design considerations have usually focused on issues of external representativeness (representing the target discourse domain), while disregarding issues of internal representativeness (whether the corpus permits reliable descriptions of linguistic variation). This disregard may be especially problematic for studies of lexical variation, where it is difficult to achieve stable, reliable results from corpus analysis. The present paper illustrates these challenges through experiments based on analysis of a corpus representing a highly restricted discourse domain: university-level introductory psychology textbooks. The results indicate that corpus design and composition has a much greater influence on lexical variation than previously recognized, highlighting the need to evaluate internal representativeness in quantitative corpus-based research.

Original languageEnglish (US)
Pages (from-to)30-53
Number of pages24
JournalInternational Journal of Corpus Linguistics
Issue number1
StatePublished - 2015


  • Corpus representativeness
  • Lexical diversity and variability
  • Reliability and validity
  • Word lists

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition'. Together they form a unique fingerprint.

Cite this