Representativeness in Corpus Design

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from within texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid empirical foundation for general purpose language tools and descriptions, and enables analyses of a scope not otherwise possible. However, a corpus must be 'representative’ in order to be appropriately used as the basis for generalizations concerning a language as a whole; for example, corpus-based dictionaries, grammars, and general part-of-speech taggers are applications requiring a representative basis (cf. Biber, 1993b). Typically researchers focus on sample size as the most important consideration in achieving representativeness: how many texts must be included in the corpus, and how many words per text sample. Books on sampling theory, however, emphasize that sample size is not the most important consideration in selecting a representative sample; rather, a thorough definition of the target population and decisions concerning the method of sampling are prior considerations. Representativeness refers to the extent to which a sample includes the full range of variability in a population.

Original languageEnglish (US)
Title of host publicationPractical Lexicography
Subtitle of host publicationA Reader
PublisherOxford University Press
Pages63-87
Number of pages25
ISBN (Electronic)9781383043891
ISBN (Print)9780199292332
DOIs
StatePublished - Jan 1 2023

Keywords

  • dictionaries
  • emphasize
  • enables
  • generalizations
  • population

ASJC Scopus subject areas

  • General Computer Science
  • General Arts and Humanities
  • General Social Sciences

Fingerprint

Dive into the research topics of 'Representativeness in Corpus Design'. Together they form a unique fingerprint.

Cite this