Exploring the composition of the searchable web: A corpus-based taxonomy of web registers

Douglas Biber, Jesse Egbert, Mark Davies

Research output: Contribution to journalArticlepeer-review

44 Scopus citations

Abstract

One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents.We base our investigation on a much larger corpus than those used in previous research (48,571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical endusers of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents.

Original languageEnglish (US)
Pages (from-to)11-45
Number of pages35
JournalCorpora
Volume10
Issue number1
DOIs
StatePublished - Apr 1 2015

Keywords

  • Hybrid registers
  • Informational registers
  • Internet language
  • Mechanical turk
  • Narrative
  • Opinion
  • Web registers
  • Web-As-Corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Exploring the composition of the searchable web: A corpus-based taxonomy of web registers'. Together they form a unique fingerprint.

Cite this