TY - JOUR
T1 - Register identification from the unrestricted open Web using the Corpus of Online Registers of English
AU - Laippala, Veronika
AU - Rönnqvist, Samuel
AU - Oinonen, Miika
AU - Kyröläinen, Aki Juhani
AU - Salmela, Anna
AU - Biber, Douglas
AU - Egbert, Jesse
AU - Pyysalo, Sampo
N1 - Publisher Copyright:
© 2022, The Author(s).
PY - 2023/9
Y1 - 2023/9
N2 - This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined characteristics. These corpora feature only a subset of documents found on the unrestricted open Web, for which register identification has been particularly difficult because the range of linguistic variation on the Web is known to be substantial. As part of this study, we present the first open release of the Corpus of Online Registers of English (CORE), which is drawn from the unrestricted open Web and, currently, is the largest collection of manually annotated Web registers. Furthermore, we demonstrate that the CORE registers can be automatically identified with competitive results, with the best performance being an F1-score of 68% with the deep learning model BERT. The best performance was achieved using two modeling strategies. The first one involved modeling the registers using propagated register labels, that is, repeating the main register label along with its corresponding subregister label in a multilabel model. In the second one, we explored how the length of the document affects model performance, discovering that the beginning provided superior classification accuracy. Overall, the current study presents a systematic approach for the automatic identification of a large number of Web registers from the unrestricted Web, hence providing new pathways for future studies.
AB - This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined characteristics. These corpora feature only a subset of documents found on the unrestricted open Web, for which register identification has been particularly difficult because the range of linguistic variation on the Web is known to be substantial. As part of this study, we present the first open release of the Corpus of Online Registers of English (CORE), which is drawn from the unrestricted open Web and, currently, is the largest collection of manually annotated Web registers. Furthermore, we demonstrate that the CORE registers can be automatically identified with competitive results, with the best performance being an F1-score of 68% with the deep learning model BERT. The best performance was achieved using two modeling strategies. The first one involved modeling the registers using propagated register labels, that is, repeating the main register label along with its corresponding subregister label in a multilabel model. In the second one, we explored how the length of the document affects model performance, discovering that the beginning provided superior classification accuracy. Overall, the current study presents a systematic approach for the automatic identification of a large number of Web registers from the unrestricted Web, hence providing new pathways for future studies.
KW - Deep learning
KW - Document classification
KW - Web register identification
KW - Web-as-corpus
UR - http://www.scopus.com/inward/record.url?scp=85140654411&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140654411&partnerID=8YFLogxK
U2 - 10.1007/s10579-022-09624-1
DO - 10.1007/s10579-022-09624-1
M3 - Article
AN - SCOPUS:85140654411
SN - 1574-020X
VL - 57
SP - 1045
EP - 1079
JO - Language Resources and Evaluation
JF - Language Resources and Evaluation
IS - 3
ER -