Constructing two vietnamese corpora and building a lexical database

Hien Pham, Benjamin V. Tucker, R. Harald Baayen

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.

Original languageEnglish (US)
Pages (from-to)465-498
Number of pages34
JournalLanguage Resources and Evaluation
Volume53
Issue number3
DOIs
StatePublished - Sep 15 2019
Externally publishedYes

Keywords

  • Dispersion
  • Film subtitle corpus
  • Frequency
  • HAL
  • LSA
  • Validation
  • Vietnamese
  • Written corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Constructing two vietnamese corpora and building a lexical database'. Together they form a unique fingerprint.

Cite this