Title :
Compilation of the Balanced Corpus of Contemporary Written Japanese in the KOTONOHA Initiative (Invited Paper)
Author_Institution :
Dept. Language Res., Nat. Inst. for Japanese Language, Japan
Abstract :
Compilation of a hundred million word balanced corpus named Balanced Corpus of Contemporary Written Japanese (BCCWJ) is underway at the National Institute for Japanese Language. This corpus is a component of the KOTONOHA super-corpus that covers the full range of modern Japanese from the middle of the 19th century up to the present. The unique characteristics of the BCCWJ consists in that about two third of the samples in the corpus were randomly selected from two statistical populations: one of them consists in the publication data of books, magazines, and newspapers during the years 2001-2005, and the other consists of the set of books registered in more than 13 public libraries of Tokyo metropolis (335,000 different books). The corpus will be publicly available in the first half of 2011.
Keywords :
linguistics; natural language processing; text analysis; Balanced Corpus of Contemporary Written Japanese; KOTONOHA super-corpus; modern Japanese; statistical population; word balanced corpus; Books; Copyright protection; Government; History; Internet; Large-scale systems; Libraries; Natural languages; Speech; Writing; BCCWJ; KOTONOHA; balanced corpus; copyright law; random sampling;
Conference_Titel :
Universal Communication, 2008. ISUC '08. Second International Symposium on
Conference_Location :
Osaka
Print_ISBN :
978-0-7695-3433-6
DOI :
10.1109/ISUC.2008.82