مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

1204341

Title :

The Unreasonable Effectiveness of Data

Author :

Halevy, Alon ; Norvig, Peter ; Pereira, Fernando

Volume :

Issue :

fYear :

2009

Firstpage :

Lastpage :

Abstract :

At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it´s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It´s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it´s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.

Keywords :

Internet; data handling; natural language processing; Brown Corpus; English words; Web-derived corpora; data unreasonable effectiveness; frequency counts; grammatical errors; hand-corrected part-of-speech tags; incomplete sentences; spelling errors; trillion-word corpus; unfiltered Web pages; Broadcasting; Data mining; Frequency estimation; Humans; Machine learning; Natural language processing; Speech recognition; Tagging; Videos; Web pages; Semantic Web; machine learning; very large data bases;

fLanguage :

English

Journal_Title :

Intelligent Systems, IEEE

Publisher :

ieee

ISSN :

1541-1672

Type :

jour

DOI :

10.1109/MIS.2009.36

Filename :

4804817

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1204341