Title of article
Unsupervised named-entity extraction from the Web: An experimental study Original Research Article
Author/Authors
Oren Etzioni، نويسنده , , Michael Cafarella، نويسنده , , Doug Downey، نويسنده , , Ana-Maria Popescu، نويسنده , , Tal Shaked، نويسنده , , Stephen Soderland، نويسنده , , Daniel S. Weld، نويسنده , , Alexander Yates، نويسنده ,
Issue Information
روزنامه با شماره پیاپی سال 2005
Pages
44
From page
91
To page
134
Abstract
The KnowItAll system aims to automate the tedious process of extracting large collections of facts (e.g., names of scientists or politicians) from the Web in an unsupervised, domain-independent, and scalable manner. The paper presents an overview of KnowItAllʹs novel architecture and design principles, emphasizing its distinctive ability to extract information without any hand-labeled training examples. In its first major run, KnowItAll extracted over 50,000 class instances, but suggested a challenge: How can we improve KnowItAllʹs recall and extraction rate without sacrificing precision?
This paper presents three distinct ways to address this challenge and evaluates their performance. Pattern Learning learns domain-specific extraction rules, which enable additional extractions. Subclass Extraction automatically identifies sub-classes in order to boost recall (e.g., “chemist” and “biologist” are identified as sub-classes of “scientist”). List Extraction locates lists of class instances, learns a “wrapper” for each list, and extracts elements of each list. Since each method bootstraps from KnowItAllʹs domain-independent methods, the methods also obviate hand-labeled training examples. The paper reports on experiments, focused on building lists of named entities, that measure the relative efficacy of each method and demonstrate their synergy. In concert, our methods gave KnowItAll a 4-fold to 8-fold increase in recall at precision of 0.90, and discovered over 10,000 cities missing from the Tipster Gazetteer.
Keywords
Question answering , Unsupervised , Pointwise mutual information , Information extraction
Journal title
Artificial Intelligence
Serial Year
2005
Journal title
Artificial Intelligence
Record number
1207422
Link To Document