DocumentCode :
1498265
Title :
Gleaning the Web
Author :
Kushmerik, N.
Author_Institution :
Dept. of Comput. Sci., Univ. Coll. Dublin
Volume :
14
Issue :
2
fYear :
1999
Firstpage :
20
Lastpage :
22
Abstract :
We are drowning in words. E-mail and Web browsers provide access to myriad sources of text, from newswires to product catalogs, from recipes to movie schedules, but we can pause just a tiny fraction of the available terabytes. The cliche is born: information overload, which threatens to swamp the Internet´s promised productivity gains, educational benefits, and entertainment value. In recent years, computer science has risen to this challenge, with substantial progress on systems for retrieving and filtering text. Information extraction systems provide a complementary service. IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. Scalability is the major challenge to IE. IE systems usually rely on extraction rules tailored to a particular document collection. If this knowledge is hand-crafted, porting an IE system to new collections will be expensive. Recent research has led to the identification of important classes of Internet IE tasks for which highly scalable systems have been developed. I describe these IE tasks and explain how machine learning yields highly scalable IE systems, and discuss remaining challenges and argue that scaling up AI applications on the Internet is an important challenge to machine learning
Keywords :
Internet; information resources; information retrieval; learning (artificial intelligence); online front-ends; Internet; Web browsers; document collection; e-mail; information extraction; information overload; machine learning; scalability; semantic content; Catalogs; Computer science; Data mining; Electronic mail; Filtering; Internet; Machine learning; Motion pictures; Processor scheduling; Productivity;
fLanguage :
English
Journal_Title :
Intelligent Systems and their Applications, IEEE
Publisher :
ieee
ISSN :
1094-7167
Type :
jour
DOI :
10.1109/5254.757626
Filename :
757626
Link To Document :
بازگشت