Abstract :
We are drowning in words. E-mail and Web browsers provide access to myriad sources of text, from newswires to product catalogs, from recipes to movie schedules, but we can pause just a tiny fraction of the available terabytes. The cliche is born: information overload, which threatens to swamp the Internet´s promised productivity gains, educational benefits, and entertainment value. In recent years, computer science has risen to this challenge, with substantial progress on systems for retrieving and filtering text. Information extraction systems provide a complementary service. IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. Scalability is the major challenge to IE. IE systems usually rely on extraction rules tailored to a particular document collection. If this knowledge is hand-crafted, porting an IE system to new collections will be expensive. Recent research has led to the identification of important classes of Internet IE tasks for which highly scalable systems have been developed. I describe these IE tasks and explain how machine learning yields highly scalable IE systems, and discuss remaining challenges and argue that scaling up AI applications on the Internet is an important challenge to machine learning
Keywords :
Internet; information resources; information retrieval; learning (artificial intelligence); online front-ends; Internet; Web browsers; document collection; e-mail; information extraction; information overload; machine learning; scalability; semantic content; Catalogs; Computer science; Data mining; Electronic mail; Filtering; Internet; Machine learning; Motion pictures; Processor scheduling; Productivity;