Title :
Searching for Historical Events on a Large-Scale Web Archive
Author :
Huang, Lian´en ; Lin, Wu ; Li, Xiaoming
Author_Institution :
Internet Res. & Eng. Center, Peking Univ., Shenzhen, China
Abstract :
Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines´ in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.
Keywords :
Internet; Web sites; history; information retrieval systems; query processing; BM25; Web page; anchor text; domain independent approach; event search; historical event; human knowledge; knowledge search; large scale Web archive; news story extraction; query expression; retrieval algorithm; search engine; social knowledge; Domain; Historical Event; Knowledge Finding; News Extraction; Web Mining;
Conference_Titel :
Semantics Knowledge and Grid (SKG), 2010 Sixth International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-8125-5
Electronic_ISBN :
978-0-7695-4189-1
DOI :
10.1109/SKG.2010.37