• DocumentCode
    3520827
  • Title

    Searching for Historical Events on a Large-Scale Web Archive

  • Author

    Huang, Lian´en ; Lin, Wu ; Li, Xiaoming

  • Author_Institution
    Internet Res. & Eng. Center, Peking Univ., Shenzhen, China
  • fYear
    2010
  • fDate
    1-3 Nov. 2010
  • Firstpage
    259
  • Lastpage
    266
  • Abstract
    Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines´ in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.
  • Keywords
    Internet; Web sites; history; information retrieval systems; query processing; BM25; Web page; anchor text; domain independent approach; event search; historical event; human knowledge; knowledge search; large scale Web archive; news story extraction; query expression; retrieval algorithm; search engine; social knowledge; Domain; Historical Event; Knowledge Finding; News Extraction; Web Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantics Knowledge and Grid (SKG), 2010 Sixth International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-8125-5
  • Electronic_ISBN
    978-0-7695-4189-1
  • Type

    conf

  • DOI
    10.1109/SKG.2010.37
  • Filename
    5663519