• DocumentCode
    2840405
  • Title

    Dash: A Novel Search Engine for Database-Generated Dynamic Web Pages

  • Author

    Lee, Ken C. K. ; Bankar, K. ; Baihua Zheng ; Chi-Yin Chow ; Honggang Wang

  • Author_Institution
    Dept. of Comput. & Inf. Sci., Univ. of Massachusetts, Dartmouth, MA, USA
  • fYear
    2012
  • fDate
    18-21 June 2012
  • Firstpage
    435
  • Lastpage
    444
  • Abstract
    Database-generated dynamic web pages (db-pages, in short), whose contents are created on the fly by web applications and databases, are now prominent in the web. However, many of them cannot be searched by existing search engines. Accordingly, we develop a novel search engine named Dash, which stands for Db-pAge Search, to support db-page search. Dash determines db-pages possibly generated by a target web application and its database through exploring the application code and the related database content and supports keyword search on those db-pages. In this paper, we present its system design and focus on the efficiency issue. To minimize costs incurred for collecting, maintaining, indexing and searching a massive number of db-pages that possibly have overlapped contents, Dash derives and indexes db-page fragments in place of db-pages. Each db-page fragment carries a disjointed part of a db-page. To efficiently compute and index db-page fragments from huge datasets, Dash is equipped with MapReduce based algorithms for database crawling and db-page fragment indexing. Besides, Dash has a top-k search algorithm that can efficiently assemble db-page fragments into db-pages relevant to search keywords and return the k most relevant ones. The performance of Dash is evaluated via extensive experimentation.
  • Keywords
    Internet; database management systems; indexing; information retrieval; search engines; search problems; Dash; MapReduce based algorithm; Web application; application code; cost minimization; database content; database crawling; database-generated dynamic Web page; db-page fragment indexing; db-page search; huge dataset; keyword search; search engine; searching; system design; top-k search algorithm; Educational institutions; Indexing; Keyword search; Search engines; Web pages; Database Crawling; Database-Generated Dynamic Web Pages; Hadoop and Performance; Indexing; MapReduce; Search Engine; Top-k Search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on
  • Conference_Location
    Macau
  • ISSN
    1063-6927
  • Print_ISBN
    978-1-4577-0295-2
  • Type

    conf

  • DOI
    10.1109/ICDCS.2012.53
  • Filename
    6258016