• DocumentCode
    2830395
  • Title

    Text Extraction from the Web via Text-to-Tag Ratio

  • Author

    Weninger, Tim ; Hsu, William H.

  • Author_Institution
    Comput. & Inf. Sci., Kansas State Univ., Manhattan, KS
  • fYear
    2008
  • fDate
    1-5 Sept. 2008
  • Firstpage
    23
  • Lastpage
    28
  • Abstract
    We describe a method to extract content text from diverse Web pages by using the HTML document´s text-to-tag ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the text-to-tag ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
  • Keywords
    Internet; hypermedia markup languages; information retrieval; text analysis; HTML document; diverse Web pages; text extraction; text-to-tag ratio; Art; Cascading style sheets; Data mining; Databases; Expert systems; HTML; Histograms; Internet; Testing; Web pages; Histogram; Information Extraction; Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
  • Conference_Location
    Turin
  • ISSN
    1529-4188
  • Print_ISBN
    978-0-7695-3299-8
  • Type

    conf

  • DOI
    10.1109/DEXA.2008.12
  • Filename
    4624686