• DocumentCode
    183047
  • Title

    Page query language generation for structural extraction

  • Author

    He Hu ; Xiaoyong Du

  • Author_Institution
    Sch. of Inf., Renmin Univ. of China, Beijing, China
  • fYear
    2014
  • fDate
    19-21 Aug. 2014
  • Firstpage
    605
  • Lastpage
    609
  • Abstract
    The information on the Web is usually fabricated to be understandable by human users rather than machines. It´s not easy to automatically catalogue and extract the Web information solely with a software agent. Based on these observations, we present an approach that uses human guided operations to automatically generate a PQL query, a SQL like query language focusing on Web pages, to extract the interested information fragments on Web pages. The PQL query uses XPath expressions to locating the target HTML nodes. We develop a K-Medoid clustering algorithm to process PQL queries to generate the structural extractions. The extracted information is structured as a relational table (in CSV format) which can be manipulated smoothly with spreadsheet software or a relational DBMS system.
  • Keywords
    SQL; Web sites; cataloguing; hypermedia markup languages; pattern clustering; query processing; software agents; CSV format; HTML nodes; K-medoid clustering algorithm; PQL query generation; SQL like query language; Web information cataloguing; Web information extraction; Web pages; XPath expressions; human guided operations; page query language generation; relational DBMS system; relational table; software agent; spreadsheet software; structural extractions; Algorithm design and analysis; Browsers; Clustering algorithms; Data mining; Database languages; HTML; Web pages; Browser Extension; PQL; Structural Extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th International Conference on
  • Conference_Location
    Xiamen
  • Print_ISBN
    978-1-4799-5147-5
  • Type

    conf

  • DOI
    10.1109/FSKD.2014.6980903
  • Filename
    6980903