DocumentCode :
183047
Title :
Page query language generation for structural extraction
Author :
He Hu ; Xiaoyong Du
Author_Institution :
Sch. of Inf., Renmin Univ. of China, Beijing, China
fYear :
2014
fDate :
19-21 Aug. 2014
Firstpage :
605
Lastpage :
609
Abstract :
The information on the Web is usually fabricated to be understandable by human users rather than machines. It´s not easy to automatically catalogue and extract the Web information solely with a software agent. Based on these observations, we present an approach that uses human guided operations to automatically generate a PQL query, a SQL like query language focusing on Web pages, to extract the interested information fragments on Web pages. The PQL query uses XPath expressions to locating the target HTML nodes. We develop a K-Medoid clustering algorithm to process PQL queries to generate the structural extractions. The extracted information is structured as a relational table (in CSV format) which can be manipulated smoothly with spreadsheet software or a relational DBMS system.
Keywords :
SQL; Web sites; cataloguing; hypermedia markup languages; pattern clustering; query processing; software agents; CSV format; HTML nodes; K-medoid clustering algorithm; PQL query generation; SQL like query language; Web information cataloguing; Web information extraction; Web pages; XPath expressions; human guided operations; page query language generation; relational DBMS system; relational table; software agent; spreadsheet software; structural extractions; Algorithm design and analysis; Browsers; Clustering algorithms; Data mining; Database languages; HTML; Web pages; Browser Extension; PQL; Structural Extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th International Conference on
Conference_Location :
Xiamen
Print_ISBN :
978-1-4799-5147-5
Type :
conf
DOI :
10.1109/FSKD.2014.6980903
Filename :
6980903
Link To Document :
بازگشت