DocumentCode :
133976
Title :
Automated specification extraction for consolidated product catalogue
Author :
Hareendran, Stuthi ; Parashar, Anuvrat ; Khan, Farhat Ullah
Author_Institution :
Dept. of Comput. Sci. & Eng., Amity Univ., Noida, India
fYear :
2014
fDate :
1-2 March 2014
Firstpage :
1
Lastpage :
7
Abstract :
This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial structure in source format. The most significant problem in designing a solution is the source of the data itself. Since the data is fetched from not just one but many different portals, the sheer variety of it is an obstacle as the format and structure vary for every single portal. The paper considers two subproblems of data available in structured as well as unstructured format. The methodology developed for structured data makes use of the information pattern contained in the underlying tree structure of the page´s HTML content from which data is sourced in order to perform extraction. And pattern matching using regular expressions is the concept used for cases where data is unstructured. Implementation has been carried out using Python as the programming language with the usage of tools like Scrapy and LXML.
Keywords :
electronic commerce; hypermedia markup languages; pattern matching; portals; text analysis; tree data structures; HTML pages; LXML; Python; Scrapy; automated specification extraction; consolidated product catalogue; data source; data structure; e-commerce portals; information pattern; page HTML content; pattern matching; product details; product specification extraction; programming language; regular expressions; source format; tree structure; Data mining; Dictionaries; HTML; Pattern matching; Pediatrics; Portals; LXML; e-commerce; information extraction; specification extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical, Electronics and Computer Science (SCEECS), 2014 IEEE Students' Conference on
Conference_Location :
Bhopal
Print_ISBN :
978-1-4799-2525-4
Type :
conf
DOI :
10.1109/SCEECS.2014.6804527
Filename :
6804527
Link To Document :
بازگشت