Title :
Regular expression-based reference metadata extraction from the web
Author :
Tang, Xiaoyu ; Zeng, Qingtian ; Cui, Tingting ; Wu, Zeze
Author_Institution :
Coll. of Inf. Sci. & Eng., Shandong Univ. of Sci. & Technol., Qingdao, China
Abstract :
Accurate reference metadata extraction becomes an intriguing task to researchers who want to collect data of scientific publications. In this paper, we introduce an approach to extracting the reference metadata based on regular expressions. A prototype system named “Goldrusher” is created which automatically extracts data from the website of Association for Computing Machinery (ACM). The experimental results show that, by using our regular expression-based method, we can effectively extract author names, article titles, journal titles, DIOs, etc.
Keywords :
Internet; Web sites; information retrieval; meta data; Association for Computing Machinery; Goldrusher; Web site; World Wide Web; accurate reference metadata extraction; regular expression; Books; Crawlers; Data mining; HTML; Libraries; Machinery; Web pages;
Conference_Titel :
Web Society (SWS), 2010 IEEE 2nd Symposium on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6356-5
DOI :
10.1109/SWS.2010.5607427