Title :
Self-Adaptive Extracting Academic Entities from World Wide Web
Author :
Pingpeng Yuan;Yi Li;Hai Jin;Ling Liu
Author_Institution :
Services Comput. Technol. &
Abstract :
Huge amount of entities and theirs relationships are posted on the Web. Those entities and theirs relationship networks help many activities. In this paper, we focus on the task of extracting academic entity network from homepages. Homepages usually contain many entities, such as persons, conference/journal and organization and theirs relationship. However, homepages don´t follow a unified layout format and often contains similar information, but differs greatly in layouts and styles, which makes it impossible to use a unified set of rules to handle them all. Thus we propose an integrated approach to automatically extract data from unstructured texts. The main idea of the approach is to adopt the most suitable approach to extract entities. Thus, the approach is self-adaptive. Firstly, the approach decomposes web pages into text units and then classifier is used to determine units´ type. Once the units´ types are known, the different technologies are chosen to deal with them. For example, edit distance and inverted index are used to identify names etc. And Conditional Random Field technology is considered the best solution to extract publication entries. The result shows that LineX has achieved high performance on extracting entities from web pages in academic community.
Keywords :
"Web pages","Data mining","Knowledge based systems","Electronic mail","HTML","Internet","Organizations"
Conference_Titel :
Collaboration and Internet Computing (CIC), 2015 IEEE Conference on
DOI :
10.1109/CIC.2015.33