DocumentCode :
2435149
Title :
Identifying named entities on a University intranet
Author :
Althobaiti, Maha ; Kruschwitz, Udo ; Poesio, Massimo
Author_Institution :
Sch. of Comput. Sci. & Electron. Eng., Univ. of Essex, Colchester, UK
fYear :
2012
fDate :
12-13 Sept. 2012
Firstpage :
94
Lastpage :
99
Abstract :
Named entities (NEs) are textual references via proper names, such aspeople names, company names, places and so on. The importance of NEs has been observed in intranet search engines, including university web sites. In this paper, a mechanism is built exclusively to recognize the three named entities, which are constantly referenced in the University of Essex domain: names, course codes, and room numbers. While a person name is considered a common named entity, course codes and room numbers are specific to the University domain. We developed a technique specifically to train three different classifiers on electronic corpora, consisting of 16,629 examples in total, which were collected and annotated manually from the University domain. The resulting models were then incorporated into the NER system that was built to use pre-trained classifiers in the detection process, mark these NEs, and cross-reference them to the related documents. The proposed method performed well on a test corpus, with the average precision reaching nearly 0.97. The recall varied, but was lower overall than precision with an average of 0.82. Moreover, in terms of name recognition in the University domain, our system outperformed two other systems: the OpenNLP name finder and ANNIE system.
Keywords :
Web sites; educational computing; educational institutions; information retrieval; intranets; learning (artificial intelligence); natural language processing; search engines; statistical analysis; text analysis; ANNIE system; NE; OpenNLP name finder; University of Essex; company names; corpus-based methods; course codes; detection process; electronic corpora; information extraction; machine learning; name recognition; named entity identification; named entity recognition; natural language processing; people names; pretrained classifiers; proper names; room numbers; statistical approach; textual references; university Web sites; university domain; university intranet search engines; Computer science; Educational institutions; Entropy; Search engines; Training; Training data; Web pages; Corpus-based methods; Information Extraction from the web; Machine learning; Named Entity Recognition; Natural Language Processing; Statistical approach;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Electronic Engineering Conference (CEEC), 2012 4th
Conference_Location :
Colchester
Print_ISBN :
978-1-4673-2665-0
Type :
conf
DOI :
10.1109/CEEC.2012.6375385
Filename :
6375385
Link To Document :
بازگشت