DocumentCode
2053906
Title
Extracting Geospatial Entities from Wikipedia
Author
Witmer, Jeremy ; Kalita, Jugal
Author_Institution
Colorado Springs Comput. Sci. Dept., Univ. of Colorado, Colorado Springs, CO, USA
fYear
2009
fDate
14-16 Sept. 2009
Firstpage
450
Lastpage
457
Abstract
This paper addresses the challenge of extracting geospatial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vector Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high incidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not immediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sentence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovisualization of Wikipedia.
Keywords
Web sites; geographic information systems; natural language processing; Wikipedia; assumed point geometry; geospatial data; geospatial entities; geospatial named entity recognition; geospatial search; geovisualization; place names; spatial location; support vector machine; training corpus; word-based features; Computer science; Data mining; Databases; Hidden Markov models; Internet; Open source software; Springs; Support vector machine classification; Support vector machines; Wikipedia; NER; Wikipedia extraction; geospatial entity recognition; geospatial extraction; location extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Semantic Computing, 2009. ICSC '09. IEEE International Conference on
Conference_Location
Berkeley, CA
Print_ISBN
978-1-4244-4962-0
Electronic_ISBN
978-0-7695-3800-6
Type
conf
DOI
10.1109/ICSC.2009.62
Filename
5298641
Link To Document