• DocumentCode
    2053906
  • Title

    Extracting Geospatial Entities from Wikipedia

  • Author

    Witmer, Jeremy ; Kalita, Jugal

  • Author_Institution
    Colorado Springs Comput. Sci. Dept., Univ. of Colorado, Colorado Springs, CO, USA
  • fYear
    2009
  • fDate
    14-16 Sept. 2009
  • Firstpage
    450
  • Lastpage
    457
  • Abstract
    This paper addresses the challenge of extracting geospatial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vector Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high incidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not immediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sentence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovisualization of Wikipedia.
  • Keywords
    Web sites; geographic information systems; natural language processing; Wikipedia; assumed point geometry; geospatial data; geospatial entities; geospatial named entity recognition; geospatial search; geovisualization; place names; spatial location; support vector machine; training corpus; word-based features; Computer science; Data mining; Databases; Hidden Markov models; Internet; Open source software; Springs; Support vector machine classification; Support vector machines; Wikipedia; NER; Wikipedia extraction; geospatial entity recognition; geospatial extraction; location extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantic Computing, 2009. ICSC '09. IEEE International Conference on
  • Conference_Location
    Berkeley, CA
  • Print_ISBN
    978-1-4244-4962-0
  • Electronic_ISBN
    978-0-7695-3800-6
  • Type

    conf

  • DOI
    10.1109/ICSC.2009.62
  • Filename
    5298641