مرکز منطقه ای اطلاع رساني علوم و فناوري - Detecting environmental disasters in digital news archives

Abstract :

Automatically extracting events from large, unstructured/semi-structured textual data requires a mechanism for identifying the event, abstracting it from the text, validating the event´s occurrence against some known values, and sharing the event with users effectively. Inherent in the challenge of Big Data is that it often exceeds a scale at which humans can effectively operate. In this paper, we focus on the domain of archived newspaper articles, and describe a system that generates a collection of event summaries from unstructured text, extracts a geographic marker for the event, and stores both in an on-line database that can be searched and/or visualized using an interactive map. The system relies on text mining techniques to filter out a dataset of news stories from a digital news archive source and extracts 1-2 sentences from each event to be stored in the database. We illustrate this approach using a flood database case study, automatically extracting descriptions of past flooding events occurring in Nova Scotia, Canada from a 20-year archive of regional newspaper articles. We validate our event extraction in two dimensions (identification of articles mentioning flood events; identification of accurate geographic markers from articles about flood events) using Amazon´s Mechanical Turk (MTurk) to obtain human assessments at scale.