DocumentCode :
3571008
Title :
Automatically classifying and interpreting polar datasets with Apache Tika
Author :
Burgess, Ann B. ; Mattmann, Chris A.
Author_Institution :
Comput. Sci. Dept., Univ. of Southern California, Los Angeles, CA, USA
fYear :
2014
Firstpage :
863
Lastpage :
867
Abstract :
The Arctic and Antarctic are undergoing rapid change attributed to the Earth\´s changing climate. This change is captured via space and airborne remote sensing, in-situ measurement, and climate modeling. Those observations and simulations record data in myriad formats across a number of pertinent data archives funded by NSF, NASA, NOAA, and other federal agencies. Simply finding data may be hard, but we restrict our focus in this paper to the subject of what to do with the data (and metadata) once it is found - the "complexity" portion of the Big Data challenge. We present our current efforts for dealing with the complexity and heterogeneity of Arctic and Antarctic data - Apache Tika. Apache Tika is an open source framework for metadata exploration, automatic text mining, and information retrieval of 1200 of the most widely used data file formats and 20 rich metadata models to go along with those formats. Our current research efforts are targeted at expanding Apache Tika to parse, extract, and analyze common data formats used in Artie and Antarctic research making them more easily accessible, searchable, and retrievable by all major content management systems (Plone, Drupal, Alfresco, etc.). Furthermore, expanding Tika to handle common Polar data formats will also naturally invite the technology/open source community to deal with Polar use cases, helping to draw attention to and increase understanding of these remote regions.
Keywords :
Big Data; geophysics computing; pattern classification; Antarctic; Apache Tika; Arctic; Big Data; airborne remote sensing; climate change; climate modeling; content management system; data archives; data complexity; data heterogeneity; in-situ measurement; information retrieval; metadata exploration; metadata models; polar dataset classification; polar dataset interpretation; space remote sensing; text mining; Antarctica; Arctic; Content management; Data mining; Ice; MATLAB; NASA; MIME; Polar; Tika; metadata; open source;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
Type :
conf
DOI :
10.1109/IRI.2014.7051982
Filename :
7051982
Link To Document :
بازگشت