DocumentCode
2126228
Title
Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss during Document Analysis
Author
Goslin, Kyle ; Hofmann, Martin
Author_Institution
Dept. of Inf. & Eng., Inst. of Technol. Blanchardstown, Dublin, Ireland
fYear
2013
fDate
12-14 Aug. 2013
Firstpage
100
Lastpage
105
Abstract
During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.
Keywords
feature extraction; hypermedia markup languages; law; text analysis; text detection; HTML structures; HTML/CSS output; cross domain assessment; document analysis; document structure; document-to-HTML conversion tools; error rate; forensic text analysis process; structural loss; text extraction tools; text loss; unified analysis format; Forensics; HTML; Layout; Optical character recognition software; Portable document format; Text analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligence and Security Informatics Conference (EISIC), 2013 European
Conference_Location
Uppsala
Type
conf
DOI
10.1109/EISIC.2013.22
Filename
6657132
Link To Document