• DocumentCode
    2126228
  • Title

    Cross Domain Assessment of Document to HTML Conversion Tools to Quantify Text and Structural Loss during Document Analysis

  • Author

    Goslin, Kyle ; Hofmann, Martin

  • Author_Institution
    Dept. of Inf. & Eng., Inst. of Technol. Blanchardstown, Dublin, Ireland
  • fYear
    2013
  • fDate
    12-14 Aug. 2013
  • Firstpage
    100
  • Lastpage
    105
  • Abstract
    During forensic text analysis, the automation of the process is key when working with large quantities of documents. As documents often come in a wide variety of different file types, this creates the need for tailored tools to be developed to analyze each document type to correctly identify and extract text elements for analysis without loss. These text extraction tools often omit sections of text that are unreadable from documents leaving drastic inconsistencies during the forensic text analysis process. As a solution to this a single output format, HTML, was chosen as a unified analysis format. Document to HTML/CSS extraction tools each with varying techniques to convert common document formats to rich HTML/CSS counterparts were tested. This approach can reduce the amount of analysis tools needed during forensic text analysis by utilizing a single document format. Two tests were designed, a 10 point document overview test and a 48 point detailed document analysis test to assess and quantify the level of loss, rate of error and overall quality of outputted HTML structures. This study concluded that tools that utilize a number of different approaches and have an understanding of the document structure yield the best results with the least amount of loss.
  • Keywords
    feature extraction; hypermedia markup languages; law; text analysis; text detection; HTML structures; HTML/CSS output; cross domain assessment; document analysis; document structure; document-to-HTML conversion tools; error rate; forensic text analysis process; structural loss; text extraction tools; text loss; unified analysis format; Forensics; HTML; Layout; Optical character recognition software; Portable document format; Text analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligence and Security Informatics Conference (EISIC), 2013 European
  • Conference_Location
    Uppsala
  • Type

    conf

  • DOI
    10.1109/EISIC.2013.22
  • Filename
    6657132