• DocumentCode
    3462834
  • Title

    Practical token retrieval and indexing from binary data: An application in computer aided design

  • Author

    Gruber, M. ; Geschray, R. ; Hillbrand, C.

  • Author_Institution
    V-Res. GmbH, Dornbirn, Austria
  • fYear
    2011
  • fDate
    25-27 Aug. 2011
  • Firstpage
    123
  • Lastpage
    126
  • Abstract
    In many commercial applications proprietary file formats make it difficult to access the generated data. In the worst case interoperability is impeded even further by shortcomings in interface technology. The objective of this work is to find out whether it is possible to retrieve textual data from certain binary files in a quality which is sufficient to build a useful index. We propose a method to parse and filter binary data in multiple stages. Besides stop-words, we use whitelists and phonetic as well as phonotactic criteria to create token data while minimizing noise. The results are promising: with a few simple steps we are able to filter most of the invalid tokens while preserving abbreviations and terms like company names even though they are not in a dictionary.
  • Keywords
    CAD; indexing; information retrieval; CAD; binary data; binary files; computer aided design; indexing; interface technology; phonetic; phonotactic criteria; proprietary file formats; textual data retrieval; token retrieval; whitelists; worst case interoperability; Design automation; Encoding; Filtering algorithms; ISO standards; Indexes; Law; Software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Logistics and Industrial Informatics (LINDI), 2011 3rd IEEE International Symposium on
  • Conference_Location
    Budapest
  • Print_ISBN
    978-1-4577-1842-7
  • Type

    conf

  • DOI
    10.1109/LINDI.2011.6031132
  • Filename
    6031132