DocumentCode
3462834
Title
Practical token retrieval and indexing from binary data: An application in computer aided design
Author
Gruber, M. ; Geschray, R. ; Hillbrand, C.
Author_Institution
V-Res. GmbH, Dornbirn, Austria
fYear
2011
fDate
25-27 Aug. 2011
Firstpage
123
Lastpage
126
Abstract
In many commercial applications proprietary file formats make it difficult to access the generated data. In the worst case interoperability is impeded even further by shortcomings in interface technology. The objective of this work is to find out whether it is possible to retrieve textual data from certain binary files in a quality which is sufficient to build a useful index. We propose a method to parse and filter binary data in multiple stages. Besides stop-words, we use whitelists and phonetic as well as phonotactic criteria to create token data while minimizing noise. The results are promising: with a few simple steps we are able to filter most of the invalid tokens while preserving abbreviations and terms like company names even though they are not in a dictionary.
Keywords
CAD; indexing; information retrieval; CAD; binary data; binary files; computer aided design; indexing; interface technology; phonetic; phonotactic criteria; proprietary file formats; textual data retrieval; token retrieval; whitelists; worst case interoperability; Design automation; Encoding; Filtering algorithms; ISO standards; Indexes; Law; Software;
fLanguage
English
Publisher
ieee
Conference_Titel
Logistics and Industrial Informatics (LINDI), 2011 3rd IEEE International Symposium on
Conference_Location
Budapest
Print_ISBN
978-1-4577-1842-7
Type
conf
DOI
10.1109/LINDI.2011.6031132
Filename
6031132
Link To Document