• DocumentCode
    8138
  • Title

    Automatic General-Purpose Sanitization of Textual Documents

  • Author

    Sanchez, Dominick ; Batet, Montserrat ; Viejo, Alexandre

  • Author_Institution
    Dept. d´Eng. Inf. i Mat., Univ. Rovira i Virgili, Tarragona, Spain
  • Volume
    8
  • Issue
    6
  • fYear
    2013
  • fDate
    Jun-13
  • Firstpage
    853
  • Lastpage
    862
  • Abstract
    The advent of new information sharing technologies has led society to a scenario where thousands of textual documents are publicly published every day. The existence of confidential information in many of these documents motivates the use of measures to hide sensitive data before being published, which is precisely the goal of document sanitization. Even though methods to assist the sanitization process have been proposed, most of them are focused on the detection of specific types of sensitive entities for concrete domains, lacking generality and and requiring user supervision. Moreover, to hide sensitive terms, most approaches opt to remove them, a measure that hampers the utility of the sanitized document. This paper presents a general-purpose sanitization method that, based on information theory and exploiting knowledge bases, detects and hides sensitive textual information while preserving its meaning. Our proposal works in an automatic and unsupervised way and it can be applied to heterogeneous documents, which make it specially suitable for environments with massive and heterogeneous information-sharing needs. Evaluation results show that our method outperforms strategies based on trained classifiers regarding the detection recall, whereas it better retains the document´s utility compared to term-suppression methods.
  • Keywords
    data privacy; information theory; publishing; security of data; text analysis; automatic general-purpose sanitization method; confidential information; data publishing; document sanitization goal; information sharing technologies; information theory; knowledge bases; sensitive textual information detection; textual documents; Companies; Context; Data privacy; Government; Knowledge based systems; Manuals; Proposals; Data publishing; document sanitization; information theory; privacy;
  • fLanguage
    English
  • Journal_Title
    Information Forensics and Security, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1556-6013
  • Type

    jour

  • DOI
    10.1109/TIFS.2013.2239641
  • Filename
    6410029