• DocumentCode
    3187696
  • Title

    E-Clean: A Data Cleaning Framework for Patient Data

  • Author

    Mohamed, Hasimah Hj ; Kheng, Tee Leong ; Collin, Chee ; Lee, Ong Siong

  • Author_Institution
    Sch. of Comput. Sci., Univ. Sains Malaysia, Pulau, Malaysia
  • fYear
    2011
  • fDate
    12-14 Dec. 2011
  • Firstpage
    63
  • Lastpage
    68
  • Abstract
    We need to prepare quality data by pre-processing the raw data. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data cleaning system are needed to support any changes in the structure, representation or content of data. There are three parts in the cleaning process, i.e. extract the invalid value, matching attributes with valid values and data cleaning algorithm. Our system uses the extract, transform and load model as the system main process model to serve as a guideline for the implementation of the system. Besides that, parsing techniques is also use for the identification of dirty data. The method that we choose for matching attributes is regular expression. Among those data cleaning algorithms, k-Nearest Neighbor algorithm is selected for the data cleaning part of this project because it is simple to understand and easy to implement.
  • Keywords
    attribute grammars; data handling; medical administrative data processing; E-Clean; data cleaning algorithm; data cleansing; data inconsistency; data scrubbing; dirty data identification; error detection; error removal; k-nearest neighbor algorithm; matching attributes; parsing techniques; patient data; raw data pre-processing; Classification algorithms; Cleaning; Data mining; Databases; Knowledge based systems; Load modeling; Transforms; data cleaning; k-Nearest Neighbor; regular expression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Informatics and Computational Intelligence (ICI), 2011 First International Conference on
  • Conference_Location
    Bandung
  • Print_ISBN
    978-1-4673-0091-9
  • Type

    conf

  • DOI
    10.1109/ICI.2011.21
  • Filename
    6141651