• DocumentCode
    3128147
  • Title

    Automatic Cleaning and Linking of Historical Census Data Using Household Information

  • Author

    Fu, Zhichun ; Christen, Peter ; Boot, Mac

  • Author_Institution
    Res. Sch. of Comput. Sci., Australian Nat. Univ. Canberra, Canberra, ACT, Australia
  • fYear
    2011
  • fDate
    11-11 Dec. 2011
  • Firstpage
    413
  • Lastpage
    420
  • Abstract
    Historical census data captures information about our ancestors. These data contain the social status at a certain point time. They contain valuable information for genealogists, historians, and social scientists. Historical census data can be used to reconstruct important aspects of a particular era in order to trace the changes in households and families. Record linkage across different historical census datasets can help to improve the quality of the data, enrich existing census data with additional information, and facilitate improved retrieval of information. In this paper, we introduce a domain driven approach to automatically clean and link historical census data based on recent developments in group linkage techniques. The key contribution of our approach is to first detect households, and to use this information to refine the cleaned data and improve the accuracy of linking records between census datasets. We have developed a two-step linking approach, which first links individual records using approximate string similarity measures, and then performs a group linking based on the previously detected households. The results show that this approach is effective and can greatly reduce the manual efforts required for data cleaning and linking by social scientists.
  • Keywords
    demography; history; information retrieval; records management; approximate string similarity measures; automatic cleaning; group linkage techniques; historical census data; household information; information retrieval; record linkage; social status; two-step linking approach; Cleaning; Computer science; Couplings; Educational institutions; Electronic mail; Joining processes; Magnetic heads; Historical census data; data cleaning; domain knowledge; group linking; record linkage;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on
  • Conference_Location
    Vancouver, BC
  • Print_ISBN
    978-1-4673-0005-6
  • Type

    conf

  • DOI
    10.1109/ICDMW.2011.35
  • Filename
    6137409