• DocumentCode
    579464
  • Title

    Demographics Identification: Variable Extraction Resource (DIVER)

  • Author

    Hsieh, Alexander ; Doan, Son ; Conway, Michael ; Lin, Ko-Wei ; Kim, Hyeoneui

  • Author_Institution
    Dept. of Med., Univ. of California, San Diego, La Jolla, CA, USA
  • fYear
    2012
  • fDate
    27-28 Sept. 2012
  • Firstpage
    40
  • Lastpage
    49
  • Abstract
    Lack of standardization in representing phenotype data generated in different studies is a major barrier to data reuse for cross study analyses. To address this issue, we developed DIVER, a tool that identifies and standardizes demographic variables in dbGaP, based on simple natural language processing and standardized terminology mapping. In its evaluation using variables (N=3,565) from a range of pulmonary studies in dbGaP, DIVER proved to be an effective approach to standardizing dbGaP variables by successfully identifying demographic variables with high rates of recall and precision (98% and 94%, respectively). In addition, DIVER correctly modeled 79% of the identified demographic variables at the core semantic level. Examination of variables that DIVER could not handle shed light on where our tool needs enhancement so it can further improve its semantic modeling accuracy. DIVER is an important component of a system for phenotype discovery in dbGaP studies.
  • Keywords
    demography; medical computing; natural language processing; standardisation; DIVER tool; cross-study analysis; data reuse; dbGaP variable standardization; demographic variable identification; demographic variable standardization; natural language processing; phenotype data generation; precision rates; pulmonary studies; recall rates; semantic modeling accuracy improvement; variable extraction resource; Bioinformatics; Data mining; Dictionaries; Genomics; Semantics; Standardization; Unified modeling language; data reuse; data standardization; dbGaP; phenotype variables;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Healthcare Informatics, Imaging and Systems Biology (HISB), 2012 IEEE Second International Conference on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-1-4673-4803-4
  • Type

    conf

  • DOI
    10.1109/HISB.2012.17
  • Filename
    6366187