• DocumentCode
    257479
  • Title

    Exploiting limited data for parsing

  • Author

    Dongchen Li ; Xiantao Zhang ; Xihong Wu

  • Author_Institution
    Key Lab. of Machine Perception & Intell., Peking Univ., Beijing, China
  • fYear
    2014
  • fDate
    4-6 June 2014
  • Firstpage
    171
  • Lastpage
    175
  • Abstract
    Data sparsity issues are extremely severe for parser due to the flexibility of tree structures. Many tags and productions appears a little, nevertheless, they are crucial for the parse disambiguation where it occurs. Besides, when a common tag somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. In this paper, we propose a metric that measures the scarcity of any phrase with arbitrary span size. To make a better compromise between training trees with high confidence and scarcity, we try to catch some constraints in response to rare but articulating categories when training latent variable grammar. We exploits the limited data more sufficiently by capturing the depicting power of rate tree structure configuration in Expectation & Maximization procedure and Split & Merge framework. The resulting grammars are interpretable as our intension. Based on this approach, we further propose a method that exploits the limited training date from multiple perspectives, and accumulates their advantages in a product model. Despite its limited training data, out model improves parsing performance on Penn Chinese Treebank Fifth Edition, even higher than some systems with extra unlabeled data and external resources. Furthermore, this method is easy to generalized to cope with data sparsity in other natural language processing tasks.
  • Keywords
    expectation-maximisation algorithm; grammars; merging; natural language processing; tree data structures; Penn Chinese treebank fifth edition; data sparsity; expectation and maximization procedure; latent variable grammar; natural language processing tasks; parse disambiguation; parsing performance improvement; rate tree structure configuration; scarcity measurement; split and merge framework; training trees; tree structure flexibility; Computational linguistics; Data models; Grammar; Merging; Production; Training; Vegetation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer and Information Science (ICIS), 2014 IEEE/ACIS 13th International Conference on
  • Conference_Location
    Taiyuan
  • Type

    conf

  • DOI
    10.1109/ICIS.2014.6912128
  • Filename
    6912128