• DocumentCode
    322419
  • Title

    Inferring lexical and grammatical structure from sequences

  • Author

    Manning, Craig G Nevill ; Witten, Ian H.

  • Author_Institution
    Dept. of Biochem., Stanford Univ., CA, USA
  • fYear
    1997
  • fDate
    11-13 Jun 1997
  • Firstpage
    265
  • Lastpage
    274
  • Abstract
    In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word `the´ many parts of speech (such as verbs) are necessarily excluded. This paper shows how these kinds of structure can be inferred automatically from sequences. We begin with an example that both illustrates the utility of inferring the kinds of structure we seek and shows what our techniques can do. Next we present an efficient and non-obvious algorithm for identifying exact repetitions-including nested repetitions-in time which is linear with the length of the sequence. Then we describe a very simple algorithm for identifying interactions between sequence elements. The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone. We show how they combine to generate the kind of structure sought in the original motivating example. Although the two methods work well together on many simple examples, the results frequently conflict with intuition in the inference of branching structure. The minimum description length principle seems to provide the only satisfactory general approach
  • Keywords
    grammars; inference mechanisms; sequences; DNA; branching structure inference; computer programs; efficient algorithm; exact repetitions; grammatical structure; lexical structure; minimum description length principle; music; nested repetitions; sequences; text; Automata; Biochemistry; Computer science; DNA computing; Fractals; Frequency; Large-scale systems; Sequences; Speech; Switches;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Compression and Complexity of Sequences 1997. Proceedings
  • Conference_Location
    Salerno
  • Print_ISBN
    0-8186-8132-2
  • Type

    conf

  • DOI
    10.1109/SEQUEN.1997.666921
  • Filename
    666921