• DocumentCode
    3169483
  • Title

    Exploring Regularity in Source Code: Software Science and Zipf´s Law

  • Author

    Zhang, Hongyu

  • Author_Institution
    Sch. of Software, Tsinghua Univ., Beijing
  • fYear
    2008
  • fDate
    15-18 Oct. 2008
  • Firstpage
    101
  • Lastpage
    110
  • Abstract
    Are there statistical regularities behind computer programming? In 1970s, Halstead proposed the software science theory which attempted to describe some of the regularities based on the direct measurement of lexical tokens in programs. The famous software science length equation models the relationship between program length and vocabulary. By analyzing the source code of twelve Java software systems collected from public software repositories, we find that Halstead´s length equation does not hold for large-scale modern software systems. We discover that the distribution of lexical tokens in studied systems follows the Zipf´s law (or more generally, Zipf-Mandelbrot law), which is an empirical law in statistical natural language processing. Based on the discovery of Zipf´s law, we propose a revised software science length equation for describing the vocabulary-length relationship. Our new equation fits the real data well and achieves better accuracy than the original equation. Our study reveals that we could discover statistical regularities behind computer programming by mining software repositories.
  • Keywords
    Java; programming; software engineering; Halstead length equation; Java software systems; Zipf law; computer programming; from public software repositories; software repository mining; software science; source code; statistical natural language processing; Computer languages; Equations; Frequency; Java; Natural language processing; Natural languages; Programming; Software measurement; Software systems; Vocabulary; Zipf´s law; mining software repository; regularity; software metrics; software science;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Reverse Engineering, 2008. WCRE '08. 15th Working Conference on
  • Conference_Location
    Antwerp
  • ISSN
    1095-1350
  • Print_ISBN
    978-0-7695-3429-9
  • Type

    conf

  • DOI
    10.1109/WCRE.2008.37
  • Filename
    4656399