• DocumentCode
    1764763
  • Title

    Active learning and effort estimation: Finding the essential content of software effort estimation data

  • Author

    Kocaguneli, Ekrem ; Menzies, T. ; Keung, Jacky ; Cok, D. ; Madachy, R.

  • Author_Institution
    Lane Dept. of Comput. Sci. & Electr. Eng., West Virginia Univ., Morgantown, WV, USA
  • Volume
    39
  • Issue
    8
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1040
  • Lastpage
    1053
  • Abstract
    Background: Do we always need complex methods for software effort estimation (SEE)? Aim: To characterize the essential content of SEE data, i.e., the least number of features and instances required to capture the information within SEE data. If the essential content is very small, then 1) the contained information must be very brief and 2) the value added of complex learning schemes must be minimal. Method: Our QUICK method computes the euclidean distance between rows (instances) and columns (features) of SEE data, then prunes synonyms (similar features) and outliers (distant instances), then assesses the reduced data by comparing predictions from 1) a simple learner using the reduced data and 2) a state-of-the-art learner (CART) using all data. Performance is measured using hold-out experiments and expressed in terms of mean and median MRE, MAR, PRED(25), MBRE, MIBRE, or MMER. Results: For 18 datasets, QUICK pruned 69 to 96 percent of the training data (median = 89 percent). K = 1 nearest neighbor predictions (in the reduced data) performed as well as CART´s predictions (using all data). Conclusion: The essential content of some SEE datasets is very small. Complex estimation methods may be overelaborate for such datasets and can be simplified. We offer QUICK as an example of such a simpler SEE method.
  • Keywords
    data handling; learning (artificial intelligence); software cost estimation; statistical analysis; CART learner; Euclidean distance; K-nearest neighbor prediction; QUICK method; SEE data content; complex learning scheme; mean; median; software effort estimation; Complexity theory; Estimation; Euclidean distance; Frequency selective surfaces; Indexes; Labeling; Principal component analysis; Software cost estimation; active learning; analogy; k-NN;
  • fLanguage
    English
  • Journal_Title
    Software Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0098-5589
  • Type

    jour

  • DOI
    10.1109/TSE.2012.88
  • Filename
    6392173