• DocumentCode
    2941469
  • Title

    How many clusters to report: A recursive heuristic

  • Author

    Carlis, John ; Bruso, Kelsey

  • Author_Institution
    Comput. Sci. & Eng. Dept., Univ. of Minnesota, Minneapolis, MN, USA
  • fYear
    2010
  • fDate
    Aug. 31 2010-Sept. 4 2010
  • Firstpage
    1069
  • Lastpage
    1072
  • Abstract
    Clustering can be a valuable tool for analyzing large amounts of data, but anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter when working within each of the three available frameworks where one thinks of clustering: as a Euclidean distance problem; as a statistical model problem; or as a complexity theory problem. We report here a novel recursive square root heuristic, RSQRT, which accurately predicts Kreported as a function of the attribute or item count, depending on attribute scales. We tested the heuristic on 226 widely-varying, but mostly scientific, studies, and found that the heuristic´s Kbest-predicted rounded to exactly Kreported in over half of the studies and was close in almost all of them. We claim that this strongly-supported heuristic makes sense and that, although it is not prescriptive, using it prospectively is much better than guessing.
  • Keywords
    bioinformatics; data analysis; data clustering; item count; recursive square root heuristic; Bayesian methods; Clustering algorithms; Complexity theory; Computational modeling; Presses; Shape; Spirals; Algorithms; Cluster Analysis; Data Interpretation, Statistical; Humans; Incidence; Proportional Hazards Models; Risk Assessment; Risk Factors; Schizophrenia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE
  • Conference_Location
    Buenos Aires
  • ISSN
    1557-170X
  • Print_ISBN
    978-1-4244-4123-5
  • Type

    conf

  • DOI
    10.1109/IEMBS.2010.5627287
  • Filename
    5627287