Data categorization using decision trellises

Author

Frasconi, Paolo ; Gori, Marco ; Soda, Giovanni

Author_Institution

Dept. of Syst. & Inf., Florence Univ., Italy

Volume

11

Issue

5

fYear

1999

Firstpage

697

Lastpage

712

Abstract

We introduce a probabilistic graphical model for supervised learning on databases with categorical attributes. The proposed belief network contains hidden variables that play a role similar to nodes in decision trees and each of their states either corresponds to a class label or to a single attribute test. As a major difference with respect to decision trees, the selection of the attribute to be tested is probabilistic. Thus, the model can be used to assess the probability that a tuple belongs to some class, given the predictive attributes. Unfolding the network along the hidden states dimension yields a trellis structure having a signal flow similar to second order connectionist networks. The network encodes context specific probabilistic independencies to reduce parametric complexity. We present a custom tailored inference algorithm and derive a learning procedure based on the expectation-maximization algorithm. We propose decision trellises as an alternative to decision trees in the context of tuple categorization in databases, which is an important step for building data mining systems. Preliminary experiments on standard machine learning databases are reported, comparing the classification accuracy of decision trellises and decision trees induced by C4.5. In particular, we show that the proposed model can offer significant advantages for sparse databases in which many predictive attributes are missing

Keywords

belief networks; data mining; decision trees; deductive databases; inference mechanisms; learning (artificial intelligence); neural nets; optimisation; probability; belief network; categorical attributes; class label; classification accuracy; context specific probabilistic independencies; custom tailored inference algorithm; data categorization; data mining systems; decision trees; decision trellises; expectation-maximization algorithm; hidden states dimension; hidden variables; learning procedure; parametric complexity; predictive attributes; probabilistic graphical model; second order connectionist networks; signal flow; single attribute test; sparse databases; standard machine learning databases; supervised learning; trellis structure; tuple categorization; Data mining; Databases; Decision trees; Expectation-maximization algorithms; Graphical models; Inference algorithms; Machine learning algorithms; Predictive models; Supervised learning; Testing;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/69.806931

Filename

806931