This document is an english translation of the parts of my PhD thesis (Clusterverfahren zur datenbasierten Generierung interpretierbarer Regeln unter Verwendung lokaler Entscheidungskriterien, Lars Haendel, PhD thesis, Chair of Electrical Control Engineering, University of Dortmund, 2003), that are relevant to the PNC2 Cluster Algorithm.
|Download the document here!|
This document describes the hierarchical agglomerative cluster algorithm PNC2 in the context of direct generation of If-Then rules for classification tasks. As an agglomerative cluster algorithm, the PNC2 initializes each learn data tuple as a single cluster. Then, if a merge test is passed, iteratively always those two clusters with the same output value are merged, that are closest to each other. The merge test transforms the generalized cluster into a rule and evaluates it by a kind of hitrate. The rule's premise is the cuboid, that encloses the input vectors of all learn data tuples merged in the cluster. This representation suffers in high dimensional input spaces due to the COD problem and thus a special mechanism is used to extend the cuboid during the merge test.
A Heterogenous Normalized and Weighted Minkowski Overlap Metric is used to be able to process continuous and nominal inputs at the same time. An integrated bagging component can improve accuracy and also reduces the time complexity for a learn data sample with N data tuples from O(N^3) to approximately O(N^2). The size of the learned rule set can be further reduced by applying a context sensitive feature selection, that individually removes the unnecessary inputs from each rule's premise. The algorithm can also be viewed as an instance based learning algorithm, namely as an exemplar-based generalization approach. Thus the idea of the k-nearest-neighbor algorithm (kNN), to base the decision on several surrounding learn data tuples, can be transferred to improve the prediction accuracy.
The number of free parameters of the PNC2 is been reduced in a preliminary study with some development benchmarks. Then the PNC2 is compared experimentally with the most similar existing algorithms, namely with the NGE, the RISE and, of course, with the kNN algorithm. All remaining free parameters of the PNC2 are tuned using cross-validation or similar approaches within the respective learn data samples. The PNC2 outperforms the NGE algorithms and its variants and reaches better or comparative accuracies as the kNN or the RISE algorithm - with typically much smaller ruleset/model sizes.
The PNC2 Algorithm was developed while I was a scholarship holder in the post graduate research program Modelling and Model-Based Design of Complex Technological Systems at the Chair of Electrical Control Engineering at the University of Dortmund, Germany. My research project was initiated by Prof. Dr. rer. nat. H. Kiendl. The post graduate research program was founded by the Deutsche Forschungsgemeinschaft (DFG).
The PNC2 Rule Induction System is a free software tool, that is using the PNC2 Algorithm to automatically induce rules from a given data sample. This program is distributed under the terms of the GNU General Public License. Additionally a DOS command line version will be available soon. The source kernels are written in ANSI C++, they are well documented and should easily be compiled for different operating systems.
|Download the document here!|
last updated: 22 January 2004 © 2000-2004 by Lars Haendel