Representation and distance measure
Previous  Top  Next

Representation
A data tuple is given, as described in What is it all about ?, by an m dimensional input vector x and an associated output value y. A cluster is represented by an output value c and by an input space cuboid H. This cuboid has m single components that are defined as follows:
 
·Continuous inputs  
For a continuous input, the cuboid component is described by its lower left and its upper right bound l respectively r.  
 
·Nominal inputs  
For a nominal input, the cuboid component is defined by a so called bit string b, that encodes, which symbols are covered by the cuboid component. The bit string has as many elements as the nominal input can take on different symbols. Each element corresponds to one of the possible symbols and can either have the value 0 or the value 1, whereas the latter means, that the symbol is covered by the bit string.  

A data tuple can be transformed to an elementary cluster by taking the data tuple's output as the cluster's output value and by building a
 trivial cuboid, that includes only the data tuple's input vector.



Figure: Cuboid representation and component wise distances in the case of two continuous inputs.



Distance measure
The distance between two clusters
 Ca and Cb is defined as the distance of the two cluster's cuboids Ha and Hb. This distance is measured using a Heterogenous Normalized and Weighted Minkowski Overlap Metric as



For continuous inputs, the component wise distances are determined according to



and for symbolic inputs by



The feature weights used are calculated based one mutual information of the particular input to the output and the normalization factors used can be chosen either as the range or as the fourth of the standard deviation of the particular input.

Note:
As the input vector of a data tuple can be transformed into a trivial cuboid, this distance measure can also be used, to determine the distance of a data tuple to a cluster. A distance of
 d=0 means, that the data tuple lies inside the cuboid. The data tuple is covered by the cluster.

Note: The distances are determined in the input space only and thus the output value of the data tuples and the clusters are irrelevant.


Generalization
The cuboid of a cluster should include all input vectors of the data tuples merged within. Thus two clusters
 Ca and Cb with the same output value are merged to a generalized cluster by just taking the output value and building a generalized cuboid Hg from the two cuboids Ha and Hb as follows:

·Continuous inputs  
Take the minimum of the two lower left bounds of the two cuboid components that are generalized as the new lower left bound. Proceed the same way with the upper right bound, but use the maximum now. This is the most specific generalization with respect to the representation used.  
 
·Nominal inputs  
When generalizing two cuboids, the bit string for a symbolic input is build by an OR-Operation, such that a symbol is covered by the generalized bit string, if it is covered by any of the two bit strings which are generalized.  
 



Figure: Two examples to illustrate representation and generalization. The first three inputs are nominal and can take on the symbols a, b and c whereas the latter three inputs are continuous.

 
More Info