1     \section{Distance Functions\label{SeMetric}} Distance functions
2     are used to measure the distance between two data tuples $P_a$ and
3     $P_b$. In this work distances are always calculated in the input
4     space. Thus the term {\em input vector} or just {\em input} is
5     used in the following descriptions. In the context of unsupervised
6     clustering these terms should be replaced by  {\em data vector}
7     and {\em variable}. A broadly known distance function is the {\em
8     Minkowski} metric, that is defined as
10    \begin{equation}
11    d = \sqrt[\rho]{\sum_{j=1}^m d_j^\rho}\ .\label{EqMinkowskiMetric}
12    \end{equation}
14    For continuous inputs the component wise distances $d_j$ are
15    calculated as
17    \begin{equation}
18    d_j = |x_{a j}-x_{b j}|\label{EqSingleDistance}\ ,
19    \end{equation}
21    whereas $x_{a j}$ resp. $x_{b j}$ denotes the $j$-th component of
22    the input vector $\mathbf{x}_a$ resp. $\mathbf{x}_b$. For some
23    particular values of the metric parameter $\rho$, the Minkowski
24    metric corresponds to the special distance functions as stated in
25    table \ref{TaMinkowskiMetric}.
29    \begin{table}[!ht]
30    \caption{\label{TaMinkowskiMetric}Parameter values $\rho$ for the
31    Minkowski metric.} \centering \vspace{\TableSkip}
32    \begin{tabular}{lp{1cm}lp{0.5cm}l}
33    \hline\\[-7pt]
34    Block wise distance && $\rho = 1$ && ${\displaystyle d = \sum_{j=1}^m d_j}$\\
35    \\
36    Euclidean distance && $\rho = 2$&& ${\displaystyle d = \sqrt{\sum_{j=1}^m d_j^2}}\qquad$\\
37    \\
38    Chebychev distance && $\rho = \infty$ && ${\displaystyle d =
39    \max_{j=1}^m
40    |d_j|}$\\[7pt]
41    \hline
42    \end{tabular}\end{table}
43    \vspace{10pt}
45    Usually the metric parameter is chosen to satisfy the inequality
46    $\rho \geq 1$. However, also other values are possible. In
47    \cite{AgHiKe01} the so called {\em fractional distance metric},
48    that allows values $0< \rho < 1$, is introduced and it is shown
49    experimentally, that this metric can improve the prediction
50    accuracy, if applied in a high dimensional input space together
51    with a $k${\sc nn} algorithm.
54    \paragraph{Component Wise Distances for Nominal Inputs}
55    The simplest approach to calculate the component wise distance
56    $d_j$ of the two symbols $w=x_{a j}$ and $z=x_{b j}$ of a nominal
57    input $x_j$ is the so called {\em overlap} metric. The distance is
58    0 if the to symbols coincide and 1 otherwise.
60    \begin{equation}
61    d_j = \left\{ \begin{array}{ll}
62    0 &\quad x_{a j} = x_{b j}\\
63    1 &\quad else\\
64    \end{array}\right.\label{EqOverlapMetric}
65    \end{equation}
67    In \cite{StaWa92} the so called {\em Value Difference Metric}
68    (VDM) is introduced to allow a more precise graduation of the
69    calculated distances. The idea of the VDM metric is, that two
70    symbols $w=x_{a j}$ and $z=x_{b j}$ of a nominal input $x_j$ are
71    closer to each other, if the conditional probabilities
72    $P(y=s|x_j=w)$ and $P(y=s|x_j=z)$ are similar for the different
73    possible output classes $s$. A simplified VDM metric can be
74    calculated as
76    \begin{equation}
77    d_j = \sum_{s=1}^{S_y} \left| P(y=s|x_j=w)-P(y=s|x_j=z) \right| =
78    \sum_{s=1}^{S_y} \left| \frac{N_{w, s}}{N_w} - \frac{N_{z,
79    s}}{N_z} \right|\label{EqVDM}\ .
80    \end{equation}
82    The required probabilities are estimated based upon the
83    frequentness with respect to the given learn data sample. $N_w$
84    resp. $N_z$ is the number of learn data tuples, for which the
85    input $x_j$ has the symbol $w$ resp. $z$ and $N_{w, s}$ resp.
86    $N_{z, s}$ is correspondingly the number of learn data tuples, for
87    which additionally the output has the symbol $s$. Remind, that the
88    different possible output symbols are encoded as integer values
89    starting from 1.
91    If there are several nominal or mixed nominal and continuous
92    inputs, the component wise distances $d_j$ are calculated
93    according to eqn. (\ref{EqOverlapMetric}) or (\ref{EqVDM}) for
94    nominal and according to eqn. (\ref{EqSingleDistance}) for
95    continuous inputs. Afterwards the component wise distances are
96    summarized to the overall distance using eqn.
97    (\ref{EqMinkowskiMetric}). In the case of $\rho=2$ the resulting
98    distance measures are called {\em Heterogeneous Euclidean Overlap
99    Metric\label{DeHEOM}} (HEOM) and {\em Heterogeneous Value
100   Difference Metric} (HVDM) \cite{WiMa97}\footnote{Recently
101   published work \cite{WiMa97, WiMa00b} reports about the successful
102   application of the VDM and the similar and further developed {\em
103   Interpolated Value Difference Metric} (IVDM) in the context of
104   data-based learning algorithms. However first experiments with the
105   {\sc Pnc\,2} cluster algorithm were disappointing, as the
106   prediction accuracy -- compared with a properly normalized and
107   weighted heterogenous Minkowski overlap metric as defined by eqn.
108   (\ref{EqHMOM} -- could not be improved. Thus we discarded the
109   approach of the VDM or IVDM metric, as it is more complicated.}.
112   \paragraph{Missing Feature Values}
113   Some practical learning tasks contain input vectors with missing
114   feature values. There are several approaches how to deal with this
115   problem. The rule induction algorithm {\sc Rise} \cite{Dom97}, for
116   example, treats a missing feature value as an additional symbol.
117   The representation of the rules is extended by an additional flag
118   for each input, that indicates, if missing feature values are
119   covered by the rule's premise. Note, that this approach is
120   possible independently of the input's type. Another approach is
121   the ignore strategy \cite{Aha90}, that simply ignores a component
122   wise distance in eqn. (\ref{EqMinkowskiMetric}) by setting
123   $d_j=0$, if a feature value is missing. The overall distance is
124   afterwards divided by the number of known features to distinguish
125   a perfect match from the case of an input vector, where all
126   feature values are missing. Alternatively the data sample can be
127   prepared by guessing missing values -- for example by using a
128   special model, that was learned for this purpose. An overview of
129   several approaches used together with decision trees can be found
130   in \cite{Qui89,Liu97}.
133   \subsection{Normalization\label{SeNormalization}}
134   \paragraph{Continuous Inputs}
135   All inputs should contribute evenly to the overall distance value
136   \footnote{If some inputs are more relevant than others one can use
137   feature weights as described in section \ref{SeWeights}.}. However
138   this requirement is not always satisfied for the distance function
139    (\ref{EqMinkowskiMetric}). For continuous inputs a problem
140    arises, if the ranges of the different inputs differ significantly. For
141    example, if one input is within a range of $[0..1]$ and the other
142    within the range of $[0..100]$, then the latter input can
143    dominate the other one. Thus it is advisable to normalize the single inputs
144   by a factor $\delta_j$, so that the component wise distances are
145   mostly within the interval $[0,1]$. The normalization can be
146   integrated into the distance function.
148   \begin{equation}
149   d = \sqrt[\rho]{\sum_{j=1}^m
150   \left(\frac{d_j}{\delta_j}\right)^\rho}\label{Eq231}\end{equation}
152   An easy approach to chose the normalization factors is to couple
153   them to the range or standard deviation -- calculated using the
154   learn data sample -- of the particular input. For a formal
155   definition of these simple statistics see table \ref{TaStatistic}.
158   \begin{table}[!ht]\NormalsizeIfTenPt
159   \caption{Simple statistics of an input $x$.\label{TaStatistic}}
160   \centering\vspace{\TableSkip}\begin{tabular}{lp{1cm}l}
161   \hline\\
162   Mean && $\displaystyle \overline{x} = \frac{1}{N}
163   \sum_{i=1}^N x_i ,$\\
164   \\
165   Standard deviation && $\displaystyle \sigma = \frac{1}{N-1} \sum_{i=1}^N (x_i-\overline{x})^2$\\
166   \\Maximum value && $\displaystyle x_{max} = \max_{i=1}^N x_i$\\
167   \\Minimum value && $\displaystyle x_{min} = \min_{i=1}^N x_i$\\
168   \\Range && $\displaystyle x_{range} =  x_{max} - x_{min}$\\
169   \\\hline
170   \end{tabular}
171   \end{table}
173   If the normalization factors are chosen as
175   \begin{equation}
176   \delta_j = x_{range\,j}\label{EqRangeNormalization}\ ,
177   \end{equation}
179   then all possible component wise distances $d_j$ of two learn data
180   tuples are guaranteed to be within the interval $[0,1]$. However
181   the inputs for a test data sample can have a different range and
182   thus it is possible to get a component wise distance $d_j>1$ --
183   but usually this should not be a problem as a slightly higher
184   component wise distance is often justified in such a case. A
185   problem when using the range normalization is, that this statistic
186   can be highly affected by noise. Thus often a certain percentage
187   of the smallest and biggest values is omitted when calculating the
188   range.
190   If the normalization factors are chosen as the fourth of the
191   standard deviation
193   \begin{equation}
194   \delta_j = 4 \sigma_j\label{EqDevNormalization}\ ,
195   \end{equation}
197   this leads to component wise distances, that are mostly in the
198   interval $[0,1]$. This is due to the fact, that 95\% of all values
199   of a normally distributed variable $x_j$ are inside the interval
200   $[-2\sigma_j, +2\sigma_j]$ and thus the distance of two randomly
201   chosen values will mostly be less than $4 \sigma_j$.
204   \paragraph{Mixed Continuous and Nominal Inputs}
205   Proper normalization for mixed continuous and nominal inputs is a
206   non trivial problem, if the overlap metric according to eqn.
207   (\ref{EqOverlapMetric}) is used to calculate the component wise
208   distances of nominal inputs\footnote{For notes on how to proper
209   normalize distances, that were calculated using eqn. (\ref{EqVDM})
210   see \cite{WiMa97}.}. The component wise distances are within the
211   interval $[0,1]$, but the average value tends to be much bigger
212   than the average value of a normalized continuous input. This
213   leads to an disproportionate high influence of nominal inputs. In
214   the context of this work, two different methods how to normalize
215   mixed continuous and nominal inputs are considered.
217   The first and simplest approach is to keep on using eqn.
218   (\ref{EqRangeNormalization}) or (\ref{EqDevNormalization}) to
219   determine the normalization factors for continuous inputs and to
220   simply chose a fixed $\delta = \delta_{Symb} > 1$ for all nominal
221   inputs.
223   The second approach is to calculate all normalization factors --
224   the ones for the nominal as well as the ones for the continuous
225   inputs -- using the normalization by the {\em average component
226   wise distance} as it is described in \cite{HoCa99}. The basic idea
227   of this approach is to normalize each input, such that the average
228   normalized component wise distance is 1. Therefore the average
229   component wise distance of two learn data tuples to each other are
230   calculated as
232   \begin{equation}
233   \delta_j =  d_{avr} = \frac{{\displaystyle \sum_{i=1}^N
234   \sum_{z=i+1}^N d_j(x_{i,j}, x_{z,j})}}{{\displaystyle \frac{1}{2}
235   N ( N - 1 )}}\ ,\label{EqAverageDistance}
236   \end{equation}
238   whereas $N$ is the sample size and $d_j(x_{i,j}, x_{z,j})$ denotes
239   the distance of the $i$-th to the $z$-th learn data tuple. For
240   large sample sizes $N$ it is not necessary to evaluate all
241   possible pairs of two learn data tuples to get a sufficiently
242   robust estimate -- the computationally costs can be reduced by
243   only evaluating a subset of the learn data sample.
246   \subsection{Feature Weights\label{SeWeights}}
247   The influence of more relevant inputs can be increased using so
248   called {\em feature weights}. Do not confuse with the different
249   aims of normalization and weighting. The normalization tries to
250   balance the influence of the inputs, whereas the feature weights
251   afterwards scale each input's influence with respect to its
252   relevance.
254   Several methods to calculate feature weights with application to
255   IBL algorithms are described in \cite{WeAhMo97, Aha98}. The
256   following paragraph describes the approach to calculate feature
257   weights based upon the mutual infor\-mation \cite{Re94} of a
258   particular single input to the output. This approach has also been
259   used in \cite{WD94} together with the {\sc Nge} and the $k${\sc
260   nn} algorithm.
262   The mutual information $I(x,y)$ measures the decrease of the
263   uncertainty about the outcome of a discrete random variable $y$,
264   if another discrete variable $x$ is known. The uncertainty is
265   measured by the entropy $E(y)$ \cite{Sha48}.
267   \begin{equation}
268   E(y)=-\sum_{s=1}^{S_y}P(y=s) \log P(y=s)\label{EqEntropy}
269   \end{equation}
271   \begin{equation}
272   I(x,y)=I(y,x)=\sum_{w=1}^{S_x}\sum_{s=1}^{S_y}P(y=s \wedge x=w)
273   \log \frac{P(y=s \wedge x=w)}{P(y=s) P(x=w)}
274   \end{equation}
276   Thereby $S_x$ and $S_y$ denotes the number of possible discrete
277   values of the random variable $x$ resp. $y$ and $P(x=w)$ and
278   $P(y=s)$ denote the probability of the $w$-th resp. the $s$-th
279   value. A division by zero is circumvented by setting
280   $\log\frac{0}{0}=0$. If an input does not contribute to reduce the
281   uncertainty about the output, then the mutual information of this
282   input will be 0. If an input completely explains the output, then
283   the mutual information equals the entropy of the output.
285   The mutual information can be easily calculated and the resulting
286   values are relatively robust with respect to different learn data
287   samples.
289   The feature weights are calculated based upon the mutual
290   information as follows: The mutual information $I(x_j,y)$ is
291   calculated for each of the $m$ inputs. Continuous inputs are
292   discretized -- as described in the following paragraph -- for this
293   calculation. The values $I(x_j,y)$ are normalized to get an
294   average feature weight of one.
296   \begin{equation}\label{EqWeights}
297   \omega_j=\frac{m\,I(x_j,y)}{{\displaystyle
298   \sum_{w=1}^{m}I(x_w,y)}}
299   \end{equation}
302   \paragraph{Discretization of Continuous Inputs\label{SeDiscretization}}
303   Continuous inputs can be transformed into ordinal inputs using a
304   so called discretization method, that divides the range of the
305   continuous input into several intervals. Each continuous value is
306   replaced by the number of the interval it is in\footnote{We
307   principally enumerate these intervals starting with 1.}. The
308   simplest and most commonly used approaches are the so called {\em
309   equidistant} and the {\em equifrequent} discretization. Both
310   approaches work univariate and unsupervised. It is necessary to
311   pre-specify a fixed number of discretization intervals $N_{Bins}$.
313   \begin{itemize}
314   \item {\em  Equidistant discretization.} The range of the input
315   $x$ is equidistantly divided into $N_{Bins}$ intervals of the
316   width $w=\frac{x_{range}}{N_{Bins}}$. As by the range
317   normalization it is advisable, to ignore a small percentage of the
318   smallest and highest feature values when determining the input's
319   range to get a more robust estimate in the case of noise. The
320   left- and the right-most interval are extended to infinity.
322   \item {\em Equifrequent discretization.} The equidistant
323   discretization can be unfavorable if the numbers of data tuples,
324   that fall into each interval, differ significantly. The
325   equifrequent discretization faces this problem by positioning the
326   intervals such that each interval contains approximately the same
327   number of data tuples $N'=floor(\frac{N}{N_{Bins}})$. Thereby the
328   operator $floor(x)$ returns the largest integer not greater than
329   $x$, i.e. rounds down. The left- and the right-most interval are
330   extended to infinity.
331   \end{itemize}
333   Several more advanced discretization approaches are published. Of
334   particular interest for application together with the calculation
335   of the mutual information is the entropy based method first
336   published in \cite{FaIr89}. For an overview about possible other
337   approaches see \cite{HuLiTa99, DoKoSa95}.
339   The method in {\cite{FaIr89} produces a discretization, that
340   minimizes the entropy of the output $y$, that is calculated
341   according to eqn. (\ref{EqEntropy}) for each interval. It is not
342   necessary to pre-specify the number of intervals. The approach is
343   applied as follows: The input is divided into two intervals, such
344   that the entropy of this discretization is minimal. Afterwards,
345   one continues recursively with the resulting intervals as long as
346   the {\em Minimum Description Length Principle} (MDLP) is
347   satisfied. This principle weights the reduction of the entropy
348   against the increasing complexity of the discretization. The MDLP
349   was originally introduced in the field of communication theory in
350   the context of the bandwidth needed to transmit a message from a
351   sender to a receiver.\footnote{This approach seems to be
352   especially qualified to be used in the context of mutual
353   information calculations. However, first experiments with the {\sc
354   Pnc\,2} cluster algorithm were disappointing as the prediction
355   accuracy was not increased. Thus this approach was discarded.}