```
1 \section{
```**Distance Functions**\label{SeMetric}} Distance functions
2 are used to measure the distance between two data tuples $P_a$ and
3 $P_b$. In this work distances are always calculated in the input
4 space. Thus the term {\em input vector} or just {\em input} is
5 used in the following descriptions. In the context of unsupervised
6 clustering these terms should be replaced by {\em data vector}
7 and {\em variable}. A broadly known distance function is the {\em
8 Minkowski} metric, that is defined as
9
10 \begin{equation}
11 d = \sqrt[\rho]{\sum_{j=1}^m d_j^\rho}\ .\label{EqMinkowskiMetric}
12 \end{equation}
13
14 For continuous inputs the component wise distances $d_j$ are
15 calculated as
16
17 \begin{equation}
18 d_j = |x_{a j}-x_{b j}|\label{EqSingleDistance}\ ,
19 \end{equation}
20
21 whereas $x_{a j}$ resp. $x_{b j}$ denotes the $j$-th component of
22 the input vector $\mathbf{x}_a$ resp. $\mathbf{x}_b$. For some
23 particular values of the metric parameter $\rho$, the Minkowski
24 metric corresponds to the special distance functions as stated in
25 table \ref{TaMinkowskiMetric}.
26
27
28
29 \begin{table}[!ht]
30 \caption{\label{TaMinkowskiMetric}Parameter values $\rho$ for the
31 Minkowski metric.} \centering \vspace{\TableSkip}
32 \begin{tabular}{lp{1cm}lp{0.5cm}l}
33 \hline\\[-7pt]
34 Block wise distance && $\rho = 1$ && ${\displaystyle d = \sum_{j=1}^m d_j}$\\
35 \\
36 Euclidean distance && $\rho = 2$&& ${\displaystyle d = \sqrt{\sum_{j=1}^m d_j^2}}\qquad$\\
37 \\
38 Chebychev distance && $\rho = \infty$ && ${\displaystyle d =
39 \max_{j=1}^m
40 |d_j|}$\\[7pt]
41 \hline
42 \end{tabular}\end{table}
43 \vspace{10pt}
44
45 Usually the metric parameter is chosen to satisfy the inequality
46 $\rho \geq 1$. However, also other values are possible. In
47 \cite{AgHiKe01} the so called {\em fractional distance metric},
48 that allows values $0< \rho < 1$, is introduced and it is shown
49 experimentally, that this metric can improve the prediction
50 accuracy, if applied in a high dimensional input space together
51 with a $k${\sc nn} algorithm.
52
53
54 \paragraph{**Component Wise Distances for Nominal Inputs**}
55 The simplest approach to calculate the component wise distance
56 $d_j$ of the two symbols $w=x_{a j}$ and $z=x_{b j}$ of a nominal
57 input $x_j$ is the so called {\em overlap} metric. The distance is
58 0 if the to symbols coincide and 1 otherwise.
59
60 \begin{equation}
61 d_j = \left\{ \begin{array}{ll}
62 0 &\quad x_{a j} = x_{b j}\\
63 1 &\quad else\\
64 \end{array}\right.\label{EqOverlapMetric}
65 \end{equation}
66
67 In \cite{StaWa92} the so called {\em Value Difference Metric}
68 (VDM) is introduced to allow a more precise graduation of the
69 calculated distances. The idea of the VDM metric is, that two
70 symbols $w=x_{a j}$ and $z=x_{b j}$ of a nominal input $x_j$ are
71 closer to each other, if the conditional probabilities
72 $P(y=s|x_j=w)$ and $P(y=s|x_j=z)$ are similar for the different
73 possible output classes $s$. A simplified VDM metric can be
74 calculated as
75
76 \begin{equation}
77 d_j = \sum_{s=1}^{S_y} \left| P(y=s|x_j=w)-P(y=s|x_j=z) \right| =
78 \sum_{s=1}^{S_y} \left| \frac{N_{w, s}}{N_w} - \frac{N_{z,
79 s}}{N_z} \right|\label{EqVDM}\ .
80 \end{equation}
81
82 The required probabilities are estimated based upon the
83 frequentness with respect to the given learn data sample. $N_w$
84 resp. $N_z$ is the number of learn data tuples, for which the
85 input $x_j$ has the symbol $w$ resp. $z$ and $N_{w, s}$ resp.
86 $N_{z, s}$ is correspondingly the number of learn data tuples, for
87 which additionally the output has the symbol $s$. Remind, that the
88 different possible output symbols are encoded as integer values
89 starting from 1.
90
91 If there are several nominal or mixed nominal and continuous
92 inputs, the component wise distances $d_j$ are calculated
93 according to eqn. (\ref{EqOverlapMetric}) or (\ref{EqVDM}) for
94 nominal and according to eqn. (\ref{EqSingleDistance}) for
95 continuous inputs. Afterwards the component wise distances are
96 summarized to the overall distance using eqn.
97 (\ref{EqMinkowskiMetric}). In the case of $\rho=2$ the resulting
98 distance measures are called {\em Heterogeneous Euclidean Overlap
99 Metric\label{DeHEOM}} (HEOM) and {\em Heterogeneous Value
100 Difference Metric} (HVDM) \cite{WiMa97}\footnote{Recently
101 published work \cite{WiMa97, WiMa00b} reports about the successful
102 application of the VDM and the similar and further developed {\em
103 Interpolated Value Difference Metric} (IVDM) in the context of
104 data-based learning algorithms. However first experiments with the
105 {\sc Pnc\,2} cluster algorithm were disappointing, as the
106 prediction accuracy -- compared with a properly normalized and
107 weighted heterogenous Minkowski overlap metric as defined by eqn.
108 (\ref{EqHMOM} -- could not be improved. Thus we discarded the
109 approach of the VDM or IVDM metric, as it is more complicated.}.
110
111
112 \paragraph{**Missing Feature Values**}
113 Some practical learning tasks contain input vectors with missing
114 feature values. There are several approaches how to deal with this
115 problem. The rule induction algorithm {\sc Rise} \cite{Dom97}, for
116 example, treats a missing feature value as an additional symbol.
117 The representation of the rules is extended by an additional flag
118 for each input, that indicates, if missing feature values are
119 covered by the rule's premise. Note, that this approach is
120 possible independently of the input's type. Another approach is
121 the ignore strategy \cite{Aha90}, that simply ignores a component
122 wise distance in eqn. (\ref{EqMinkowskiMetric}) by setting
123 $d_j=0$, if a feature value is missing. The overall distance is
124 afterwards divided by the number of known features to distinguish
125 a perfect match from the case of an input vector, where all
126 feature values are missing. Alternatively the data sample can be
127 prepared by guessing missing values -- for example by using a
128 special model, that was learned for this purpose. An overview of
129 several approaches used together with decision trees can be found
130 in \cite{Qui89,Liu97}.
131
132
133 \subsection{**Normalization**\label{SeNormalization}}
134 \paragraph{**Continuous Inputs**}
135 All inputs should contribute evenly to the overall distance value
136 \footnote{If some inputs are more relevant than others one can use
137 feature weights as described in section \ref{SeWeights}.}. However
138 this requirement is not always satisfied for the distance function
139 (\ref{EqMinkowskiMetric}). For continuous inputs a problem
140 arises, if the ranges of the different inputs differ significantly. For
141 example, if one input is within a range of $[0..1]$ and the other
142 within the range of $[0..100]$, then the latter input can
143 dominate the other one. Thus it is advisable to normalize the single inputs
144 by a factor $\delta_j$, so that the component wise distances are
145 mostly within the interval $[0,1]$. The normalization can be
146 integrated into the distance function.
147
148 \begin{equation}
149 d = \sqrt[\rho]{\sum_{j=1}^m
150 \left(\frac{d_j}{\delta_j}\right)^\rho}\label{Eq231}\end{equation}
151
152 An easy approach to chose the normalization factors is to couple
153 them to the range or standard deviation -- calculated using the
154 learn data sample -- of the particular input. For a formal
155 definition of these simple statistics see table \ref{TaStatistic}.
156
157
158 \begin{table}[!ht]\NormalsizeIfTenPt
159 \caption{Simple statistics of an input $x$.\label{TaStatistic}}
160 \centering\vspace{\TableSkip}\begin{tabular}{lp{1cm}l}
161 \hline\\
162 Mean && $\displaystyle \overline{x} = \frac{1}{N}
163 \sum_{i=1}^N x_i ,$\\
164 \\
165 Standard deviation && $\displaystyle \sigma = \frac{1}{N-1} \sum_{i=1}^N (x_i-\overline{x})^2$\\
166 \\Maximum value && $\displaystyle x_{max} = \max_{i=1}^N x_i$\\
167 \\Minimum value && $\displaystyle x_{min} = \min_{i=1}^N x_i$\\
168 \\Range && $\displaystyle x_{range} = x_{max} - x_{min}$\\
169 \\\hline
170 \end{tabular}
171 \end{table}
172
173 If the normalization factors are chosen as
174
175 \begin{equation}
176 \delta_j = x_{range\,j}\label{EqRangeNormalization}\ ,
177 \end{equation}
178
179 then all possible component wise distances $d_j$ of two learn data
180 tuples are guaranteed to be within the interval $[0,1]$. However
181 the inputs for a test data sample can have a different range and
182 thus it is possible to get a component wise distance $d_j>1$ --
183 but usually this should not be a problem as a slightly higher
184 component wise distance is often justified in such a case. A
185 problem when using the range normalization is, that this statistic
186 can be highly affected by noise. Thus often a certain percentage
187 of the smallest and biggest values is omitted when calculating the
188 range.
189
190 If the normalization factors are chosen as the fourth of the
191 standard deviation
192
193 \begin{equation}
194 \delta_j = 4 \sigma_j\label{EqDevNormalization}\ ,
195 \end{equation}
196
197 this leads to component wise distances, that are mostly in the
198 interval $[0,1]$. This is due to the fact, that 95\*% of all values*
199 of a normally distributed variable $x_j$ are inside the interval
200 $[-2\sigma_j, +2\sigma_j]$ and thus the distance of two randomly
201 chosen values will mostly be less than $4 \sigma_j$.
202
203
204 \paragraph{**Mixed Continuous and Nominal Inputs**}
205 Proper normalization for mixed continuous and nominal inputs is a
206 non trivial problem, if the overlap metric according to eqn.
207 (\ref{EqOverlapMetric}) is used to calculate the component wise
208 distances of nominal inputs\footnote{For notes on how to proper
209 normalize distances, that were calculated using eqn. (\ref{EqVDM})
210 see \cite{WiMa97}.}. The component wise distances are within the
211 interval $[0,1]$, but the average value tends to be much bigger
212 than the average value of a normalized continuous input. This
213 leads to an disproportionate high influence of nominal inputs. In
214 the context of this work, two different methods how to normalize
215 mixed continuous and nominal inputs are considered.
216
217 The first and simplest approach is to keep on using eqn.
218 (\ref{EqRangeNormalization}) or (\ref{EqDevNormalization}) to
219 determine the normalization factors for continuous inputs and to
220 simply chose a fixed $\delta = \delta_{Symb} > 1$ for all nominal
221 inputs.
222
223 The second approach is to calculate all normalization factors --
224 the ones for the nominal as well as the ones for the continuous
225 inputs -- using the normalization by the {\em average component
226 wise distance} as it is described in \cite{HoCa99}. The basic idea
227 of this approach is to normalize each input, such that the average
228 normalized component wise distance is 1. Therefore the average
229 component wise distance of two learn data tuples to each other are
230 calculated as
231
232 \begin{equation}
233 \delta_j = d_{avr} = \frac{{\displaystyle \sum_{i=1}^N
234 \sum_{z=i+1}^N d_j(x_{i,j}, x_{z,j})}}{{\displaystyle \frac{1}{2}
235 N ( N - 1 )}}\ ,\label{EqAverageDistance}
236 \end{equation}
237
238 whereas $N$ is the sample size and $d_j(x_{i,j}, x_{z,j})$ denotes
239 the distance of the $i$-th to the $z$-th learn data tuple. For
240 large sample sizes $N$ it is not necessary to evaluate all
241 possible pairs of two learn data tuples to get a sufficiently
242 robust estimate -- the computationally costs can be reduced by
243 only evaluating a subset of the learn data sample.
244
245
246 \subsection{**Feature Weights**\label{SeWeights}}
247 The influence of more relevant inputs can be increased using so
248 called {\em feature weights}. Do not confuse with the different
249 aims of normalization and weighting. The normalization tries to
250 balance the influence of the inputs, whereas the feature weights
251 afterwards scale each input's influence with respect to its
252 relevance.
253
254 Several methods to calculate feature weights with application to
255 IBL algorithms are described in \cite{WeAhMo97, Aha98}. The
256 following paragraph describes the approach to calculate feature
257 weights based upon the mutual infor\-mation \cite{Re94} of a
258 particular single input to the output. This approach has also been
259 used in \cite{WD94} together with the {\sc Nge} and the $k${\sc
260 nn} algorithm.
261
262 The mutual information $I(x,y)$ measures the decrease of the
263 uncertainty about the outcome of a discrete random variable $y$,
264 if another discrete variable $x$ is known. The uncertainty is
265 measured by the entropy $E(y)$ \cite{Sha48}.
266
267 \begin{equation}
268 E(y)=-\sum_{s=1}^{S_y}P(y=s) \log P(y=s)\label{EqEntropy}
269 \end{equation}
270
271 \begin{equation}
272 I(x,y)=I(y,x)=\sum_{w=1}^{S_x}\sum_{s=1}^{S_y}P(y=s \wedge x=w)
273 \log \frac{P(y=s \wedge x=w)}{P(y=s) P(x=w)}
274 \end{equation}
275
276 Thereby $S_x$ and $S_y$ denotes the number of possible discrete
277 values of the random variable $x$ resp. $y$ and $P(x=w)$ and
278 $P(y=s)$ denote the probability of the $w$-th resp. the $s$-th
279 value. A division by zero is circumvented by setting
280 $\log\frac{0}{0}=0$. If an input does not contribute to reduce the
281 uncertainty about the output, then the mutual information of this
282 input will be 0. If an input completely explains the output, then
283 the mutual information equals the entropy of the output.
284
285 The mutual information can be easily calculated and the resulting
286 values are relatively robust with respect to different learn data
287 samples.
288
289 The feature weights are calculated based upon the mutual
290 information as follows: The mutual information $I(x_j,y)$ is
291 calculated for each of the $m$ inputs. Continuous inputs are
292 discretized -- as described in the following paragraph -- for this
293 calculation. The values $I(x_j,y)$ are normalized to get an
294 average feature weight of one.
295
296 \begin{equation}\label{EqWeights}
297 \omega_j=\frac{m\,I(x_j,y)}{{\displaystyle
298 \sum_{w=1}^{m}I(x_w,y)}}
299 \end{equation}
300
301
302 \paragraph{**Discretization of Continuous Inputs**\label{SeDiscretization}}
303 Continuous inputs can be transformed into ordinal inputs using a
304 so called discretization method, that divides the range of the
305 continuous input into several intervals. Each continuous value is
306 replaced by the number of the interval it is in\footnote{We
307 principally enumerate these intervals starting with 1.}. The
308 simplest and most commonly used approaches are the so called {\em
309 equidistant} and the {\em equifrequent} discretization. Both
310 approaches work univariate and unsupervised. It is necessary to
311 pre-specify a fixed number of discretization intervals $N_{Bins}$.
312
313 \begin{itemize}
314 \item {\em Equidistant discretization.} The range of the input
315 $x$ is equidistantly divided into $N_{Bins}$ intervals of the
316 width $w=\frac{x_{range}}{N_{Bins}}$. As by the range
317 normalization it is advisable, to ignore a small percentage of the
318 smallest and highest feature values when determining the input's
319 range to get a more robust estimate in the case of noise. The
320 left- and the right-most interval are extended to infinity.
321
322 \item {\em Equifrequent discretization.} The equidistant
323 discretization can be unfavorable if the numbers of data tuples,
324 that fall into each interval, differ significantly. The
325 equifrequent discretization faces this problem by positioning the
326 intervals such that each interval contains approximately the same
327 number of data tuples $N'=floor(\frac{N}{N_{Bins}})$. Thereby the
328 operator $floor(x)$ returns the largest integer not greater than
329 $x$, i.e. rounds down. The left- and the right-most interval are
330 extended to infinity.
331 \end{itemize}
332
333 Several more advanced discretization approaches are published. Of
334 particular interest for application together with the calculation
335 of the mutual information is the entropy based method first
336 published in \cite{FaIr89}. For an overview about possible other
337 approaches see \cite{HuLiTa99, DoKoSa95}.
338
339 The method in {\cite{FaIr89} produces a discretization, that
340 minimizes the entropy of the output $y$, that is calculated
341 according to eqn. (\ref{EqEntropy}) for each interval. It is not
342 necessary to pre-specify the number of intervals. The approach is
343 applied as follows: The input is divided into two intervals, such
344 that the entropy of this discretization is minimal. Afterwards,
345 one continues recursively with the resulting intervals as long as
346 the {\em Minimum Description Length Principle} (MDLP) is
347 satisfied. This principle weights the reduction of the entropy
348 against the increasing complexity of the discretization. The MDLP
349 was originally introduced in the field of communication theory in
350 the context of the bandwidth needed to transmit a message from a
351 sender to a receiver.\footnote{This approach seems to be
352 especially qualified to be used in the context of mutual
353 information calculations. However, first experiments with the {\sc
354 Pnc\,2} cluster algorithm were disappointing as the prediction
355 accuracy was not increased. Thus this approach was discarded.}

Top