 # Some Basics ## The basic task of data-based modelling

The following figure illustrates the basic task of data-based modelling. Given is a system with an input vector x and the corresponding output y. The input vector consists out of one or several single input variables. The aim is to build up a model, which predicts the unknown output value given the input vector. A tuple P=(x,y) with the input vector and the output value is also called data tuple. To build up a data-based model you first need to collect some data tuples from the real system by recording input and corresponding output values for some representative operating conditions. Then the PNC2 cluster algorithm can be employed to find rules in your data.

## What about a variable's type?

With respect to the possible values there are three different types of variables.

• Nominal variables can only have symbolic values, that cannot be ordered with respect to a greater-less relation. An example for a nominal variable is the color of an object, which can have the different symbols red, green and blue.
• Ordinal variables can only have symbolic values, but, in contrast to nominal variables, they can be ordered with respect to a greater-less relation. An example for an ordinal variable is a temperature that is measured with the qualitative terms cold, warm and hot. Another example is the age of a person that is measured in years. Within the PNC2 Rule Induction System, ordinal variables with just a few different symbols, as the above example with the temperature, should be treated as nominal. But ordinal variables with many different symbols, as the above example with the age, should be treated as continuous.
• Continuous variables can have arbitrary real values - only limited by the precision of the measuring device. An example for a continuous variable is the temperature measured in centigrade.
By the means of the output variable's type there are two fundamentally different types of learning. If the output is nominal, one has got to deal with a classification task. Whereas if the output is continuous, one has got to solve a regression task. The PNC2 Rule Induction System is primarily intended for classification tasks.

## How to estimate a model's prediction accuracy?

Usually the prediction accuracy of a learned model is evaluated with respect to a new and unseen test data sample. Therefore, based upon the particular input vectors, a prediction of the output value is estimated for each test data tuple. Then the difference between the real and the predicted output values is evaluated and summarized into a single loss function value as follows:

• Classification tasks    Mean classification error (MCE), i.e. the mean number of miss-classifications done
• Regression tasks    Mean absolute error (MAE)

last updated: 22 January 2004   © 2000-2004 by Lars Haendel