18 October 2010

The Effect of Test Set Selection on Classification Accuracy

I was looking at some prediction results for the UCI Michalski and Chilausky soybean data set and wondered how they depended on test set selection. Some had classification accuracy as high as 93.1% accuracy on a 25% training set and  97.1% on 290 training and 340 test instances.

A few weeks ago I had been asked to find the best classifier for the soybean data set based on prediction accuracy on a test set of 20% of the data. The remaining 80% could be used for training. That gave 306!/(245!x61!) = 1.3 x 10^65 possible splits of the 306 data points into training and test sets. Could some of these splits lead to better results than others for the classifiers I was about to use?

The WEKA data mining package was used for classification. WEKA has many classifiers that can be run on a data set and their performance to be compared.

WEKA also has a programming interface so I used it to write some Jython toolsto explore the performance of a range of classifiers.

One of these tools was run on the soybean data to find the training/test splits with best and worst classification accuracy. The results were

Classifier Best Accuracy Worst Accuracy
Naive Bayes 100% 70.5%
Bayes Net 100% 75.4%
J48 (C4.5) 95% 69%
JRip (RIPPER) 98.4% 70.5%
KStar 96.7% 65.6%
Random Forest 95% 62.3%
SMO (support vector machine) 96.7% 82%
MLP (neural network) 100% 77%
Fig1. Best and worst accuracies for selected WEKA classifiers run on different training/test splits

That was quite a range of test set accuracies for different training/test splits. My simple genetic algorithm may not have found the extremes of the distributions so the actual range may have been higher.

When I ran the test set selection scripta second time (Fig 2) it also found a 100% SMO accuracy. The second test was set up to find a single training/test set split that gave best results for all classifiers at once. It also had a slightly different pre-processing. The 4 duplicate instances were removed and the troublesome single 2-4-5-t sample was left in. Therefore I expected it to give worse results than the pre-processing used for the results in Fig 1.

Classifier Correct (out of 60) Percent Correct
Naive Bayes 57 95 %
Bayes Net 59 98.3 %
J48 58 96.7 %
JRip 60 100 %
KStar 60 100 %
Random Forest 59 98.3 %
SMO 60 100 %
MLP 60 100 %
Fig2. Best accuracies for selected WEKA classifiers all run on the same training/test split

Both the above results were for the default settings of each of the WEKA classifiers. The WEKA classifiers all have parameters that can be tuned and it is possible to select subsets of attributes so they can give better and much worse results than the defaults. However the default parameters are usually close to the best so they may be good indicators of the best possible accuracies.

It appears that the training/test split of a data set can change classification accuracy by more than 30%. This was observed on a well-known and widely used classification data set.

No comments: