blogit ergo sum: The Effect of Test Set Selection on Classification Accuracy

I was looking at some prediction results for the UCI Michalski and Chilausky soybean data set and wondered how they depended on test set selection. Some had classification accuracy as high as 93.1% accuracy on a 25% training set and 97.1% on 290 training and 340 test instances.

A few weeks ago I had been asked to find the best classifier for the soybean data set based on prediction accuracy on a test set of 20% of the data. The remaining 80% could be used for training. That gave 306!/(245!x61!) = 1.3 x 10^65 possible splits of the 306 data points into training and test sets. Could some of these splits lead to better results than others for the classifiers I was about to use?

The WEKA data mining package was used for classification. WEKA has many classifiers that can be run on a data set and their performance to be compared.

WEKA also has a programming interface so I used it to write some Jython toolsto explore the performance of a range of classifiers.

One of these tools was run on the soybean data to find the training/test splits with best and worst classification accuracy. The results were

Classifier	Best Accuracy	Worst Accuracy
Naive Bayes	100%	70.5%
Bayes Net	100%	75.4%
J48 (C4.5)	95%	69%
JRip (RIPPER)	98.4%	70.5%
KStar	96.7%	65.6%
Random Forest	95%	62.3%
SMO (support vector machine)	96.7%	82%
MLP (neural network)	100%	77%

Fig1. Best and worst accuracies for selected WEKA classifiers run on different training/test splits

That was quite a range of test set accuracies for different training/test splits. My simple genetic algorithm may not have found the extremes of the distributions so the actual range may have been higher.

When I ran the test set selection scripta second time (Fig 2) it also found a 100% SMO accuracy. The second test was set up to find a single training/test set split that gave best results for all classifiers at once. It also had a slightly different pre-processing. The 4 duplicate instances were removed and the troublesome single 2-4-5-t sample was left in. Therefore I expected it to give worse results than the pre-processing used for the results in Fig 1.

Classifier	Correct (out of 60)	Percent Correct
Naive Bayes	57	95 %
Bayes Net	59	98.3 %
J48	58	96.7 %
JRip	60	100 %
KStar	60	100 %
Random Forest	59	98.3 %
SMO	60	100 %
MLP	60	100 %

Fig2. Best accuracies for selected WEKA classifiers all run on the same training/test split

Both the above results were for the default settings of each of the WEKA classifiers. The WEKA classifiers all have parameters that can be tuned and it is possible to select subsets of attributes so they can give better and much worse results than the defaults. However the default parameters are usually close to the best so they may be good indicators of the best possible accuracies.

It appears that the training/test split of a data set can change classification accuracy by more than 30%. This was observed on a well-known and widely used classification data set.

blogit ergo sum

18 October 2010

The Effect of Test Set Selection on Classification Accuracy

No comments:

Translate

My Blog List

Total Pageviews

Followers

Your Details

Google Analytics

Feedjit Live Blog Stats

Cluster Map