Test accuracy higher than training. How to interpret?

I've a dataset containing at most 150 examples (split into training & test), with many features (higher than 1000). I need to compare classifiers and feature selection methods which perform well on data. So, I'm using three classification methods (J48, NB, SVM) and 2 feature selection methods (CFS, WrapperSubset) with different search methods (Greedy, BestFirst). While comparing, I'm looking at training accuracy (5-fold cross-folding) and test accuracy. Here is one of the results of J48 and CFS-BestFirst: < "accuracyTraining" : 95.83, "accuracyTest" : 98.21 >Many results are like this, and on the SVM there are many results that indicate that test accuracy is much higher than training (training: 60%, test: 98%) How can I meaningfully interpret these kind of results? If it was lower, I would say it's overfitting. Is there something to be said about bias and variance in this case by looking all the results? What can I do to make this classification meaningful, such as re-selecting training and test sets or just using cross-validation on all data? I have 73 training & 58 test instances. Some answers didn't have this info when they were posted.

1,772 3 3 gold badges 20 20 silver badges 30 30 bronze badges asked May 21, 2013 at 14:40 483 1 1 gold badge 4 4 silver badges 6 6 bronze badges

7 Answers 7

$\begingroup$

I think a first step is to check whether the reported training and test performance are in fact correct.

I agree with @mbq that training error is hardly ever useful in machine learning. But you may be in one of the few situations where it actually is useful: If the program selects a "best" model by comparing accuracies, but has only training errors to choose from, you need to check whether the training error actually allows a sensible choice.
@mbq outlined the best-case scenario for indistinguishable models. However, worse scenarios happen as well: just like test accuracy, training accuracy is also subject to variance but has an optimistic bias compared to the generalization accuracy that is usually of interest. This can lead to a situation where models cannot be distinguished although they really have different performance. But their training (or internal cv) accuracies are too close to distinguish them because of their optimistic bias. E.g. iterative feature selection methods can be subject to such problems that may even persist for the internal cross validation accuracies (depending on how that cross validation is implemented).

So if such an issue could arise, I think it is a good idea to check whether a sensible choice can possibly result from the accuracies the program uses for the decision. This would mean checking that the internal cv accuracy (which is supposedly used for selection of the best model) is not or not too much optimistically biased with respect to an externally done cv with statistically independent splitting. Again, synthetic and/or random data can help finding out what the program actually does.

A second step is to have a look whether the observed differences for statistically independent splits are meaningful, as @mbq pointed out already.

I suggest you calculate what difference in accuracy you need to observe with your given sample size in order to have a statistically meaningful difference. If your observed variation is less, you cannot decide which algorithm is better with your given data set: further optimization does not make sense.