Test accuracy higher than training. How to interpret?

I've a dataset containing at most 150 examples (split into training & test), with many features (higher than 1000). I need to compare classifiers and feature selection methods which perform well on data. So, I'm using three classification methods (J48, NB, SVM) and 2 feature selection methods (CFS, WrapperSubset) with different search methods (Greedy, BestFirst). While comparing, I'm looking at training accuracy (5-fold cross-folding) and test accuracy. Here is one of the results of J48 and CFS-BestFirst: < "accuracyTraining" : 95.83, "accuracyTest" : 98.21 >Many results are like this, and on the SVM there are many results that indicate that test accuracy is much higher than training (training: 60%, test: 98%) How can I meaningfully interpret these kind of results? If it was lower, I would say it's overfitting. Is there something to be said about bias and variance in this case by looking all the results? What can I do to make this classification meaningful, such as re-selecting training and test sets or just using cross-validation on all data? I have 73 training & 58 test instances. Some answers didn't have this info when they were posted.

classification
feature-selection

1,772 3 3 gold badges 20 20 silver badges 30 30 bronze badges asked May 21, 2013 at 14:40 483 1 1 gold badge 4 4 silver badges 6 6 bronze badges

7 Answers 7

$\begingroup$

I think a first step is to check whether the reported training and test performance are in fact correct.

Is the splitting during the 5-fold cross validation done in a way that yields statistically independent cv train/test sets? E.g. if there are repeated measurements in the data, do they always end up in the same set?
95.83% accuracy in a 5-fold cv of 150 samples is in line with 5 wrong out of 130 training samples for the 5 surrogate models, or 25 wrong cases for 5 * 130 training samples.
98.21% test accuracy is more difficult to explain: during one run of the cv, each case should be tested once. So the possibly reported numbers should be in steps of 100%/150. 98.21% corresponds to 2.68 wrong cases (2 and 3 wrong out of 150 test cases gives 98.67 and 98.00% accuracy, respectively).
If you can extract your model, calculate the reported accuracies externally.
What are the reported accuracies for random input?
Do an external cross validation: split your data, and hand over only the training part to the program. Predict the "external" test data and calculate accuracy. Is this in line with the program's output?
Make sure the reported "test accuracy" comes from independent data (double/nested cross validation): if your program does data driven optimization (e.g. choosing the "best" features by comparing many models), this is more like at training error (goodness of fit) than like a generalization error.

I agree with @mbq that training error is hardly ever useful in machine learning. But you may be in one of the few situations where it actually is useful: If the program selects a "best" model by comparing accuracies, but has only training errors to choose from, you need to check whether the training error actually allows a sensible choice.
@mbq outlined the best-case scenario for indistinguishable models. However, worse scenarios happen as well: just like test accuracy, training accuracy is also subject to variance but has an optimistic bias compared to the generalization accuracy that is usually of interest. This can lead to a situation where models cannot be distinguished although they really have different performance. But their training (or internal cv) accuracies are too close to distinguish them because of their optimistic bias. E.g. iterative feature selection methods can be subject to such problems that may even persist for the internal cross validation accuracies (depending on how that cross validation is implemented).

So if such an issue could arise, I think it is a good idea to check whether a sensible choice can possibly result from the accuracies the program uses for the decision. This would mean checking that the internal cv accuracy (which is supposedly used for selection of the best model) is not or not too much optimistically biased with respect to an externally done cv with statistically independent splitting. Again, synthetic and/or random data can help finding out what the program actually does.

A second step is to have a look whether the observed differences for statistically independent splits are meaningful, as @mbq pointed out already.

I suggest you calculate what difference in accuracy you need to observe with your given sample size in order to have a statistically meaningful difference. If your observed variation is less, you cannot decide which algorithm is better with your given data set: further optimization does not make sense.