## Lively discussion: how to cross-validate?

So today’s group meeting got a bit heated as Nafiz, Ashley, and Xiao touched on the finer points of how to cross validate. Machine learning people, your comments are welcome.

So today’s group meeting got a bit heated as Nafiz, Ashley, and Xiao touched on the finer points of how to cross validate. Machine learning people, your comments are welcome.

It would be easier to work out what the argument was if people didn’t talk all over eacb other 😉

Nafiz (I assume that’s him at the board) is correct.

Given a dataset, and the scheme he’s describing on the board (k-fold CV), you would:

i) divide the data into two groups in the the proportion 1/k [test] and (k-1)/k [train] (your choice of k is how you decide where the line is)

ii) train/fit the model on the larger [train] portion

iii) evaluate the model/fit on the smaller [test] portion

you would then do this (as he describes) on k mutually-exclusive (k-1)/k and 1/k portions*. You do not train/fit your model on the same data used to test it. The whole purpose of this procedure is to simultaneously exclude information used to evaluate the fit/model from the procedure used to train/fit the model, while allowing the model/fit to be assessed on every point in the dataset. It provides an estimate of the performance of the fit/model, which is what should be reported as the expected performance of the final model/fit, where the final model is trained on the complete dataset.

However, there is one more subtlety… 😉

If you are comparing models/fits (which includes parameter choices) to decide which is the best, you should divide your dataset into three groups, proportioned as e.g. 1/k [test], 1/k [hold-out], (k-2)/k [train]. The procedure runs as before with the training and test sets – used for all models/choices – and the estimate of performance obtained as before: but this performance estimate only helps decide between the models/fits/parameter sets. The final reported model/fit performance should instead be evaluated on the hold-out set. This is because the training process (both the model/fit process itself, and choosing between alternative models/fits) has effectively involved both the training and test sets – and you should not employ information that was used to train a model in its assessment (otherwise you risk overfitting).

It does not matter if your model/fit accuracy is lower in any particular 1/k and (k-1)/k subsetting – you would expect it to sometimes by higher, sometimes lower, than the final model/fit estimate.

Shameless self-promotion:

We discuss CV methods briefly in this, but there are links out to more detail: https://www.ncbi.nlm.nih.gov/pubmed/24643551

Slides 118-126 here should give a graphical explanation of where/how the line is decided: http://www.slideshare.net/leightonp/mining-plant-pathogen-genomes-for-effectors

*constructed for example, as he says, by shuffling the dataset before you start, and pre-marking the set with k divisions**.

**noting that, for most appropriate results, you should aim not to truly randomise the dataset, but to have each 1/k block be representative of the datasets as a whole (e.g. if stratified, the proportions of each group should be about the same)