An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

Kristjan Korjus; Martin N Hebart; Raul Vicente

doi:10.1371/journal.pone.0161788

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

Standard

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable. / Korjus, Kristjan; Hebart, Martin N; Vicente, Raul.

in: PLOS ONE, Jahrgang 11, Nr. 8, 2016, S. e0161788.

Publikationen: SCORING: Beitrag in Fachzeitschrift/Zeitung › SCORING: Zeitschriftenaufsatz › Forschung › Begutachtung

Harvard

Korjus, K, Hebart, MN & Vicente, R 2016, 'An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable', PLOS ONE, Jg. 11, Nr. 8, S. e0161788. https://doi.org/10.1371/journal.pone.0161788

APA

Korjus, K., Hebart, M. N., & Vicente, R. (2016). An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable. PLOS ONE, 11(8), e0161788. https://doi.org/10.1371/journal.pone.0161788

Vancouver

Korjus K, Hebart MN, Vicente R. An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable. PLOS ONE. 2016;11(8):e0161788. https://doi.org/10.1371/journal.pone.0161788

Bibtex

@article{bd50757e0143493db81b4fbc50939a38,

title = "An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable",

abstract = "Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier's generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term {"}Cross-validation and cross-testing{"} improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.",

keywords = "Journal Article",

author = "Kristjan Korjus and Hebart, {Martin N} and Raul Vicente",

year = "2016",

doi = "10.1371/journal.pone.0161788",

language = "English",

volume = "11",

pages = "e0161788",

journal = "PLOS ONE",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "8",

}

RIS

TY - JOUR

T1 - An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

AU - Korjus, Kristjan

AU - Hebart, Martin N

AU - Vicente, Raul

PY - 2016

Y1 - 2016

N2 - Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier's generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term "Cross-validation and cross-testing" improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

AB - Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier's generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term "Cross-validation and cross-testing" improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.

KW - Journal Article

U2 - 10.1371/journal.pone.0161788

DO - 10.1371/journal.pone.0161788

M3 - SCORING: Journal article

C2 - 27564393

VL - 11

SP - e0161788

JO - PLOS ONE

JF - PLOS ONE

SN - 1932-6203

IS - 8

ER -