Superior Cross-Validation, Ensemble Generation And Big Data Partitioning
Superior Cross-Validation, Ensemble Generation And Big Data Partitioning


To effectively develop predictive model with machine learning it is often necessary to partition data. Sometimes the data is partitioned to facilitate testing and sometimes the partitioning is required to analyze very large data volumes. The authors introduce a novel technique for both types of partitioning rooted in Latin Square experimental design theory that provides major advantages, allowing analysts to obtain new measures of uncertainty surrounding record level predictions, providing for new forms of automatic ensemble creation, introducing a new strategy for deliberately overfitting models that participate in an ensemble (with overfitting eliminated by the ensemble averaging), and the partitioning of very large databases into optimally overlapping subsamples. The partitioning plans are also applicable to partitioning data by columns rather than rows, thus, we might partition data into many thousands of subsets of overlapping predictors while also simultaneously partitioning the data by rows. The partitioning plans are generated via a straightforward recursive algorithm that can be applied to any scale of data, ranging from a simple 7-fold variation of cross-validation, to partitioning schemes involving hundreds of millions of parts.

For K-fold cross-validation the most obvious novelty is in leaving out of multiple parts for testing for every fold instead of the classical “leave out just one part”. Parts of data are also left out for testing in multiple folds resulting in multiple “test” predictions for every record of data, supporting a measure of the prediction variance. Examples of several variations of the new scheme applied to real data are presented.


Coming soon

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google