Abstract: The incomplete data is a long-standing pathological issue in the broad science and engineering domains for a variety of reasons. Incomplete data problem hampers reliable data-driven research and trustworthy critical decision-making. Not only the scientific and engineering research communities but also data science areas suffer from large incomplete data-oriented data curing tools while the existing methods and theories often require complex distributional assumptions or difficult statistical experts' interventions. Naïve data-curing mehod is widely used in data science and machine learning areas.
To meet this daunting challenge in the emerging era of machine learning and data, our team combined the theory of fractional hot-deck imputation (FHDI), computational statistics, and parallel computing to cure ultra incomplete data (i.e., concurrently big-n and big-p) with tremendous instances and high dimensionality. The ultra data-oriented FHDI is named as UP-FHDI. The parallel program and sources of UP-FHDI are made publicly available to benefit broader audiences in science, engineering, data science domains. Uncertainty measurement of the cured data is another important issue. In lieu of the computationally expensive parallel Jackknife method, the uncertainty assessment of UP-FHDI is enabled by a computationally efficient parallel linearization technique. Results confirm that UP-FHDI can handle diverse ultra data with up to millions of instances and >10,000 variables. The scale-up now depends on the amount of memory and computing power available. We also show that UP-FHDI holds a positive impact on the subsequent deep learning’s prediction performance. We believe that this achievement will catalyze large/big data-driven science and engineering where incomplete large data pose a daunting challenge to advanced machine learning and statistical predictions.
Bio: In Ho Cho is an associate professor of Iowa State University, CCEE department.
He received his PhD from the California Institute of Technology with a focus on computational science and engineering. His major research areas cover data-driven science and engineering by forging a technological convergence of computational statistics, machine learning, engineering principles, and physics. One of his ongoing projects focuses on curing large/big incomplete data for broad engineers and scientists without barriers of complex assumptions or statistical expertise. His group seeks to answer how to easily, efficiently, and accurately cure formidably large and complex missing data to best catalyze the subsequent statistical inference and machine learning? Open-source R package is available via CRAN, and ultra data-oriented parallel computing version program is also publicly shared. His research is supported by several awards from National Science Foundation.