InfoCoBuild

Divide and Recombine for the Analysis of Big Data

Divide and Recombine for the Analysis of Big Data by William S. Cleveland - Machine Learning Summer School at Purdue, 2011. Divide and Recombine (D&R) consists of the general approach of parallelizing big data, statistical methods for division and recombination, sampling and display methods for visualization of samples of subsets, computational methods, and computational environments.

In D&R, the data are broken up into structured subsets, general analysis methods are applied to each subset, and the results of the analyses recombined. The necessary steps of data division and recombination open up an exciting area of research in statistical theory and methods, and there are already a number of very useful results. The steps also open up research in computational methods and hardware-software environments, and here, too, there are important results.

By introducing the exploitable parallelization of the data, D&R succeeds in making it possible to apply to big data almost any existing analysis method from statistics, machine learning, and visualization. This enables detailed, comprehensive analysis of big data at all stages of the analysis process, starting with the raw data. This includes detailed visualization at all stages, not just to reduced data such as summary statistics, results of dimension reduction methods, fitted models, and the output of algorithms applied to the detailed data. Visualization at all stages substantially reduces the chances of losing critical information in the data.

Lecture 1 - D&R for the Analysis of Big Data (Part 1)
Lecture 2 - D&R for the Analysis of Big Data (Part 2)
Lecture 3 - D&R for the Analysis of Big Data (Part 3)
Lecture 4 - D&R for the Analysis of Big Data (Part 4)
Lecture 5 - D&R for the Analysis of Big Data (Part 5)
Lecture 6 - D&R for the Analysis of Big Data (Part 6)
Lecture 7 - D&R for the Analysis of Big Data (Part 7)
Lecture 8 - D&R for the Analysis of Big Data (Part 8)


Machine Learning Summer School at Purdue, 2011
A Machine Learning Approach for Complex Information Retrieval Applications
A Short Course on Reinforcement Learning
Classic and Modern Data Clustering
Divide and Recombine for the Analysis of Big Data
Graphical Models for the Internet
Introduction to Machine Learning
Large-Scale Machine Learning and Stochastic Algorithms
Machine Learning for a Rainy Day
Machine Learning for Discovery in Legal Cases
Machine Learning for Statistical Genetics
Mining Heterogeneous Information Networks
Modeling Complex Social Networks
Optimization for Machine Learning
Privacy Issues with Machine Learning: Fears, Facts, and Opportunities
Survey of Boosting from an Optimization Perspective
The MASH Project