Large data sets make it difficult to get going on a data analysis project, especially when trying to perform preliminary steps on a local machine. If the data is too large to store or process all at once, it must be subdivided and distributed over a network of separate processors or over time.
For distributed data, algorithms must be adapted to work on just a piece of the data, to save the state of the model, and to provide information that can be passed on to a controller or another instance that is working on a different piece of the data. Ultimately, the information generated over different pieces of the data must be combined in a way that makes sense and leads to something useful. Some algorithms are better suited to this type of parallelization and, with very little effort, produce exact results, meaning that the combined outputs are equivalent to the results had the entire data been processed at once. For some algorithms, however, approximate results are easier to implement and are still very useful, especially in preliminary work.
At this year’s Conference on Statistical Practice (CSP 2018), I will be presenting an example of an algorithm adapted to distributed data that yields approximate results. The talk, entitled “Stochastic Gradient Boosting for Distributed Data,” discusses an approximate method that is applied sequentially, but to separate chunks of the data, allowing practitioners to get a relatively quick assessment of how well the method—stochastic gradient boosting—performs and the quality of the data for a given problem. This approach, and others like it, help to evaluate the feasibility of the data and the analysis before investing in a larger scale solution. On big data platforms, where it is much more difficult to detect and diagnose data issues, model failures are extremely costly and sometimes embarrassing, as described in this Fortune article, “When Big Data goes bad.”
This year’s Conference on Statistical Practice will be held February 15-17, 2018 in Portland Oregon. The purpose of the conference is to bring together statisticians and data scientists throughout the world who work daily to solve real life problems in business and industry using statistical methods and data analysis. The talks, tutorials, and demonstrations cover many aspects of data science from data wrangling—cleaning, summarizing, transforming—to the latest tools, techniques, and perspectives in machine learning.