The latest release of IMSL Numerical Libraries for Java (JMSL) brings a powerful new technique for classification and regression problems called stochastic gradient boosting. This technique is relatively new and uses the idea of combining many weak learning algorithms into a single, strong, iterative learner to achieve better validity and performance for regression models – a very important consideration for big data and Hadoop-based applications.
In general, boosting is a type of ensemble method that uses a set of base learning methods, which are typically independent decision trees, to generate predictions. With gradient boosting, instead of using independent trees, a sequence of trees are judiciously re-weighted iteratively to minimize prediction errors and improve accuracy at each stage. Essentially, each successive tree is built upon the residuals of the preceding tree, using the best partitioning of the data.
A common issue with this type of iteratively-run algorithm is “overfitting,” or including data points that aren’t likely to improve the validity of the model because you don’t know when to stop running. Hence the introduction of stochastic sampling to minimize this problem.
With stochastic gradient boosting, each successive decision tree is built on an independently drawn random sample of preceding trees, rather than the full sample. By introducing such randomness into the process, each tree is based on a different sample of observations, which protects against overfitting and offers greater accuracy.
As a part of JMSL 7.2, stochastic gradient boosting adds an impressive capability to the only commercially-available set of predictive analytics solutions for Java.