In part 1 of the blog series, the new support vector machines algorithm in the JMSL Library was presented along with a straightforward linear example. The methodology described is also applicable to more advanced data sets, which we will cover in this blog.
Example two: Non-linearly separable data
The example one provides a starting point in writing applications using the JMSL SVM classification algorithm. However, it is admittedly trivial. Most clustering and classification algorithms should have no problem with easily grouped data. The next example that appears in SVM introductions moves onto a less trivial example where the data classes are not linearly separable. In these cases, traditional algorithms like K-Means clustering or K-Nearest Neighbors do not do a very good job. Consider the data set pictured below:
As with the first example, there are sixteen total points, eight classified as 0 and eight as 1, but this time there is no single simple line to be drawn to separate the domains. The source code and analysis steps are very similar to the first example, but with different input data and a different kernel function. To begin, this data set is defined as:
The same array of variable types is used and the analysis steps are exactly the same. However, because prior knowledge lead to the choice of a linear kernel in the first example, this needs to change here. The default radial basis kernel actually does a fine job with this data in properly classifying the training set, but there is additional knowledge that is advantageous here. One could easily draw a circle that encompasses all of the blue points in the graph, and
so a second-order polynomial kernel is an appropriate choice. As it is easy to try different kernels, confirming that the radial basis default works well enough is left to the reader as an exercise. To set the desired circular kernel, the
PolynomialKernel class is used, but the default arguments are insufficient as they will general a first-order polynomial
(i.e., a line). Therefore the constructor that allows setting the parameters is used:
Following the rest of the steps discussed in part one to validate classification of the training data again results in no misclassifications. As further validation and a test of the prediction capabilities of the trained model, a set of sixteen points along the main diagonal can be classified with the following result:
Complex examples: Good and bad
Above and in part one of the blog, two text-book examples were presented to help understand how the JMSL SVM classes can be used to classify data sets. In those examples, the example data were simple x,y pairs classified as either 0 or 1 and grouped in a way that a line or a circle easily delineates the two classes. In the real world data are rarely so clean or so easily understood. Here, the JMSL SVM algorithm is applied to two of the data sets from the UCI Machine
Learning UCI Machine Learning Repository with mixed results. There are numerous data sets available, most from real-world academic studies. The selection criteria for this analysis was to find sets for classification (many are for regression analysis, which is also possible with JMSL and the SVRegression class, but not the focus here) small enough to be hard-coded without being unwieldy, and with more than three attributes in the data set. The methodology is straightforward: Load in the data, select about 10 percent of the data randomly (using JMSL RandomSamples class) to be used as a test with the remainder used for training, evaluate errors in the training set, and then evaluate errors in the test set. The JMSL CrossValidation class could be used to automate this type of random sampling and testing, repeating the work multiple times, resampling with replacement and obtaining average prediction errors.
The Qualitative Bankruptcy data contains 250 observations with six categorical inputs and one classification result. The idea is very much a real-world problem. If there is historical data of how a company is rated on various risk factors and the known bankruptcy status of each individual, is it possible to predict the bankruptcy status accurately given a new set
of inputs? For this data, each observation includes a ranking (positive, average, or negative) across six qualities: industrial risk, management risk, financial flexibility, credibility, competitiveness, and operating risk; along with a flag indicating bankruptcy or not. The raw data are text values (P, A, and N for ranks, B and NB for bankruptcy or not), which were
converted to categorical integer values. Of the 250 data points, 25 randomly selected points were set aside for testing, leaving 225 for training the model. The output of validating the training set and the test data are:
The JMSL SVM algorithm classifies this data perfectly. This result isn’t unique, having tested several other small data sets from the same repository. While one should regard such results with suspicion, there are several reasons why very good output may be observed. First, the data categories may fundamentally capture the dynamics of the classes (in this case, the measures of risk are likely very good indicators of bankruptcy). Second, much of the data available in such repositories are not raw data but already cleaned and pre-processed, reducing the noise is true raw data. While data scrubbing is an important part of any analysis, data scientists must be mindful of confirmation bias in what they include or exclude. Finally, even though this data can be classified without error, using support vector machine algorithms provides no additional insight such as which characteristics are most important.
Consider the Haberman’s Survival data set, which includes data on five-year survival rates after breast cancer surgery. Given the patient age, year of operation, and number of positive axillary nodes detected, can whether or not the patient survived at least five years be predicted? The output of the validation steps is as follows:
There were some errors in classification of the training data set. While this is expected for real-world data, an error rate of nearly 33 percent (24 of 73) for one class is worrisome. When the results of training against 276 input values is applied to the randomly selected 30 test data points, all eight of the test cases where the patient died within five years
were misclassified. The primary conclusion from such results is that without additional tweaking to the model or using different methodologies, the input data is not sufficient to predict the outcome. Using cross-validation techniques can help select the best performing model, but if important predictor variables are absent from the input data, there simply
isn’t much a model can do to make up for it.
Support vector machines can be a robust tool for classification problems complementing other data mining techniques. The JMSL Library implements a flexible programming interface for users to explore this modern technique as shown in this blog series through a series of examples. The IMSL Numerical Libraries have helped users solve numerical algorithm
challenge for over 40 years. To discover more information on the IMSL algorithms discussed here, visit roguewave.com.
Access the complimentary full source code highlighting the examples discussed this our support vector machines in JMSL blog series.