An implementation of support vector machines (SVM) is available in the JMSL Numerical Library as of release 7.3. The data mining functionality in JMSL continues to expand with the latest release, including areas such as decision trees and bootstrap aggregation. The documentation of the IMSL Libraries is detailed and robust, but the algorithm discussion and examples can only cover a finite set of use cases.
This series of blog posts walks through some additional examples with a focus on classification, starting with the textbook examples part of most SVM resources. Notes and key points are highlighted throughout to provide a complementary resource for users new to SVM or new to using the JMSL Library.
There are numerous resources available online and in books that cover SVM concepts in significant depth, well beyond what any blog post could accomplish. Some basic familiarity with the fundamentals of SVM will be helpful to understand the context here. No matter how great a library of software algorithms is, it cannot replace domain knowledge and experience.
This blog focuses on using SVM for classification use cases. However, the JMSL implementation also features algorithms for regressions and the “one class” problem. The example problems here will all use the SVClassification class as the primary tool, but readers are encouraged to investigate the SVRegression and SVOneClass classes as well.
Nearly every discussion about SVM classification starts with two rather basic examples: Linearly separable and non-linearly separable. Variations of the same two data sets appear repeatedly across different material from different authors. These two textbook examples will be presented as a means to get started with the JMSL SVM implementation using familiar data.
Text book example 1: Linearly separable data
Consider the following graph consisting of two sets of points, where any given x,y location is associated with a value of 0 or 1 in a prime example of a classification problem.
The key idea here in terms of classification is that given this set of known coordinates and 0 or 1 values, can one accurately predict whether any new x,y point should be a 0 or a 1? In this case, the data are linearly separable meaning that one can easily draw a line between the two populations and define a border between them. Any new point to be classified would simply require identification of which side of the boundary line the point is located. Where exactly to place the dividing line requires some effort and this is what the SVM algorithm accomplishes. The method essentially involves drawing two parallel lines (or hyperplanes as problems extend to higher dimensions) to define the margin between the two sets such that it is as large as possible, and then defining the line (or maximum-margin hyperplane) that lies midway between the first lines. Note that mathematically the placement of these hyperplanes in the linearly-separable case depend only on the points closest to the margin. It does not matter if there are three points behind the margin or three million, thus those points nearest to the boundary are called support vectors.
Mapping this problem to be solved using the JMSL SVM classes can be done with a handful of lines of Java code. Including validation results and classifying new data requires a few more lines. As this is the first, and most straightforward example, the required source will be traversed in sections, with succinct concepts for each section.
To begin, the data set is defined by eight points in each group set for a total of sixteen x,y pairs, each with a classification of either 0 or 1:
The JMSL class SVClassification will be used for the analysis and has a constructor signature of:
The first argument is the input data that is defined as x above. The algorithm needs to know which column contains the response variable, which is column 0 for this example. Additionally for each column, the variable type is required. The types can be one of Categorical, One Class, Ordered Discrete, Quantitative Continuous, or Ignore. (More details can be found in the PredictiveModel.VariableType enumeration documentation.) For this problem the response column is categorical while the x,y locations are continuous, leading to the following:
With the SVClassification object created, the next step is to set the kernel. A kernel helps map higher dimensional problems by transforming the data. As such, the default kernel for SVM in the JMSL Library is a radial basis function with one parameter. However, for this linear case, a linear kernel is the best choice. This is set easily using:
At this stage the model needs to be trained with the known data classes provided in the constructor, which is done with just one more line of code:
Although not a requirement, it’s a good idea to validate the model by confirming that the training data is properly classified. This is more important as the data becomes more complex. To accomplish this create a set of known values (below,
knownClass) and compare those with the values predicted by fitting the SVM model to the training data (below,
predict() method fits the training data while the
getClassErrors() method helps in the comparison:
From here, one can use the JMSL PrintMatrix class to output results. For a small data set simply printing the
fittedClass array is sufficient, but printing the
fittedClassErrors output array provides a summary table containing the number of errors and non-missing observations for each class and a sum for all the classes. The print methods and their respective output follows:
This output reveals that the model fits the training data perfectly, with zero errors across all sixteen observations.
The next test for this introductory example is to classify some additional data to confirm that the model works against unknown values not present in the training data. For this test sixteen more points are defined along the main diagonal of the domain and each are classified using the trained model. Plotting the results provides an indication of where the hyperplane lies that separates the two classes of data. Above, to classify the training data
svm.predict() was called with no arguments. To classify new data simply pass an array of x,y values to this method as shown below:
Plotting these predicted values as solid squares, colored with blue = 0 and red = 1, provides confidence that classification of additional points will be accurate. The boundary between the domains, which is linear due to the selected kernel, can be inferred easily from this test.
In this first part, the new support vector machines algorithm in the JMSL Library is presented along with a straightforward linear example. The methodology described is applicable to more advanced data sets, which will be covered in part 2.