Home » Analytics and Data Mining » See it for yourself: analytics in IMSL C# vs. R – Part I

Statistical analysis and desktop modeling are often performed with software such as the well-known R Project. Many statisticians are not programmers, and straightforward, higher-level languages such as R and Python lend themselves well to the iterative, try-test-repeat pattern of model development.

However, sometimes statistical analysis is required in a fully-featured language framework such as C#.NET, for situations where you’re building a breadth of applications, integrating data analysis tasks with web applications or databases, or depending on high application performance. R is designed to make statistics easier, not to make the most efficient use of a computer. The options for analysis and modeling are more limited in these cases and include implementing your own algorithms or kludging together calls to reference a desktop package’s API.

Recently, Microsoft published an article in MSDN Magazine that discusses the need for statistical analysis and introduces the R language as an option to perform these statistical duties. They were even nice enough to include a home-grown version of a popular statistical function. However, you don’t need to copy, write, or test your own code when there’s a more robust solution available that you can simply plug in. IMSL.NET Numerical Library (IMSL C#) provides a faster path to a complete solution in pure C#, including statistical algorithms.

Over the course of this two part blog, we’ll walk through the examples in the MSDN article and show how the same numerical results can be achieved in less time and with less complexity using the IMSL C# library.

Example #1 – Chi-squared test

The initial example in the MSDN article focuses on a randomness test of rolling a die. The chi-squared test is a solid choice for analysis of this data set and is available in IMSL C#. The input data is a set of 150 observations of rolling a die and using the chi-squared test, we can determine statistically if the observed results are in line with a fair die, meaning each outcome is equally likely. Here is the entire functional code in C# for this example:

``````double[] obs = { 20, 28, 12, 32, 22, 36 };
ChiSquaredTest cs = new ChiSquaredTest(this, obs.Length-1, 0);
cs.SetRange(0, 6); // min is exclusive, max is inclusive
cs.Update(new double[] { 1, 2, 3, 4, 5, 6 }, obs);``````

Note the “this” in the `ChiSquaredTest` constructor. The main class implements the IMSL `ICdfFunction` interface to specify the continuous distribution function (CDF) used in the test. For the case of rolling fair dice, the appropriate CDF is the discrete uniform, which is also available in IMSL (among many other functions). The implementation of the interface requires us to add a method called `CdfFunction` that accepts a double value of the location to evaluate the CDF and returns the result as a double. There’s little additional code necessary to convert the random deviates to integers and to trap for 0.0 values. The full method is:

``````public double CdfFunction(double x)
{
int xi = (int)Math.Floor(x);
if (xi > 0)
{
return Cdf.DiscreteUniform(xi, 6);
}
else
{
return 0.0;
}
}``````

At this stage, the calculations are in place and it’s just a matter of retrieving the statistical measures of interest. For those reported in the article (a subset of what’s available in IMSL C#), retrieving the information is a matter of accessing properties or getter methods, such as:

``````double chisq = cs.ChiSquared;
double dof = cs.DegreesOfFreedom;
double pvalue = cs.P;
double[] expected = cs.GetExpectedCounts();``````

Output preferences and formatting are up to the user’s application, but could be reported in any of various GUI elements or directly to the console.

``````chi-squared statistic: 15.28
degrees of freedom: 5
p-value: 0.009231
Expected Counts: { 25, 25, 25, 25, 25, 25 }``````

With a large chi-squared value and a p-value less than 1 percent, we conclude that this particular die does not produce statistically random uniform results.

Example #2 – Contingency table

Another example in the article is explained in much less detail on the R side. The concept is to statistically evaluate whether or not observations are statistically independent. If the resulting p-value is less than 5 percent, the conclusion is that the observations on two variables are dependent on each other. One could again use a chi-squared algorithm to determine this, but IMSL C# has a separate function for this problem statement. In IMSL C#, the class `ContingencyTable` results in notably terse code:

``````double[,] matrix = { { 15, 25, 10 }, { 30, 15, 5 } };
ContingencyTable ct = new ContingencyTable(matrix);
double pvalue = ct.P;  ``````

You just need to compare the resulting p-value against 0.05 to test for statistical significance. In this example, the p-value = 0.01022, thereby concluding that the input data are not independent.