# Histograms and LinqPad

As an avid user of stackoverflow I find myself reaching for LinqPad when I want to write and test a snippet of code quickly.  LinqPad is billed as a great way to query databases and data sources with linq, but its is also one of the best C# scratchpads available. The basic version is free and fully functional, but there’s an upsell for autocompletion, smart tags, code outlining and .Net reflector in the Pro edition. The Premium edition includes code snippets and the ability to execute queries across multiple databases in Microsoft SQL Server. You can find out more about LinqPad at http://www.linqpad.net

Recently LinqPad added support for custom ‘visualizers’ as a way to display queried data, and I thought that it would be a great to be able to add a front-end to the histogram control found in NumSkull.

The custom visualizer support is currently in beta (as of 3/25/2012), so you’ll need to download the beta version until the new version is released. You’ll also need to download and compile NumSkull from https://bitbucket.org/skyguy94/numskull.

Once you have the tools in-hand, start up LinqPad and create a new query. From the Query menu, select query properties (or press F4 from the query window). The Query Properties window should appear. You’ll need to add references to the following assemblies (if you miss one, a detailed error window will appear and it’ll be obvious which is missing).

Once you’ve configured the references, creating a histogram is similar to programmatically creating a standard control in C#:

var histogram = new LinqPadVisualizers.HistogramVisualizer();
var data = new NumSkull.Histogram(new[] { 0d, 5, 10, 15, 20 });

The first line creates an instance of the histogram visualizer, and the second line creates an instance of the underlying histogram control with five evenly spaced bins. You connect the two by means of a dependency property on the visualizer, but the next step is to add some data to the histogram control.

var deltas = new[] { 195d,20,2,17,3,54,2,18,1065,45,34 };
LinqPadVisualizers.HistogramVisualizer.SetBins(histogram, data);

The last line uses the dependency property to assign the data to the visualizer. The final step is to tell LinqPad to show the control. Here’s the entire snippet:

var histogram = new LinqPadVisualizers.HistogramVisualizer();
var data = new NumSkull.Histogram(new[] { 0d, 5, 10, 15, 20 });
var deltas = new[] { 195d,20,2,17,3,54,2,18,1065,45,34 };
PanelManager.DisplayWpfElement(histogram, "My Histogram"); 

The call to PanelManager.DisplayWpfElement is needed to tell LinqPad to display the visualizer. The first argument is the instance of the visualizer and the second argument is its title which will appear in the results window.

This is a sample of typical output (this is from a different, and much larger dataset):

Enjoy!

# Statistics and C# (Part 3)

In previous posts, I’ve been developing a rudimentary statistical library in C#. The library, NumSkull, currently supports various descriptive statistics.  This post adds the variance and standard deviation estimators to the library.

Variance

The variance of a sample of measurements is a fairly complicated idea that is usually presented as a formula in basic statistics courses without much explanation.  Unfortunately, I intend to do the same thing for this entry, but at some point I’d like to blog about estimators and explore population and sample variance in greater depth. For this post, I present sample variance mathematically and then leverage it to provide the standard deviation.

$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i – \bar y)^2$$

The LINQ for this is fairly boilerplate and uses the mean function previously developed, with some help from the System.Math library:

public static double Variance(this IEnumerable<double> data) { if (data.IsNullOrEmpty()) return default(double); var mean = data.Mean(); var values = data.Select(v => Math.Pow(v – mean, 2)); var variance = Math.Pow((1d / (data.Count() – 1) * (values.Sum())), 2); return variance; }

Standard Deviation

Standard deviation, for a population, is the positive square root of the variance, and is used to provide a fairly accurate picture of variation for a single set of measurements. Like variance, standard deviation is an estimator.

$$s = \sqrt (s^2)$$

The LINQ in this code is constrained to the variance method, so the code becomes a wrapper around the Math.Sqrt function.

public static double StandardDeviation(this IEnumerable<double> data)
{
var variance = data.Variance();
var standardDeviation = Math.Sqrt(variance);
return standardDeviation;
}

# Statistics and C# (Part 2)

In a previous post, Statistics and C# (Part 1), I introduced the concept of the sample mean as one of the measures of central tendency used in basic statistical analysis.  Included with that group of measurements are the median and mode.

Median

The median provides the middle value of a set of numbers if that set has an odd cardinality, or it is the mean of the middle two numbers if the set has an even cardinality

Formula for odd cardinality:

$$median = \frac{n+1}{2}$$ (where n is the number of values)

Formula for even cardinality:

$$x = \frac{n+1}{2}$$ (where n is the number of values)  which generates a fractional value for the index variable x.

$$median = \frac{(x + .5) – (x – .5)}{2}$$

In code, the method will have to combine and test for these two cases. Also, implicit in the definition is the notion that the set is ordered, so the code must sort the set in order to calculate the correct result.

public static double Median(this IEnumerable<double> data)
{
if (data.IsNullOrEmpty()) return default(double);

var sorted = data.OrderBy(d => d);
var count = sorted.Count();

var isEven = count % 2 == 0;
var middle = sorted.Skip((count - 1) / 2).Take(isEven ? 2 : 1);

var median = middle.Average();
return median;
}


Mode

The mode measurement returns the number the occurs most often in the set. If multiple values with the same occurrence are found, the mode returns a subset of the sequence containing those numbers.

Implementing the mode in LINQ is fairly interesting.

public static IEnumerable Mode(this IEnumerable data)
{
if (data.IsNullOrEmpty()) return Enumerable.Empty();

var mode = data.GroupBy(d => d).Select(g => new
{
Value = g.Key,
Count = g.Count()
});

var max = mode.Max(d => d.Count);
var groupedModes = mode.OrderByDescending(d => d.Value);
var filtered = groupedModes.Where(d => d.Count == max && max > 1);

var modes = filtered.Select(g => g.Value);
return modes.ToList();
}

The code for this project can be found on codeplex. NumSkull

# Statistics and C# (Part 1)

I’m currently attending a night class covering statistics for mathematicians at Eastern Michigan University. This class, MATH 370, covers basic concepts of probability; expectation, variance, covariance distribution functions and their application to statistical tests of hypothesis; bivariate, marginal and conditional distributions; treatment of experimental data. More information can be found at Eastern’s website http://www.emich.edu.

As a method for increasing my personal knowledge of programming with statistical functions and floating point representations, I thought it would be worthwhile to try and develop a library that mirrors the algorithms and functions that I am learning about in the class.

I am developing this library using C# 4.0 in VS2010 using TDD as my development methodology. For functions that consume sets of data, I will implement them as extension methods that operate solely on the System.Double type, which is a 64-bit implementation of the IEEE-754 standard for representation of floating point numbers (http://en.wikipedia.org/wiki/Double_precision_floating-point_format).

The class started off by discussing the need for numerically descriptive measures of a set of data. These measures can be broken down into two categories.  The first category are algorithms that measure the central tendency of a set of numbers, and the second are algorithms that measure dispersion or variation.

The most common measure of central tendency is the arithmetic mean, commonly known as the (plain vanilla) mean or average.

# Mean

The mathematical representation of mean is this:

$$\bar y = \frac{1}{n} \sum_{i=1}^n y_i$$

There’s a statistical point to be made here. The formula is used for the sample mean, and not the population mean. In plain English, the difference is understanding that the sample of a set of data is much smaller than the entire set, but the measurement is only a useful measurement of the sample.  You cannot reliably make an inference from the sample mean and apply it to the entire set (population).

LINQ does provide an extension method for the mean, called Average, but for illustrative purposes, I chose to re-implement it using the LINQ Sum() and count() extension methods.

public static double Mean(this IEnumerable data)
{
if (data.IsNullOrEmpty()) return default(double);

var mean = (1d / data.Count()) * data.Sum();
return mean;
}


The code for this project can be found on codeplex. NumSkull