Demystifying LINQ to Objects

19 Oct 2010

To improve my understanding of LINQ, I’ve long wanted to learn how LINQ to Objects works under the covers. I’ll do so by relating query expression syntax to method invocation syntax and lambda expressions to generic delegates. Since many of the building blocks that make up LINQ are nothing more than syntactic sugar around .NET 2.0 constructs, learning LINQ in terms of .NET 2.0 is one way to get into the more functional spirit of C#.

To set the stage, I’ve created an array of people to query.

Person[] people = new[] { new Person { Age = 1, ...}, ... };

It’s worth noting that all arrays implicitly derive from the Array class. It’s a special class that only the compiler and the runtime may derive from. Doing so yourself is rewarded with a compiler error stating simply that the class cannot be derived from. What’s also not apparent from the definition of the Array class is that, as of .NET 2.0, it also implements IList<T>, ICollection<T>, and IEnumerable<T>, which derive from the corresponding non-generic interfaces.

public abstract class Array :
    ICloneable, IList, ICollection,
    IEnumerable, IStructuralComparable, IStructuralEquatable

The runtime takes special care of extending the Array type with implementations of the generic interfaces based on the type of object that the array stores. That’s why the interfaces are invisible in the documentation. Any array of type T — in this case Person — will therefore implement, among others, the IEnumerable<T> interface. It’s the implementation of this interface on a class that enables LINQ to Objects to query any array or collection.

Query expression syntax

Let’s create a simple query using query expression syntax. To support it, additional keywords have been added to C# to express common query operators. Examples of what’s not supported with query expression syntax are the Sum, Take, and Skip query operators. In those cases you can combine query expression with method invocation syntax or write everything using the latter.

var q = from p in people
        where p.Age > 25
        select p;

Method invocation syntax

To make matters transparent, let’s not rely on type inference and the var keyword for typing the result. At the same time, let’s assume the role of the compiler and translate the query expression syntax into method invocation syntax with lambda expressions. Each standard query operator, such as Where and Select, translates to a corresponding method invocation on the collection, oftentimes with a lambda expression as the argument. Both representations are semantically equivalent, but for this simple query method invocation syntax appears more complex. This is generally not the case for either representation as you can try out with a tool like LINQPad.

IEnumerable<Person> r = people
    .Where(p => p.Age > 25)
    .Select(p => p);

The concise nature of the code is due to the compiler inferring the type arguments to Where and Select and the type of the lambda expression. In this case, because the collection stores objects of type Person, the type arguments of the generic methods as well as the type of the lambda expression is also of type Person. To make these types explicit, you can substitute real types for the generic ones in the definition of Where and Select provided later.

IEnumerable<Person> s = people
    .Where<Person>((Person p) => p.Age > 25)
    .Select<Person, Person>((Person p) => p);

Generic delegates

With the type specifications made explicit, it’s more obvious how lambda expressions are compatible with delegates. A lambda expression is nothing more than a short-hand notation for a delegate, a type-safe pointer to a piece of code, to be passed into a method. Hence, a lambda expression can be substituted with an anonymous delegate by wrapping it in additional ceremony.

IEnumerable<Person> t = people
    .Where<Person>(delegate(Person p) { return p.Age > 25; })
    .Select<Person, Person>(delegate(Person p) { return p; });

To make delegates type safe, their definition include the return type and the types of the arguments passed into it. This is unfortunate since the standard query operators must be able to work on any type of object. LINQ therefore relies on generic delegates in the definition of its operators. Like with other generic types, the compiler and runtime work together to generate real delegates based on the specified, or inferred, return type and types of arguments. Delegates like the ones below for Where and Select are what’s generated by the runtime.

// delegate Boolean WhereDelegate(Person p);
// delegate Person SelectDelegate(Person p);
bool WhereClause(Person p) { return p.Age > 25; }
Person SelectClause(Person p) { return p; }

IEnumerable<Person> u = people
    .Select<Person, Person>(SelectClause);

LINQ relies on a set of generic delegates defined in the .NET framework. These delegates come in two flavors: those that return void, named Action, and those that don’t, named Func. You can use these delegates in your own code to not only parameterize methods by value but by functionality.

delegate void Action();
delegate void Action<T>(T obj);
delegate void Action<T1, T2>(T1 arg1, T2 arg2);
// up to eight arguments

delegate TResult Func<TResult>();
delegate TResult Func<T, TResult>(T arg);
delegate TResult Func<T1, T2, TResult>(T1 arg1, T2 arg2);
// up to eight arguments

Extension methods

Each of the 50 or so standard query operators is defined on the Enumerable class with the this modifier on their first IEnumerable<T> argument. This makes them extension methods to every object that implements IEnumerable<T>. From the definition of the Where operator below, it then follows that when you write "Where(p => p.Age > 25)", the compiler infers that because Where is called on a collection of type Person, p must also be of type Person. Furthermore, because the result of the comparison is a bool, the lambda expression is compatible with a delegate that accepts a type Person and returns a type bool. In other words, the matching Where has a signature like the commented one below.

The purpose of the Where operator is to filter a collection based on a predicate, the mathematical term for a function that returns true or false. The people that satisfy the predicate is therefore returned as an IEnumerable<Person>. However, the yield keyword adds an interesting twist to the return of Where and Select and other operators that return IEnumerable<T>. Instead of the operator iterating the source collection and returning every element of the result at once, the yield keyword instructs the compiler to emit a state machine so the operator can keep track of how far through the source collection it is, and return one resulting element at a time. The effect is lazy evaluation of LINQ to Objects queries.

In the example, the output of Where goes into Select whose output may goto the console using a foreach. To understand yield, think of the foreach that writes to the console as pulling on a string of objects connecting Select to Where to the source collection. Each iteration of the loop pulls the string just enough to get at the next resulting element. This may in turn require Where to iterate multiple times until the predicate is once again satisfied.

// IEnumerable<Person> Where<Person>(
//     IEnumerable<Person> source,
//     Func<Person, bool> predicate)

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate) {
    foreach (TSource element in source) {
        if (predicate(element))
            yield return element;

The Select operator is an example of one whose input type may differ from its output type. By its nature, it projects one type onto another. It can even project onto an anonomous type, like in "Select(p => new { x = p.Name })", but then the var keyword must be used for its return type. In the example I project from Person to Person, which makes the matching signature of Select like the commented one below.

// IEnumerable<Person> Select<Person, Person>(
//     this IEnumerable<Person> source,
//     Func<Person, Person> selector)

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector) {
    foreach (TSource element in source)
        yield return selector(element);

The ability to chain methods together to form a data pipeline is what adds real power to LINQ. Although the idea of chaining methods together is old, the traditional approach of each method returning a new or mutated instance works best for objects of your own. Imagine adding to the IEnumerable<T> interface a set of query operators. Then every collection would have to provide implementations for Where, Select, and so on. Also, it wouldn’t be possible to add new operators in vNext of .NET without breaking existing code implementing the interface.

Extension methods solve this issue by providing the necessary syntactic sugar to make chaining of methods on a collection feel like chaining on an object of your own. Behind the scenes, the compiler rewrites calls to extension methods to static methods.

var v = Enumerable.Select(
    Enumerable.Where(people, p => p.Age > 25),
    p => p);

Note how, without the use of extension methods, operations must be specified in the opposite order in which they’re executed. This doesn’t read as nicely as the “infix” syntax when many operations are chained together.