Iteratee module used in this post is part of the FSharpx
library and provides a set of types and functions for building compositional, input processing components.
A fold, by any other name
A common scenario for I/O processing is parsing an HTTP request message. Granted, most will rely on ASP.NET, HttpListener, or WCF to do this for them. However, HTTP request parsing has a lot of interesting elements that are useful for demonstrating problem areas in inefficient resource usage using other techniques. For our running sample, we’ll focus on parsing the headers of the following HTTP request message:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
When parsing an HTTP request message, we most likely want to return not just a list of strings but something more descriptive, such as an abstract syntax tree represented as a F# discriminated union. A perfect candidate for taking in our request message above and producing the desired result is
The left fold is a very useful function. It equivalent to the
Enumerable.Aggregate extension method in LINQ. This function takes in a state transformation function, an initial seed value, and a sequence of data, and produces a final state value incrementally.
Looks like we’re done here. However, there are still problems with stopping here:
You can certainly use function composition to generate an appropriate state incrementing function, but you would still have the problem of being able to pause to delay selecting the appropriate message body processor.
Suppose you really only ever want just the message headers. How would you stop processing to free your resources as quickly as possible? Forget it. Fold is going to keep going until it runs out of chunks. Even if you have your fold function stop updating the state after the headers are complete, you won’t get a result until the entire data stream has been processed.
The iteratee itself is a data consumer. It consumes data in chunks until it has either consumed enough data to produce a result or receives an EOF.
An iteratee is based on the
fold operator with two differences:
The iteratee may complete its processing early by returning a Done state. The iteratee may return the unused portion of any chunk it was passed in its Done state. This should not be confused with the rest of the stream not yet provided by the enumerator.
The iteratee does not pull data but receives chunks of data from an “enumerator”. It returns a continuation to receive and process the next chunk. This means that an iteratee may be paused at any time, either by neglecting to pass it any additional data or by passing an Empty chunk.
For those interested in reading more, there are any number of articles and examples. Here are a few I’ve referenced quite frequently:
- Iteratee IO
- enumerator package for Haskell
- Yesod Book - Enumerator package
- Scalaz Tutorial: Enumeration-Based I/O with Iteratees
The Scalaz version arguably comes closest to the F# implementation, but I have generally based my implementations on the Haskell source. I’ve got two implementations currently, a version based on the enumerator package, and a continuation-passing style version based on the iteratee package.
The iteratee is composed of one of two states, either a done state with a result and the remainder of the data stream, or a continue state containing a continuation to continue processing.
1 2 3
Stream type used here is not
System.IO.Stream but yet another discriminated union to represent different stream chunk states:
1 2 3 4
The following are some sample iteratees using the List<’a> type in F#. While not the most efficient, they are easy to express concisely and are thus better for illustrating the use of iteratees.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
Enumerators feed data into iteratees, advancing their state until they complete.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
Reading an HTML Request by Lines
Back to our earlier example of reading HTTP requests, let’s see how we might read it by lines. I’ll add a few caveats. This is not encapsulating the
Stream well as we are using a memory stream. I could wrap this up in an
enumerateMemoryStream function that takes all the bytes to read (which will likely happen soon). The example below is using the
Iteratee.Binary module, not the
1 2 3 4 5 6 7 8 9 10
The System.IO.Stream type should be familiar to anyone who has ever worked with I/O in .NET. Streams are the primary abstraction available for working with streams of data, whether over the file system (FileStream) or network protocols (NetworkStream). Streams also have a nice support structure in the form of TextReaders and TextWriters, among other, similar types.
Using the standard Stream processing apis, you might write something like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
What might be the problems with this approach?
Blocking the current thread
This is a synchronous use of reading from a Stream. In fact,
StreamReader can only be used in a synchronous fashion. There are no methods that even offer the Asynchronous Programming Model (APM). For that, you’ll need to read from the
Stream in chunks and find the line breaks yourself. We’ll get to more on this in a few minutes.
First off, the sample above is performing side effects throughout, and side-effects don’t compose. You might suggest using standard function composition to create a complete message parser, but what if you don’t know ahead of time what type of request body you’ll receive? Can you pause processing to select the appropriate processor? You could certainly address this with some conditional branching, but where exactly do you tie that into your current logic? Also, the current parser breaks on lines using the StreamReader’s built in logic. If you want to do parsing on individual lines, where would you tie in that logic?
In this example, we are not taking any care about our memory use. To that end, you might argue that we should just use
StreamReader.ReadToEnd() and then break up the string using a compositional parser like FParsec. If we don’t care about careful use of memory, that’s actually quite a good idea. However, in most network I/O situations – such as writing high-performance servers – you really want to control your memory use very carefully.
StreamReader does allow you to specify the chunk size, so it’s not a complete case against using
StreamReader, which is why this is a last-place argument. And of course, you can certainly go after
How can we add better compositionality and refrain from blocking the thread? One way others have solved this problem is through the use of iterators. A common example can be found on Stack Overflow. Iterators allow you to publish a lazy stream of data, either one element at a time or in chunks, and through LINQ or F#’s
Seq module, chain together processors for the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Three options are presented. In each, I’m using a
StreamReader, but you could just as easily replace those with a
SocketAsyncEventArgs. What could be wrong with these options?
Lazy, therefore resource contentious
Iterators are pull-based, so you’ll only retrieve a chunk when requested. This sounds pretty good for your processing code, but it’s not a very good situation for your sockets. If you have data coming in, you want to get it moved out as quickly as possible to free up your allocated threads and pinned memory buffers for more incoming or outgoing traffic.
Lazy, therefore non-deterministic
Each of these items is lazy; therefore, it’s impossible to know when you can free up resources. Who owns the
StreamReader passed into each method? When is it safe to free the resource? If used immediately within a function, you may be fine, but you’d also be just as well off if you used a
array comprehension and blocked until all bytes were read. (The one possible exception might be when using a co-routine style async pattern.)
Iteratees are able to resolve a number of issues related to general stream processing, as well as either lazy, e.g. seq<_>- or Async<_>-based, non-determinism. I’m sure you are wondering why I didn’t dive deeper in explaining the iteratees above. In the next part, we’ll explore a more idiomatic .NET solution for creating iteratees. We’ll then decompose the iteratees above and cover each in more detail. In the meantime, please checkout FSharpx for additional information.