Skip to content

Iloggable

Fixing leaking semaphores with mod_mono

After porting my mod_mono ASP.NET MVC application to Ruby and Rails and setting up Phusion Passenger up to run the application under mono, I finally figured out how to fix the leaking semaphore issue. The real title of this post should probably be "PEBKAC or Don't assume errors are unrelated, you idiot".

Recap of the problem

The problem manifests itself as a build up of semaphore arrays by the apache process, which is visible via ipcs. When the site is first started the output looks like this:

[root@host ~]# ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01014009 1671168    root       600        52828      48
0x0101400a 1703937    root       600        52828      25
0x0101400c 1736706    root       600        52828      35

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 10616832   apache     600        1
0x00000000 10649601   apache     600        1
0x00000000 10682370   apache     600        1
0x00000000 10715139   apache     600        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

Eventually it'll look like this:

[root@host ~]# ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01014009 1671168    root       600        52828      48
0x0101400a 1703937    root       600        52828      25
0x0101400c 1736706    root       600        52828      35

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 10616832   apache     600        1
0x00000000 10649601   apache     600        1
0x00000000 10682370   apache     600        1
_...
lots more
..._
0x00000000 11141158   apache     600        1
0x00000000 11173927   apache     600        1
0x00000000 11206696   apache     600        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

At some point all ASP.NET pages will return blank pages. No errors, no nothing, .NET logging reports normal behavior, but no content is sent. And you can restart the mono processes and apache all you want, it won't come back. Sorry.

What's really wrong

Since day one i'd been receiving warnings at apache startup up, but since i didn't understand what they meant and things seemed to be working, i had been ignoring the,. Of course, that was a lie on its face. Things were clearly not working, with the leaking semaphores, but I conveniently filed the two issues as unrelated in my head and ignored them at my peril. The warning was this:

[Mon Jan 24 00:12:50 2011] [crit] The unix daemon module not initialized yet.
Please make sure that your mod_mono module is loaded after the User/Group
directives have been parsed. Not initializing the dashboard.

This warning was repeated for as many times as I had ASP.NET vhosts defined. I looked at my vhost configurations and saw nothing about users and groups and thought it was some weird mono issue and left it at that. But the actual problem was not in the vhost configuration but in httpd.conf. The problem was this default section:

#
# Load config files from the config directory "/etc/httpd/conf.d".
#
Include conf.d/\*.conf

#
# If you wish httpd to run as a different user or group, you must run
# httpd as root initially and it will switch.
#
# User/Group: The name (or #number) of the user/group to run httpd as.
#  . On SCO (ODT 3) use "User nouser" and "Group nogroup".
#  . On HPUX you may not be able to use shared memory as nobody, and the
#    suggested workaround is to create a user www and use that user.
#  NOTE that some kernels refuse to setgid(Group) or semctl(IPC_SET)
#  when the value of (unsigned)Group is above 60000;
#  don't use Group #-1 on these systems!
#
User apache
Group apache

Obviously User and Group are set after all vhost configs are loaded. Pretty much exactly what the warning was saying (doh). I simply moved the Include below User/Group and since then I have not seen more than 9 semaphores and I've restarted mono, rebuilt the application and hit the app with ApacheBench, the combination of which used to drive semaphores up.

Since things are working now and ASP.NET MVC under mod_mono is significantly faster than the Rails port, I'm sticking with ASP.NET MVC for production right now, monitoring semaphores to make sure this really did fix the problem.

Installing Phusion Passenger on Amazon Linux AMI 1.0

Once again, this is a progression of building out my Amazon Linux AMI, so the pre-requisites might be off, since I've previously installed a number of other things. And once again, this is simply a log of tasks for my own future reference, rather than a build recipe. Maybe this will be useful to someone else as well, so I've gone back and tagged all AMI articles with aws-linux-ami, so you can at least see the history of pre-requisites.

Anyway, this time I'm installing phusion passenger to host the ASP.NET app I ported to Rails last week. The AMI comes with Ruby 1.8.7. I next installed the following repos:

yum install libcurl-devel openssl-devel mysql-devel ruby-devel rubygems

Even tho gems is now installed, it's not current enough for rails, so first thing, upgrade gems

gem update --system

I also had rails fail to install, with

Installing ri documentation for rails-3.0.3...
File not found: lib

Which i fixed with rebuilding rdoc:

gem install rdoc-data
rdoc-data --install
gem rdoc --all --overwrite

Now it's finally time to install and build rails with mysql support (which is how i set my rails application up) and passenger

gem install mysql2
gem install rails
gem install passenger

Next, build the passenger apache2 module. I actually killed the install the first time around because libcurl-devel and openssl-devel were missing. The installer assured me that it would guide me through getting those dependencies resolved, but I wanted to make sure they came in through yum rather than have this installer download and build them from source. Anyway the command was:

passenger-install-apache2-module

This installed flawlessly and ended with instructions to put the following in my apache config:

LoadModule passenger_module /usr/lib/ruby/gems/1.8/gems/passenger-3.0.2/ext/apache2/mod_passenger.so
PassengerRoot /usr/lib/ruby/gems/1.8/gems/passenger-3.0.2
PassengerRuby /usr/bin/ruby

A git diversion

Before getting to the apache setup of my rails app, I ran into this error trying to check the port out from my repo:

warning: remote HEAD refers to nonexistent ref, unable to checkout.

I don't know how this happened, since other gitosis repos i've created haven't had the same problem, but running

git push --all

on my development machine did the job. Apparently it had been pushing changes into the repo, but never set up a branch because that command reported:

\* [new branch]      master -> master

Well, fortunately after that all was good :)

Configuring rails in apache

Finally, the apache vhost config was exceedingly simple:

   <VirtualHost \*:80>
      ServerName www.yourhost.com
      DocumentRoot /somewhere/public    # <-- be sure to point to 'public'!
      <Directory /somewhere/public>
         AllowOverride all              # <-- relax Apache security settings
         Options -MultiViews            # <-- MultiViews must be turned off
      </Directory>
   </VirtualHost>

The important thing is that the DocumentRoot needs to point to the rails public directory not the root of the rails application.

The last task was running

rake db:create:all

to set up the expected db locally. After that, and an apache restart, the app came up without a hitch.

Of course, while setting all this up, I finally figured out why mod_mono was leaking semaphores, making all of this likely moot. But i'm glad to have this alternative while I determine whether the mod_mono behavior is really fixed.

Type-safe actor messaging approaches

For notify.me I hand-rolled a simple actor system to handle all Xmpp traffic. Every user in the system has its own actor that maintains their xmpp state, tracking online status, resources, resource capability, notification queues and command capabilities. When a message comes in either via our internal notification queues or from the user, a simple dispatcher sends the message on to the actor which handles the message and responds via a message that the dispatcher either hands off to the Xmpp bot for formatting and delivery to the client or sends it to our internal queues for propagation to other parts of the notify.me system.

This has worked flawlessly for over 2 years now, but its ad-hoc nature means it's a fairly high touch system in terms of extensibility. This has led me down building a more general actor system. Originally Xmpp was our backbone transport among actors in the notify.me system, but at this point, I would like to use Xmpp only as an edge transport, and otherwise use in-process mailboxes and serialize via protobuf for remote actors. I still love the Xmpp model for distributing work, since nodes can just come up anywhere, sign into a chatroom and report for work. You get broadcast, online monitoring, point-to-point messaging, etc. all for free. But it means all messages go across the xmpp backbone, which has a bit of overhead and with thousands of actors, i'd rather stay in process when possible. No point going out to the xmpp server and back just to talk to the actor next to you. I will likely still use Xmpp for Actor Host nodes to discover each other, but the actual inter-node communication will be direct Http-RPC (no, it's not RESTful, if it's just messaging).

Definining the messaging contract as an Interface

One design approach I'm currently playing with is using actors that expose their contract via an interface. Keeping the share-nothing philosophy of traditional actors, you still won't have a reference to an actor, but since you know its type, you know exactly what capabilities it has. That means rather than having a single receive point on the actor and making it responsible for routing the message internally based on message type (a capability that lends itself better to composition), messages can arrive directly at their endpoints by signature. Another benefit is that testing the actor behavior is separate from its routing rules.

public interface IXmppAgent {
    void Notify(string subject, string body);
    OnlineStatus QueryStatus();
}

Given this contract we could just proxy the calls. So our mailbox could have a proxy factory like this:

public interface IMailbox {
    TRecipient For<TRecipient>(string id);
}

allowing us to send messages like this:

var proxy = _mailbox.For<IXmppAgent>("foo@bar.com");
proxy.Notify("hey", "how'd you like that?");
var status = proxy.QueryStatus();

But messaging is supposed to be asynchronous

While this is simple and decoupled, it is implictly synchronous. Sure .Notify could be considered a fire-and-forget message, .QueryStatus definitely blocks. And if we wanted to communicate an error condition like not finding the recipient, we'd have to do it as an exception, moving errors into the synchronous pipeline as well. In order to retain the flexibility of a pure message architecture, we need a result handle that let's us handle results and/or errors via continuation.

My first pass at an API for this resulted in this calling convention:

public interface IMailbox {
    void Send<TRecipient>(string id, Expression<Action<TRecipient>> message);
    Result SendAndReceive<TRecipient>(string id, Expression<Action<TRecipient>>  message);
    Result<TResponse> SendAndReceive<TRecipient, TResponse>(
        string id,
        Expression<Func<TRecipient, TResponse>>  message
    );
}

transforming the messaging code to this:

_mailbox.Send<IXmppAgent>("foo@bar.com",a => a.Notify("hey", "how'd you like that?"));
var result = _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
    "foo@bar.com",
    a => a.QueryStatus()
);

I'm using MindTouch Dream's Result<T> class here, instead of Task<T>, primarily because it's battle tested and I have not properly tested Task under mono yet, which is where this code has to run. In this API, .Send is meant for fire-and-forget style messaging while .SendAndReceive provides a result handle -- and if Void were an actual Type, we could have dispensed with the overload. The result handle has the benefit of letting us choose how we want to deal with the asynchronous response. We could simply block:

var status = _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus())
    .Wait();
Console.WriteLine("foo@bar.com status:", status);

or we could attach a continuation to handle it out of band of the current execution flow:

_mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus()
    )
    .WhenDone(r => {
        var status = r.Value;
        Console.WriteLine("foo@bar.com status:", status);
    });

or we could simply suspend our current execution flow, by invoking it from a coroutine:

var status = OnlineStatus.Offline;
yield return _mailbox.SendAndReceive<IXmppAgent, OnlineStatus>(
        "foo@bar.com",
        a => a.QueryStatus()
    )
    .Set(x => status = x);
Console.WriteLine("foo@bar.com status:", status);

Regardless of completion strategy, we have decoupled the handling of the result and error conditions from the message recipient's behavior, which is the true goal of the message passing decoupling of the actor system.

Improving usability

Looking at the signatures there are two things we can still improve:

  1. If we send a lot of messages to the same recipient, the syntax is a bit repetitive and verbose
  2. Because we need to specify the recipient type, we also have to specify the return value type, even though it should be inferable

We can address both of these, by providing a factory method for a typed mailbox:

public interface IMailbox {
    IMailbox<TRecipient> To<TRecipient>(string id);
}

public interface IMailbox<TRecipient> {
    void Send(Expression<Action<TRecipient>> message);
    Result SendAndReceive<TResponse>(Expression<Action<TRecipient>>  message);
    Result<TResponse> SendAndReceive<TResponse>(
        Expression<Func<TRecipient, TResponse>>  message
    );
}

which let's us change our messaging to:

var actorMailbox = _mailbox.To<IXmppAgent>("foo@bar.com");
actorMailbox.Send(a => a.Notify("hey", "how'd you like that?"));
var result2 = actorMailbox.SendAndReceive(a => a.QueryStatus());

// or inline
_mailbox.To<IXmppAgent>("foo@bar.com")
    .Send(a => a.Notify("hey", "how'd you like that?"));
var result3 = _mailbox.To<IXmppAgent>("foo@bar.com")
    .SendAndReceive(a => a.QueryStatus());

I've included the inline version because it is still more compact than the explicit version, since it can infer the result type.

Supporting Remote Actors

The reason the mailbox uses Expression instead of raw Action and Func is that at any point an actor we're sending a message to could be remote. The moment we cross process boundaries, we need to serialize the message. That means we need to be able to programatically inspect the inspection, and build a serializable AST as well as serialize the captured data members used in the expression.

Since we're talking serializing, inspecting the expression also allows us to verify that all members are immutable. For value types, this is easy enough, but DTOs would need be prevented from changing so that local vs. remote invocation won't end up with different result just because the sender changed it's copy. We could handle this via serialization at message send time, although this looks like a perfect place to see how well the Freezable pattern works.

Porting ASP.NET MVC to Ruby on Rails

This isn't yet another .NET developer defecting to Ruby. I have very little interest in making Ruby my primary language. I've done a couple of RoR projects over the years, nothing serious I admit, but I just don't seem to enjoy it in the way that so many of my peers do. That said, RoR does hit a sweetspot for websites. The site I'm porting has very little in terms of business logic -- it's primarily HTML templating with navigation -- so this was an exercise to circumvent my mod_mono issues.

I'm a huge C# fanboy, but having worked with ASP.NET MVC for a while I have to admit that the amount of cruft one has to assemble to stay DRY in ASP.NET templating is just not worthwhile. While views can be strongly typed, it's an exercise in frustration trying to write templates generically. Maybe this becomes easier with dynamic usage in MVC3, but i haven't checked it out. What certainly doesn't help is that the MVC team decided to make TemplateHelper internal, turning the addition of helpers in the vein of .DisplayFor or .EditorFor into a major task that still ends up being a pile of hacks. Now I'm not an ASP.NET MVC expert and there's probably a lot of extension points I just don't know about. But the articles on extending it that I have found are usually pages of code. I shouldn't have to become a framework internals expert just to add some generic templating extensibility.

Ok, enough ranting. ASP.NET MVC is still a huge improvement over webforms, but right now I'm watching Manos de Mono and OWIN to see what develops in .NET land for websites there. The ASP.NET stack, in my opinion, is just too heavy for something that should be simple.

So, why RoR instead of node.js, since I claimed that I was going to get serious about javascript this year? Mostly because this port has a deadline, so use what you know applies, and it's a production site, so use known stable tech applies. Another benefit was that RoR uses the same <% %> syntax as webforms views and MVC was clearly heavily inspired by RoR.

I ported the site over 3 nights, maybe 10 hours of cumulative seat time which feels like time well spent. Strategic search and replace got me 80% there, faking Html. for my custom extension in RoR got me another 10%, leaving only 10% for actual new business logic written in ruby. Once I get to more complex business logic for the site I may stick to Ruby, although I know I'll be sorely tempted to write it as REST services in C# on top of Dream.

Convention over Configuration in statically typed languages

For me, static typing serves two purposes: First, compile time checking that arguments to calls are the appropriate type, and second, static discovery of code usage, dependency and wire-up. The former is about correctness and safety, and constitutes the "free unit testing" the compiler provides (which statically-typed detractors deride and statically-typed proponents celebrate). The latter is about productivity. It allows me to quickly navigate through a complex codebase, find dead code, track down unintended coupling, etc. In many ways, the discovery aspects is the greater reason why I like statically-typing. And that's why I consider Convention over Configuration a paradigm that should be approached with caution.

If you are in a dynamically typed language, discovery relies on reading code and documentation. While your code may be expressed as symbols not strings, navigation of those symbols is often impossible until runtime. You have nothing that's guarding you from your own typos, so all code is equally suspect and you are likely to manually or automatically test basic syntax a lot more. With that premise, automatic configuration of instances and wireup doesn't complicate the existing testing burden significantly but greatly cuts down on typing and code size, i.e. it's a good thing.

Now in statically-typed land things are different: Compiler checked syntax can lull you into complacency, especially if you enjoy refactorings that track down on the symbolic rather than search/replace level. Renaming something to fit the context can quickly break mappings or wire-ups relying on convention. Don't get me wrong, config files for mapping or wire-up is no better -- I am advocating mapping and wire-up in code as much as possible. The minutes spent on initial coding are worth hours in maintainability.

I'm not going to discount Convention over Configuration in totallity -- I think it is immensely useful at code boundaries, such as mapping schemas to entities, or Urls to controller actions. These touch points already are on the code edge where represenation is leaving the homogenous domain, so using convention only provides naming guidance and reduces tedious scaffolding.

So, my advice is, don't use Convention over Configuration to wire up your code internals unless the relationships are obvious and/or well covered by tests. But do favor it for the plumbing required to connect your code to outside input and output sources.

nginx+mono vs. apache+mod_mono

I've been using Apache with mod_mono for some ASP.NET MVC2 projects and kept having problems with semaphore arrays being leaked. Under 2.6.7 this even broke xbuild after a while. I then went to 2.8 and 2.8.1, but it didn't stop the leaks. I posted on the mono-devel list and after lack of response simply asked if anyone was actually running ASP.NET under mod_mono, which also elicited no replies. Finally, I posted the problem on stackoverflow, again without any resolution.

The mod_mono problem

The problem manifests itself as a build up of semaphore arrays by the apache process, which is visible via ipcs. When the site is first started the output looks like this:

[root@host ~]# ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01014009 1671168    root       600        52828      48
0x0101400a 1703937    root       600        52828      25
0x0101400c 1736706    root       600        52828      35

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 10616832   apache     600        1
0x00000000 10649601   apache     600        1
0x00000000 10682370   apache     600        1
0x00000000 10715139   apache     600        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

Eventually it'll look like this:

[root@host ~]# ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x01014009 1671168    root       600        52828      48
0x0101400a 1703937    root       600        52828      25
0x0101400c 1736706    root       600        52828      35

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 10616832   apache     600        1
0x00000000 10649601   apache     600        1
0x00000000 10682370   apache     600        1
_...
lots more
..._
0x00000000 11141158   apache     600        1
0x00000000 11173927   apache     600        1
0x00000000 11206696   apache     600        1

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

At some point all ASP.NET pages will return blank. No errors, no nothing, .NET logging reports normal behavior, but no content is sent. And you can restart the mono processes and apache all you want, it won't come back. Sorry.

What does work is to remove all semaphore arrays via ipcrm and restart apache. For the time being i've had a script in cron that did this:

/usr/bin/ipcrm sem $(/usr/bin/ipcs -s | grep apache | awk '{print$2}');

Unfortunately, the leaking semaphores are somehow related to traffic, so eventually i'd either have to increase the frequency of the restart script or make it more intelligent. I opted for neither and decided to try out nginx+fastcgi+mono.

Installing and Configuring nginx+fastcgi+mono

Like my mono 2.8.1 install, I'm doing this on an Amazon Linux AMI 1.0. And like that article, this isn't so much a recipe than a log of my actions. Note that this was done after the 2.8.1 install from source so there might be dependencies i'm not mentioning since they'd already been addressed.

First, the simple part, the yum install:

yum install nginx

Append the below to /etc/nginx/fastcig_params:

# mono
fastcgi_param PATH_INFO "";
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

Now, let's assume there's an apache vhost config in /etc/httpd/conf.d/foobar.conf that looks like this:

Include conf.d/mod_mono.conf

MonoSetEnv MONO_DISABLE_SHM=1

<VirtualHost \*:80>
  ServerName www.foobar.com
  ServerAdmin admin@foobar.com
  DocumentRoot /foobar/http/www

  ErrorLog      /foobar/log/www/error_log
  CustomLog     /foobar/log/www/access_log common

  MonoServerPath www.foobar.com "/opt/mono-2.8.1/bin/mod-mono-server2"
  MonoDebug www.foobar.com true
  MonoApplications www.foobar.com "/://foobar/http/www"
  MonoAutoApplication disabled
  AddHandler mono .aspx .ascx .asax .ashx .config .cs .asmx .axd

  <Location "/">
    Allow from all
    Order allow,deny
    MonoSetServerAlias www.foobar.com
    SetHandler mono
  </Location>
</VirtualHost>

The equivalent nginx config in /etc/nginx/conf.d/foobar.conf would look like this:

server {
  server_name  www.foobar.com;
  access_log   /foobar/log/www/nginx.access.log;

  location / {
    root /foobar/http/www;
    index index.html index.htm default.aspx Default.aspx;
    fastcgi_index /;
    fastcgi_pass 127.0.0.1:9000;
    include /etc/nginx/fastcgi_params;
  }
}

Now we need to set up the fastcgi server:

fastcgi-mono-server4 /applications=/:/foobar/http/www/ /socket=tcp:127.0.0.1:9000

and finally we can start nginx:

/etc/init.d/nginx start

Voila, ASP.NET MVC2 under nginx. This may have other issues, but i have not yet observed them, so this seems to be a way to get around the mod_mono issues.

Of course that's a bit cumbersome. What we really need is an init script so we can start and stop teh fastcgi server like other services:

#!/bin/sh

# chkconfig:   - 85 15
# description:  Fast CGI mono server
# processname: fastcgi-mono-server2.exe

PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/mono-2.8.1/bin
DESC=fastcgi-mono-server2

WEBAPPS="/:/foobar/http/www/"
LISTENER="tcp:127.0.0.1:9000"

MONOSERVER=/opt/mono-2.8.1/bin/fastcgi-mono-server2
MONOSERVER_PID=$(ps auxf | grep "${LISTENER}" | grep -v grep | awk '{print $2}')

case "$1" in
        start)
                if [ -z "${MONOSERVER_PID}" ]; then
                        echo "starting mono server"
                        ${MONOSERVER} /applications=${WEBAPPS} /socket=${LISTENER} &
                        echo "mono server started"
                else
                        echo ${WEBAPPS}
                        echo "mono server is running"
                fi
        ;;
        stop)
                if [ -n "${MONOSERVER_PID}" ]; then
                        kill ${MONOSERVER_PID}
                        echo "mono server stopped"
                else
                        echo "mono server is not running"
                fi
        ;;
esac

exit 0

And now we can start and stop fastcgi properly.

After all that, i'll likely not use it

While this takes care of my ASP.NET troubles, it now means that I'd have to migrate the various php packages over as well. Wordpress is no problem, but OpenCart would be a bit of hacking, which is really the last thing I want to do when it comes to ecom.I thought about running both nginx and apache and using one to proxy the sites on the other (since EC2 won't let me attach multiple IPs to a single host), but decided against that as well, since it would just be a hack of a different color. There's also the option of running fastcgi against apache, but I've not found any docs on how to set up ASP.NET MVC that way, all the existing examples map ASP.NET file extensions to fastcgi, which isn't an option.

Apache is still the most supported solution, so when integrating a number of sites on a single host, it ends up being the best option. It's just that mod_mono doesn't seem to be playing along for me :( So, I hatched a scheme to rid myself of ASP.NET for this site, since it really only has trivial business logic and I have a holiday coming up. More on that later.

More on .ToList() vs. .ToArray()

Like my last post, "Materializing an Enumerable" this may be a bit academic, but as a linq geek, whether I should use .ToList() or .ToArray() is something the piques my curiosity. Most of the time when I return IEnumerable<T> i want it to be in a threadsafe manner, i.e. i don't want the list to change underneath the iterator, so I return a unique copy. For this I have always used .ToArray(), since it's immutable and I figured it was leaner.

Finally having challenged this assumption, it turns out that .ToList() is theoretically faster for sources that are not ICollection<T>. When the final count is known, as is the case with ICollection<T>, both .ToList() and .ToArray() create an array under the hood for storage that is sufficiently large and copy the source into the destination. When the count isn't known however, both allocate an array and write to it, copying the contents to a larger array anytime the size is exceeded. So far, both are nearly identical in execution. However, once the end of the source is reached, .ToList() is done, while .ToArray() does one more copy to return a properly sized array. Of course, the overhead of iterating on that source, which is more than likely hitting some I/O or Computation barrier, means that in terms of measurable performance difference, again, both are identical.

It is still true that a List<T> object uses more memory than an T[], but even that difference is almost always going to be irellevant as the collections size is insignificant compared to the items it contains. That means that using .ToList() or .ToArray() to create an IEnumerable<T> is really a matter of personal preference.

Materializing an Enumerable

Yesterday I posted the question "Is there a way to Memorize or Materialize an IEnumerable?" on stackoverflow, hoping that there was already a built in way in the BCL. The answers and comments showed, that there wasn't but also challenged my existing assumptions as well as illustrated that materializing and/or memorizing could be interpreted in a number of ways. I figured that amount of ambiquity required a deeper dive into the subject.

What's this for anyway?

I try to use IEnumerable<T> as the return value for any method that is supposed to return a sequence meant purely for consumption. I choose IEnumerable<T> over an array or list because T[] exposes an unneeded implementation details while returning IList<T> or ICollection<T> allow modification of the sequence which is almost always undesirable behavior. And that doesn't even address that the enumerable might be a stream of items coming from an external source like a database cursor, a file stream or from executing a linq AST.

The drawback of this is that making multiple calls on an IEnumerable<T> that enumerate it under the hood may either incur a large cost, in the case of executing a linq AST repeatedly, or fail, in the case of stream or cursor. In order to be able to do something like the below, you really want to be certain that you have a finite sequence to query:

if(enumberable.Any() {
  foreach(var item in enumerable) {
    ...
  }
} else {
  ...
}

.Any() has to get an enumerator and call .MoveNext() once to see if it returns true and foreach, of course, gets the enumerator and iterates over it until the end. In order to safely write the above code, you really want the IEnumerable<T> converted into a computed collection.

The usual solution is to just call either .ToList() or .ToArray() and be done with it. But both have undesirable side-effects. Both will always create a new copy of the collection, which may have a non-insignifcant cost. And both change the type from IEnumerable<T>. Sure you can cast it back, but because both are not idempotent, casting to IEnumerable<T> hides the only clue that you don't want to call .ToList()/.ToArray() again. In addition, .ToList() also produces a mutable collection.

Most of the time, none of these side-effects are significant detractors, but should you return the memorized version from a method, you probably would want to cast it back to IEnumerable<T> and then the cost of this behavior can start to add up. Having a method that lets you memorize or materialize in an idempotent fashion would be useful.

Memorize()

What is the expected behavior of .Memorize()? It should capture the current state of sequence at the time of call and return an immutable sequence and it should force that sequence into memory so that multiple enumerations are relatively cheap. This one is fairly simple to implement:

public static IEnumerable Memorize(this IEnumerable enumerable) { return enumerable.GetType().IsArray ? enumerable : enumerable.ToArray(); }

Arrays are already immutable sequences, so we can use them reliably as our memorized collection. And if the source already is an array, we can safely return it unmodified. Now we can pass the resultant enumerable arround without concern that someone else calling .Memorize() again needlessly copies it.

Materialize()

Unlike .Memorize(), .Materialize() does not imply that the enumerable becomes a private, immutable copy. It only wants to make certain that the type can be safely enumerated. This lesser requirement actually complicates the idempotency scenario, requiring a internmediate collection class to be created:

public static class LinqEx {

    public static IEnumerable<T> Materialize<T>(this IEnumerable<T> enumerable) {
        if(enumerable is MaterializedEnumerable<T> || enumerable.GetType().IsArray) {
            return enumerable;
        }
        return new MaterializedEnumerable<T>(enumerable);
    }

    private class MaterializedEnumerable<T> : IEnumerable<T> {
        private readonly ICollection<T> _collection;
        public MaterializedEnumerable(IEnumerable<T> enumerable) {
            _collection = enumerable as ICollection<T> ?? enumerable.ToArray();
        }

        public IEnumerator<T> GetEnumerator() {
            return _collection.GetEnumerator();
        }

        IEnumerator IEnumerable.GetEnumerator() {
            return GetEnumerator();
        }
    }
}

The purpose of MaterializedEnumerable<T> is as marker for a previous materialization that can wrap or coerce a collection, so that no unnecessary copying is done.

A word on the use of .ToArray() instead of .ToList(): I've always leaned towards .ToArray(), both because it creates an immutable collection and because I thought arrays to be more lightweight than lists. After cracking them both open in Reflector, it became apparent that they should be about the same and confirmed that there is no significant difference with some simple tests.

While memorize and materialize have subtly different meaning, both intending to optimize access to an enumerable idempotently, in day to day use simply using .ToArray() will usually be just fine.

Func/Action vs. Delegate

A while back I wrote that you really never have to write another delegate again, since any delegate can easily be expressed as an Action or Func. After all what's preferable? This:

var work = worker.ProcessTaskWithUser(delegate(Task t, User u) {
  // define the work callback
});

or this:

var work = worker.ProcessTaskWithUser((t, u) => {
  // define the work callback
});

I know I prefer lambda's over delegates. But this is just on the consuming end. The signature for the above could be either:

delegate Task TaskUserDelegate(Task inputTask, User contextUser);
IEnumerable<Task> ProcessTaskWithUser( TaskUserDelegate processCallback );

or:

IEnumerable<Task> ProcessTaskWithUser( Func<Task,User,Task> processCallback );

Either one can be used with the same lambda, so using the delegate doesn't inconvenience us in consumption. But writing the Func version is certainly more concise so it seems like the winner once again. But In terms of consumption of that API, we've lost the signature of the method which would explain what each parameter is used for. Sure, .Where(Func<T,bool> filter) is pretty self-explanatory, but .WhenDone(Func<T,V,string,T> callback) really doesn't tell us much of anything.

So there seems to be straight forward usability rule of thumb: Use a delegate if the parameter's meaning isn't obvious from the usage of the lambda. But if the goal here is to make it easier for the consumer of the API, unfortunately it's not that simple, since the primary tool for communicating the API's documentation, intellisense, actually makes things worse.

Usability of delegate

For maximum usability, let's document the the API so it's meaning is discoverable:

/// <summary>
/// The task user delegate is meant to transform a given task into a new one in the context of a user.
/// </summary>
/// <param name="inputTask">The task to transform.</param>
/// <param name="activeUser">The user context to use for the transform.</param>
/// <returns>A new task in the user's context.</returns>
delegate Task TaskUserDelegate(Task inputTask, User activeUser);

/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUser(TaskUserDelegate processCallback) {
    //...
}

And this is what it looks like on code completion:

While TaskUserDelegate is well documented, this does not get exposed via intellisense. Worse, this signature tells us nothing about the arguments for our lambda. So, yeah, we created a better documented API, but made it's discovery worse.

Usability of func

Now, let's do the same for the func signature:

/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUserx(Func<Task, User, Task> processCallback) {
    //...
}

which gives us this completion:

Now we at least know the exact signature of the lambda we're creating, even if we don't know what the purpose of the arguments is.

Best usability: Plain Old Documentation

In both cases, the best discoverability ends up being plain old textual documentation of the parameter and even though the delegate provides extra documentation possibilities, their access is not convenient that for expediency i'd still have to vote for the Func signature.

The one exception to the rule would be a lambda that is meant as a dependency. I.e. a class or method that has a callback that you attach for later use, rather than immediate dispatch. In that case the lambda really functions as a single method interface and should be treated like any other dependency and be as explicit as possible.

Happy New Year, part II

Now that I have last year's progress out of the way, let's examine what I'd like to accomplish outside my regular work and play development.

resolution: javascript, again

Since last year ended up being an academic exercise in learning javascript, that task is still outstanding. But it is no less important. I consider javascript to be the one language I cannot afford not being good at. It is truly the assembly language of the web and is rapidly gaining ground on the server as well. I may or may not warm up to going back to a dynamically typed language but that's really independent of the need represented by this gap in expertise.

In many ways, javascript provides a lot of the features I am aiming for with promise. It follows the same pattern of methods being lambda's attached to slots on objects. Although syntax-wise, CoffeeScript is an even better match -- getting rid of the overly verbose function prefix for lambda's among many other cool changes.

resolution: Use Scala/Akka for a real world project

I still doubt that javascript will become my new favorite language, simply because of my strongly-typed tendencies. But being a C# programmer i'm in kind of a weird space. I don't much care for windows as client and abhor it as a server. So almost everything I do in C# runs under mono. I admire Miguel DeIcaza's relentless drive for creating the best environment possible regardless of detractors and am continually amazed at the quality and completeness of mono. That said being a C# advocate in the linux world is asking for additional pain. Mono will always be trailing MS's implementation by a bit and for your troubles you do end up being a pariah in the linux world. Finding a language that is more at home and accepted on my favorite platform would be beneficial. For a long time, java and C# were just close enough that in the worst case scenario, I could always go java again. But now that i'm used to C#'s lambda syntax and linq, java just feels ancient and dead to me.

Of all the languages I've looked at Scala hits my personal feature bingo the best. I love the actor pattern and built it using Xmpp in C# for notify.me. From my sideline review Akka seems to be the actor implementation to beat, so picking a project and implementing it start to finish in Scala/Akka seems like the way to go. After that I should have enough of a feel for the language to see whether it's a contender for my C# affections

resolution: Release an App for iPhone, Android and Windows Phone 7

To stick with the common thread, I think that going forward in the mobile space, javascript once again is going to be the most important tool in the development toolkit. But at the same time, I am a sucker for native clients and am happy that the current crop of smartphones have revived writing client side software.

However, the last time I did mobile client programming was WM5, so I have some serious catching up to do. The goal here is to pick a useful app and write it for all three of the above platforms and release it. I'm going to stick somewhat to my comfort zone here by using C#, the default on Windows Phone 7 and enabled by MonoTouch and MonoDroid on the other two. The departure from my comfort zone is venturing back to the client after spending almost all my time on the server and figuring out that the re-use vs. platform specific stories are and what deployment looks like. I've not settled on an app, but most likely it will be a native notify.me client.

Those resolutions should keep me busy enough, especially since they are spare time activities for when I'm not busy working on MindTouch and Dream, extending notify.me or maintaining curdsandwine.com.