Like my last post, "Materializing an Enumerable" this may be a bit academic, but as a linq geek, whether I should use .ToList() or .ToArray() is something the piques my curiosity. Most of the time when I return IEnumerable<T> i want it to be in a threadsafe manner, i.e. i don't want the list to change underneath the iterator, so I return a unique copy. For this I have always used .ToArray(), since it's immutable and I figured it was leaner.
Finally having challenged this assumption, it turns out that .ToList() is theoretically faster for sources that are not ICollection<T>. When the final count is known, as is the case with ICollection<T>, both .ToList() and .ToArray() create an array under the hood for storage that is sufficiently large and copy the source into the destination. When the count isn't known however, both allocate an array and write to it, copying the contents to a larger array anytime the size is exceeded. So far, both are nearly identical in execution. However, once the end of the source is reached, .ToList() is done, while .ToArray() does one more copy to return a properly sized array. Of course, the overhead of iterating on that source, which is more than likely hitting some I/O or Computation barrier, means that in terms of measurable performance difference, again, both are identical.
It is still true that a List<T> object uses more memory than an T[], but even that difference is almost always going to be irellevant as the collections size is insignificant compared to the items it contains. That means that using .ToList() or .ToArray() to create an IEnumerable<T> is really a matter of personal preference.
Yesterday I posted the question "Is there a way to Memorize or Materialize an IEnumerable?" on stackoverflow, hoping that there was already a built in way in the BCL. The answers and comments showed, that there wasn't but also challenged my existing assumptions as well as illustrated that materializing and/or memorizing could be interpreted in a number of ways. I figured that amount of ambiquity required a deeper dive into the subject.
I try to use IEnumerable<T> as the return value for any method that is supposed to return a sequence meant purely for consumption. I choose IEnumerable<T> over an array or list because T[] exposes an unneeded implementation details while returning IList<T> or ICollection<T> allow modification of the sequence which is almost always undesirable behavior. And that doesn't even address that the enumerable might be a stream of items coming from an external source like a database cursor, a file stream or from executing a linq AST.
The drawback of this is that making multiple calls on an IEnumerable<T> that enumerate it under the hood may either incur a large cost, in the case of executing a linq AST repeatedly, or fail, in the case of stream or cursor. In order to be able to do something like the below, you really want to be certain that you have a finite sequence to query:
.Any() has to get an enumerator and call .MoveNext() once to see if it returns true and foreach, of course, gets the enumerator and iterates over it until the end. In order to safely write the above code, you really want the IEnumerable<T> converted into a computed collection.
The usual solution is to just call either .ToList() or .ToArray() and be done with it. But both have undesirable side-effects. Both will always create a new copy of the collection, which may have a non-insignifcant cost. And both change the type from IEnumerable<T>. Sure you can cast it back, but because both are not idempotent, casting to IEnumerable<T> hides the only clue that you don't want to call .ToList()/.ToArray() again. In addition, .ToList() also produces a mutable collection.
Most of the time, none of these side-effects are significant detractors, but should you return the memorized version from a method, you probably would want to cast it back to IEnumerable<T> and then the cost of this behavior can start to add up. Having a method that lets you memorize or materialize in an idempotent fashion would be useful.
What is the expected behavior of .Memorize()? It should capture the current state of sequence at the time of call and return an immutable sequence and it should force that sequence into memory so that multiple enumerations are relatively cheap. This one is fairly simple to implement:
Arrays are already immutable sequences, so we can use them reliably as our memorized collection. And if the source already is an array, we can safely return it unmodified. Now we can pass the resultant enumerable arround without concern that someone else calling .Memorize() again needlessly copies it.
Unlike .Memorize(), .Materialize() does not imply that the enumerable becomes a private, immutable copy. It only wants to make certain that the type can be safely enumerated. This lesser requirement actually complicates the idempotency scenario, requiring a internmediate collection class to be created:
public static class LinqEx {
public static IEnumerable<T> Materialize<T>(this IEnumerable<T> enumerable) {
if(enumerable is MaterializedEnumerable<T> || enumerable.GetType().IsArray) {
return enumerable;
}
return new MaterializedEnumerable<T>(enumerable);
}
private class MaterializedEnumerable<T> : IEnumerable<T> {
private readonly ICollection<T> _collection;
public MaterializedEnumerable(IEnumerable<T> enumerable) {
_collection = enumerable as ICollection<T> ?? enumerable.ToArray();
}
public IEnumerator<T> GetEnumerator() {
return _collection.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator() {
return GetEnumerator();
}
}
}
The purpose of MaterializedEnumerable<T> is as marker for a previous materialization that can wrap or coerce a collection, so that no unnecessary copying is done.
A word on the use of .ToArray() instead of .ToList(): I've always leaned towards .ToArray(), both because it creates an immutable collection and because I thought arrays to be more lightweight than lists. After cracking them both open in Reflector, it became apparent that they should be about the same and confirmed that there is no significant difference with some simple tests.
While memorize and materialize have subtly different meaning, both intending to optimize access to an enumerable idempotently, in day to day use simply using .ToArray() will usually be just fine.
A while back I wrote that you really never have to write another delegate again, since any delegate can easily be expressed as an Action or Func. After all what's preferable? This:
var work = worker.ProcessTaskWithUser(delegate(Task t, User u) {
// define the work callback
});
or this:
var work = worker.ProcessTaskWithUser((t, u) => {
// define the work callback
});
I know I prefer lambda's over delegates. But this is just on the consuming end. The signature for the above could be either:
Either one can be used with the same lambda, so using the delegate doesn't inconvenience us in consumption. But writing the Func version is certainly more concise so it seems like the winner once again. But In terms of consumption of that API, we've lost the signature of the method which would explain what each parameter is used for. Sure, .Where(Func<T,bool> filter) is pretty self-explanatory, but .WhenDone(Func<T,V,string,T> callback) really doesn't tell us much of anything.
So there seems to be straight forward usability rule of thumb: Use a delegate if the parameter's meaning isn't obvious from the usage of the lambda. But if the goal here is to make it easier for the consumer of the API, unfortunately it's not that simple, since the primary tool for communicating the API's documentation, intellisense, actually makes things worse.
For maximum usability, let's document the the API so it's meaning is discoverable:
/// <summary>
/// The task user delegate is meant to transform a given task into a new one in the context of a user.
/// </summary>
/// <param name="inputTask">The task to transform.</param>
/// <param name="activeUser">The user context to use for the transform.</param>
/// <returns>A new task in the user's context.</returns>
delegate Task TaskUserDelegate(Task inputTask, User activeUser);
/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUser(TaskUserDelegate processCallback) {
//...
}
And this is what it looks like on code completion:
While TaskUserDelegate is well documented, this does not get exposed via intellisense. Worse, this signature tells us nothing about the arguments for our lambda. So, yeah, we created a better documented API, but made it's discovery worse.
/// <summary>
/// Transform all tasks for a set of users.
/// </summary>
/// <param name="processCallback">Callback for transforming each task for a specific user</param>
/// <returns>Sequence of transformed tasks</returns>
IEnumerable<Task> ProcessTaskWithUserx(Func<Task, User, Task> processCallback) {
//...
}
which gives us this completion:
Now we at least know the exact signature of the lambda we're creating, even if we don't know what the purpose of the arguments is.
In both cases, the best discoverability ends up being plain old textual documentation of the parameter and even though the delegate provides extra documentation possibilities, their access is not convenient that for expediency i'd still have to vote for the Func signature.
The one exception to the rule would be a lambda that is meant as a dependency. I.e. a class or method that has a callback that you attach for later use, rather than immediate dispatch. In that case the lambda really functions as a single method interface and should be treated like any other dependency and be as explicit as possible.
Since last year ended up being an academic exercise in learning javascript, that task is still outstanding. But it is no less important. I consider javascript to be the one language I cannot afford not being good at. It is truly the assembly language of the web and is rapidly gaining ground on the server as well. I may or may not warm up to going back to a dynamically typed language but that's really independent of the need represented by this gap in expertise.
In many ways, javascript provides a lot of the features I am aiming for with promise. It follows the same pattern of methods being lambda's attached to slots on objects. Although syntax-wise, CoffeeScript is an even better match -- getting rid of the overly verbose function prefix for lambda's among many other cool changes.
I still doubt that javascript will become my new favorite language, simply because of my strongly-typed tendencies. But being a C# programmer i'm in kind of a weird space. I don't much care for windows as client and abhor it as a server. So almost everything I do in C# runs under mono. I admire Miguel DeIcaza's relentless drive for creating the best environment possible regardless of detractors and am continually amazed at the quality and completeness of mono. That said being a C# advocate in the linux world is asking for additional pain. Mono will always be trailing MS's implementation by a bit and for your troubles you do end up being a pariah in the linux world. Finding a language that is more at home and accepted on my favorite platform would be beneficial. For a long time, java and C# were just close enough that in the worst case scenario, I could always go java again. But now that i'm used to C#'s lambda syntax and linq, java just feels ancient and dead to me.
Of all the languages I've looked at Scala hits my personal feature bingo the best. I love the actor pattern and built it using Xmpp in C# for notify.me. From my sideline review Akka seems to be the actor implementation to beat, so picking a project and implementing it start to finish in Scala/Akka seems like the way to go. After that I should have enough of a feel for the language to see whether it's a contender for my C# affections
To stick with the common thread, I think that going forward in the mobile space, javascript once again is going to be the most important tool in the development toolkit. But at the same time, I am a sucker for native clients and am happy that the current crop of smartphones have revived writing client side software.
However, the last time I did mobile client programming was WM5, so I have some serious catching up to do. The goal here is to pick a useful app and write it for all three of the above platforms and release it. I'm going to stick somewhat to my comfort zone here by using C#, the default on Windows Phone 7 and enabled by MonoTouch and MonoDroid on the other two. The departure from my comfort zone is venturing back to the client after spending almost all my time on the server and figuring out that the re-use vs. platform specific stories are and what deployment looks like. I've not settled on an app, but most likely it will be a native notify.me client.
Those resolutions should keep me busy enough, especially since they are spare time activities for when I'm not busy working on MindTouch and Dream, extending notify.me or maintaining curdsandwine.com.
I was just readying David Tchepak's inspiring new year's resolution and also re-read his linked article "There is no U in Collective Ownership" and I thought that it sounded very much like an adaptation of Isaac Asimov's First Law of Robotics. So, just for fun, I wondered if all three laws couldn't be adapted as laws for practicing coders.
A Coder may not harm the code base or, through inaction, allow the code base to come to harm.
A Coder must follow the guidance set up for the code base except where such practices would conflict with the First Law.
A Coder must protect their livelihood as long as such protection does not conflict with the First or Second Law.
While this is mostly in jest, I do think this works as a pretty good set of guidelines. The overarching principle here is that code quality must be the highest goal, followed by consistent application of team standards and finally, that not wanting to rock the boat is no excuse for letting quality slip.
Of course the hard and subjective term here is code quality. It's easy to slip into academic excercises of aethestics and call them quality, just as it is easy to label horrible spaghetti code as pragmatism. My personal test for quality is best expressed by the following questions:
It's the beginning of a new year and you know what that means: Public disclosure of measurable goals for the coming year. I've not made a New Year's Resolution post before, which of course means that not living up to them was slightly less embarrassing. Well, this year, I'm going to go on the record with some professional development goals. I consider the purpose of these goals to be exercises of taking myself out of my comfort zone. Simply resolving to do something that is a logical extension of my existing professional development road map is just padding. But before I get into enumerating my resolutions for 2011, let's see how I fared on those undisclosed ones from last year.
By the end of 2009, I had only used git to the minimum extend required to get code from github. I knew i didn't like svn, because I was a big branch advocate and it just sucked at all related tasks, like managing, merging, re-merging, re-branching branches. I had been using perforce for years and considered it the pinnacle of revision control because of its amazing branch handling and excellent UI tooling. I also got poisoned against git early when I watched Linus assigned the sins of svn to all non-distributed version control system in his googletalk on git. I knew this was irrational and eventually i would need to give git a chance to stand on its merits. But the only way that was going to happen was by going git cold turkey and forcing myself to use it until i was comfortable with it. That happened on January 1st, 2010. I imported all my perforce repos into git and made the switch. I also started using git on top of all projects that were in svn that i couldn't change, keeping my own local branches and syncing/merging back into svn periodically. This latter workflow has been amazingly productive and gives me far greater revision granularity, since i constantly commit WIP to my local branches that wouldn't be fit for a shared SVN trunk.
One other aspect about DVCS that had kept me from it was that I consider version control both my work history and my offsite backup. So, I probably still push a lot more than most git folks. Sure, i've only lost work once due to disk failure vs. several times because of ill-considered disk operations or lack of appropriate rollback points, but I also work on a number of machines and religious pushing and pulling lets me move between machines more easily. Basically, I never leave my desk without committing and pushing because I've been randomized by meetings or other occasions that led me home before making sure i had everything pushed for work from home.
The last couple of years I've been doing toy projects in Ruby as an alternative to my daily C# work. But unlike seemingly everyone else, i never found it to be more fun than C#. Maybe it's because i used to be dynamic language guy doing perl and I became a static typing guy by choice. As dynamic languages go, Ruby doesn't really buy me anything over perl, which I'd worked with off and on for the last 15 years. And while the energy of the Ruby community is amazing, too much of that energy seems to be devoted to re-discovering patterns and considering them revolutionary inventions.
Javascript, on the other hand, offered something over C# other than just being a dynamic language. It was a language that could be used efficiently on both client and server. That was compelling (and was the same reason why I liked Silverlight as a developer, although i never considered it viable for use on the web). Up until last year, I used javascript like so many server-side programmers: only in anger. I knew enough to write crappy little validation and interactivity snippets for web pages, but tried to keep all real logic on the server where i was most comfortable. When i did venture into javascript, I'd try to treat it like C# and hated it even more because I perceived it to be a crappy object-oriented language. But even then I understood that what I hated more than anything was the DOM and its inconsistencies and that blaming javascript for those failures was misguided.
So in 2010 I was going to get serious about javascript and but initially went down the frustrating path of trying treat javascript like other OO languages I knew. It wasn't until I watched Douglas Crockford's InfoQ talk "The State and Future of Javascript", that it clicked. What I called object-oriented was really a sub-species called class oriented. If I was to grok and love javascript, i needed to meet it on its own turf.
In the end, 2010 never went beyond lots of reading, research and little toy projects. I should have committed to solving a real problem without flinching. While my understanding of javascript is pretty good now on an academic level, i certainly didn't get serious.
It wasn't as much a new lesson learned as a re-affirmation of an old axiom: Learning something to the extend that you can truly judge its merits and become more than just proficient requires immersion. Casual use just doesn't build up the muscle memory and understanding required to reason in the context of the subject. If you don't immerse yourself, your use of that subject will always be one of translation from your comfort zone into the foreign concept, and like all translations, things are only likely to get lost in the process.
I generally stay away from anything that's not in rpm form, because playing the configuration/build game is just such a waste of time especially when trying to come up with a repeatable recipe. Unfortunately, there's no rpm's for 2.8.x, 2.6.7 being the latest on the mono-project site.
I don't have a clue about building rpm's from source and have not found a build recipe either, but if one exists, I'd be more than happy to start building and distributing current rpm's for mono, since i don't feel like building from source on every new machine and having that machine's config fall out of the management purview of yum/rpm.
The purpose of build 2.8.1 is running ASP.NET MVC2. In theory this should work under 2.6.7, but there's a couple of bugs that mean it's not quite usable for production purposes. While I develop in Visual Studio, i don't want to commit binaries to my git repo and I want the ability to make fixes with emacs on the development server. To make this work, I also include xbuild as part of the recipe, since it lets me build the Visual Studio solution.
Anyway, here's the chronicle of my install. This is less of an article than a log of what I did to get mono 2.8.1 installed from source on a fresh Amazon Linux AMI 1.0, so it may include things you don't need that were dead-ends i went down. I started by looking at Nathan Bridgewater's Ubuntu & Fedora recipe.
cd /tmp
mkdir mono-2.8.1
cd mono 2.8.1
wget http://ftp.novell.com/pub/mono/sources/xsp/xsp-2.8.1.tar.bz2
wget http://ftp.novell.com/pub/mono/sources/mod_mono/mod_mono-2.8.tar.bz2
wget http://ftp.novell.com/pub/mono/sources/mono/mono-2.8.1.tar.bz2
wget http://ftp.novell.com/pub/mono/sources/libgdiplus/libgdiplus-2.8.1.tar.bz2
tar jxf libgdiplus-2.8.1.tar.bz2
tar jxf mod_mono-2.8.tar.bz2
tar jxf mono-2.8.1.tar.bz2
tar jxf xsp-2.8.1.tar.bz2
cd libgdiplus-2.8.1
./configure --prefix=/opt/mono-2.8.1 (30s)
make
make install
cd ..
cd mono-2.8.1
./configure --prefix=/opt/mono-2.8.1 (2m)
make
make install
cd ..
In order to build xsp, we need to configure the environment to find mono:
I did this build on an Amazon EC2 micro instance, and it took 3 hours. So might be worth attaching the EBS to a EC2 High CPU instance building it there and then re-attaching it to a micro. But you can see why I prefer rpm's. Sure i can zip up the build directories and run make install on another identical machine, but the make install on the micro instance for mono took 20 minutes. Compare that to the 2.6.7 rpm install, which took me 42 seconds including downloading via yum from novell.
To back up my wordpress install, i used to do a mysqldump of the DB, bzip it, separately tar/bzip the install directory and rsync the timestamped files to my backup server. This has worked well and was a simple enough setup. A couple of days ago, however, i decided to update wordpress and all extensions and noticed that some of my manual tweaks were gone.
Now, i have to say that the install/upgrade system of wordpress is one of the simplest and troublefree i've used in both commercial and free software -- this is not a complaint about that system. Clearly I manually tweaked files, which means i inherited the responsibility of migrating those tweaks forward. But it did get me thinking about a different upgrade strategy. But even though I have the backups to determine the changes i had to re-apply, the process was annoying. Untar the backup and manually run diffs between live and backup. The task reminded me too much of how I deal with code changes, so why wasn't I treating this like code?
Sure, there's some editing, but most of it is already revisioned, which makes it append only. That means storage of all wordpress assets, including the DB dump should make it an ideal candidate for revision control. Most of the time, the only changes are additions, and if files are actually changed it does represent an update of an existing system and should be tracked with change history. If the live install was under revision control you could just run the upgrade, do a diff to see the local changes and tweak/revert them one at a time, the commit the finished state.
Inside backup, I kept the tar'ed up copies of the wordpress directory with date suffix, as well as dated, bzipped mysqldumps of the DB. I generally deleted older backups after rsync, since on the live server i only cared about having the most recent backup.
First thing i did was create a mirror of this hierarchy. I used an old wordpress tar for the wordpress directory and bunzipped the db archive from the same date as iloggable.sql as the only file in backup, since for the git repo i no longer needed the other backups, only the most current database dump. I then ran git init inside of iloggable. I also created .gitignore with wordpress/wp-content/cache in it to avoid capturing the cache data.
I added all files and committed this state. I then unarchived the subsequent archives, copied them on top of the mirror hierarchy and added/committed those files in succession. At this point i had created a single git repo of the backup history I had. Now i could review previous states via git log and git diff.
Finally, i copied the .git directory into the live hierarchy. Immediately, doing git status, it showed me the changes on production since the last backup. I deleted all the old and again added/committed the resulting "changes". That gave me a git repo of the current state, that I pushed to my back-up server.
Now, my backup script just overwrites the current mysqldump, does a git commit -a -m'$timestamp' and git push. From now on, as i do tweaks, or upgrade wordpress, i can do a git commit before and after and I have an exact change history for the change.
See what i did there? Err... yeah, i know it's horrible. I apologize.
I did want to post an update about Promise since I've gone radio-silent since I finished up my series about features and syntax. I've started a deep dive into the DLR, but mostly got sidetracked with learning antlr since I don't want to build the AST by hand for testing. However, coming up with a grammar for Promise is the real stumbling block. So that's where i'm currently at, spending a couple of hours here and there playing with antlr and the grammar.
In the meantime, I've been doing more ruby coding over the last couple of weeks and even dove back into perl for some stuff and the one thing I am more sure is that I find dynamically code incredibly tedious to work with. The lack of static analysis and even simple contracts turns dynamic programming into a task of memorization, terse syntax and text search/replace. I'm not saying that the static type systems of yore are not equally painful with the hoops you have to jump through to explain contracts with any flexibility to the compiler, but given the choice I tend towads that end. This experience just solidifies my belief that a type system like Promise, i.e. types describe contracts, but classes aren't typed other than by their ability to quack appropriately, would be so much more pleasant.
I'm currently splitting up some MindTouch Dream functionality into client and server assemblies and running into a consistent pattern where certain classes have augmented behavior when running in the server context instead of the client context. Since the client assembly is used by the server assembly, and these classes are part of the client assembly, they do not even know about the concept of the server context. Which leads me to the dilemma, how do I inject different behavior into these classes when they run under the server context?
Ok, the short answer to this is probably "you're doing it wrong", but let's just go with it and accept that this is what's happening. Let's also assume that the server context can be discovered by static means (i.e. a static singleton accessor) which also lives in the server assembly.
Here's what I've come up with: Extract the desired functionality into an interface and chain implementations together. Each class that needs this facility would have to create it's own interface and default implementation and looks something like this:
public interface IFooHandler {
ushort Priority { get; }
bool TryFoo(string input, out string output);
}
TryFoo gives the handler implementation a chance to look at the inputs and decide whether to handle it, or whether to pass on it. The usage of collection of handlers takes the following form:
public string Foo(string input) {
string output = null;
_handlers.Where(x => x.TryFoo(input, out output)).First();
return output;
}
This assumes that _handlers is sorted by priority. It return the result of the first handler report true on invocation. Building up the _handlers happens in the static constructor:
where DiscoverImplementors and Instantiate are extension methods i'll leave as an exercise to the reader.
Now the server assembly simply creates its implementation of IFooHandler, gives it a higher priority and on invocation checks its static accessor to see if it's running in the server context and if not, lets the chain fall through to the next (likely the default) implementation.
This works just fine. I don't really like the static discovery of handlers, and if it weren't for backwards compatibility, I'd move the injection of handlers into a factory class and leave all that wire-up to IoC. Since that's not an option, I think this is the most elegant and performant solution.
It still feels clunky, tho. Anyone have a better solutions for changing the behavior of a method in an existing class that doesn't required the change to be in a subclass (since the instance will be of the base type)? Is there a pattern i'm overlooking?