Wednesday, September 9, 2009

Map / Reduce and Dependency Injection

I've had a number of good discussions lately around map / reduce frameworks like Hadoop and GridGain and (object and resource) dependency management. This is just a summarized rehash of those discussions, but hopefully in a slightly more organized format.

So you need / want to write map / reduce code. It's a whole new world out there. It's the in thing to do and all your friends are cooler than you. I get it. You fire up Hadoop and realize after a few rounds of edit, deploy, test that this is less than fun. GridGain solves this with the zero deployment model, serializing your map job and thawing it on the other side of the cluster, but you quickly run into fun with things that don't serialize cleanly. The question becomes how do I get what I need into my map (or reduce) implementation?

Before we continue, it's worth noting that GridGain offers some support for (GridGain component and SPI) resource injection as well as some support for integrating with Spring. To be honest, I found it awkward, but I don't want to address that here. I'd like to give you something that works on Hadoop as well. (That said, it's worth checking out the deeper levels of GridGain as it is a great project that I still use daily.)

My preferred method of working on moderately complex code is to build business logic as pojos and test them directly from junit. I may or may not manually inject data sources or configuration data (via simple setter method calls). Once things work, I wire them together in Spring configs (although one could easily pick some other DI container) and test the aggregate with an integration oriented test suite. Having everything working together, I box the result up into whatever deployable unit is appropriate. For me, this is usually a standalone daemon or a web service, but it could be OSGi or whatever environment you prefer.

In many cases, there's some integration code to deal with the environment. In the case of a web service, I use Spring MVC and have a Spring config devoted to the MVC specific bits and the obligatory web.xml (against my will, usually). If I luck out and we're talking about Spring dm Server, that gets easier, but that's another post. You get the idea.

When we talk about something like Hadoop, we seem to treat it as a wildly different execution environment and start trying to find crazy ways to make resources available to Mappers or Reducers. I'll make it simple - wire up your Spring beans into a top level component that acts as a controller, instantiate the ApplicationContext from within your Mapper, get over the ick of calling context.getBean("myapp"), and invoke the appropriate methods. Revolutionary? Not at all. That's the point. Simple, straight forward, but somehow not something people think of doing.

My suggestions for reducing the chance of Hadoop types sneaking into the thick of your code is to use the Mapper to do what you would do with a servlet shell; confirm the environment is as you expect it, coerce the input into something your application code likes, and then pass as arguments to your application's pseudo-main() method a custom parameter object that encapsulates the Hadoop-iness of the input.

Many map / reduce applications are different in the type of data they process and what I'm talking about is not a drop in solution. You absolutely should write your application code in a way that is conducive to the environment. This, in the case of large scale data tasks, usually means being stream oriented rather than complete-collection (in memory) based. Don't try and shoehorn some web controller into a Mapper and send me hate mail. I'll only pass it around as a cautionary tale of following overly general advice from know-it-all Internet nobodies that will scare the less jaded of our ilk. Of course, this does not necessitate abandoning things like encapsulation, layering, testability, the open / closed principle, and sometimes even good judgement.

In my opinion, this leaves you with a system that shields the majority of your code from being glued to the outer infrastructure. There is always a risk of going overboard and you shouldn't sacrifice performance so much for obtuse abstraction. Use your brain. You know your tasks better than anyone. This structure should allow you to easily continue your normal development and testing outside of the M/R framework until you need to actually get to the integration testing. Even then, you have some additional flexibility with varying your application components without needing to worry about completely breaking your M/R glue layer. This reduces test time and (hopefully) lets you get back to what you were supposed to be doing... whatever that might have been.