Wednesday, December 23, 2009

Upcoming Posts

It's been longer than I'd have liked since I've posted anything. I've been involved in transitioning to a new position at a fantastic (secret) start up and immersing myself in learning new things. I want to take some time to put together some really interesting pieces that will be helpful not only in the academic oh that's nice way, but also pragmatically.

With the growing emphasis on distributed, highly concurrent applications in the development world, I'd like to put out some analysis of best practices in these types of applications. Frameworks / projects like Spring and Hadoop give us the means to build huge applications that solve real problems, but still so many people struggle to put them into practice. We have excellent reference, but the narrative is filled with oversimplified how-tos that don't show work, as my eighth grade math teacher would say.

Of particular interest, and the focus of the next few posts, will be Spring (3.0.0 release), Spring Integration, OSGi (yea, again), Hadoop, HBase, and EC2. Sure, plenty has been written about all of these, but hey, it's the internet and every crackpot gets their say.

Wednesday, September 9, 2009

Map / Reduce and Dependency Injection

I've had a number of good discussions lately around map / reduce frameworks like Hadoop and GridGain and (object and resource) dependency management. This is just a summarized rehash of those discussions, but hopefully in a slightly more organized format.

So you need / want to write map / reduce code. It's a whole new world out there. It's the in thing to do and all your friends are cooler than you. I get it. You fire up Hadoop and realize after a few rounds of edit, deploy, test that this is less than fun. GridGain solves this with the zero deployment model, serializing your map job and thawing it on the other side of the cluster, but you quickly run into fun with things that don't serialize cleanly. The question becomes how do I get what I need into my map (or reduce) implementation?

Before we continue, it's worth noting that GridGain offers some support for (GridGain component and SPI) resource injection as well as some support for integrating with Spring. To be honest, I found it awkward, but I don't want to address that here. I'd like to give you something that works on Hadoop as well. (That said, it's worth checking out the deeper levels of GridGain as it is a great project that I still use daily.)

My preferred method of working on moderately complex code is to build business logic as pojos and test them directly from junit. I may or may not manually inject data sources or configuration data (via simple setter method calls). Once things work, I wire them together in Spring configs (although one could easily pick some other DI container) and test the aggregate with an integration oriented test suite. Having everything working together, I box the result up into whatever deployable unit is appropriate. For me, this is usually a standalone daemon or a web service, but it could be OSGi or whatever environment you prefer.

In many cases, there's some integration code to deal with the environment. In the case of a web service, I use Spring MVC and have a Spring config devoted to the MVC specific bits and the obligatory web.xml (against my will, usually). If I luck out and we're talking about Spring dm Server, that gets easier, but that's another post. You get the idea.

When we talk about something like Hadoop, we seem to treat it as a wildly different execution environment and start trying to find crazy ways to make resources available to Mappers or Reducers. I'll make it simple - wire up your Spring beans into a top level component that acts as a controller, instantiate the ApplicationContext from within your Mapper, get over the ick of calling context.getBean("myapp"), and invoke the appropriate methods. Revolutionary? Not at all. That's the point. Simple, straight forward, but somehow not something people think of doing.

My suggestions for reducing the chance of Hadoop types sneaking into the thick of your code is to use the Mapper to do what you would do with a servlet shell; confirm the environment is as you expect it, coerce the input into something your application code likes, and then pass as arguments to your application's pseudo-main() method a custom parameter object that encapsulates the Hadoop-iness of the input.

Many map / reduce applications are different in the type of data they process and what I'm talking about is not a drop in solution. You absolutely should write your application code in a way that is conducive to the environment. This, in the case of large scale data tasks, usually means being stream oriented rather than complete-collection (in memory) based. Don't try and shoehorn some web controller into a Mapper and send me hate mail. I'll only pass it around as a cautionary tale of following overly general advice from know-it-all Internet nobodies that will scare the less jaded of our ilk. Of course, this does not necessitate abandoning things like encapsulation, layering, testability, the open / closed principle, and sometimes even good judgement.

In my opinion, this leaves you with a system that shields the majority of your code from being glued to the outer infrastructure. There is always a risk of going overboard and you shouldn't sacrifice performance so much for obtuse abstraction. Use your brain. You know your tasks better than anyone. This structure should allow you to easily continue your normal development and testing outside of the M/R framework until you need to actually get to the integration testing. Even then, you have some additional flexibility with varying your application components without needing to worry about completely breaking your M/R glue layer. This reduces test time and (hopefully) lets you get back to what you were supposed to be doing... whatever that might have been.

Friday, June 12, 2009

Why OSGi needs to come of age

The team I'm lucky enough to be part of is currently working on a project with GridGain, which I've mentioned before, once or twice. In our case, we are invoking map / reduce jobs via RESTful web services. It works well. The problems start cropping up with the way classes are loaded (or not) inside Tomcat.

GridGain is (in our case), effectively, deployed in a war file. This is because we want to communicate with a layer around GridGain via HTTP (and Spring MVC, specifically). In the interest of time and simplicity, we chose to start up the grid from within Tomcat, although, a cleaner design would be to queue tasks from within Tomcat, run our GridGain app as a daemon, and dequeue tasks to be executed there (maybe using JMS or something similar). It's a lot of jumping through hoops, really, because at that point, you may as well speak JMS directly to the GridGain-enabled daemon. Too much bloat.

Most of the complexity is really introduced simply because of the environment required to receive HTTP requests. Really, when you think about it, we've had to - simply to support HTTP invocation - change how we build, package, deploy, start and stop services, and deal with 3rd party dependencies. Worst of all, Tomcat has to deal with the servlet specification and how class loading works, therein. As it turns out, GridGain plus Tomcat equals yuck.

Now, I won't really go into specifics (in this post) about my thoughts on class loading in Java, peer class loading with GridGain, and the myriad of issues we've seen and run into. What I do want to touch on is the more general theme that, even after all of this time, this stuff (i.e. static and dynamic class loading in containers) is still way too thick, intrusive, delicate, under-documented, and buggy. We need evolution, if not revolution.

I dropped a war file containing our code, and approximately 100+ jars worth of dependencies to make Spring, Hibernate, AspectJ, and GridGain happy. One hundred, plus. Smells like disaster already. GridGain starts up, Spring MVC does its thing, life looks good. I make a request to a controller that invokes a task via GridGain and things get interesting.

The task we want to start is looked up in a Spring config, and instantiated using reflection. It makes it easy to add new types of tasks without mucking around with the core grid assembly and packaging, but it creates an interesting case. The jar(s) containing the grid tasks need to now live in the war file. The alternative seemed to be storing them in a directory accessible to the Tomcat shared class loader, but during testing, we found that Tomcat couldn't handle this (for some reason that is still admittedly a little unclear to me). The class simply wasn't found. The class loader hierarchy is modified in Tomcat due to the servlet spec so the order should be bootstrap, system (JDK), the war classes and internal libs, then the common and shared class loaders (if I remember correctly). This means that when we move our grid tasks to the shared class loader, they would come after the web app class loader. Normally this shouldn't be a problem, but I think the issue we ran into was related to the fact that the grid task extended a class that was present in the war's class loader, not in the shared class loader. When we move the grid task jar into WEB-INF/lib, it magically starts working. Sigh.

I'm a fan of OSGi. It (mostly) makes sense to me. It's explicit (painfully so, in many cases) and direct. My feeling is that, in a case like we have here, at least it would be obvious that class loaders needed to be wired together to make this work. The main grid application could have fragments containing the grid tasks, for instance, to push additional task plugins into the core application, making the classes available. Alternatively, a more direct way would be to simply list the packages as dependencies in the manifest of the core application. This would obviously couple the core app to the tasks, which isn't very nice.

The GridGain team, unfortunately, hasn't embraced OSGi. In fact, Ivanov seems to summarily disregard it in the comments of this post. I was a bit let down to see this kind of hard-line stance on a subject that more and more people seem to be interested in, especially with the SpringSource folks driving at it. While Nikita makes an excellent point about distributed class loading and OSGi being a less than nice match, I do think there's room for GridGain with peer class loading disabled in an OSGi container. For some of us who don't care about the hassle of deploying grid tasks (and GridJob implementations, specifically) on all nodes, at least having the option of GridGain in an OSGi container like Spring dm would be nice. It's possible that this can be done, but when I tried, things didn't go well. There's no doubt that Ivanov and the GridGain team have far more experience with the class loading details than I do, but the stance taken does still let me down, at least not without a better explanation and documentation.

OSGi isn't perfect. It's far from it, obviously. What it does begin to chip away at, though, is the big fat bundle-it-all-in-one-giant-file issue that plagues Java-land. Honestly, Java is thick and we have to change that. OSGi is one way to cut it down to a reasonable footprint while still allowing for decoupling and service-like component design. Sure, permgen errors are a dirty word and the class loading mechanics may be too simplistic but it scares me to think we'd be stuck in war-land for the rest of the foreseeable future.

Monday, April 27, 2009

The Cloud - Not a Panacea

Everyone loves The Cloud. I mean, why not? There's no hardware to own, no storage systems to maintain, no networking hardware to deal with, it's infinitely scalable to the billionth degree; it's perfect!

Ok, so no one gushes about it that much (do they?). It has its notable benefits, but they aren't quite as obvious or simple as people sometimes make them out to be.

The crux of the issue is simple; simply moving traditional services, applications, or systems to a cloud-like environment yields little. Traditional systems with no knowledge of a cloud (or grid, or virtualized infrastructure, or...) environment can't take advantage of the dynamicity that it offers. Many systems (and arguably, people) don't understand what the true benefits of the cloud are.

A Common Cloud Case

Let's take a concrete - and common - example: an RDBMS on the cloud. I'll talk specifically about MySQL because it's on the tip of my tongue, but it will apply to many similar systems.

So there you are, running MySQL on Amazon EC2 on your Linux distro of choice. That was easy enough. You cron a nightly export to S3. You've taken advantage of some of the shared resources and that is good, to be sure. You can use EBS for large volumes. You lose some performance, but you can afford to get a larger instance so maybe you can increase some of the buffers and keep more in memory. Depending on your situation, it may all work out in the end. We won't get into the nuances of exactly what computational resources you get because it's damn near impossible to measure accurately and consistently. Let's call it a wash.

Perhaps the most important part of this is that you can easily set up one or more replicas. That's pretty damn nice.

That's a lot. The problem is that there isn't much one couldn't do in a traditional data center. Sure, something like EC2 can help jumpstart a start up - which is great - but for medium sized and larger companies, this isn't a concern. The question is what does the cloud do that the traditional data center (and approach) does not? The cloud is dynamic and on demand for a reason and this isn't it.

What the Cloud Does Well

Applications that natively and intrinsically know about the cloud and its properties and can actually react to changing conditions are the true candidates for cloud computing. For instance, stateless web serving is something that can take advantage of this kind of environment (with some additional functionality). The reasoning here is simple: with some simple measuring of load and knowledge of capacity, additional web server instances can be forked off and run, independently. These additional web servers (or their IPs) could be added to an external load balancing system to make them available to the public. There's no significant dependency here. Content to be served by the additional web servers can be made available with little fanfare and, provided you have enough resources, additional connections can be made to resources such as relational databases, caching servers like memcached, and so forth.

That's, of course, not the native case I mentioned. That's an adaption of a traditional service to the cloud using an orchestration process or resource manager, of sorts. The best case for the cloud are applications where change of the infrastructure is built in. People tend to easily go to places like map / reduce frameworks where jobs are self contained or transport state with them. Not to trivialize that case, but it's not everyday you find the perfect fit for such a model. Many times, you find hybrids where map / reduce jobs (a computation layer) requires access to a data grid (the storage layer) which limits your flexibility and ability to deal with changing conditions. In that case, it's usually not desirable to repartition or shift your data storage based on load (although you can repopulate caches based on load and network expansion). Maybe you don't care about data affinity because the footprint is small or access is infrequent, but with massive data stores or logically partitioned data, this is prohibitively expensive.

I'm not sure I have a good conclusion to this. What I'm driving at is that, while the cloud and dynamic infrastructure is a blessing, don't think of it as a panacea. Do the cost breakdown and consider what portions of your systems make sense in such an environment. Many times, running a traditional system where dynamic setup / teardown of nodes isn't feasible, on a platform like EC2, will wind up being more expensive after a year or so given even light usage. It's a buzzy topic and everyone wants to be on the cloud. Remember that the fastest way to kill something in the eyes of business is to push it for the wrong reasons or in the wrong situations.

Friday, April 17, 2009

We're Hiring!

We at Conductor are looking for Java developers to join our engineering team here in New York City, NY, US. Please take a look at the position details and submit your resume if you fit the bill. Make sure you indicate where you heard about the position (no, I don't get a referral bonus if people come from my blog. Unless someone from our HR department is reading this - then I want my bonus).

We're very interested in people who have real world experience building large scale, highly available, distributed, performant, applications in Java with Spring. Love of open source and / or technology in general is a huge plus. Knowledge of a language like Python, Ruby, Groovy, or Javascript is also good. Experience building data mining or analysis applications is good for bonus points. Come work with us. Trust me, it's a good place to be!

A reminder: None of this is sponsored, reviewed, or endorsed by my employer. Please see the Careers section of the Conductor corporate website for details.

Thursday, March 12, 2009

Principles of Architecture - Anticipate Reality

My title at work is System Architect. Actually, it's something like System Architect / Engineering Lead, but I have my suspicions that it was suggested by our business card printing company who may, or may not, get paid by the letter. That's a story for another time.

What I'm driving at is that many times, architects are thought of as different from developers. It's true that the path to software architecture is either rooted in, or tightly entwined with, software development, but it tends to be some kind of a specialization. The reason I bring this up is because, as architects, we run the risk of separation from actual, real world, implementation concerns. By removing ourselves from the nitty gritty implementation details (a phrase I've been lazy enough to toss around in certain circumstances, admittedly) we have the potential to forget or otherwise disregard the venerable minefield that is the production environment and even the real world.

Using tools like diagrams, white boards with the nice non-smelly dry erase markers, pens and paper, and even more direct methods such as defining interfaces, we're still far removed from the underbelly. We're removed from reality both conceptually and (almost) physically; the implementation - the realization of a given architecture, big or small - is not our own. In other words, we're not subject to our own dog food. And, if you're not careful, you might end up designing for the utopian world of your favorite modeling tools. The professional term for that is screwed.

By not pushing yourself back into the role of implementor either by contributing to code directly or working closely with those dealing with your precious architecture, you are robbing yourself as well as sabotaging the rest of the team, not to mention the project. If you work in an organization where this isn't feasible you can still place yourself in the shoes of the developer, the testers, tech writers, product managers, all the way up to the end users. The impact of your design decisions are greater than just the common set of technical concerns. You are bound by the goals of the project. Your design must not only be simple, elegant, and technically correct, but it must deal with the idiosyncracies of the business, production woes, maintenance and operation teams, and so on. In fact, I'm a little sad I even used the word elegant.

I spend quite a bit of time writing code at work. I do this for a few reasons. For starters, we're just in need of extra hands; we're working on some very cool projects on tight time lines and I'm always a developer regardless of title. Possibly even more important than simply generating code, though, is the need to get things right, in terms of design. By working with the rest of the team, dealing with implementation questions and concerns, all day, every day, I'm forced to constantly reconsider what is working and what isn't. Developers are the first to trip over corner cases in the design or find awkward situations that are difficult to detect in a pure design phase. In an agile environment, the constant attention to high quality and correctness means - in terms of design - following a design through to the end, through all its transformations and anticipating real world situations.

Wednesday, March 4, 2009

Declarative Concurrency In Java - A follow up

It's put up or shut up, right? I don't know if this is something I can do by myself, but I'm happy to try and get the ball rolling.

Some time ago, I wrote about declarative concurrency in Java. It seemed to get a good reaction from many people from different sections of the community. I wound up receiving a lot of email about how people were interested and how the idea of being able to define concurrency semantics in such a manner was appealing for a number of reasons. Well, I went ahead and stubbed out a project which I've pushed to github[1] for the world to pick at! It's minimal and there's very little there right now, but I wanted to solidify my intent to actually produce a prototype by physically creating the project and pushing it out there.

Everyone is welcome and encouraged to participate. The goal of the project will be to create a simple, open source, library in Java that will do the following.

  1. Allow developers to annotate methods, indicating that they may be executed in parallel.
  2. Provide a simple library that will, based on configuration and hints from the annotations, intercept invocations of the annotated methods and execute them concurrently.

I hope to get additional feedback on the concepts as well as the implementation as it evolves over time. Thanks to all who have provided feedback and encouragement thus far!

http://wiki.github.com/esammer/decothread

[1] - github.com - decothread project @ github