Wednesday, December 23, 2009

Upcoming Posts

It's been longer than I'd have liked since I've posted anything. I've been involved in transitioning to a new position at a fantastic (secret) start up and immersing myself in learning new things. I want to take some time to put together some really interesting pieces that will be helpful not only in the academic oh that's nice way, but also pragmatically.

With the growing emphasis on distributed, highly concurrent applications in the development world, I'd like to put out some analysis of best practices in these types of applications. Frameworks / projects like Spring and Hadoop give us the means to build huge applications that solve real problems, but still so many people struggle to put them into practice. We have excellent reference, but the narrative is filled with oversimplified how-tos that don't show work, as my eighth grade math teacher would say.

Of particular interest, and the focus of the next few posts, will be Spring (3.0.0 release), Spring Integration, OSGi (yea, again), Hadoop, HBase, and EC2. Sure, plenty has been written about all of these, but hey, it's the internet and every crackpot gets their say.

Wednesday, September 9, 2009

Map / Reduce and Dependency Injection

I've had a number of good discussions lately around map / reduce frameworks like Hadoop and GridGain and (object and resource) dependency management. This is just a summarized rehash of those discussions, but hopefully in a slightly more organized format.

So you need / want to write map / reduce code. It's a whole new world out there. It's the in thing to do and all your friends are cooler than you. I get it. You fire up Hadoop and realize after a few rounds of edit, deploy, test that this is less than fun. GridGain solves this with the zero deployment model, serializing your map job and thawing it on the other side of the cluster, but you quickly run into fun with things that don't serialize cleanly. The question becomes how do I get what I need into my map (or reduce) implementation?

Before we continue, it's worth noting that GridGain offers some support for (GridGain component and SPI) resource injection as well as some support for integrating with Spring. To be honest, I found it awkward, but I don't want to address that here. I'd like to give you something that works on Hadoop as well. (That said, it's worth checking out the deeper levels of GridGain as it is a great project that I still use daily.)

My preferred method of working on moderately complex code is to build business logic as pojos and test them directly from junit. I may or may not manually inject data sources or configuration data (via simple setter method calls). Once things work, I wire them together in Spring configs (although one could easily pick some other DI container) and test the aggregate with an integration oriented test suite. Having everything working together, I box the result up into whatever deployable unit is appropriate. For me, this is usually a standalone daemon or a web service, but it could be OSGi or whatever environment you prefer.

In many cases, there's some integration code to deal with the environment. In the case of a web service, I use Spring MVC and have a Spring config devoted to the MVC specific bits and the obligatory web.xml (against my will, usually). If I luck out and we're talking about Spring dm Server, that gets easier, but that's another post. You get the idea.

When we talk about something like Hadoop, we seem to treat it as a wildly different execution environment and start trying to find crazy ways to make resources available to Mappers or Reducers. I'll make it simple - wire up your Spring beans into a top level component that acts as a controller, instantiate the ApplicationContext from within your Mapper, get over the ick of calling context.getBean("myapp"), and invoke the appropriate methods. Revolutionary? Not at all. That's the point. Simple, straight forward, but somehow not something people think of doing.

My suggestions for reducing the chance of Hadoop types sneaking into the thick of your code is to use the Mapper to do what you would do with a servlet shell; confirm the environment is as you expect it, coerce the input into something your application code likes, and then pass as arguments to your application's pseudo-main() method a custom parameter object that encapsulates the Hadoop-iness of the input.

Many map / reduce applications are different in the type of data they process and what I'm talking about is not a drop in solution. You absolutely should write your application code in a way that is conducive to the environment. This, in the case of large scale data tasks, usually means being stream oriented rather than complete-collection (in memory) based. Don't try and shoehorn some web controller into a Mapper and send me hate mail. I'll only pass it around as a cautionary tale of following overly general advice from know-it-all Internet nobodies that will scare the less jaded of our ilk. Of course, this does not necessitate abandoning things like encapsulation, layering, testability, the open / closed principle, and sometimes even good judgement.

In my opinion, this leaves you with a system that shields the majority of your code from being glued to the outer infrastructure. There is always a risk of going overboard and you shouldn't sacrifice performance so much for obtuse abstraction. Use your brain. You know your tasks better than anyone. This structure should allow you to easily continue your normal development and testing outside of the M/R framework until you need to actually get to the integration testing. Even then, you have some additional flexibility with varying your application components without needing to worry about completely breaking your M/R glue layer. This reduces test time and (hopefully) lets you get back to what you were supposed to be doing... whatever that might have been.

Friday, June 12, 2009

Why OSGi needs to come of age

The team I'm lucky enough to be part of is currently working on a project with GridGain, which I've mentioned before, once or twice. In our case, we are invoking map / reduce jobs via RESTful web services. It works well. The problems start cropping up with the way classes are loaded (or not) inside Tomcat.

GridGain is (in our case), effectively, deployed in a war file. This is because we want to communicate with a layer around GridGain via HTTP (and Spring MVC, specifically). In the interest of time and simplicity, we chose to start up the grid from within Tomcat, although, a cleaner design would be to queue tasks from within Tomcat, run our GridGain app as a daemon, and dequeue tasks to be executed there (maybe using JMS or something similar). It's a lot of jumping through hoops, really, because at that point, you may as well speak JMS directly to the GridGain-enabled daemon. Too much bloat.

Most of the complexity is really introduced simply because of the environment required to receive HTTP requests. Really, when you think about it, we've had to - simply to support HTTP invocation - change how we build, package, deploy, start and stop services, and deal with 3rd party dependencies. Worst of all, Tomcat has to deal with the servlet specification and how class loading works, therein. As it turns out, GridGain plus Tomcat equals yuck.

Now, I won't really go into specifics (in this post) about my thoughts on class loading in Java, peer class loading with GridGain, and the myriad of issues we've seen and run into. What I do want to touch on is the more general theme that, even after all of this time, this stuff (i.e. static and dynamic class loading in containers) is still way too thick, intrusive, delicate, under-documented, and buggy. We need evolution, if not revolution.

I dropped a war file containing our code, and approximately 100+ jars worth of dependencies to make Spring, Hibernate, AspectJ, and GridGain happy. One hundred, plus. Smells like disaster already. GridGain starts up, Spring MVC does its thing, life looks good. I make a request to a controller that invokes a task via GridGain and things get interesting.

The task we want to start is looked up in a Spring config, and instantiated using reflection. It makes it easy to add new types of tasks without mucking around with the core grid assembly and packaging, but it creates an interesting case. The jar(s) containing the grid tasks need to now live in the war file. The alternative seemed to be storing them in a directory accessible to the Tomcat shared class loader, but during testing, we found that Tomcat couldn't handle this (for some reason that is still admittedly a little unclear to me). The class simply wasn't found. The class loader hierarchy is modified in Tomcat due to the servlet spec so the order should be bootstrap, system (JDK), the war classes and internal libs, then the common and shared class loaders (if I remember correctly). This means that when we move our grid tasks to the shared class loader, they would come after the web app class loader. Normally this shouldn't be a problem, but I think the issue we ran into was related to the fact that the grid task extended a class that was present in the war's class loader, not in the shared class loader. When we move the grid task jar into WEB-INF/lib, it magically starts working. Sigh.

I'm a fan of OSGi. It (mostly) makes sense to me. It's explicit (painfully so, in many cases) and direct. My feeling is that, in a case like we have here, at least it would be obvious that class loaders needed to be wired together to make this work. The main grid application could have fragments containing the grid tasks, for instance, to push additional task plugins into the core application, making the classes available. Alternatively, a more direct way would be to simply list the packages as dependencies in the manifest of the core application. This would obviously couple the core app to the tasks, which isn't very nice.

The GridGain team, unfortunately, hasn't embraced OSGi. In fact, Ivanov seems to summarily disregard it in the comments of this post. I was a bit let down to see this kind of hard-line stance on a subject that more and more people seem to be interested in, especially with the SpringSource folks driving at it. While Nikita makes an excellent point about distributed class loading and OSGi being a less than nice match, I do think there's room for GridGain with peer class loading disabled in an OSGi container. For some of us who don't care about the hassle of deploying grid tasks (and GridJob implementations, specifically) on all nodes, at least having the option of GridGain in an OSGi container like Spring dm would be nice. It's possible that this can be done, but when I tried, things didn't go well. There's no doubt that Ivanov and the GridGain team have far more experience with the class loading details than I do, but the stance taken does still let me down, at least not without a better explanation and documentation.

OSGi isn't perfect. It's far from it, obviously. What it does begin to chip away at, though, is the big fat bundle-it-all-in-one-giant-file issue that plagues Java-land. Honestly, Java is thick and we have to change that. OSGi is one way to cut it down to a reasonable footprint while still allowing for decoupling and service-like component design. Sure, permgen errors are a dirty word and the class loading mechanics may be too simplistic but it scares me to think we'd be stuck in war-land for the rest of the foreseeable future.

Monday, April 27, 2009

The Cloud - Not a Panacea

Everyone loves The Cloud. I mean, why not? There's no hardware to own, no storage systems to maintain, no networking hardware to deal with, it's infinitely scalable to the billionth degree; it's perfect!

Ok, so no one gushes about it that much (do they?). It has its notable benefits, but they aren't quite as obvious or simple as people sometimes make them out to be.

The crux of the issue is simple; simply moving traditional services, applications, or systems to a cloud-like environment yields little. Traditional systems with no knowledge of a cloud (or grid, or virtualized infrastructure, or...) environment can't take advantage of the dynamicity that it offers. Many systems (and arguably, people) don't understand what the true benefits of the cloud are.

A Common Cloud Case

Let's take a concrete - and common - example: an RDBMS on the cloud. I'll talk specifically about MySQL because it's on the tip of my tongue, but it will apply to many similar systems.

So there you are, running MySQL on Amazon EC2 on your Linux distro of choice. That was easy enough. You cron a nightly export to S3. You've taken advantage of some of the shared resources and that is good, to be sure. You can use EBS for large volumes. You lose some performance, but you can afford to get a larger instance so maybe you can increase some of the buffers and keep more in memory. Depending on your situation, it may all work out in the end. We won't get into the nuances of exactly what computational resources you get because it's damn near impossible to measure accurately and consistently. Let's call it a wash.

Perhaps the most important part of this is that you can easily set up one or more replicas. That's pretty damn nice.

That's a lot. The problem is that there isn't much one couldn't do in a traditional data center. Sure, something like EC2 can help jumpstart a start up - which is great - but for medium sized and larger companies, this isn't a concern. The question is what does the cloud do that the traditional data center (and approach) does not? The cloud is dynamic and on demand for a reason and this isn't it.

What the Cloud Does Well

Applications that natively and intrinsically know about the cloud and its properties and can actually react to changing conditions are the true candidates for cloud computing. For instance, stateless web serving is something that can take advantage of this kind of environment (with some additional functionality). The reasoning here is simple: with some simple measuring of load and knowledge of capacity, additional web server instances can be forked off and run, independently. These additional web servers (or their IPs) could be added to an external load balancing system to make them available to the public. There's no significant dependency here. Content to be served by the additional web servers can be made available with little fanfare and, provided you have enough resources, additional connections can be made to resources such as relational databases, caching servers like memcached, and so forth.

That's, of course, not the native case I mentioned. That's an adaption of a traditional service to the cloud using an orchestration process or resource manager, of sorts. The best case for the cloud are applications where change of the infrastructure is built in. People tend to easily go to places like map / reduce frameworks where jobs are self contained or transport state with them. Not to trivialize that case, but it's not everyday you find the perfect fit for such a model. Many times, you find hybrids where map / reduce jobs (a computation layer) requires access to a data grid (the storage layer) which limits your flexibility and ability to deal with changing conditions. In that case, it's usually not desirable to repartition or shift your data storage based on load (although you can repopulate caches based on load and network expansion). Maybe you don't care about data affinity because the footprint is small or access is infrequent, but with massive data stores or logically partitioned data, this is prohibitively expensive.

I'm not sure I have a good conclusion to this. What I'm driving at is that, while the cloud and dynamic infrastructure is a blessing, don't think of it as a panacea. Do the cost breakdown and consider what portions of your systems make sense in such an environment. Many times, running a traditional system where dynamic setup / teardown of nodes isn't feasible, on a platform like EC2, will wind up being more expensive after a year or so given even light usage. It's a buzzy topic and everyone wants to be on the cloud. Remember that the fastest way to kill something in the eyes of business is to push it for the wrong reasons or in the wrong situations.

Friday, April 17, 2009

We're Hiring!

We at Conductor are looking for Java developers to join our engineering team here in New York City, NY, US. Please take a look at the position details and submit your resume if you fit the bill. Make sure you indicate where you heard about the position (no, I don't get a referral bonus if people come from my blog. Unless someone from our HR department is reading this - then I want my bonus).

We're very interested in people who have real world experience building large scale, highly available, distributed, performant, applications in Java with Spring. Love of open source and / or technology in general is a huge plus. Knowledge of a language like Python, Ruby, Groovy, or Javascript is also good. Experience building data mining or analysis applications is good for bonus points. Come work with us. Trust me, it's a good place to be!

A reminder: None of this is sponsored, reviewed, or endorsed by my employer. Please see the Careers section of the Conductor corporate website for details.

Thursday, March 12, 2009

Principles of Architecture - Anticipate Reality

My title at work is System Architect. Actually, it's something like System Architect / Engineering Lead, but I have my suspicions that it was suggested by our business card printing company who may, or may not, get paid by the letter. That's a story for another time.

What I'm driving at is that many times, architects are thought of as different from developers. It's true that the path to software architecture is either rooted in, or tightly entwined with, software development, but it tends to be some kind of a specialization. The reason I bring this up is because, as architects, we run the risk of separation from actual, real world, implementation concerns. By removing ourselves from the nitty gritty implementation details (a phrase I've been lazy enough to toss around in certain circumstances, admittedly) we have the potential to forget or otherwise disregard the venerable minefield that is the production environment and even the real world.

Using tools like diagrams, white boards with the nice non-smelly dry erase markers, pens and paper, and even more direct methods such as defining interfaces, we're still far removed from the underbelly. We're removed from reality both conceptually and (almost) physically; the implementation - the realization of a given architecture, big or small - is not our own. In other words, we're not subject to our own dog food. And, if you're not careful, you might end up designing for the utopian world of your favorite modeling tools. The professional term for that is screwed.

By not pushing yourself back into the role of implementor either by contributing to code directly or working closely with those dealing with your precious architecture, you are robbing yourself as well as sabotaging the rest of the team, not to mention the project. If you work in an organization where this isn't feasible you can still place yourself in the shoes of the developer, the testers, tech writers, product managers, all the way up to the end users. The impact of your design decisions are greater than just the common set of technical concerns. You are bound by the goals of the project. Your design must not only be simple, elegant, and technically correct, but it must deal with the idiosyncracies of the business, production woes, maintenance and operation teams, and so on. In fact, I'm a little sad I even used the word elegant.

I spend quite a bit of time writing code at work. I do this for a few reasons. For starters, we're just in need of extra hands; we're working on some very cool projects on tight time lines and I'm always a developer regardless of title. Possibly even more important than simply generating code, though, is the need to get things right, in terms of design. By working with the rest of the team, dealing with implementation questions and concerns, all day, every day, I'm forced to constantly reconsider what is working and what isn't. Developers are the first to trip over corner cases in the design or find awkward situations that are difficult to detect in a pure design phase. In an agile environment, the constant attention to high quality and correctness means - in terms of design - following a design through to the end, through all its transformations and anticipating real world situations.

Wednesday, March 4, 2009

Declarative Concurrency In Java - A follow up

It's put up or shut up, right? I don't know if this is something I can do by myself, but I'm happy to try and get the ball rolling.

Some time ago, I wrote about declarative concurrency in Java. It seemed to get a good reaction from many people from different sections of the community. I wound up receiving a lot of email about how people were interested and how the idea of being able to define concurrency semantics in such a manner was appealing for a number of reasons. Well, I went ahead and stubbed out a project which I've pushed to github[1] for the world to pick at! It's minimal and there's very little there right now, but I wanted to solidify my intent to actually produce a prototype by physically creating the project and pushing it out there.

Everyone is welcome and encouraged to participate. The goal of the project will be to create a simple, open source, library in Java that will do the following.

  1. Allow developers to annotate methods, indicating that they may be executed in parallel.
  2. Provide a simple library that will, based on configuration and hints from the annotations, intercept invocations of the annotated methods and execute them concurrently.

I hope to get additional feedback on the concepts as well as the implementation as it evolves over time. Thanks to all who have provided feedback and encouragement thus far!

http://wiki.github.com/esammer/decothread

[1] - github.com - decothread project @ github

Tuesday, February 24, 2009

Principles of Architecture - Reduce and Simplify

Just a few days ago, I had a weekly one on one meeting with my boss. It's times like that where work becomes kind of like a game of Can I go a full hour without putting my foot in my mouth? Turns out, I came out unscathed this time.

Recently, around the office, we've been talking a lot about the principles behind the agile manifesto. Pav (aka John Pavley, CTO) mentioned that I probably had a similar list of principles of software architecture I operate by. He pointed out that I hadn't really vocalized what those are in any obvious way and that doing so would probably be beneficial or at least interesting. Thinking about what those principles are and needing to actually enumerate them also helps me think about what's really important and why.

Before we get into the first principle I want to discuss, I want to clarify why I'm using the term principle rather than, well, anything else. During our discussions of the principles of the agile manifesto, we used this word because, like what I hope to describe in software architecture, those items are considered intrinsic properties of software development. As Pav would say, they're discovered, not invented. I tend to think he's right and that the choice of the word principle is deliberate and intentional. This is also my intention here.

One of the first two principles that came to mind was the idea of reduction and simplicity. When designing software, we strive to reduce or eliminate complexity wherever possible. There are times where a task is inherently complicated, but the design of the system need not necessarily be complicated. If that sounds counter-intuitive, consider the separation between designing the system's architecture - the way it behaves, the layering, the major objects in play, the way it interacts with constituent systems or resources, its fault semantics, and so on - from its implementation. In many cases, what you'll find is that the implementation may have some inherent level of complexity to meet the business requirements, but that a well designed system is almost obvious and easily fits in your brain without confusing you. Let's consider something concrete.

If you were to design an SQL query execution engine... I don't even need to finish that sentence for it to sound scary. Take a few minutes to think about how you might design a query execution engine. In five or ten minutes you might actually be able to work out a simple model that makes sense (within reason). The details of how to implement that design is where one gets into the shady details of how to make the magic happen. Even the design of a compiler is simple enough, in most cases, where as the implementation is where the complexity lies. A compiler will have a grammar, a lexer, a parser, probably an AST, a chain of optimization strategies, maybe a number of output generation strategies. Within each of those major components, you could break things down further and come up with an easily understood design for a modular compiler. I'm not trying to trivialize building a compiler (try implementing a C++ compiler some day) but I do think that with some thought, the design process would effect a reasonable, intuitive result.

The point I'm trying to make is that a well designed system should be intuitive to the person or team implementing the system as well as the architect. If you find it difficult to communicate a design, there's a high chance that the implementation of that design will not make things any simpler. In fact, it's probably impossible. To be clear, some things have an inherent degree of complexity, but we should always strive for the simplest, but still most complete, design possible.

These are all relative terms; simple, complex, intuitive, complete. You'll always have to rely on your judgement, experience, and best practices of the trade. By properly deconstructing an application, its components, their components, and so on, even the most complex system can be easily understood and digested.

Some techniques I find useful for making this happen are:

  • Apply Divide and Conquer. Break down components recursively until you get to easy to understand units of functionality.
  • Never work alone. Statistically, you're more than likely to be surrounded by people who can contribute experiences and ideas during the design process that will yield a better result. The added benefit here is that you're constantly having to explain your thought process and ideas to other humans; the degree of complexity is probably proportional to the number of times a junior developer says Huh?
  • Follow patterns and best practices. Silly questions like Does your class do one thing and do it well? have saved me from my own cleverness more than once (but admittedly not always).
  • Trust your gut. If it sounds too complicated, it probably is. Take a break, look at similar problems, ask around, and try a different approach.

There's no complexity blasting ray gun of designly awesomeness. There's no third party library that you can just drop in to make it simple. Sometimes, the business requirements are as tough as they sound. Most of the time though, you can reduce and simplify.

Tuesday, February 10, 2009

Class Categories in Java

Class categories have existed in a few incarnations over the years. My personal knowledge of comp sci history is thin, at best, but my small amount of research into the material is that it came from Smalltalk-80. Certainly, my first exposure to class categories came from working with Objective-C on NeXTSTEP and later Mac OS X.

What a category does is relatively simple to understand. Basically, the idea is that a developer may take an existing class and, effectively, append methods to it without having access to the source code or subclassing it. This is best illustrated in code.

// A normal, and terribly boring class.
public class Foo {

  public void displayMessage(String message) {
    System.out.println("A useless message:" + message);
  }

}

// Extending the class by creating a category on it.
public class Foo (MyExtensions) {

  public String getDefaultMessage() {
    return "Hello world.";
  }

}

// ...and finally, what one would expect.
Foo f = new Foo();

f.displayMessage(f.getDefaultMessage());

Normally, when I explain class categories to Java developers, I barely get the words out before I'm hit with what you might expect. People question whether this is breaking encapsulation, if this bloats code, if it creates tight coupling, if it breaks access rules, and so on. Some of the more dynamic languages like Ruby and Perl will happily let you do things like this (albeit, sometimes safer than others), but that happens at runtime.

I'm proposing we bring categories to Java. Yep. I said it. I'm going to focus more on how this might work rather than why I think it's a good idea, although I'll try and briefly address that too. Here's how I think it could work and why.

The Basics

The syntax would work like Objective-C's syntax. Creating a category would be done by specifying the same package and class definition, in the interest of simplicity, with the addition of a category name enclosed in parenthesis (see the above example). It would not be legal to specify inheritance when creating a category; inheritance would always be defined by the original definition (i.e. the uncategorized class declaration), although there's no reason to prohibit the implementation of additional interfaces in a category. This would allow those creating categories on a class to extend an existing class to implement a new interface without modifying source code.

Access, Security, and Visibility

In many ways, the access and visibility rules of subclassing applies to categories as the result is very similar from the perspective of the original class; the new functionality is unknown and untrusted.

It would only be legal to access public or protected members of a class when creating a category. This would respect access and visibility restrictions on code developed prior to the existence of categories. Overriding a method in a category would not be permitted, although method overloading is fine. Private members within a category would not be visible outside of the category.

Category Availability

The biggest differentiation between categories in Java and class reopening in Ruby, for instance, is that the contributions made to a class via a category would be known and could be checked for at compile time. This would allow developers to see and avoid cases of competing categories or member addition during development, which is usually not possible with languages that allow for this kind of functionality.

It would not be legal to create a category on a class declared as final. This extends the meaning of declaring a class to be final, but only slightly as creating a category is similar in intention to subclassing (in theory). This rightly implies that there is no way to prevent the creation of categories, but allow subclassing as there's no obvious reason to draw a distinction as categories can only access public and protected members of an existing class, just as a subclass would.

External classes including unrelated as well as subclasses of a class with categories would see all members including those defined in categories as usual. The category information of a member should be made available via the standard reflection classes and methods. Given the above example Foo class, the following would work as expected.

/* Includes both methods from the original class declaration
 * as well as methods from categories.
 */

Methods[] methods = Foo.class.getMethods();

/* Additionally, category information should be made available
 * via reflection.
 */

Category[] categories = Foo.class.getCategories();

for (Category category : categories) {
  System.out.println("Methods in category:" + category.getName());

  for (Method method : category.getMethods()) {
    System.out.println("method name:" + method.getName());
  }
}

Some Quick Reasons Why

There are a few nice advantages to having categories available in a language, especially at compile time. There are the obvious advantages such as simple code organization. What I tend to think is more interesting, though, is creating categories to apply specialized functionality to core classes. For instance, one may want to add methods to collections to glue validation logic to core components. In cases where Spring is used, it's not uncommon to see many adapter type objects that simply exist to make an object more amenable to participate in DI. I believe that a lot of code and class structures could be greatly simplified by being able to make minor alterations to existing classes rather than resorting to multiple objects to mediate or adapt existing code to new systems and frameworks.

Like anything else, there is the obvious ability to abuse something like categories. I think there are times when more traditional approaches are the best option, and there's no replacement for good design and education, but to remove a valuable tool because some subset of the population may misuse it only serves to hurt those that could make proper use.

My plan is to attempt to draft this as a JSR and submit it for review. I don't know if I have the ability to chew through the politics (I'm assuming are) attached to that, but it might be fun to try. Clearly there's more to work out (I haven't looked deeply into what this does to the compiler and runtime at a low level, for instance), but I'm interested in what people think about categories in Java.

Wednesday, January 21, 2009

Another Night with GridGain

Earlier this evening, I had a chance to attend a presentation on GridGain at this month's NYC JavaSIG at the Google Engineering building here in New York City, NY, US. I've written about GridGain before, but if you haven't read my thoughts on it, I'll sum it up; I'm a fan.

I got a chance to talk to Nikita Ivanov, if only briefly. Nice enough guy. What I like about his presentation most is the lack of - and there's probably no other way to really say it - bullshit. Sure, he uses words like grid and cloud which is always suspect, but in this case, he provides an actual, single slide, definition of what it means to him and GridGain.

Grid Computing = Compute Grid + Data Grid

Makes sense if the terms compute grid and data grid mean something to you. Nikita seems to stick to (what I think is) the standard definition of a data grid - a network of data storage machines containing partitioned or distributed storage. I'm paraphrasing a bit here, but mostly because I don't recall his exact wording. The compute grid portion of that should be obvious. I'm not providing any hints on that one.

Cloud computing is defined by Ivanov as follows.

Cloud Computing = Grid Computing + Data Center Automation

This is also simple and concise. So we get grid computing; at least in the context of GridGain. Data center automation, in this case (and Ivanov's opinion) covers not just the normal stuff, but specifically the creation and shutdown of instances of machines. This is generally stuck behind an API such as Amazon's EC2 and related services with the goal that one can have a greater degree of flexibility. While I'm not really in love with EC2 as some others may be (not referring to Ivanov, specifically, just the sometimes expressed idea that EC2 is solution to all woes) it is a readily available cloud environment that one can play with. I'm glad it exists.

A panacea? Of course not. Honestly, what could be, short of a super code-monkey falling from the sky to do your evil bidding? Ok, maybe an intern.

The point, I think, is that this kind of functionality - the ability to perform massive, distributed, parallelized computing without six-plus figures worth of hardware and software - is both simple and significant.

As usual, no grand epiphany here on my part... just some commentary on one of the areas where we can push performance in real world applications. That, of course, being something we should always be looking to do. Thanks to GridGain, Ivanov, NY JavaSIG, Google, the JavaSIG sponsors, and my employers for not getting annoyed that I suckered my team into cutting out early to go to the event.

Friday, January 9, 2009

Of Maven Dependencies and Repositories

I think of Maven the same way I tend to think of Git; excellent features, but just a little more complicated and obtuse than is really reasonable for the task. I know that's insanely unpopular to say about Git, but luckily this isn't about Git.

Recently, I was converting a project at work to Maven (from Ant) as an experiment. This is a relatively standard, mid-sized, Java project that makes heavy use of a number of what I would consider common Java libraries. In our case, we use Spring very heavily, along with other staples like Hibernate. One of Maven's killer features is the ability to resolve dependencies and pull the correct versions from the Maven Repository, but we already know that.

I found the selection of dependencies from the central repository to be one of the worst things I've had to do in recent days. I was spending more time setting up different repositories and wading through the duplicate packages than I was enjoying the benefits of such features. It's almost more of a headache than doing it all by hand.

In OSGI bundles, which have some similarities in dependency specification, at least, with Maven, one has the ability to express dependencies in a few ways. The obvious unit of dependency is specifying another bundle. This is, effectively, the same as Maven. OSGI bundles, though, may also opt to specify only what Java packages the bundle imports and let the runtime figure out what bundle to take those packages from. This is similar to how many Linux package managers operate when more than one package can fulfill a dependency and it's traditionally referred to as a virtual dependency.

Maybe what we need from Maven is the notion of the virtual dependency. A Maven POM could specify virtual packages as dependencies that could be filled by any one of a number of providers. Java lends itself to this very well because the majority of standards define the APIs with service providers being distributed separately. Think of things like JPA (provided by Hibernate EM), JAXP (Xerces and friends), and so on. I suppose it's a little different because Java developers want to pick an implementation for a specific reason, but having virtual dependencies would eliminate many of the overly specific dependency graphs created when dealing with complex packages such as Spring, for instance.

It's worth noting that the most significant issue I have with Maven is the quality of the metadata. It is just plain awful. Some of the things I ran into were:

  • Packages that weren't updated with bug fixes or recent versions
  • Many copies of the same package with different names and odd descrepancies in versions
  • Missing (or unavailable) dependencies
  • When using Spring's Maven repositories, duplicates of the dependencies are pulled in because Spring depends on versions not in the central rep.
  • Because Spring came from Spring's repository, packages like GridGain which depend on Spring, grab the version from the central repository, but Spring Integration which is only available from Spring's rep has a dependency on the version of Spring from Spring's rep... AARRRRRRRRRRRRGGGGGGGGGGHHHHH!

I get that this is hard and it requires a lot of coordination. I get that I could repackage things in my local repository or a corporate shared repository. Should I have to? A lot of the advantage of Maven is lost when one has to manually follow dependencies to figure out why there are two (full) versions of Spring Core in the project. It's annoying, wasteful, and prone to error.

Maven, I want to like you. Really I do. But like a real, live, flesh and blood human, you make it so difficult sometimes, just like your sister (Git).

Wednesday, January 7, 2009

Declarative Concurrency in Java

Good, solid, safe, effective concurrent programming is hard. Modern languages and paradigms make it easier, but for most, it's still a challenge to get right, right away. Many people have predicted the end of the great Ghz race. They're probably right. I don't have any great insight into the CPU design community. Honestly, it just doesn't hold my attention. Multi-core systems are all the rage these days, though, and that's pretty damn cool. None of this is new; plenty of smarter people than myself have pointed it out.

One of the purported benefits of functional programming is how it lends itself to concurrent programming. Luckily, I work with a smart guy who's both patient and polite enough to talk to me about FP without serving kool-aid (thanks Adam). Many of those conversations entail discussion about state, immutability, and side effects in software implementation. This, of course, leads me to think about how some of these things apply to one of our weapons of choice where we work - Java.

Java accomplishes concurrency via thread objects. Big deal; nothing new here. Most of the confusion comes into play not when deciding what should run concurrently - that's usually obvious - but when figuring out how to protect shared state. Again, in Java-land, we do this with different types of locks, either implicitly with synchronized blocks or explicitly with the grab bag of fun from the java.util.concurrent.locks package. Many of the Sun docs talk about how we use locks to establish happens before relationships between points in code. What's interesting is that this language seems so natural and simple. So why is lock management such a pain?

Maybe imperative locking isn't the right approach. Maybe, there's a more natural way to establish a happens before relationship. It sounds like dependency declaration. I'm wondering if we can't find a way to declare dependencies within source code. Maybe there's a way to declare dependencies, with something like annotations, where instrumentation can infer what we're looking for. This, of course, is sugar for what we have now, but I don't think sugar is always bad.

 public class MyClass {

   private int counter = 0;

   @Concurrent( stateful = true )
   public void execute() {
     /* Do something that might touch shared state. */
     this.counter++;
   }

   @Concurrent( unitName = "otherExecute", stateful = false )
   public void otherExecute(String someArg) {
     /* Do something that promises not to alter ourselves. */
   }

   @Concurrent(
     unitName      = "somethingElse",
     stateful      = true,
     happensBefore = "otherExecute"
   )
   public void somethingElse() {
     /* This can be run concurrently, could touch state, but
      * must happen before "otherExecute" is called.
      */
   }

   static public void main(String[] args) {
     ConcurrentController controller;
     ConcurrentUnit       unit1;
     ConcurrentUnit       unit2;

     controller = ConcurrencyController.forClass(MyClass.class);

     unit1 = controller.getUnit("somethingElse").setThreadPoolSize(10);
     unit2 = controller.getUnit("otherExecute").setThreadPoolSize(5);

     unit1.start();
     unit2.start();
   }
 }

The @Concurrent annotations would instruct an instrumentation library to perform an operation in parallel. The hints stateful and happensBefore could be used to perform additional automatic member variable monitor acquisition or something equally snazzy. The unitNames could be used to grab a handle, of sorts, to a concurrent unit of work and be used to establish relationships or to report on concurrency plans (which could be similar to an RDBMS query execution plan). Who knows... I'm tossing ideas around.

I don't think it covers every situation. In fact, I'm sure it doesn't cover everything. It's beyond flawed and probably not possible. I'm just trying to get some wheels turning. The goal is to have simpler, coarse-grained, declarative concurrency definition that can be externalized.

I'm intrigued by the idea of simple concurrency models that don't remove the fine-grained control given to us by the language and APIs. If concurrency isn't going away, it has to get easier for the majority of people to do it correctly.

I'm especially interested in feedback on this.

Thursday, January 1, 2009

Agile Languages and Developer Experience

I just finished reading Jamis Buck's Legos, Play-Doh, and Programming article where he discusses some significant differences in methodologies between languages like Ruby and Java. It got me thinking.

Jamis makes a number of points about how the Ruby way is generally more dynamic and less prone to specialized components. This has a lot to do with the points of extension in Ruby and the malleability of the language, itself. For instance, Ruby's ability to inject methods into existing classes or the convention of duck typing in standard libraries allow developers to pose as different types of objects and get away with a lot more than a language like Java. There are tons of arguments in both communities over which approach is better. I call shenanigans.

It's my hypothesis that both are great. I know that sounds like a cheap way to duck the flying artillery between the camps, but it's the truth. Hey, I wrote Perl for years so don't pretend to have invented the notion of language flexibility with me. (The previous sentence just started a million replies; let's pretend I've already read them all because I know what you're going to say.) I've used Ruby, Python, Perl, Java, C, C++, and others on medium to large sized, real projects so I think I can be objective on this one.

It's my experience that the problem isn't with either approach, really. I've met ninja-good developers on all sides. The wall I have run into, on the other hand, is that the amount of rope given to a lesser experienced or disciplined developer is almost directly proportional to the insanity they can manufacture. I don't think one can make a blanket statement on language suitability one way or the other.

In the past, I've used dynamic languages (heavily) for my own projects. No matter how right or wrong my own code is, it's mine and I get it, most of the time. I have a basis in C and tend to do a lot of mental book keeping when I code. It's a holdover from days when I didn't have things to sanity check my own work. There was a lot of rope, but I was the only one in the room so I made sure not to throw it over the rafters, wrap it around my neck, climb on a chair, and then try to figure out what I had done wrong. If I did, I had only myself to blame.

When I'm working with a team of fifty developers of varying levels of experience, discipline, knowledge of good design and testability, and interest in their craft, the game changes. You can't bank on each member of the team having the same skill set or even interest in what you're trying to get done. Or, maybe you can, simply by applying very strict guidelines. The fact of the matter is that it's not as cut and dry either side wants you to believe. You're going to have code-cowboys who are clever - and beware that word for it means smart, but with a hint of tricky and subversiveness - who are going to do things like override the built in functions to given them cool new uses never previously imagined! You're going to have recent university grads who need mentoring, even if only half of them know it. You're going to have your average, mediocre, developer for whom this is a nine to five gig funding an ever more expensive pot habit. And, if you're lucky, you'll have one or two super star, ninja coders, who turn out reliable, efficient, readable, documented, testable, well designed, code. (As an aside, if you're one of those people, have one, or know one, I'm looking for resumes.)

My point: do not optimize for a group you do not have.

Lying to yourself will only get you knee deep in your own rationalizations about why your language is, in fact, the best one on the planet, and how the other guys just don't get it. The worst part is you'll still be right every single time.