If I had a nickel for every time I heard people mention Hadoop or Map/Reduce in the same sentence with Cloud Computing, well … I would have a lot of nickels. If you are not familiar with Hadoop, the best way to understand what it does is to think of it as a method or a programming model for executing complex compute jobs on very large clusters of computers. These clusters can comprise hundreds and, sometimes, thousands of machines. What Hadoop does is break, or Map, these complex jobs in to much more manageable tasks that are distributed to run on the machines in the cluster. It then assembles the results of the execution of these much smaller parts of the overall job in to one coherent answer. This process of collecting and consolidating the results of the execution is called “Reduce”. Easy, right?
Well, if it is so easy, why do so many think that Hadoop is synonymous with Cloud Computing which is an IT resource deployment model and has nothing to do with how you break up your job and reconstitute teh results? Well, it may very well be because Cloud Computing provides an excellent way to provision these very large compute clusters and when the job is done return the resources back and to pay only for the time and resources you used for executing the job. This is a very pedestrian explanation. If you are looking for something that is much deeper and more substance, read this excellent post on this subject by Anant Jhingran, IBM Information Management CTO. Most of the writings I have come across on Hadoop and map/Reduce tend to deal with just the compute resources and this is where the cloud even public cloud fits very nicely. But Anant rightfully asks the tougher questions about the data. Given the volume of data that are typically processed by Map/Reduce jobs, how does one get the data in and out of the cloud is a much tougher issue to tackle then adding a hundred nodes to a compute cluster. Interested in this topic? So, point your browser to Anant’s blog to continue the conversation.