Amazon Web Services is arguably the most recognized name in the world of Cloud Computing. Amazon is mostly known for its storage as a service S3 and its compute power as a service EC2 offerings. A less known Amazon cloud offering is a cloud-based database called SimpleDB which is currently in beta. SimpleDB is quite a bit different from traditional relational databases. It is a simple key-value pair database that targets web developers who don’t need or want a relational database.
Amazon has also partnered with major DBMS vendors and today one can deploy major relational DBMS on the Amazon EC2. We offer IBM DB2 and IDS for deployament on Amazon EC2 and this week Amazon’s CTO Werner Vogels described DB2 as having the most inovative approach for cloud deployment.
Today, Amazon announced its second entry in to the world of cloud databases. Called Amazon Elastic MapReduce, this appears to be a hosted implementation of the Hadoop framework. Hadoop, in a nutshell provides a way to analyse very large amounts of data by employing large number of processing nodes working independently. One does not use Hadoop or MapReduce just as another database where you have a connection from an application where you submit a query or an update operation. An application that uses MapReduce is more like a batch job that one submits but, instead of running on a single server, the application and data is spread across many servers with each one crunching the data. This is an approach that is often used by companies like Google and Yahoo to analyse vast amounts of information. It is also very popular in many scientific communities.
Hadoop is the kind of application that, at least on the surface, is a natural fit for the elastic nature of cloud computing. Instead of procuring large computing clusters one can just go to Amazon to run a job and pay only for the resources use by that job. However, running such a job will require transfers of very large volumes of data in to and out of the cloud. And, while compute charges on EC2 and storage charges on S3 are quite low, data transfers charges can really add up. Amazon’s home page for MapReduce has a pretty good explanation on the charges for the MapReduce itself and the EC2 charges that one can expect but it is silent on the data transfer charges. I have to assume that these are standard Amazon S3 data transfer cahrges as S3 is both the source of data as well as the destination for the output. These charges are 10 cents per GB for transfer in (on sale for 3 cents till July 1, 2009) to S3 and 17 cents per GB for retrieving your data from S3. Since most hadoop jobs would require very large data sets this can get expensive. Yes, it will be much cheaper that doing something like this on your own equipment but it is not exactly going to all of a sudden democratize the world of very complex data analysis and just make it available to everyone. One thing is certain, this is a terrific way for Amazon to generate significant revenue by letting more of us transfer and store more data in to S3 and spinning up hundreds if not thousands of EC2 machine images. I think I am going to buy some Amazon stock!