IT industry’s penchant for trends is eclipsed only by that of the fashion industry. The latest IT fashion is big data. As with most trendy topics, there is no shortage of pundits who will espouse their views on the subject. I happen to be curating a daily electronic paper called All About Big Data and I am overwhelmed with the volume of writing on the subject of big data. Unfortunately, much of this writing is thinly veiled self-serving chest beating by major IT vendors. Not a day goes by without someone asking me “so, what is so different about big data?”
I don’t think I have a definitive answer but as I have transitioned in to this space from the database world I’ve formed an opinion. So here is how I see it. For years, we in IT have been collecting data that is largely an artifact of business transactions. Every time a cash register rung in a sale, or a widget came of an assembly line, or a box of goods was loaded on a truck, or someone signed for its delivery, we diligently recorded all the pertinent details in our databases. We captured all fo the information of who, where, how, how much and maybe even why. This data is neat, precise and easily organized. It is expressed in $s and yens, with unique customer numbers and package ids, dates and times in minutes and seconds and location information expressed in standard postal addresses and longitude and latitude. To make this data easy to process we described it with metadata. In other words we made storage and operation on this transactional data piece of cake. As the volume of economic activity and the corresponding volume of data increased, our systems had little trouble keeping up thanks to Moore’s Law. I can almost see some of you raising your eyebrow. Yes, keeping up with the increase in the volume of business transactions has been relatively easy. Today, it is not uncommon to see databases that handle thousands of transactions per second and store terabytes and in some cases even petabytes worth of data. Worlds largest database, according to the Guinness Book of World Records, is handling over 3 petabytes of data.
Surely this is big data! As impressive as this is, I don’t think this qualifies as big data. 3 petabytes is a lot of data, no doubt. A few decades ago 3 petabytes would have represented the entire volume of information generated by the humanity. The fact that a database system can handle this much data is really impressive. Yet, I don’t think this qualifies as big data. Not because the Volume of data is not meeting some cut off point to be accepted to the big data club. And not because this data lacks Velocity or Variety, the other 2 of the 3Vs of big data. For me, this incredibly large database does not make the cut is because it is built on a well-known technology (DB2) that can handle this load. I am not saying this to boost DB2. My point is that what defines big data problem is the fact that it exceeds the capabilities of current systems and technologies. There are many examples of where this is the case and the data volume is just a few gigabytes. There is no magical point of terabytes, petabytes or any other arbitrary marker on the volume scale that serves as a signal for people to get interested in big data. It is more of a feeling that you have ventured out in some big waves and you are in the environment that far exceeds the design parameters of your ship. It does not mean that you need to batten down the hatches and hope to weather the storm. Big data is not a fad or a hurricane that will pass returning us to the calmer seas. It is quickly becoming the fact of life for the modern enterprise and IT will have to learn to deal with it. There are people out there who travel the world over seeking out giant waves to ride. Take a look at some of the startups and established companies who have embraced the big data challenge and are enjoying the ride.
In the coming days, I plan to write several posts on the aspects of big data that make it so different from established data management and information processing norms and practices. I plan to focus on both technologies and products and to compare and contrast how these technologies relate to the traditional database systems from the point of view of a DBA, system administrator and a database programmer. You will find that not only are the technologies take an interesting approach, the way we use these technologies differs from the approaches we use with data warehouses and analytical products in use in the enterprise. Stay tuned. If you are interested, I recommend subscribing to this blog.