Our thinking

Big data made simple

Taking technical jargon out of Big Data

There is a lot of excitement about Big Data in the ICT industry and businesses are trying to understand what it all means. Most managers will say it has something to do with analysing the large volumes of data that companies gather from their interactions with customers and prospects to better understand them and improve their marketing to them. So this is a business level explanation of Big Data as free as possible of technical jargon.

Since the 1990’s businesses have used customer relationship management systems to centralise the information about their customers, and used data warehousing and business intelligence applications to analyse that data to better understand how they can engage with and sell to their customers. This has been based upon structured tabular data held in transaction systems based upon relational database management systems.

More recently, social media has meant engagement with customers has generated large volumes of unstructured multi-media data such as text, video, audio and graphics. It created a need for a new type of technology that could analyse high volumes of all data types cheaply and efficiently.

Google and Yahoo came to the rescue with new technology that they use for analysing the multi-media data they find on Websites that they analyse for their search engines. It is based upon data analysis software patented by Google called MapReduce.

As the name suggests the software is based upon a two stage process: Map and Reduce. The Map function analyses the multi-media data and transforms it into discrete pieces of data consisting of a key and an associated value. These key value pairs classify the data. For example, cars can be classified by manufacturer. The result would be a key of “Toyota” and values of “Corolla” and “Hilux”. “Ford” would be a key with values of “Ranger”, “Mustang” and “Territory”. The Reduce function aggregates these Key-Value pairs into totals.

This type of processing is simple and effective for unstructured data transforming it into a small common unit. The real power comes from how the software has been implemented. The software distributes the data across large numbers of servers. Each piece of data will be distributed to several servers duplicating it holding at least three copies.

The analysis is not performed sequentially on one server as with traditional transaction systems as the high volumes take too long. It is performed in parallel on all servers concurrently processing the smaller amounts of data on each server to make it much faster. It is also non-stop processing because if one server fails the data is still available on other servers which automatically take over.

The software has been made available in an open source application called Hadoop, which makes it cheaper. It also works on any commodity server hardware which further decreases the cost.

High level programming languages have been developed to simplify the use of Hadoop. Some of these languages emulate more traditional languages such as SQL which is the standard data manipulation languages for relational applications. Supporting applications enable the integration of traditional relational data and unstructured data for processing by Hadoop together.

These developments have made available a low cost parallel processing computing platform for cost effective analysis of big data. The data is measured by volume, velocity (the rate at which it is gathered) and variety (the diversity of the data types and formats).

The output is analysed statistically in a search for any patterns in the data, a discipline widely known as data analytics. Specialists in this type of analysis are widely referred to as data analysts or data scientists.

The patterns found introduce new ways of thinking with the acceptance of the pattern existing even if there is no evidence of a cause. The patterns exist and are of value. The reason they exist is of no value and therefore no interest.

Companies seeking to use this new technology alongside their traditional data warehousing and business intelligence technology have a choice of how they deploy it. Few are following the practice of Google or Yahoo by acquiring their own farm of servers to use in-house, remembering that this is their core competency not a supplementary capability. Most are outsourcing to providers who offer the resources they need on demand. These arrangements are called software as a service (SAAS), platform (PAAS) or infrastructure (IAAS) arrangement depending upon the level of support required. The outsourced resources can be local or available over the Internet as a Cloud service.

The use of the Hadoop model of parallel processing is also being extended into other novel application areas. Companies will make use of Hadoop for more types of analysis as more complementary applications are developed. It is not intended to replace traditional transaction systems and should be regarded as complementary to them.

As with any technology there are risks and benefits. Companies must adopt it at an enterprise level as part of their corporate data strategy. It must be aligned with business strategy as part of a change managed programme that addresses cultural change, training and other issues inherent in acquiring this capability. External expertise to help realise the benefits and manage the risks is necessary. It is early days but it is expected to become a mainstream corporate technology for Big Data and beyond.