Big data refers to things one can do at a large scale that cannot be done at a smaller scale. Big data analysis typically uses all available information and billions of data points to identify correlations which reveal new insights about human behaviour which are simply not available when using smaller data sets.
Big data has emerged with the widespread digitisation of information which has made it easier to store and process the increasing volume of information available to us.
Big data is also dependent on the emergence of new data processing tools such as Hadoop which are not based on the rigid hierarchies of the ‘analogue’ age, in which data was typically collected with specific purposes in mind. The rise of big data is likely to continue given that society is increasingly engaged in a process of ‘datification’ – there is an ongoing process of companies collecting data about all things under the sun.
Big data is also fundamentally related to the rise of large information technology companies, most obviously Google, Facebook and Amazon, who collect huge volumes of data and see that data as having an economic value.
A good example of ‘big data analysis’ is Google’s use of its search data to predict the spread of the H1N1 flue virus in 2009, based on the billions of search queries which it receives every day. They took 50 million of the most search terms and compared them with CDC (Centre for Disease Control) data, and found 45 search terms which were correlated with the official figures on the spread of flu.
As a result, Google was able to tell how the H1N1 virus was spreading in real time in 2009 without relying on the reporting-lag which came with CDC data, which is based on people visiting doctors to report flu, a method which can only tell us about the spread of flu some days after it has already spread.
A second useful example is Oren Etzioni’s ‘Farecast company’ – which evolved to use 200 billion flight-price records to predict when the best time for consumers would be to buy plane tickets. The technology he evolved to crunch the data today forms the basis of sites such as Expedia.
There are three shifts in information analysis that occur with Big Data
- Big data analysts seek to use all available data rather than relying on sampling. This is especially useful for gaining insights into niche subcategories.
- Big data analysts give up on exactitude at the micro level to gain insight at the macro level – they look for the general direction rather than measuring exactly down to the single penny or inch.
- Big data analysis looks for correlations, not causation – it can tell us that something is happening rather than why it is happening.
Cukier uses two analogies to emphasise the differences of working with big data compared to the ‘sampled data’ approach of the analogue age.
Firstly, he likens it the shift from painting as a form of representation to movies – the later is fundamentally different to a still painting.
Secondly, he likens it to the fact that at the subatomic level materials act differently to how they do at the atomic level – a whole new system of laws seem to work at the micro level.
Big Data – don’t forget to be sceptical!
This post is only intended to provide a simple, starting point definition of big data, and the above summary is taken from a best selling book on big data (source below) – this book is very pro-big data – extremely biased, overwhelmingly in favour of it – if you buy it and read it, keep this in mind! Big data also has its critics, but more of that later.
Based on chapter 1 of ‘Mayer-Schonberger and Cuker (2017) Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight’.