Big data will change the nature of social research – more data will do away with the need for sampling (and eradicated the biases that emerge with sampling); big data analysis will be messier, but this will lead to more insights and allow for greater depth of analysis; and finally it will move us away from a limiting hypothesis-led search for causality, to non-causal analysis based on correlation.
At least according to Mayer-Schonberger and Cuker (2017) Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight.
A third of social science researchers are already working with Big Data
Below I outline their summary of how Cukier thinks big data will change social research:
You might like to read my summary of the introduction to ‘Big Data’ first
The ability to collect and analyse large amounts of data in real time has many advantages:
It does away with the need for sampling, and all the problems that can emerge with biased sampling.
More data enables us to make accurate predictions down to smaller levels – as with the case of Google’s flu predictions being able to predict the spread of flu on a city by city basis across the USA.
It enables us to use outliers to spot interesting trends – for example credit card companies can use it to detect fraud if too many transactions for a particular type of card originate in one particular area.
When we use all the data, we are more likely to find things which we never expected to find…
Cukier uses Steven Levitt’s analysis of all the data from 11 years worth of Sumo bouts as a good example of the interesting insights to be gained through big data analysis.
A suitable analogy for big data may be the Lytro camera, which captures not just a single plane of light, as with conventional cameras, but rays from the entire light field… the photographer decides later on which element of light to focus on in the digital file…. And he can reuse the same information in different ways.
One of the areas that is most dramatically being shaken up by big data is the social sciences, which have traditionally made use of sampling techniques. This monopoly is likely to be broken by big data firms and the old biases associated with sampling should disappear.
Albert-Laszlo Barabasi examined social networks using logs of mobile phones from about one fifth of an unidentified European country’s population – which was the first analysis done on networks at the societal level using a dataset in the spirit of n = all. They found something unusual – if one removes people with lots of close links in the local area the societal network remains intact, but if one removes people with links outside their community, the social network degrades.
All other things being equal, big data is ‘messier’ than small data – because the more data you collect, the higher the chance that some of it will be inaccurate. However, the aggregate of all the data should provide more breadth and frequency of data than smaller data sets.
Cukier uses the analogy of measuring temperature in a vineyard to illustrate this – if we have just one temperature gauge, we have to make sure it is working perfectly, but it we have a thousand, we will have more errors, but a much wider breadth of data, and if we take measurements with greater frequency, we will have a more sensitive measurement of changes over time.
When using big data, analysts are generally happy sacrificing some accuracy for knowing the general trend – in the big data world, it is OK if 2+2 = 3.9.
More data is sometimes all we need for 100% accuracy, for example chess games with fewer than 6 pieces on the board have all been mapped out in their entirety, thus a human will never be able to beat a computer again once this point has been reached.
The fact that messiness doesn’t matter that much is evidenced in Google’s success with its translation software – Google employed a relatively simply algorithm but fed it trillions of words from across the internet – all of the messy data it could find – this proves that simple models and lot of data trump smart models and less data.
We see messiness in action all over the internet – it lies in ‘tagging’ and likes being rounded up – none of this is precise, but it works, it provides us with usable information.
Ultimately big data means we are going to have to become happier with uncertainty.
It might be hard to fathom today, but when Amazon started up it actually employed book critics and editors to write reviews of books and make recommendations to customers.
Then the CEO Jeff Bezos had the idea of making specific recommendations to customers based on their individual shopping preferences and employed someone called Greg Linden to develop a recommendation system – in 19898 he and his colleagues applied for a patent on ‘item to item’ collaborative filtering – which allowed Amazon to look for relationships between products.
As a result, Amazon’s sales shot up, they sacked the human advisors, and today about 1/3rd of all its sales are based on their recommendations systems. Amazon was an early adopter of big data analytics to drive up sales, and today many other companies such as Netflix also use it as one of the primary methods to keep profits rolling in.
These companies don’t need to know why consumers like the products that they do, knowing that there’s a relationship between the products people like is enough to drive up sales.
Predictions and Predilections
In the big data world, correlations really shine – we can use them to gain more insights extremely rapidly.
At its core, a correlation quantifies the statistical relationship between two data values. A strong correlation means that when one of the data values changes, the other is highly likely to change as well.
Correlations let us analyse a phenomenon not by shedding light on its inner workings, but by identifying a useful proxy for it.
In the small data age, researchers needed to use hypotheses to select one or a handful of proxies to analyse, and hence hard statistical evidence on the relationship between variables was collected quite slowly; with the increase in computational power we don’t need hypothesis-driven analysis, we can simply analyse billions of data points and ‘stumble upon’ correlations.
In the big-data age we can use a data-driven approach to collecting data, and our results should be less biased and more accurate, and we should also be able to get them faster.
One such example of where this data-driven approach has been applied and strong big data correlations was the case of Google’s flu predictions. We didn’t need to know what flu search terms were the best proxy for ‘people with flu symptoms’, in this case, the data simply showed us which search terms were the best proxies.
With correlations there is no certainty, only probability, but this can still provide us with actionable data, as with the case of Amazon above, and there are many other examples of where data driven big data analytics are changing our lives. (p56)
We can use correlations to predict the future – for example, Wal-Mart noticed a correlation between Hurricanes and Flash Light sales, but also pop tarts, so when a Hurricane is predicted, it moves the pop tarts to the front of store and further boosts its sales.
Probably the most notorious use of big data correlations to make predictions is the American discount retailer, Target, who use their data on the products women buy as a proxy for pregnancy – women tend to buy non scented body lotions around the third month of pregnancy and then various vitamin supplements around the 6 month mark – big data even allows predictions about the approximate birth date to be made!
Finding proxies in social contexts is only one way that big-data techniques are being employed – another use is through ‘predictive analytics’, which aims to forsee events before they happen.
One example of predictive analytics is the shipping company UPS using them to monitor its fleet of 10s of 1000s of vehicles – to replace parts just before they wear out, saving them millions of dollars.
Another use is in health care – one piece of research by Dr Carolyn McGregor, with IBM,, used 16 different data streams to track the stats of premature babies – and found that there was a correlation between certain stats and an infection occurring 24 hours later. Interestingly this research found that an infant’s stability was a predictor of a forthcoming infection, which flew in the face of convention – again we don’t know why this is, but the correlation was there.
Illusions and Illuminations
Big data also makes it easier to find more complex, non-linear relationships than when working within a hypothesis-limiting small data paradigm.
One example of a non-linear relationship uncovered by big data analysis is that of the relationship between income and happiness – that happiness increases with income (up until about $30K per year, but then it levels out – once we have ‘enough’ adding on more money doesn’t make us any happier…
Big data also opens up more possibilities for exploring networks – by analyzing how ideas spread through the nodes of networks such as Facebook, for example.
In network analysis, it is very difficult to attribute causality, because everything is connected to everything else, and big data analysis is typically non-causal, just looking for correlations not ‘causation’.
Does big data mean the end of theory?
In 2008 Wired magazine’s chief editor argued that in the ‘Petabyte age’ we would be able to do away with theory – that correlation would be enough for us to understand reality – citing as examples Google’s search engine and gene sequencing – where simply huge amounts of data and applied mathematics replace every other tool that might be brought to bear.
However, this view is problematic because big data is itself founded on theory – it employs mathematical and statistical theories for example, and humans still select data, or at least the tools which select data, which in turn are often driven by convenience and economic concerns.
Having said that, Big Data does potentially move us away from theory and closer to empiricism than in the small data age.