Big data - ReviseSociology

Variables in quantitative reserach

What is the difference between interval/ ratio, ordinal, nominal and categorical variables? This post answers this question!

Interval/ ratio variables

Where the distances between the categories are identical across the range of categories.

For example, in question 2, the age intervals go up in years, and the distance between the years is same between every interval.

Interval/ ratio variables are regarded as the highest level of measurement because they permit a wider variety of statistical analyses to be conducted.

There is also a difference between interval and ratio variables… the later have a fixed zero point.

Ordinal variables

These are variables that can be rank ordered but the distances between the categories are not equal across the range. For example, in question 6, the periods can be ranked, but the distances between the categories are not equal.

NB if you choose to group an interval variable like age in question 2 into groups (e.g. 20 and under, 21-30, 31-40 and so on) you are converting it into an ordinal variable.

Nominal or categorical variables

These consist of categories that cannot be rank ordered. For example, in questions 7-9, it is not possible to rank subjective responses of respondents here into an order.

Dichotomous variables

These variables contain data that have only two categories – e.g. ‘male’ and ‘female’. Their relationship to the other types of variable is slightly ambiguous. In the case of question one, this dichotomous variable is also a categorical variable. However, some dichotomous variables may be ordinal variables as they could have one distinct interval between responses – e.g. a question might ask ‘have you ever heard of Karl Marx’ – a yes response could be regarded as higher in rank order to a no response.

Multiple-indicator measure such as Likert Scales provide strictly speaking ordinal variables, however, many writers argue they can be treated as though they produce interval/ ratio variables, if they generate large number of categories.

In fact Bryman and Cramer (2011) make a distinction between ‘true’ interval/ ratio variables and those generated by Likert Scales.

A flow chart to help define variables

*A nominal variable – aka categorical variable!

Questionnaire Example

This section deals with how different types of question in a questionnaire can be designed to yield different types of variable in the responses from respondents.

If you look at the example of a questionnaire below, you will notice that the information you receive varies by question

Some of the questions ask for answers in terms of real numbers, such as question 2 which asks ‘how old are you’ or questions 4 and 5 and 6 which asks students how many hours a day they spend doing sociology class work and homework. These will yield interval variables.

Some of the questions ask for either/ or answers or yes/ no answers and are thus in the form of dichotomies. For example, question 1 asks ‘are you male or female’ and question 10 asks students to respond ‘yes’ or ‘no’ to whether they intend to study sociology at university. These will yield dichotomous variables.

The rest of the questions ask the respondent to select from lists of categories:

The responses to some of these list questions can be rank ordered – for example in question 6, once a day is clearly more than once a month! Responses to these questions will yield ordinal variables.

Some other ‘categorical list’ questions yield responses which cannot be ranked in order – for example it is impossible to say that studying sociology because you find it generally interesting is ranked higher than studying it because it fits in with your career goals. These will yield categorical variables.

These different types of response correspond to the four main types of variable above.

Knowing Capitalism and Lively Data

Knowing Capitalism and Lively Data

Nigel Thrift (2005) developed the concept of ‘knowing capitalism’ to denote a new form of global economy which depends not only on technologies which generate large amounts of digital data, but also on the commodification of that data: a big data economy in which power operates through modes of communication, and

Digital data have become especially valuable as forms of knowledge, especially when they are aggregated into big data sets, and are seen as having huge potential to offer new insights into a range of human behaviours, and to disrupt various industries: from health care to education.

One key change in the age of ‘knowing capitalism’ is that there has been a shift from commodifying workers’ physical labour to profiting from information collected on people’s preferences – which online users willingly give when they create and upload digital content online, download and use geolocation apps, shop online, and like various content.

In this digital age, prosumption is the new norm – people simultaneously consuming and generating online content and In commercial circles, the user of online technologies is ‘the product’, because the information they give off when online is so valuable.

This is why so many applications, such as Facebook, are free to use – because they are really just platforms to harvest valuable data (why charge?)… and the Four big tech companies excercise huge power by virtue of the sheer amount of big data they have already, and continue to collect on their users.

Central to portrayals of the digital data economy is the idea that digital data are lively, mutable, and hybrid. Metaphors of liquidity are very commonly used:

Flows
Streams

Rivers
Floods
Tsunamis

In the digital data economy flows of information are generated and engage in non-linear movement, and according to THrift (2014) new hybrid beings emerge with the mixture of data, objects and bodies….and bodies and identities are fragmented and reassembled through a process of reconfiguration.

Furthermore, digital data and the algorithmic analytics used to interpret them are beginning to have determining effects on people’s lives, influencing their life chances and opportunities.

There is a mobile dimension to how we interact with data too.

Data can become stuck, for example when a company hoards it, or when people do not know how to use it!

Data materialisations constitute an important dimension of knowing capitalism – data is lively, in flux, but it needs to be frozen to be used – in 2D (infographics) or 3D… through printers.

Where 2D data visualisations are concerned, a lot of emphasis is placed on their aesthetic quality, and how the meaning of the data is structured.. And behind this process lies decisions about what to include and what to exclude, and limitations on what can be shown due to software used…. This there are many contingencies framing the way we understand big data in knowing capitalism!

Sources

Summarised from:

Lupton, Deborah (2017) The Quantified Self, Polity

Sociomaterial Perspectives on the self in digital networks

Sociomaterial perspectives hold that datafication via digital devices (both personal and public) are fundamentally intertwined with the way we construct our identities and ‘practice selfhood’, so much so that it is more accurate to say that today we ‘live in media’ rather than ‘we live with media’.

The most obvious manifestation of the intertwining of digital technologies, datafication and selfhood is our extensive use of mobile phones, tablets and laptops: not only do we rely on these devices for information, we also use them (sometimes consciously, sometimes not) to continually upload information about ourselves to the net.

And even if we choose to reduce our use of such technologies, or live without them altogether, our sense of self will still be partially governed by digital technology because so much of public life and public space is informed by its use.

Sociomaterial perspectives on human action are strongly influenced by actor-network theory and take our extensive use of digital technologies into account by focussing on the way that humans interact with non-human material objects such as computers in heterogeneous and diverse networks.

This approach sees objects as agents within a network, able to exert influence on humans, and it is interested in how things and meanings interrelated. It also takes account of how factors such as class, gender and ethnicity influence the context of a relational network.

Sociomaterial perspectives also recognize that there is a complex ‘web’ of interaction which lies beyond (or behind) technologically mediated networks: programmers, marketers etc, and (importantly I think) that the technologies and software which governs action within a network are themselves the product of human interactions (and thus values).

This perspective offers a useful response to post-structuralism which focuses purely on discourses and meanings, which are largely seen as floating free from the material context of action.

More specifically the sociomaterial perspective on understanding selfhood in a digital age focuses on:

How people experience technologies

How technologies are incorporated into people’s senses of self, and how they extend their sense of self
How social relations are configured through such networks incorporating networks.

Assemblages

The concept of assemblage is often used in the sociomaterialism literature. An assemblage is configured when humans, nonhumans, practices, ideas and discourses come together in a complex system. With digital systems, an assemblage will consist of the following:

Computer software and hardware
Developers
Manufacturers and retailers

Software coders
algorithms
Computer servers and archives

The computing cloud
Platforms and social media

According to sociomaterial perspective, individuals are ‘entangled’ in such assemblages – and understanding these entanglements is a complex business, precisely because these assemblages are complex – there are lot of human, and non-human actors involved.

Within these assemblages, humans can iimbue objects (such as their phones) with biological meaning, and understanding these meanings is key to understanding human action, but humans are also changed by all of the above ‘objects’ (along with the other actual humans) which make up the assemblage in which an individual acts.

Turkle (2007) for example calls mobile devices ‘evocative objects’ because they are basically repositories of ourselves – we have so much information stored on them!

Kitchen and Dodge (2011) use the term code/space to denote the ways in which software and devices such as mobile phones and sensors are configuring concepts of space and identity – our devices may even govern our access to certain spaces (think etickets), and because our behaviour can be tracked through them, we can also be nudged, or disciplined into certain ways of acting via our technologies.

Sources and Notes

This is my summary of part one of chapter two of my current January 2018 read:

Lupton, Deborah (2017) The Quantified Self, Polity

This kind of theory should hit A-level sociology about 2035, about 2 years before the cyborgs take over once and for all.

Big Data: Controlling its Use

Changes in the way we interact and communicate lead to changes in the way we govern ourselves and just as with the invention of the printing press resulting in the evolution of copyright and libel laws, so the emergence of big data will result in new laws to govern the new ways in which this information is collect, analysed and utilized.

In this final chapter of the main section of Viktor Mayer-Schonberger and Kenneth Cukier’s (2017) ‘Big Data’: The Essential Guide to Life and Learning in the Age of Insight – the authors suggest four ways in which we might control the use of Big Data in the coming years….

Firstly, Crozier suggests we will need to move from ‘privacy by consent’ to ‘privacy by accountability. Because old privacy laws by consent don’t work in the big data age (See here for why), we will effectively have to trust companies to make informed judgments about the risks of re-purposing the data they hold. If they deem there to be an element of risk of harm to people, they may have to administer a second round of ‘consent of use’, if the risk is very small, they can just go ahead and use it.

If is also possible to deliberately blur data so that it becomes fuzzy and you cannot see individuals in it – so you can set analytical programmes to return aggregate results only -an approach known as differential privacy.

Comment: NB – this sounds dubious – we just trust companies more….the problem here being that we can only really trust them to do one thing – put their profits before everything else, including people’s privacy rights.

Secondly, we will also need to ensure that we do not judge people based on propensity by aggregate. In the big data era of justice, we need to hold people account for their individual actions – i.e. for what they have actually done as individuals, rather than what the big data says people like them are likely to do.

Comment: NB – all he seems to be saying here is that we carry on doing what we already do (in most 9cases at least!)

Thirdly (which stems from the problem that big data can be something of a ‘black box’ – that is to say the number of variables which go into making up predictions and the algorithms which calculate them defy ordinary human understanding) – we will need a new series of experts called algorithmists to be on hand to analyse big data findings if and when individuals feel wronged by them. Crozier argues that these will take a ‘vow of impartiality’ in monitoring and reviewing the accuracy of big data predictions, and sees a role for both internal and external algorithmists.

Comment: this doesn’t half sound like something August Comte, the founding father of Positivism, would say!

Crozier argues this is just the same as new specialists emerging in law, medicine and computer security as these field developed in complexity.

Fourthy and finally, Crozier suggests we will need to develop some sort of new anti-trust laws to ensure that one company does not come to have a monopoly on data.

Comment: Fair enough!

Overall Comment

I detect a distinct pro-market tone in the authors’ analysis of big data – basically we trust companies to use it (but avoid monopoly power), but we mistrust governments – precisely what you’d expect from the Silicon Valley set!

The Risks of Big Data

There are three main risks of Big Data:
the paralysis of privacy,
punishment through propensity,
the fetishization of and dictatorship through data

There are three main risks of Big Data:

The paralysis of privacy
Punishment through propensity

Fetishization of and dictatorship through data

Here I continue my summary of Mayer-Schonberger and Cuker (2017) Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight’.

Three Risks of Big Data

Firstly, simply because so much data is collected on individuals – not only via state surveillance but also via Amazon, Google, Facebook and Twitter,it means that protecting privacy is more difficult -especially when so much of that data is sold on to be analysed for other purposes.

Secondly, there is the possibility of penalties based on propensities – the possibility of punishing people even before they have done anything wrong..

Finally, we have the possibility of a dictatorship of data – whereby information becomes an instrument of the powerful and a tool of repression.

Paralyzing Privacy

The value of big data lies in its reuse, quite possibly in ways that are have not been imagined at the time of collecting it. In terms of personal information, if we are to re-purpose people’s personal data than they cannot give informed consent in any meaningful sense of the phrase – because in order to so you need to know what data a company is collecting and what use they are going to put it to.

The only way big data can work is for companies to ask customers to agree to have their data collected ‘for any purpose’, which undermines the concept of informed consent.

There are still possible ways to protect privacy – for example opting out and anonymisation.

Opting out is simply where some individuals choose not to have their data collected – however, opting out can itself identify certain things about the users – for example, when certain people opted out of Google’s street view and their houses were blurred – they were still noticeable as people who had ‘opted out’ (and thus maybe had more valuable stuff to steal!)/

Anonymisation is where all personal identifiers are stripped from data – such as national insurance number, date of birth and so on, but here people can still be identified – when AOL released its data set of 20 million search queries from over 650K users in 2006, researchers were able to pick individual people out – simply by looking at the content of searches they could deduce that someone was single, female, lived in a certain areas, purchased certain things – then it’s just a matter of cross referencing to find the particular individual.

In 2006 Netflix released over 100 million rental records of half a million users – again anonymised, and again researchers managed to identify one specific Lesbian living in a conversative area by comparing the dates of movies rented with her entries onto the IMD.

Big data, it appears, aids de-anonymisation because we collect more data and we combine more data.

Of course it’s not just private companies collecting data… it’s the government too, The U.S. collects an enormous amount of data – amounts that are unthinkably large – and today it is possible to tell a lot about people by looking at how they are connected to others.

Probability and Punishment

This section starts with a summary of the introductory scene of minority report…

We already see the seeds of this type of pre-crime control through big data:

Parole boards in more than half the states of the US use big data predictions to inform their parole decisions.

A growing number of precincts use ‘Predictive Policing’ – using big data analysis to select which streets to parole and which individuals to harass..

A research project called FAST – Future Attribute Screening Technology – tries to identify potential terrorists by monitoring people’s vital signs.

Cukier now outlines the argument for big-data profiling – mainly pointing out that we’ve taken steps to prevent future risks for years (e.g. seat-belts) and we’ve profiled for years with small data (insurance!) – the argument for big data profiling is that it allows us to be more granular than previously – we can make our profiling more individualised – thus there’s no reason to stop every Arab man under 30 with a one way ticket from boarding a plane, but if that man has done a-e also, then there is a reason.

However, there is a fundamental problem of punishing people based on big data – that is, it undermines the very foundations of justice – that of individual choice and responsibility – by disallowing people choice – big data predictions about parole re offending are accurate 75% of the time – which means that if we use the profiling 100% of the time we are wrongly punishing 1 in 4 people.

Dictatorship of Data

The problem with relying on data to inform policy decisions is that the underlying quality of data can be poor – it can be biased, mis-analysed or used misleadingly. It can also fail to capture what is actually supposed to measure!

Education is a good example of a sector which is governed by endless testing – which only measure a slither of intelligence – the ability to demonstrate knowledge (predetermined by a curriculum) and show analytical and evaluative skills as an individual, in written form, all under timed conditions.

Google, believe it or not, is an example of a company that in the past has been paralysed by data – in 2009 its top designer, Douglas Bowman, resigned because he had to prove whether a border should be 3,4, or 5 pixels wide, using data to back up his view. He argued that such a dictatorship by data stifled any sense of creativity.

The problem with the above, in Steve Jobs’ words: it isn’t the consumers’ job to know what they want’.

In his book Seeing Like a State, the anthropologist James Scott documents the way in which governments make people’s lives a misery by fetishizing quantitative data:they use maps to reorganise communities rather than asking people on the ground for example.

The problem we face in the future is how to harness the utility of big data without becoming overly relying on its predictions.

The Big Data Value Chain

There are three types of company in the big-data value chain: the companies who collect the data, data-analytics companies, and data-ideas companies. This new ‘organisational landscape’ will change the power-relations between businesses enormously, at least according to Viktor Mayer-Schonberger and Kenneth Cukier (2017) in ‘Big Data’: The Essential Guide to Life and Learning in the Age of Insight;.

‘Pure’ data companies are those which have the data, or at least access to it, but not necessarily have the right skills to extract the value from the data. A good example of such a company is Twitter, which has masses of data but licences it out to independent firms to use.

Data analytics companies are those with the statistical, programming, and communication skills necessary to mining insights from data – Teradata is a good exmaple of such a company.

Finally there are those companies with the ‘big-data mindset’ whose founders and employees have unique ideas about how to unlock and combine data to find new forms of value – for example, Pete Warden, the co-founder of Jetpac, which makes travel recommendations based on the photos users upload to the site.

Data analytics has recently been touted as being in the ‘prime position’ in the big-data value chain: there has been a lot of recent talk of the shortage of ‘data scientists’ in the age of ever increasing amount of data…. The McKinsey Global Institute has talked about this for example, and Google’s chief economist Hal Varian famously called statistician the ‘sexist job around’.

We have been given the impression that we are wallowing in data, but lack sufficient people with the skills to mine this data.

Cukier, however, thinks such claims are exaggerated because it is likely that this skills gap will close. Interestingly, in a recent talk on big data science, this view also seemed to be the consensus.

He predicts that what is more likely to happen is that firms controlling access to the data will start to charge more for it, and big data innovators will be be where the real money is…

Hyrbid Data Companies

Companies such as Google and Amazon stretch across all three links in the data value chain. Google collects data like search-query typos, uses it to create a spell-checker and employs people in-house to do the analytics. Such vertical integration is no doubt precisely why Google is today one of the world’s largest companies.

The New Data Intermediaries

Cukier also predicts that there are certain business sectors which will benefit from giving their data to third parties, because keeping it in-house will not be as beneficial to them as sharing their data and combining it with others – third parties are needed to facilitate trust – for example, travel firms will benefit from such an arrangement, not to mention the banking and finance sectors – where more data is better.

The Demise of the Expert

Cukier also predicts that big data analytics will see specialists in different fields being replaced with those with data-science skills able to manage whatever field based on data. He argues that ‘mathematics, statistics, perhaps with a sprinkling of programming and network science, will be as foundational to the modern workplace as numeracy was a century ago and literacy before that’.

Big Winners, Medium Sized Losers..

Large data companies such as Google and Amazon will continue to soar, but big data presents a challenge to the victors of small-world data such as Walmart, Nestle, Boeing…. How these will adapt remains to be seen.

There are, of course, opportunities for ‘smart and nimble start-ups’, but also individuals might start to sell their own data, possibly through new third party firms.

Problems with the fusion of big data and education

The first problem is that it will be more difficult for us to forget and escape our past….

While we as individuals grow, evolve and change, comprehensive educational data collected through the years remains unchanged – there is a problem that as the amount of data collected on us through our formative years, we might be judged in the future by this historic data – creating a kind of ‘permanence of the past’.

Our historic data record might show a future employer that we were enrolled in a remedial math class in our first year of university, and this fact alone might put them off calling us for interview, even if our maths has evolved in the intervening years, which means we might get credit for how we have evolved in our later years.

The problem with data is that it is unlikely to tell anyone about the context in which it takes place – if test scores are low during particular years, for example, the data alone is unlikely to tell us what was going on more broadly in our lives at that time – unlike today, when we can effectively forget low-periods in our lives, in the forthcoming age of big data, they will always be on display for anyone to scrutinise, without access to the more in-depth context.

Employers already track Facebook posts, if there is more educational data, then they might well delve into that too.

A second problem is that our big data record might fix our future…

Today schools make predictions based on ‘small data’, yet students can argue against the paths suggested by such small data (GCSEs etc) because it is precisely that, small, collected at only a few points in time, clearly not telling the whole story.

In the Big data age, however, predictions based on more data may become so accurate that they lock students into educational tiers of particular programmes of study – some universities are already experimenting with ‘e-advisors’ – since the University of Arizona implemented such a system in 2007, the proportion of students moving on from one year to the next has increased from 77% to 84%…. In future these systems may evolve to advise, or prevent, students from undertaking particular courses of study deemed to be too difficult for them.

This may lock-in students to pre-determined study and career paths, which may have a detrimental effect on equality of opportunity.

A third problem, largely dismissed by Cukier, is that the fusion between big data and educational institutions will only work if students and parents consent to tech companies having access to their children’s private data. For some reason he cannot see the problems with this, which suggests more than anything else he’s an industry-insider.

How will Big Data Change Social Research?

Big data will change the nature of social research – more data will do away with the need for sampling (and eradicated the biases that emerge with sampling); big data analysis will be messier, but this will lead to more insights and allow for greater depth of analysis; and finally it will move us away from a limiting hypothesis-led search for causality, to non-causal analysis based on correlation.

At least according to Mayer-Schonberger and Cuker (2017) Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight.

Big Data Research — A third of social science researchers are already working with Big Data

Below I outline their summary of how Cukier thinks big data will change social research:

You might like to read my summary of the introduction to ‘Big Data’ first

More Data

The ability to collect and analyse large amounts of data in real time has many advantages:

It does away with the need for sampling, and all the problems that can emerge with biased sampling.

More data enables us to make accurate predictions down to smaller levels – as with the case of Google’s flu predictions being able to predict the spread of flu on a city by city basis across the USA.

It enables us to use outliers to spot interesting trends – for example credit card companies can use it to detect fraud if too many transactions for a particular type of card originate in one particular area.

When we use all the data, we are more likely to find things which we never expected to find…

Cukier uses Steven Levitt’s analysis of all the data from 11 years worth of Sumo bouts as a good example of the interesting insights to be gained through big data analysis.

A suitable analogy for big data may be the Lytro camera, which captures not just a single plane of light, as with conventional cameras, but rays from the entire light field… the photographer decides later on which element of light to focus on in the digital file…. And he can reuse the same information in different ways.

One of the areas that is most dramatically being shaken up by big data is the social sciences, which have traditionally made use of sampling techniques. This monopoly is likely to be broken by big data firms and the old biases associated with sampling should disappear.

Albert-Laszlo Barabasi examined social networks using logs of mobile phones from about one fifth of an unidentified European country’s population – which was the first analysis done on networks at the societal level using a dataset in the spirit of n = all. They found something unusual – if one removes people with lots of close links in the local area the societal network remains intact, but if one removes people with links outside their community, the social network degrades.

Messier

All other things being equal, big data is ‘messier’ than small data – because the more data you collect, the higher the chance that some of it will be inaccurate. However, the aggregate of all the data should provide more breadth and frequency of data than smaller data sets.

Cukier uses the analogy of measuring temperature in a vineyard to illustrate this – if we have just one temperature gauge, we have to make sure it is working perfectly, but it we have a thousand, we will have more errors, but a much wider breadth of data, and if we take measurements with greater frequency, we will have a more sensitive measurement of changes over time.

When using big data, analysts are generally happy sacrificing some accuracy for knowing the general trend – in the big data world, it is OK if 2+2 = 3.9.

More data is sometimes all we need for 100% accuracy, for example chess games with fewer than 6 pieces on the board have all been mapped out in their entirety, thus a human will never be able to beat a computer again once this point has been reached.

The fact that messiness doesn’t matter that much is evidenced in Google’s success with its translation software – Google employed a relatively simply algorithm but fed it trillions of words from across the internet – all of the messy data it could find – this proves that simple models and lot of data trump smart models and less data.

We see messiness in action all over the internet – it lies in ‘tagging’ and likes being rounded up – none of this is precise, but it works, it provides us with usable information.

Ultimately big data means we are going to have to become happier with uncertainty.

Correlation

It might be hard to fathom today, but when Amazon started up it actually employed book critics and editors to write reviews of books and make recommendations to customers.

Then the CEO Jeff Bezos had the idea of making specific recommendations to customers based on their individual shopping preferences and employed someone called Greg Linden to develop a recommendation system – in 19898 he and his colleagues applied for a patent on ‘item to item’ collaborative filtering – which allowed Amazon to look for relationships between products.

As a result, Amazon’s sales shot up, they sacked the human advisors, and today about 1/3rd of all its sales are based on their recommendations systems. Amazon was an early adopter of big data analytics to drive up sales, and today many other companies such as Netflix also use it as one of the primary methods to keep profits rolling in.

These companies don’t need to know why consumers like the products that they do, knowing that there’s a relationship between the products people like is enough to drive up sales.

Predictions and Predilections

In the big data world, correlations really shine – we can use them to gain more insights extremely rapidly.

At its core, a correlation quantifies the statistical relationship between two data values. A strong correlation means that when one of the data values changes, the other is highly likely to change as well.

Correlations let us analyse a phenomenon not by shedding light on its inner workings, but by identifying a useful proxy for it.

In the small data age, researchers needed to use hypotheses to select one or a handful of proxies to analyse, and hence hard statistical evidence on the relationship between variables was collected quite slowly; with the increase in computational power we don’t need hypothesis-driven analysis, we can simply analyse billions of data points and ‘stumble upon’ correlations.

In the big-data age we can use a data-driven approach to collecting data, and our results should be less biased and more accurate, and we should also be able to get them faster.

One such example of where this data-driven approach has been applied and strong big data correlations was the case of Google’s flu predictions. We didn’t need to know what flu search terms were the best proxy for ‘people with flu symptoms’, in this case, the data simply showed us which search terms were the best proxies.

With correlations there is no certainty, only probability, but this can still provide us with actionable data, as with the case of Amazon above, and there are many other examples of where data driven big data analytics are changing our lives. (p56)

We can use correlations to predict the future – for example, Wal-Mart noticed a correlation between Hurricanes and Flash Light sales, but also pop tarts, so when a Hurricane is predicted, it moves the pop tarts to the front of store and further boosts its sales.

Probably the most notorious use of big data correlations to make predictions is the American discount retailer, Target, who use their data on the products women buy as a proxy for pregnancy – women tend to buy non scented body lotions around the third month of pregnancy and then various vitamin supplements around the 6 month mark – big data even allows predictions about the approximate birth date to be made!

Finding proxies in social contexts is only one way that big-data techniques are being employed – another use is through ‘predictive analytics’, which aims to forsee events before they happen.

One example of predictive analytics is the shipping company UPS using them to monitor its fleet of 10s of 1000s of vehicles – to replace parts just before they wear out, saving them millions of dollars.

Another use is in health care – one piece of research by Dr Carolyn McGregor, with IBM,, used 16 different data streams to track the stats of premature babies – and found that there was a correlation between certain stats and an infection occurring 24 hours later. Interestingly this research found that an infant’s stability was a predictor of a forthcoming infection, which flew in the face of convention – again we don’t know why this is, but the correlation was there.

Illusions and Illuminations

Big data also makes it easier to find more complex, non-linear relationships than when working within a hypothesis-limiting small data paradigm.

One example of a non-linear relationship uncovered by big data analysis is that of the relationship between income and happiness – that happiness increases with income (up until about $30K per year, but then it levels out – once we have ‘enough’ adding on more money doesn’t make us any happier…

Big data also opens up more possibilities for exploring networks – by analyzing how ideas spread through the nodes of networks such as Facebook, for example.

In network analysis, it is very difficult to attribute causality, because everything is connected to everything else, and big data analysis is typically non-causal, just looking for correlations not ‘causation’.

Does big data mean the end of theory?

In 2008 Wired magazine’s chief editor argued that in the ‘Petabyte age’ we would be able to do away with theory – that correlation would be enough for us to understand reality – citing as examples Google’s search engine and gene sequencing – where simply huge amounts of data and applied mathematics replace every other tool that might be brought to bear.

However, this view is problematic because big data is itself founded on theory – it employs mathematical and statistical theories for example, and humans still select data, or at least the tools which select data, which in turn are often driven by convenience and economic concerns.

Having said that, Big Data does potentially move us away from theory and closer to empiricism than in the small data age.

Will E-learning Platforms change Education?

Big data enthusiasts argue that the greater data collection and analysis potential provided by e-learning platforms such as Khan Academy and Udacity provide much more immediate feedback to students about how they learn, and they thus predict a future in which schools and private data companies will work together in a new educational ecosystem…

This is a continuation of my summary of Meyer-Schonberger and Cukier’s in their (2017) ‘Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight.

You might like to read this previous post first – How will Big Data Change Education? (according to the above authors).

The advantages of e-learning platforms over traditional education

Khan Academy is well-known for its online videos, but just as important to its success is the software which collects data about how students learn, as well as what they are learning.

To date, Khan Academy has data on over a billion completed exercises, which includes information on not only what videos students watch and what tests scores they achieve, but also on the length and number of times they engage with each aspect of the course, and the time of day they did their work. This enables data analysts to deduce (probabilistically) how students learn most effectively, and to provide feedback as to how they might improve their learning.

The Kahn Academy is just one online learning platform, along with a whole range of MOOCs offered through Udacity, Coursera and edX, as well as SPOOCs (small, private online courses) which are collecting huge volumes of data on student learning. The volume of data is unprecedented in human history, and Cukier suggests that this could change the whole ecosystem of learning, incorporating third parties who do the data analysis and with the role of instructors (‘teachers’) changing providing advice on which learning pathways students should adopt.

At least some of the Khan Academy Data on learning is available to third parties to analyse for free, and information personal to students is presented to them in the form a dashboard, which allows for real-time feedback to take place.

Cukier contrasts the above, emerging ecosystem of online learning, to the present ‘backward’ way in which data is collected and managed in the current education system as backward (he actually uses the term ‘agrarian’ to describe the process) – in which students are subjected to a few SATs tests at predetermined stages, and this score is ‘born by them’ until the next test, which makes labelling by teachers more likely.

In addition to this, the school day and year are run in a 19th century style, pigeon holed into year groups, pre-determined classes, students exposed to the same material, and with digital devices often banned from classes. All of this means data cannot be harnessed and analysed.

Where does this leave existing institutions of learning?

Schools and universities are well poised to harvest huge amounts of data on students, simply because they have 1000s, or 10s of 1000s of students enrolled.

To date, however, these traditional education institutions have shown a very limited ability to collect, let alone analyse and use big data to better inform how students learn.

The coming change will affect universities first – these have mature students, and this audience is more than capable of digesting insights about how to learn more effectively… the big universities where fees are expensive and students don’t get much back in return are poised for disruption by innovators…

Some of the very top universities seem to have got the importance of BIg Data – MIT identified EdX as a crucial part of its forward strategy in 2013 for example, but some of the universities lower down the pecking order may find it difficult to compete.

The response of some forward looking schools is to embrace elearning – recognising the importance of getting and utilising more data on how students learn – Khan Academy is partnered with a number of schools, for example Peninsula Bridge, a summer school for middle schoolers from poor communities in the Bay area. – Cukier cites an example of one girl who managed to improve her maths due to this (again, evidence cited is almost non existent here!)

The chapter concludes with imaging a future in which schools are just part of a broader ecosystem of learning – which includes a much more prominent role for private companies and where data plays a more central role in learning.

Comments

There are number of factors which may contribute to schools’ inability to harness big data:

Time limitations – as Frank Furedi argues in ‘Wasted’, the function of schools have expanded so that they are now expected to do more than just educate kids – thus an ever larger proportion of schools’ budgets are taken up with other aspects of child development; combined with meddling by successive governments introducing new policies every few years, schools are caught in the trap of having to devote their resources to adapting to external stimuli rather than being able to innovate.

Financial limitations/ equality issues – correct me if I’m wrong, but any online course tailored to GCSEs or A-levels is going to cost money, and this might be prohibitively expensive!
The negative teacher experience of governance by ‘small data’ – there is a staggering amount of small data already collected and teachers are governed by this – it might actually be this experience of being governed by data that makes teachers reluctant to collect even more data – no one wants to be disempowered!
Child privacy rights – there is the not insignificant issue of letting big ICT education companies have access to our children’s learning data!

How will Big Data Change Education?

Big Data will make Feedback more focussed on effective teaching rather than student progress, it will make learning more individualised, and it will enable us to make probabilistic predictions about what programmes are best for different students.

This is according to Big Data enthusiasts Meyer-Schonberger and Cukier in their (2017) reprint of their 2013 original ‘Big Data: The Essential Guide to Work, Life and Learning in the Age of Insight…

This post is a summary of the section at the back of this book, which focuses on big data and education (introduction to this section is here).

An excellent counter point to the outrageous, almost entirely speculative and sweepingly general claims made in this book is Neil Selwyn’s ‘Is Technology Good for Education?‘ – the later is based on stacks of peer-reviewed evidence, the former on speculation only.

How will big data change feedback in education?

In the small data age, data collection in schools was largely limited to test scores and attendance, focussing on collecting standardised data on student performance, with feedback being almost exclusively in one direction – from the teachers to the schools to the kids and their parents – what is not measured is how well we teach our kids, or how effective different teaching techniques are in facilitating student progress.

Big data changes this by datafying the learning process – for example, e-books allow us to track how students read books, what they take notes one, at what point the give up reading, what sections they go back and check – thus we can measure how effective different books are, or different passages within books are, at helping students to understand knowledge, which can be used as a basis for immediate and differentiated intervention by teachers.

We could also use e-books in conjunction with testing to measure the relationship between different textual materials and the ‘decay curve’ – the rate at which students forget knowledge, which might be useful in improving test scores.

Companies such as Pearsons and Kaplan are very involved in producing e-books, but at time of writing (2017) even in America only 5% of school text books are digital.

Individualisation

In schools, the education which we are exposed to is standardised into a one size fits all package, tailored to a mythical average student. Learning has barely evolved from the industrial era – the materials students are given are identical, and the learning process still works essentially like an assembly line, with all students being paced through a syllabus at the same rate and learning benchmarked against a series of standardised tests.

All of this is tailored towards the needs of the teachers and the system, not the needs of the students.

However, in the Big Data age, following the American economist Tyler Cowen, ‘average is over’, and following Khan Academy founder Sal Khan ‘one size fits few’. The problem with the current, industrial era education system is that very few people actually benefit from it – the bright student is bored, while the weaker understands nothing. What we need is a means of flexibly adapting the pace and content of teaching to better fit the needs of individual students.

Tailoring education to each student has long been the aim of adaptive-learning software – an example of this is Carnegie Learning’s ‘Cognitive Tutor’ for school mathematics which decides which math questions to ask based on how students answered previous questions. This way it can identify problem areas and drill them, rather than try to cover everything but miss holes in their knowledge, as happens with the traditional system.

Another example is New York City’s ‘School of One’, a math programme in which students get their own personalised ‘playlist’ determined by an algorithm, each day, with maths problems for them suited to their needs.

Such individualised learning systems are dynamic — the learning materials change and adapt as more data is collected, analysed and transformed into feedback. More advanced material is only provided once students have mastered the fundamentals.

All of this is based on the idea of the ‘student as consumer/ client’ – one argument is that ‘if we can rip our favourite music and burn it into our own playlist’, why can’t we do this with education? A second argument is that in any other field of business, consumers provide feedback on products and the manufacturers improve (and increasingly personalise) the products to meet the demands of diverse consumers…. Adaptive learning should transform education into something which is more responsive to the needs of students/ consumers, rather than it being led by unresponsive systems and teachers.

Supporting evidence for adaptive learning:

In a trial of 400 high school freshmen in Oklahoma, the Cognitive Tutor system helped them achieve the same level of math proficiency in 12% less time than students learning math in the traditional way.

According to Bill Gates, talking in 2013, students on remedial education courses using adaptive software outperformed students in conventional courses and colleges benefitted from a 28% reduction in the cost per student.

Probabilistic Predictions

Big data will provide us with insights into how people in aggregate learn, but more importantly, into how each of us individually acquires knowledge. These insights are not perfect – they do not give us cause and effect relationships – Big data insights are probabilistic:

For example, we may spot that teaching materials of a certain sort will improve a particular person’s tests scores by 95%, but if we make a recommendation based on this, it will not work in 5% of cases.

This is something we are going to have to learn to live with, and parents and students are going to have to bear the risk – for example, all Big Data can do is to tell ‘clients’ that if they study this particular course, then there is a 70-80% chance they’ll see ‘x’ amount of improvement.

However, some probabilities will be more certain than others, and so for at least some specific recommendations, we can act with reasonable certainty.

We are going to have to get over seeing through the world through the lens of cause and effect…

Criticisms of Mayer-Schonberger and Cukier’s views on how Big Data will transform education

Personally, as a teacher myself I’m sceptical when non-experts start making sweeping predictions about the future of education based on speculation, especially when one of the claims for the Big Data is that it provides empirical insights, such speculation is hypocritical, precisely because it’s not based on any actual data!

The idea that transnational technology companies are going to help everyone in education is nonsense – they are profit driven, the fact that profit comes first, and that this will be a limiting factor in how data is used in the future is not even mentioned.

They see ‘teachers as the enemy’ – as a barrier to Big data, this is highly dismissive of a group of people who have gone into a job to benefit children, where I doubt that people for tech companies do not have this as their primary motive – also see below, for an alternative explanation of their criticism to ‘teachers as a barrier’ to ed tech companies playing more of a role in education.

The ‘one size fits all model’ might be dominant in education because with a teacher student ration of 1-100 (in colleges) teachers literally cannot meet the individual needs of individual students. There simply isn’t time for this, along with the need for teachers to keep on top of the knowledge themselves, and keep up to date with technological changes, institutional-legal requirements, and do all of the (still necessary) marking of students work.

Related to the above point, making teachers analogous to other professionals with clients, I don’t believe there’s any other field of work where professionals are expected to deal with 100 clients at a time and personally interact with each of them every single day in a meaningful way… dealing with diverse and complex knowledge (rather than specialising in one particular thing, i.e. a haircut, or a financial advice for example) – while it might be fair to expect teachers to respond to ‘clients’ demands, 1 teacher cannot do this with 100 students. The ratio needs altering (1-10 maybe?).

The authors cit very few examples of peer-reviewed evidence to back up their claims.

Further Reading…

Problems of the role of technology companies in education

Tag: Big data

Variables in quantitative reserach

Interval/ ratio variables

Ordinal variables

Nominal or categorical variables

Dichotomous variables

A flow chart to help define variables

Questionnaire Example

Like this:

Knowing Capitalism and Lively Data

Like this:

Sociomaterial Perspectives on the self in digital networks

Assemblages

Like this:

The Risks of Big Data

Three Risks of Big Data

Paralyzing Privacy

Probability and Punishment

Dictatorship of Data

Like this:

The Big Data Value Chain

Like this:

Problems with the fusion of big data and education

Like this:

How will Big Data Change Social Research?

More Data

Messier

Correlation

Like this:

Will E-learning Platforms change Education?

The advantages of e-learning platforms over traditional education

Where does this leave existing institutions of learning?

Like this:

How will Big Data Change Education?

How will big data change feedback in education?

Individualisation

Probabilistic Predictions

Criticisms of Mayer-Schonberger and Cukier’s views on how Big Data will transform education

Like this:

Interval/ ratio variables

Ordinal variables

Nominal or categorical variables

Dichotomous variables

A flow chart to help define variables

Questionnaire Example

Share this:

Like this:

Share this:

Like this:

Assemblages

Share this:

Like this:

Share this:

Like this:

Three Risks of Big Data

Paralyzing Privacy

Probability and Punishment

Dictatorship of Data

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

More Data

Messier

Correlation

Share this:

Like this:

The advantages of e-learning platforms over traditional education

Where does this leave existing institutions of learning?

Share this:

Like this:

How will big data change feedback in education?

Individualisation

Probabilistic Predictions

Criticisms of Mayer-Schonberger and Cukier’s views on how Big Data will transform education

Share this:

Like this: