The middle classes and especially those in creative industries are more likely to be on twitter, but finding this out is more difficult than you might think, at least according to some recent research:
Who Tweets?: Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data
This post is a brief summary of the methods and findings of the above.
Introduction/ Context/ Big Data
90% of the world’s data has been generated in the past 2 years and the trend is apparently exponential, the key challenges of harnessing this data (known as the 5Vs: volume,veracity, velocity, variety and value) are not so easily overcome.
The primary criticism of such data is that it is there to be collected and analysed before the question is asked and, because of this, the data required to answer the research question may not be available with important information such as demographic characteristics being absent.
The sheer volume of data and its constant, flowing, locomotive nature provides an opportunity to take the ‘pulse of the world’ every second of the day rather than relying on punctiform and time-consuming terrestrial methods such as surveys. Only 1% of Twitter users in the UK amounts to around 150,000 users. Even a tiny kernel of ‘useful’ data can still amount to a sample bigger than some of the UK’s largest sample surveys
However, social media data sources are often considered to be ‘data-light’ as there is a paucity of demographic information on individual content producers.
Yet, as Savage and Burrows argue, sociology needs to respond to the emergence of these new data sources and investigate the ways in which they inform us of the social world. One response to this has been the development of using ‘signatures’ in social media as proxies for real world events and individual characteristics
This paper builds on this work conducted at the Collaborative Online Social Media Observatory (COSMOS),through proposing methods and processes for estimating two demographic variables: age and occupation (with associated class).
How Do Twitter Users Vary by Occupation and Social Class – Methods
The researchers used a sample 32, 032 twitter profiles collected by COSMOS, relying on the entry in the ‘profile’ box to uncover occupation and class background.
They took the occupation with the most number of words as the primary occupation, and, if multiple occupations are listed, they took the first occupation as the primary occupation.
They then randomly selected 1,000 cases out of the 32,032 to which an occupation was assigned and three expert coders visually inspected the results of 1000 twitter profiles in anticipation of inaccuracies and errors.
They found that 241 (so 24%) had been misclassified, with a high level of inter-rater reliability.
The main problems of identification stemmed from the multiple meanings of many words related to occupations, Hobbies, and with obscure occupations. For example, people might refer to themselves as a ‘Doctor Who fan’ or a ‘Dancer trapped in a software engineer’s body’.
So what is the class background of twitter users?
The table below shows you three different data sets – the class backgrounds as automatically derived from the entire COSMOS sample of profiles, the class background of the 32 000 sample the researcher used and the class backgrounds of the 1000 that were visually verified by the three expert coders (for comments on the differences see ‘validity problems’ below).
There is a clear over representation of NS-SEC 2 occupations in the data compared with the general UK population which may be explained by the confusion between occupations and hobbies and/or the use of Twitter to promote oneself or one’s work. NS-SEC 2 is where occupations such as ‘artist’, ‘singer’, ‘coach’, ‘dancer’ and ‘actor’ are located and the utility of the tool for identifying occupation for this group is further exacerbated by the fact that this is by far the most populous group for Twitter users and the largest group in the general UK population by 10% points. Alternatively, if the occupation of these individuals has been correctly classified then we can observe that they are over represented on Twitter by a factor of two when using Census data as a baseline measure.
Occupations such as ‘teacher’, ‘manager’ and ‘councillor’ are not likely to be hobbies but there is an unusually high representation of creative occupations which could also be pursued as leisure interests with 4% of people in the dataset claiming to be an ‘actor’, 3.5% an ‘artist’ and 3.5% a ‘writer’. An alternative explanation is that Twitter is used by people who work in the creative industries as a promotional tool.
Validity problems with the social-class demographics of twitter data
Interestingly, the researchers rejected the idea that people would just outright lie about their occupations noting that ‘previous research [has] indicated that identity-play and the adoption of alternative personas was often short-lived, with ‘real’ users’ identities becoming dominant in prolonged interactions. The exponential uptake of the Internet,beyond this particular group of early adopters,was accompanied with a shift in the presentation of self online resulting in a reduction in online identity-play’.
The COSMOS engine does automatically identify occupation, but it identifies occupation inaccurately – and the degree of inaccuracy varies with social class background. The researchers note:
‘unmodified occupation identification tool appears to be effective and accurate for NS-SEC groups in which occupational titles are unambiguous such as professions and skilled trades (NS-SEC 1,3,4 and 5). Where job titles are less clear or are synonymous with alternative activities (NS-SEC 2, 6 and7) the requirement for human validation becomes apparent as the context of the occupational term must betaken into account such as the difference between “I’m a dancer in a ballet company”and “I’m a dancer trapped in the body of a software engineer’.
The researchers note that the next step is to further validate their methodology through establishing the ground-truth via ascertaining the occupation of tweeters through alternative means, such as social surveys (an on-going programme of work for the authors).
In some ways the findings are not surprising – that the middle class professionals and self-employed are over-represented on twitter, but if we are honest, we don’t know by how much, because of the factors mentioned above. It seems fairly likely that many of the people self-identifying on twitter as ‘actors’ and so on don’t do this as their main job, but we just can’t access this method by twitter alone.
Thus this research is a reminder that hyper reality is not more real than actual reality. In hyper-reality these people are actors, in actual reality, they are frustrated actors. This is an important distinction, and this alone could go some way to explaining why virtual worlds can be so much meaner than real-worlds.
This research also serves as a refreshing reminder of how traditional ‘terrestrial’ methods such as surveys are still required to ascertain the truth of the occupations and social class backgrounds of twitter users. As it stands if we left it to algorithms we’d end up with 25% of people bring incorrectly identified, which is a huge margin of error. If we leave these questions up to twitter, then we are left with a very misleading picture of ‘who tweets’ by social class background.
Having said this, it is quite possible for further rules to be developed and applied to algorithms which could increase the accuracy of automatic demographic data-mining.