In the Spring of 2006, AOL’s chief researcher, Dr. Abdur Chowdhury, had a brainwave brilliant enough to get him the sack. Given how many millions of people were typing their thoughts into AOL’s search engine in search of enlightenment, he reckoned, why not post that information somewhere public and try to do something interesting with it? Even though he worked for a huge multinational, Chowdhury’s motives were benign. Knowing what was going on in the minds of internet users, he believed, would surely help technologists to design better and smarter internet search engines. But that wasn’t the only use for it. Before he’d been hired by AOL, Chowdhury had worked as an academic researching the deluge of electronic information and what might be done with it. He was well aware that that, outside of the big companies who can afford to buy it, fresh data about human behaviour was becoming incredibly difficult to come by, especially among the social researchers who were best placed to make use of it. His plan was to gift those researchers with the freshest and most immediate data there was – a whole new set of tools with which to understand the thought-processes, interests and preoccupations of internet users.
Chowdhury’s decision to post the data was a brave one, but it didn’t reckon on the feeding frenzy which often accompanies the low-key release of sensitive information onto the net. When his huge file was chanced upon and investigated by a passing internet user, it was discovered to contain no less than twenty three million search keywords for 650, 000 AOL users over three months earlier that year. Within hours, it had been pilfered by nimble internet users and pasted up all over the more anarchic corners of the net. But in the brouhaha over privacy and data protection which followed, during which Abdur Chowdhury was loudly fired and his research unit closed down, it was too easy to forget that all of them had voluntarily queued up to type all this material into an gigantic database, one which was capable as a result of compiling a comprehensive ledger of each of their internet-mediated thoughts just as soon they wrote them. Within days, whole websites had sprung up dedicated to understanding what that data meant. One of the most popular is called AOL Stalker. The goal of AOL Stalker, as its name suggests, is to make AOL’s vast data bank of information entirely searchable by those who wanted to browse through the searches of AOL’s users. Its founder is a reserved and slightly suspicious 26 year-old Swedish hacker called Hjalmar and, when I tracked the man him down to Sweden and spoke to him on Skype, he forwarded me a popular sequence of searches for one AOL customer, whose data had been downloaded more than fifty thousand times. The anonymous internet user, identified in AOL’s data only as user 672368, seems to be a young woman, and the three months of chronological searches that we have for her tells us a great deal about her life and her mental state. At the beginning of March 2006, she appears to be in the early stages of pregnancy:
2006-03-01 18:54:10 Body fat calliper
2006-03-05 08:53:23 Curb morning sickness
2006-03-09 18:49:37 Get fit while pregnant
Two days later, her mood seems to darken.
2006-03-11 03:52:01. He doesn’t want the baby
2006-03-11 03:52:58. You’re pregnant he doesn’t want the baby
Soon things are back to normal, and her enthusiasm for having the baby returns.
2006-03-14 19:11:28. Baby names and meanings
2006-03-28 09:28:25. Maternity clothes
2006-03-29 10:01:39. Pregnancy workout videos
2006-03-29 10:12:38. Buns of steel video
It’s not long, however, before user 672368 seems to be having second thoughts.
2006-04-17 11:00:02 Abortion clinics charlotte nc
2006-04-17 11:40:22 Greater Carolinas Womens Center
2006-04-17 21:14:19 Can Christians be forgiven for abortion
2006-04-17 22:22:07 Roe vs. Wade
2006-04-18 06:50:34 Effects of abortion on fibroids
2006-04-18 15:14:03 Abortion clinic charlotte
2006-04-18 16:14:07 Symptoms of miscarriage
A few days later, she is thinking about engagement rings.
2006-04-20 16:58:37 Engagement rings
On the same day, however, abortion is still weighing on user 672368′s mind.
2006-04-20 17:53:49 High-risk abortions
Two days later, the decision seems to have been made on her behalf.
2006-05-22 18:17:53 Recover after miscarriage
Only several days later, her thoughts turn once again to marriage.
2006-05-06 21:22:18 www.weddingchannel.org
2006-05-26 19:32:52 Demetrios bridesmaid dresses
2006-05-27 07:25:45 Marry your live-in.
What Hjalmar had nudged me in the direction of seems to be the inchoate story of someone’s life – the affecting and real-life story of one young woman’s pregnancy, her subsequent wrestle with the fact that her baby might be not be wanted by her partner, and her eventually miscarriage. As the sequence of searches ends, user 672368 appears to be recovering from that miscarriage, and looking forward to marriage with her partner. All of this, of course, can only be inferred. No one really knows what is going on inside our heads just from the keywords we type, because the internet is not a complete approximation of our thought-processes or what we’re up to. All the same, there’s something awesomely fascinating about what all this electronic chatter can tell us about ourselves On its own it doesn’t make much sense. Join up the dots, however, and it can leave us with a frighteningly immediate route into our collective psyche. The hodgepodge of thoughts, desires and impulses which tumble out when we sit in front of our internet search boxes tends to short-circuit our rational selves, and makes for a uniquely powerful way of tapping into our collective mood which goes right under the radar of more traditional measures.
It’s not only subterranean hackers who are trying to make sense of it. In the last few years social scientists and market researchers have begun looking afresh at the kind of “self-reporting” which happens on blogs and social networks for what it can tell us about ourselves and which way we’re head. The upshot is a new kind of gold rush among companies like Google and Cisco who are clever or rich enough to exploit our data and make some use of it. Many of the same authorities and institutions that we thought we had left behind when we migrated online eco-systems like Facebook and Google have quietly become peeping toms there, and spend a great deal of time monitoring the traffic and thinking about how they can crunch it to their own advantage. To help them do so, they’ve hiring data analysts by the dozen. These new data-cruncher have even invented a whole vocabulary – they talk about tracking the “social index” or “social graph”, and refer to themselves rather grandly, as “sentiment analysts” or practitioners of the science of “info-demiology.” All the same, their rise tells us something significant. What it means is that, as the initial utopian impulse which grew up around all things web-related gives way to a new realism about what social networking can achieve, the balance of power on the web is slowly shifting to the number-crunchers and data analysts who have the resources to exploit it. For social networking systems like Facebook and Twitter the imperative to make more use of the data we’re throwing their way is even more pressing. We’re playing on their turf, after all, and sooner or later they’re going to need to pay the rent.
About a decade ago AOL entered into a long-term agreement to have its searches by Google, which means those searches made public by Abdur Chowdhury were immediately forwarded to Google’s search box as soon as they were asked. Since three out of five internet searches anywhere in the world are answered by the company, it’s no surprise that Google has always been keenly aware of the uses to which its data might be put. In 2008 I sat at a table in the company’s plush London offices and was shown how to use one of the new tools to emerge from its laboratory, Google Trends which uses our search words to track our enthusiasm for different subjects over time. The executive who’d invited me in demonstrated his brand new gizmo by using it to compare the numbers of people who searched for Barack Obama versus those who searched for Hilary Clinton during their race to be the Democratic Party’s nomination for the 2008 American Presidential Election. Looking at the peak and troughs of support for each and comparing public interest in the two over time, it was very easy to chart very accurately the rise in public fascination with Obama and the waning of interest in Hilary Clinton. Just as Chowdhury had predicted, what’s so wonderful about search data is its freshness and its immediacy. Given the symbiotic relationship between our fragmented thought-processes and the words we end up typing into search engines, what comes out looks like an eerily instantaneous chronicle of the public mood. To Google’s managers, all this has the potential to be seriously useful. In 2008, the same year the company invited me into its London offices, it launched Google Flu Trends, an ambitious attempt to use its searches to predict epidemics of flu in advance of medical authorities. Other big companies are experimenting with ways to use this kind of data to track public health problems or even short-term economic trends.
Mining for search data gold, however, isn’t as easy as it sounds. For one thing, it’s difficult to know what we’re measuring here. Are our searches for Michael Jackson, for example, an indication of his popularity or his lingering infamy? There are lots of different reasons why we might be searching, and no easy way to find out. If this kind of data is any use, it’s only at the aggregate level, where analysts to trawl through it to decipher patterns in the ether. Even then, it’s far from foolproof. Take Google Flu Trends. Just as Abdur Chowdhury predicted, the great thing about Google Flu Trends is that its search data can be collected and analysed instantaneously, whereas traditional flu surveillance systems can take days or even weeks to process. But that doesn’t mean it’s more accurate. According to a major study published by the University of Washington last year, in fact, it’s substantially less accurate than traditional systems the medical authorities. The problem for Google Flu Trends is that lots of different viruses can give us flu-like illnesses, but that doesn’t mean we’ve got flu. The stuff we’re tying into the net, if it can tell us anything, is usually better at telling us what we’re thinking is happening than what’s really happening. And, as we know from the lightning pace at which information finds its way around the net, this kind of “info-demiology” lends itself very well to “viral” bouts of social hysteria.
Search data, of course, isn’t the only tool that data analysts have at their disposal. When millions of us migrated to spend our lives on online social networks like Facebook, a more promising avenue for researchers opened up – to work out who we are on the basis of who we know. 2007 was the year in which Mark Zuckerberg began to present Facebook as an all-knowing “social graph” which would open our eyes to the ties which bind people to each other without anyone ever knowing. In that same year, a few MIT students decided to take him at his word. Wondering what we were unconsciously telling others about ourselves by ‘friending’ people there, the pair used data downloaded from Facebook’s MIT network and a piece of software to try to predict the sexuality of some of their fellow students who didn’t report their sexual preferences on the basis of who they knew. Though there was no scientific way of proving their results, their knowledge of the students they predicted to be gay proved that they were right in ever case they checked. Simply by looking at who people know on Facebook, in other words, they were able to predict whether that person was gay.
This idea of working out who people are on the basis of the company they keep is called social network analysis, and not at all new. To sociologists its known as the “homophily principle” – the tendency of similar kinds of people to hang together. It was a kind of social network analysis, for example, which led the 4th American Infantry Division to Saddam Hussein’s hole in the ground in December 2003. Other recent studies have tried to use it to predict everything from who might be a terrorist to the likelihood that we’ll end up happy or fat. Online social networks, however, have added high-tech fuel to the theory. It’s easy to see how it might work. If your online friends are all Muslim or Jewish it’s likely that you will be too. It might even come in handy – the police, for example, might investigate a murder by trawling through a list of the victims online acquaintances. All the same, it seems a circular and slightly primitive way of understanding who we are. We don’t necessarily act like our friends; often, in fact, we choose friends precisely because they’re so different from ourselves. And even if we did work out that someone is gay or Muslim, what would that really tell us – do people with the same religion or sexual preference really think alike?
In July 2009 actor Sacha Baron Cohen’s comedy film Bruno made an impressive one-day debut of $US14.4 million at US and Canadian box offices. The following day, however, it dropped precipitously, falling 39 per cent to $US8.8 million. Americans and Canadians had reacted quickly and badly to what they saw of the film, media reports surmised, and had voted with their tweets. Since then, many big companies have been working hard to take account of the “Twitter effect.” The result is a growing cadre of “sentiment analysts” who are paid to trawl through our electronic chatter and find out what we make of products of all kinds. One of them is a sprightly woman called Margaret Francis. When I visited her in her office in San Francisco last year, she gave me a crash course in how to do it. Francis works for a company called Scout Labs (it’s since changed it’s name to Lithium), and she and her colleagues have trained a computer programme to recognise thousands of words which come up in publicly accessible online conversation. She doesn’t need to know whether the people she’s eavesdropping on are gay or Muslim; she doesn’t even need to know their names. The only thing she cares about is what they’re saying about her clients; what comes out the other end, she showed me, is a bar chart tracking the murmurings of online opinion about different companies.
Sentiment analysis is a growth profession. While we dart around online, teams of ethnographic bird-watchers are looking over our shoulder to find out what we’re tweeting about. Last year a pair of researchers from Hewlett Packard’s Lab in Palo Alto used a computer algorithm to crunch the positive or negative sentiments expressed in 2.9 million Twitter messages about 24 movies. The result, they claimed, perfectly predicted the box-office performance of each film, with an accuracy of over 97 per cent in the opening weekend – they’re now in the process of patenting it. Quantifying sentiment in this way, according to its boosters, isn’t only useful for its insights into our mood – it can also help us understand the direction in which things are headed. In October of last year, a team of researchers at Indiana University classified 9.7 million Twitter posts under six mood categories (happiness, kindness, alertness, sureness, vitality and calmness), and reckoned that by doing so they could predict changes in the Dow Jones Industrial Average. It’s not as easy as it sounds. Recognising the nuances of online conversation can be a tricky business, and Margaret Francis has one of her analysts read through the data to look for discrepancies and make sure the computer’s getting it right. “We had motherfucker on the list as a negative word”, she told me, “and then we were like ‘why does the machine think this is negative and a person thinks this it’s positive?’ And then you look at the content, and it says ‘badass motherfucker’ – and motherfucker is only a bad word, it turns out, if it isn’t preceded by ‘badass motherfucker’ or ‘ ‘that righteous motherfucker’. There are all these permutations of motherfucker that are good.”
In her previous career Margaret Francis spent much of her time dissecting customers into different demographic groups, but when I asked her about it she rolled her eyes as if it were ancient history. “You’re taking me back years now”, she said. If the stuff that she does works, it’s only because the audience itself is changing its form. In the time we spend online we’re developing passionate attachments to the most recherché of interests and flocking together with those who feel the same way. Those we join up with online aren’t usually our friends, but those with whom we share a passion or an affinity. Identify with something, however, and it’s easier for other people to identify you with it too. For half a century everyone from marketers to political parties have been using demographic information about us – how old we are and our gender, our ethnicity, our sexual preferences and where we live – to help them predict what we might want and how we might act. Most of it was guesswork: reading between the lines of large datasets, asking questions of small samples of the population in the hope of blowing them up into more general significance. It was always a little flimsy, an unreliable guide to how we were. This is different. Instead of identifying people via their accidental demographic characteristics, it identifies them by the things they’re interested in, which usually happen to be the stuff they really like. It’s not perfect, of course. Tracking real-time buzz about a product or a film in real-time is usually much too late. If the film’s a turkey, it’s a bit too late for the studio to get its money back. In any case, this is a brutal and unforgiving way to understand what works and what doesn’t – the best stuff usually takes time to breathe and develop. But for anything susceptible to being built up or understood over time, the opportunities are almost limitless. For everyone from TV shows to pop bands to political sub-cultures, these new tools up whole new routes to connecting with a real audience (rather than a statistically imaginary one) and growing audiences more organically than ever before.
But it’s not only businesses and institutions who have something to learn from it. Just as Abdur Chowdhury guessed and those subterranean data hounds seemed to realise, all that information about us online is likely to be a huge free gift to social researchers as well as the downright prurient. The masses of data which now exist about us on places like Facebook, Twitter, blogs and the net seem sure to revolutionise the slightly arcane world of polling and social research. The industry, after all, is still young – its origins can be traced to the middle decade of the 20th Century. One of its early pioneers was a bunch of well-meaning British anthropologists who, in 1937, took to the streets to report the behaviour of ordinary people in cognito because they didn’t trust more traditional sociological ways of understanding how ordinary people behaved and believed. In what became known as the “mass-observation” experiments – it all started with a letter on the pages of the New Statesman – hundreds of observers were sent out to mingle among the natives and quietly report back on the everyday habits and behaviour of the British public. This new kind of self-reported mass observation has the potential to be every bit as illuminating – if only we know what to look for, and which questions to ask.
©James Harkin. James Harkin is Director of the strategic research agency Flockwatching. His book Niche: Why the market no longer favours the mainstream was published in March by Little, Brown