Read The Internet of Us Online
Authors: Michael P. Lynch
Understanding and the Digital Human
Google knows us so well that it finishes our sentences. This program, known to any user of the Internet, is called Google Complete. Search as I just did for “Web 3.0 and . . .” and Google will suggest “big data” and “education”; search for “knowledge and . . .” and you might get “power” and “information systems.” Complete is a familiar, if rather gentle, form of big data analysis. It works because Google knows not only what much of the world is searching for on the Web, but also what
you've
been searching for. That data is useless without Google's propriety analytic tools for transforming the numbers and words into a predictive search. These predictions aren't perfect. But they are amazingly good, and getting better all the time.
Google has done more than perhaps any other single high-profile company or entity to usher in the brave new world of big
data. As I noted in the first chapter, the term “big data” can refer to three different things. The first is the ever-expanding volume of data being collected by our digital devices. The second is analytical tools for extracting information from that data. And the third is the firms like Google that employ them.
One of the lessons of previous chapters is that big data and our digital form of life, while sometimes making it easier to be a responsible and reasonable believer, often makes it harder as wellâwhile at the same time setting up conditions that make reasonable belief more important than ever before. The same thing could be said for understandingâexcept even more so. And that's important, because understanding is what keeps the “human” in what I earlier called the digital human.
In 2008, Chris Anderson, then editor of
Wired,
wrote a controversial and widely cited editorial called “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Anderson claimed that what we are now calling big data analytics was overthrowing traditional ways of doing science:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity.
With enough data,
the numbers speak for themselves
. . . . Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
1
Traditional scientific theorizing aims at model construction. Collecting data is just a first step; to do good science, you must
explain
the data by constructing a model of how and why the phenomenon in question occurred as it did. Anderson's point was that the traditional view assumes that the data is always limited. That, he says, is the assumption big data is overthrowing.
In 2013, the data analytics expert Christian Rudder (and cofounder of the dating website OkCupid) echoed Anderson's point. In talking about the massive amount of information that OkCupid (and other) dating sites collect, Rudder writes:
Eventually we were analyzing enough information that larger trends became apparent, big patterns in the small ones, and even better, I realized I could use the data to examine taboos like race by direct inspection. That is, instead of asking people survey questions or contriving small-scale experiments, which was how social science was often done in the past, I could go and look
at what actually happens
, when, say, 100,000 white men and 100,000 black women interact in private.
2
Anderson and Rudder's comments are not isolated; they bring to the surface sentiments that have been echoed across discussions of analytics over the last few years. While Rudder has been particularly adept at showing how huge data gathered by social sites can provide eye-opening correlations, and data scientists and companies the world over have been harvesting a wealth of surprising information using analytics, Google remains the most visible leader in this field. The most frequently cited, and still one of the most interesting, examples is Google Flu Trends. In a now-famous journal article in
Nature
, Google scientists compared the 50 million most common search terms used in America with the CDC's data about the spread of seasonal flu between 2003 and 2008.
3
What they learned was that forty-five search terms could be used to predict where the flu was spreadingâand do so in real time, as they did with some accuracy in 2009 during the H1N1 outbreak.
Google Flu Trendsâwhich we will look at again belowâis really only an extension of the design methods behind Google's main search engine. Its algorithms (and their creators) don't know why one page is what you want rather than another; they just apply mathematical techniques to find patterns in incoming links. That's all. Similarly, Google Flu Trends doesn't care why people are searching as they do; it just correlates the data. And Walmart doesn't care why people buy more Pop-Tarts before a hurricane, nor do insurance companies care why certain credit scores correlate with certain medication adherences; they care only that they do. As Viktor Mayer-Schönberger and Kenneth Cukier put it, “predictions based on correlations lie at the heart of big data. Correlation analyses are now used so frequently that
we sometimes fail to appreciate the inroads they have made. And the uses will only increase.”
4
Does the use of big data in this way however, really signal the end of theory, as Anderson alleged? The answer is no. And, as we'll see, that is a very good thing.
Start with Rudder and Anderson's remarks. As Rudder puts it, big data seems to allow us to investigate by direct inspection. We don't have to look through the lens of a model or theory; we can let the numbers speak for themselves. Big data brings us to the real-life correlations that exist, and because those correlations are so perfectly . . . well,
correlated
, we can predict what happens without having to worry about why it happens.
But can we
ever
look at the “data in itself” without presupposing a theory? In
The Structure of Scientific Revolutions
, Thomas Kuhn famously argued that you cannot: data is always “theory-laden.” His point was that there is no direct observation of the world that isn't at least somewhat affected by prior observations, experiences and the beliefs we've formed as a result. These beliefs in turn set up expectations. In short, theory permeates data.
This operates even at the level of deciding what experimental techniques or devices to employ. As Kuhn put it, “consciously or not, the decision to employ a particular piece of apparatus and to use it in a particular way carries an assumption that only a certain sort of circumstances will arise. There are instrumental as well as theoretical expectations.”
5
In support of the claim, Kuhn cited the now-classic 1949 article by Bruner and Postman on perceptual incongruity. Bruner and Postman showed their subjects playing cards, some of which had abnormalities (a
spade card was red, for example).
6
What they found was that, primed with ordinary cards, respondents identified the abnormal cards as perfectly normal; their expectations seemingly affected what they saw. The last seventy-five years of psychology have only underlined the lesson (if not necessarily the letter) of Bruner and Postman's experiment. What you believe can affect what you observe.
Rudder and Anderson may well protest that they don't mean to deny Kuhn's point. They aren't worried about perceptual observations but mathematics. But even when it comes to mathematical correlations detected by mindless programs, our theoretical assumptions will matter: they will determine how we interpret those correlations as meaningful and, most importantly, what we do with them.
A trivial example of how assumptions can matter in this way occurs in Rudder's book. When discussing a well-known data map that tracks the “emotional epicenter” of an earthquake by looking at Twitter reactions, Rudder notes that “Knowing nothing else about a quake, if it were your job to distribute aid to victims, the contours of the Twitter reaction would be a far better guide than the traditional shockwaves around an epicenter model.”
7
Maybe so; and the data map, and others like it, certainly are interesting. But Rudder's point here rests on some key assumptions. First, it assumes that aid workers won't be concerned about aftershocks (which will be better predicted by models employing geological and geographical data). Second, it assumes that all types of quakes generate equally explicable Twitter reactions. (What happens, for example, if people are too injured to type?) Third, it not only assumes that people have
equal access to smartphones, but that their first priority is to tweet rather than rescue the injured. In an extreme quake, the emotional epicenter as charted by Twitter may be far away from the point of truest need. My point is not to overhype what is a passing remark in a much longer work; it is to show that data correlations themselves are useful only under certain background assumptions. And where do those assumptions ultimately come from?
Theory
.
Another word for this is
context
. Without it, correlations can be as misleading as they are informative. A recent and extremely striking example is art historian Maximilian Schich's video map of cultural history (reported in
Science,
with a following video posed by
Nature
).
8
Schich and his colleagues, employing data gleaned by Freebase (a huge set of data owned by, who else, Google), used the mathematical techniques of network analysis to map what they referred to as the development of cultural history. After collecting data about the locations and times of the births and deaths of 150,000 “notable” people over 2,000 years, they made a video map of the data (with births in blue and deaths in red). What resulted showed how, over time, “culture” moved and migratedâsometimes, it seems according to the map, clustering around certain cities (Paris, a center of red) and sometimes more widely distributed. The video is arresting (if you haven't seen it, google “Schich and cultural history”). The idea, Schich said, was to show that you could do in history what is done in the sciences: use data to show actual correlations rather than relying on armchair theorizing.
But Schich's data map relies on a host of assumptions. A good deal of discussion of the video on Twitter and elsewhere following
its release concerned the Eurocentric nature of the map. The notable figures chosen by Schich were almost entirely white, European and male (and in many cases, possessing some wealth). This makes the widely viewed videoâwhich talks about cultural history
simpliciter
ânot just striking but strikingly cringeworthy at points. In fairness, Schich was well aware of this bias; the researcher's point, as he noted, was to use the available data to discover patterns in broadly European cultural history.
Yet Schich's assumptions don't stop at race, gender and ethnicity, nor are they all the products of available data. Some of his assumptions are about how to define “culture.” Schich's map suggests that culture is driven by notable figures (from scientists to movie stars). But is what used to be called “the big man” theory the only or best way to understand what shapes cultural change? What about economics or politics, for example? Other assumptions concern how the drift of culture is measured. Why think that where someone died has more predictive value for cultural development than where they spent their most productive years? Descartes, for example, died in Sweden, but he spent most of his productive life in France. Once again, theoretical assumptions drive work in big data as much as they do in any other field. Kuhn would not be surprised.
None of this diminishes the importance of network analyses as tools for research, including fields typically not associated with data, like history. It's a growing and exciting mode of research across anthropology, literature, the digital arts and the humanities. But as the historian and digital humanities scholar Tom Scheinfeldt has remarked, this work is only as good as the theoretical context in which we place it.
9
So, like it or not, we can't do data analytics without theory. It's what gives us the context in which to pose questions and interpret the correlations we discover. But we should like theory; the process of theorizing employs a composite of cognitive capacities, ones that when employed together make up
understanding
, another way of knowing that is important to human beings.
Suppose you want to learn why your apple tree is not producing good apples. You google it and the first website you look at (for example, the ACME Apple Research Center) is a source of scientific expertise on apples. It tells you the correct answer, call it X. But there are many other websites (e.g., the nefarious Center for Apple Research) that came up during your search that would have told you the wrong answer, and many others (e.g., MysticApples.com) that would have given you the right answer but for the wrong reasons. So you could have easily been wrong about X or right about it but for the wrong reasons.
10
Silly as it is, this example replicates how we now know much of what we know, as I've been pointing out in this book. We know by Google-knowing. Not that there is anything wrong with that. After all, in the above case, we are being responsible and believing X, based on an investigation and on the basis of a reliable source.
11
In several ways, then, you could be said to know X. And for most purposes, that's good enough. Yet it is clear that something valuable can sometimes go missing even when you go about the process responsibly. Sometimes we need to know more
than the facts; sometimes we want to understand. And it is our capacity to understand that our digital form of life often undersells, and which more data alone can't give us.