Authors: Ian Ayres
Tell Me What You Know About Me
Tera mining sometimes gives businesses a decided information advantage over their customers. Hertz, after analyzing terabytes of sales data, knows a lot more than you do about how much gas you're likely to leave in the tank if you prepay for the gas. Cingular knows the probability that you will go beyond your “anytime minutes” or leave some unused. Best Buy knows the probability that you will make a claim on an extended warranty. Blockbuster knows the probability that you will return the rental late.
In each of these cases, the companies not only know the generalized probability of some behavior, they can make incredibly accurate predictions about how individual customers are going to behave. The power of corporate tera mining creepily suggests the opening lines of Psalm 139:
You have searched me and you know me.
You know when I sit and when I rise; you perceive my thoughts from afar.
You discern my going out and my lying down; you are familiar with all my ways.
We may have free will, but data mining can let business emulate a kind of aggregate omniscience. Indeed, because of Super Crunching, firms sometimes may be able to make more accurate predictions about how you'll behave than you could ever make yourself.
But instead of trying to prohibit statistical analysis, we might react to the possibility of advantage-taking by simply making sure that consumers know that the number crunching is going on. The rise of these predictive models suggests the possibility of a new kind of disclosure duty. Usually government only requires firms to tell a consumer about their products or services (“made in Japan”). Now firms sometimes know more about consumers than the consumers know about themselves. We could require firms to educate consumers about themselves. It might be helpful if Avis told you, before you agree to prepay for gasoline, that other people like you tend to leave more than a third of a tank full when they return the carâso that the effective price for prepaid gas is really four bucks per gallon. Or Verizon might be asked to tell you when their statistical model predicts that you're on the wrong phone plan.
Government could also Super Crunch some of its enormous datasets to inform citizens about themselves. Indeed, Super Crunching could truly help reinvent government. The IRS nowadays is almost universally disliked. Yet the IRS has tons of information that could help people if only it would analyze and disseminate the results. Imagine a world where people looked to the IRS as a source for useful information. The IRS could tell a small business that it might be spending too much on advertising or tell an individual that the average taxpayer in her income bracket gave more to charity or made a larger IRA contribution. Heck, the IRS could probably produce fairly accurate estimates about the probability that small businesses (or even marriages) would fail. In fact, I'm told that Visa already does predict the probability of divorce based on credit card purchases (so that it can make better predictions of default risk). Of course, this is all a bit Orwellian. I might not particularly want to get a note from the IRS saying my marriage is at risk. (A little later on, we will take on whether all this Super Crunching is really a good idea. Just because it's possible to make accurate predictions about intimate matters doesn't mean that we should.) But I might at least want the option of having the government make predictions about various aspects of my life. Instead of thinking of the IRS as solely a taker, we might also think of it as an information provider. We could even change its name to the “Information & Revenue Service.”
Consumers Fight Back
Even without government's help, entrepreneurs are bringing new services to market which use Super Crunching as a consumer advocacy tool. Coming to the aid of consumers, these firms are using data-crunching to counteract the excesses of seller-side price extraction. The airline industry is especially fertile ground for such advocacy, because airlines engage in increasingly bewildering pricing shenanigansâtrying to find in their databases any crevice of an opportunity to enhance their “revenue yield.”
What's a consumer to do? Enter Oren Etzioni, a professor of computer science at the University of Washington. On a fateful day in 2002, Etzioni was peeved to learn that the people sitting next to him on an airplane trip had bought their tickets for a much lower price merely because they waited to buy their tickets later. He had a student go out and try to forecast whether particular airline fares would increase or decrease as you got closer to the travel date. With just a little data, the student could make pretty accurate forecasts about whether it was a good idea to buy early or wait.
Etzioni ran with the idea in a big way. What he did is a prime example of how consumer-oriented Super Crunching can counteract the number-crunching price manipulations of sellers. He created Farecast.com, a travel website that lets you search for the lowest current fare. Farecast goes further than other fare-search sites; it adds an arrow that simply points up or down telling you which way Farecast predicts fares are headed. Even a prediction that the fare is likely to go up is valuable, because it lets consumers know that they should hurry up and pull the trigger.
“We're doing the same thing the weatherman does,” said Hugh Crean, Farecast's chief executive. “We haven't achieved clairvoyance, nor will we. But we're doing travel search with a real level of advocacy for the consumer.” Henry H. Harteveldt, a vice president and principal travel analyst at Forrester Research in Cambridge, says Farecast is trying to level the informational playing field for travelers. “Farecast provides guidance, much like a stockbroker, about whether you should take action now, or whether you should wait.”
The company (which was originally named Hamlet and had the motto “to buy or not to buy”) is based on a serious Super Crunch. In a five-terabyte database, it keeps fifty billion prices that it purchased from ITA Software, a company that sells price data to travel agents, websites, and computer reservation services. Farecast has information on nearly all the major carriers except Jet Blue and Southwest (who do not provide data to ITA). Farecast can indirectly account for and even predict Jet Blue and Southwest pricing by looking at how other airlines on the same routes react to price changes of the two missing competitors.
Farecast bases its predictions on 115 indicators that are reweighed every day for every market. It pays attention not just to historic pricing patterns, but also to a host of factors that will shift the demand or supply of ticketsâthings like the price of fuel or the weather or who wins the National League pennant. It turns all this information into an up-arrow if it predicts the price will go up, or a down-arrow if it predicts the price will go down. “It's like going to the ballet,” Harteveldt says. “We don't see the many years of practice and toil and blood and sweat and strain that the ballet dancer has experienced. We're only there in the auditorium watching them as they dance gracefully on the stage. With Farecast, we see the graceful dancing onstage. We don't see the data-crunching, we don't really care about the data-crunching.”
Farecast turns the tera-crunching tables on the airlines. It uses the same databases and even some of the same statistical techniques that airlines have been using to extract money from consumers. But Farecast isn't the only service that has been crunching numbers to help the little guy.
There are a bunch of other services popping up that crunch large datasets to predict prices. Zillow.com in just a few months has become one of the most visited real estate sites on the net. Zillow crunches a dataset of over sixty-seven million home prices to help both buyers and sellers price their homes.
And if you can predict the selling price of a house, why not the selling price of a PDA? Accenture is doing just that. Rayid Ghani, a researcher at Accenture's Information Technology group, for the past two years has been mining the data from 50,000 eBay auctions to predict the price that PalmPilots and other PDAs will ultimately sell for. He hopes to convince insurance companies or eBay itself to offer sellers price-protection insurance that guarantees a minimum price they'll receive. Explains Ghani, “You'll put a nice item on eBay. Then if you pay me ten dollars, I'll guarantee it will go for at least a thousand dollars. And if it doesn't, I'll pay you the difference.” Of course, auction bidders will also be interested in these predictions. Bidcast software that will suggest whether you should bid now or wait for the next item is sure to be coming to a web portal near you.
Sometimes Super Crunching is helping consumers just get through the day. Inrix's “Dust Network” crunches data on the speed of a half million commercial vehicles to predict traffic jams. Today's large commercial fleets of taxis and delivery vans are equipped with global positioning systems that in real time can relay information not just about their position but about how fast they're going. Inrix combines this traffic-flow information with information about the weather, accidents, and even when schools and rock concerts are letting out, to provide instantaneous advice on the fastest way to get from point A to point B.
Meanwhile, Ghani is working to use Super Crunching to personalize our shopping experience further. Soon, supermarkets may ask us to swipe our loyalty cards as we enter the storeâat which point the store will data mine through our previous shopping trips and make a prediction of what foods we're running out of. Ghani sees a day when the supermarket will become a food shopping advisor, telling us what we need to buy and offering special deals for the day's shopping trip.
The simple predictive power of a good data crunch can be applied to almost any activity where people do the same thing again and again. Super Crunching can be used to give one side an edge in a commercial transaction, but there's no reason why it has to be the seller. As more and more data becomes increasingly available for free, consumer services like Farecast and Zillow will step forward and crunch it.
In Regressions We Trust
These services not only tell you which way the price is going to move, they also tell you how confident they are in their estimates. So with Farecast a consumer might learn not only that the fare is expected to drop, but also that this type of prediction turns out to be correct 80 percent of the time. Farecast knows that it doesn't always have enough data to make a very precise prediction. Other times it does. So it lets you know not only its best guess, but how confident it is in that guess. Farecast not only tells you how confident it is, but it puts its money where its mouth is. For $10, it will provide you with “Fareguard” insuranceâwhich guarantees that an offered airfare will remain valid for a week, or Farecast will make up the difference.
This ability to report a confidence level in predictions underscores one of the most amazing things about the regression technique. The statistical regression not only produces a prediction, it also simultaneously reports how precisely it was able to predict. That's rightâa regression tells you how accurate the prediction is. Sometimes there are just not enough historical data to make a very precise estimate and the output of the regression technique tells you just this. Indeed, it gets even better, because the regression tells you not only the precision of the regression equation on the whole, it also tells you the precision with which it was able to estimate the impact of each individual term in the regression equation.
So Wal-Mart learns three different kinds of things from its employment test regression. First, it learns how long a particular applicant is likely to stay on the job. Second, it learns how precisely it made this prediction. The predicted longevity of an applicant might be thirty months, but the regression will separately report the probability that the applicant would work less than fifteen months. If the thirty months prediction is fairly accurate, the probability that the applicant will work only fifteen months would be pretty small, but for inaccurate predictions this probability might begin to balloon. A lot of people want to know whether they can really trust a regression prediction. If the prediction is imprecise (say because of poor or incomplete data), the regression itself will be the first one to tell you not to rely on it. When was the last time you heard a traditional expert tell you the precision of his or her estimate?
And finally, the regression output tells Wal-Mart how precisely it was able to measure the impact of individual parts of the regression equation. Wal-Mart isn't about to report the results of its regression formula. However, the regression output might tell Wal-Mart that applicants who think “there is room in every corporation for a non-conformist” are likely to work 2.8 months less than people who disagree. The prediction associated with the specific question is 2.8 fewer months, holding everything else about the applicant constant. The regression output can go even further and tell Wal-Mart the chance that “non-conformist” applicants will end up working
longer
. Depending on the accuracy of the 2.8-month prediction, this probability or a contrary influence might be 2 percent or 40 percent. The regression begins the process of validating itself. It tells you the impact of more rainfall on wine, and whether that particular influence is really valid.
All the World's a Mine
Tera mining of customer records, airline prices, and inventories is peanuts compared to Google's goal of organizing all the world's information. Google reportedly has five petabytes of storage capacity. That's a whopping 5,000 terabytes (or a quadrillion bytes). At first, it may not seem that a search engine really has much to do with data mining. Google makes a concordance of all the words used on the Internet and then if you search for “kumquat,” it simply sends you a list of all the web pages that use that word the most times. Yet Google uses all kinds of Super Crunching to help you find the kumquat pages you really want to see.