Chances Are (27 page)

Read Chances Are Online

Authors: Michael Kaplan

BOOK: Chances Are
12.27Mb size Format: txt, pdf, ePub
Nowadays, the randomized double-blind clinical trial is to medical experiment what the Boy Scout oath is to life: something we should all at least try to live up to. It is the basis of all publication in good medical journals; it is the prerequisite for all submissions to governmental drug approval agencies. But there have always been problems adapting Fisher's guidelines to fit the requirements of a global industry.
The simplest question is still the most important: has the trial shown an effect or not? The classical expression of statistical significance is Fisher's
p
-number, which, as you've seen, describes a slightly counterintuitive idea:
assuming
that chance alone was responsible for the results, what is the probability of a correlation
at least as strong
as the one you saw? As a working scientist with many experiments to do, Fisher simply chose an arbitrary value for the
p
-number to represent the boundary between insignificant and significant: .05. That is, if simple chance would produce these results only 5 percent of the time, you can move on, confident that either there really was something in the treatment you'd tested—or that a coincidence had occurred that would not normally occur more than once in twenty trials. Choosing a level of 5 percent also made it slightly easier to use the table that came with Student's
t
-test; in the days before computers, anything that saved time with pencil and slide rule was a boon.
So there is our standard; when a researcher describes a result as “statistically significant,” this is what is meant, and nothing more. If we all had the rigorous self-restraint of Fisher, we could probably get along reasonably well with this, taking it to signify no more than it does.
Unfortunately, there are problems with
p
-numbers. The most important is that we almost cannot help but misinterpret them. We are bound to be less interested in whether a result was produced by chance than in whether it was produced by the treatment we are testing: it is a natural and common error, therefore, to transfer over the degree of significance, turning a 5 percent probability of getting these results
assuming
randomness into a 5 percent probability of
randomness
assuming we got these results (and therefore a 95 percent probability of genuine causation). The two do not line up: the probability that I will carry an umbrella, assuming it is raining, is not the same as the probability that it is raining, assuming that I'm carrying an umbrella. And yet even professionals can be heard saying that results show, “with 95 percent probability,” the validity of a cause. It's a fault as natural and pervasive as misusing “hopefully”—but far more serious in its effects.
Another problem is that, in a world where more than a hundred thousand clinical trials are going on at any moment, this casually accepted 5 percent chance of coincidence begins to take on real importance. Would you accept a 5 percent probability of a crash when you boarded a plane? Certainly not, but researchers are accepting a similar probability of oblivion for their experiments. Here's an example: in a group of 33 clinical trials on death from stroke, with a total of 1,066 patients, the treatment being tested reduced mortality on average from 17.3 percent in the control group to 12 percent in the treated group—a reduction of more than 25 percent. Are you impressed? Do you want to know what this treatment is?
It's rolling a die. In a study reported in the
British Medical Journal,
members of a statistics course were asked to roll dice representing individual patients in trials; different groups rolled various numbers of times, representing trials of various sizes. The rules were simple: rolling a six meant the patient died. Overall, mortality averaged out at the figure you would expect: 1/6, or 16.7 percent. But two trials out of 44 (1/22—again, the figure you'd expect for a
p
-value of 5 percent) showed statistically significant results for the “treatment.” Many of the smaller trials veered far enough from the expected probabilities to produce a significant result when taken together. A follow-up simulation using the real mortality figures from the control group of patients in a study of colorectal cancer (that is, patients who received no treatment), showed the same effect: out of 100 artificially generated randomized trials, four showed statistically significant positive results. One even produced a 40 percent decrease in mortality with a
p
-value of .003. You can imagine how a study like that would be reported in the news: “Med Stats Prove Cancer Miracle.”
Chance, therefore, plays its habitual part even in the most rigorous of trials—and while Fisher would be willing to risk a small chance of false significance, you wouldn't want that to be the basis of your own cancer treatment. If you want to squeeze out chance, you need repetition and large samples.
Large samples turn out to be even more important than repetition. Meta-analysis—drawing together many disparate experiments and merging their results—has become an increasingly used tool in medical statistics, but its reliability depends on the validity of the original data. The hope is that if you combine enough studies, a few sets of flawed data will be diluted by the others, but “enough” is both undefined and crucial. One meta-analysis in 1991 came up with a very strong positive result for injecting magnesium in cases of suspected heart attack: a 55 percent reduction in the chance of death with a
p
-value less than .001. The analysis was based on seven separate trials with a total of 1,301 patients. Then came ISIS-4, an enormous trial with 58,050 patients; it found virtually no difference in mortality between the 29,001 who were injected with magnesium and the 29,039 who were not. The results differed so dramatically because, in studies with 100 patients or fewer, one death more in the control group could artificially boost the apparent effectiveness of the treatment—while no deaths in the control group would make it almost impossible to compare effectiveness. Only a large sample allows chance to play its full part.
There is a further problem with statistical significance if what you are investigating is intrinsically rare. Since significance is a matter of proportion, a high percentage of a tiny population can seem just as significant as a high percentage of a large one. Tuberculosis was a single disease, spread over a global population. Smoking was at one time a nearly universal habit. But now researchers are going after cagier game: disorders that show themselves only rarely and are sometimes difficult to distinguish accurately from their background.
In
Dicing with Death,
Stephen Senn tells the story of the combined measles, mumps, and rubella vaccine—which points out how superficially similar situations can require widely separate styles of inference, leading to very different conclusions. That rubella during pregnancy caused birth defects was discovered when an Australian eye specialist overheard a conversation in his waiting room between two mothers whose babies had developed cataracts. A follow-up statistical study revealed a strong correlation between rubella (on its own, a rather minor disease) and a range of birth defects—so a program of immunization seemed a good idea. Similar calculations of risk and benefit suggested that giving early and complete immunity to mumps and measles would have important public health benefits. In the mid-1990s, the UK established a program to inoculate children with a two-stage combined measles, mumps, rubella (MMR) vaccine.
In 1998,
The Lancet
published an article by Dr. Andrew Wakefield and others, describing the cases of twelve young children who had a combination of gastrointestinal problems and the sort of developmental difficulties associated with autism. In eight of the cases, the parents said they had noticed the appearance of these problems about the time of the child's MMR vaccination. Dr. Wakefield and his colleagues wondered whether, in the face of this large correlation, there might be a causal link between the two. The reaction to the paper was immediate, widespread, and extreme: the authors' hypothetical question was taken by the media as a definitive statement; public confidence in the vaccine dropped precipitately; and measles cases began to rise again.
On the face of it, there seem to be parallels between the autism study and the original rubella discovery: parents went to a specialist because their children had a rare illness; they found a past experience in common; the natural inference was a possible causal relationship between the two.
But rubella during pregnancy is relatively rare and the incidence of cataracts associated with it was well over the expected annual rate. The MMR vaccination, on the other hand, is very common—it would be unusual to find a child who had
not
received it during the years covered by the
Lancet
study. Moreover, autism is not only rare, but difficult to define as one distinct disorder. To see whether the introduction of MMR had produced a corresponding spike in cases of autism would require a stable baseline of autism cases—something not easy to establish.
Later studies looking for a causal link between MMR and autism could find no temporal basis for causation, either in the year that the vaccine was introduced or in the child's age at the onset of developmental problems. In 2004, ten of the authors of the original
Lancet
paper formally disassociated themselves from the inferences that had been drawn from it—possibly the first time in history that people have retracted not their own statement, but what others had made of it.
High correlation is not enough for inference: when an effect is naturally rare and the putative cause is very common, the chance of coincidence becomes significant. If you asked people with broken legs whether they had eaten breakfast that morning, you would see a very high correlation. The problem of rarity remains and will become more troubling, the more subtle the illnesses we investigate. Fisher could simply plant another block of wheat; doctors cannot simply conjure up enough patients with a rare condition to ensure a reliable sample.
 
The word “control” has misleading connotations for medical testing: when we hear of a “controlled experiment,” it's natural to assume that, somehow, all the wild variables have been brought to heel. Of course, all that's really meant is that the experiment includes a control group who receive a placebo as well as a treatment group who get the real thing. The control group is the fallow ground, or the stand of wheat that has to make its way unaided by guano. Control is the foundation stone of meaning in experiment; without it, we build conclusions in the air. But proper control is not an easy matter: in determining significance, a false result in the control can have the same effect as a true one in the treatment group.
Let's consider an obvious problem first: a control group should be essentially similar to the treatment group. It's no good comparing throat cancer with cancer of the tongue, or the depressed with the schizophrenic. If conditions are rare or difficult to define, the control group will pose the same difficulties of sample size, chance effects, and confounded variables as the treatment group. Controls also complicate the comparison of trials from different places: is the reason your throat-cancer controls show a higher baseline mortality in China than in India simply a matter of chance, to be adjusted out of your data—or is there hidden causation: drinking boiling hot tea or using bamboo food scrapers?
Moreover, how do you ensure that the control group really believes that it might have received the treatment? In the 1960s, controls in some surgery trials did indeed have their chests opened and immediately sewn up again—a procedure unlikely to pass the ethics committee now. How would you simulate chemotherapy or radiation? Surreptitiously paint the patient's scalp with depilatory cream? Patients are no fools, especially now in the age of Internet medicine: they know a lot about their conditions and are naturally anxious to find out whether they are getting real treatment. If the control is the foundation, there are some soils on which it is hard to build securely.
The word “placebo” is a promise: “I will please.” This promise is not a light matter. For a double-blind experiment to succeed, both the control group and the doctors who administer the placebo have to be pleased by it—they have to believe that it is indistinguishable from the active treatment. Considerable work has gone into developing placebos that, while inactive for the condition being tested, provide the side effects associated with the treatment.
But placebos can please too much: patients get drunk on placebo alcohol, they become more alert (although not more irritable) on placebo coffee. Placebo morphine is a more effective painkiller than placebo Darvon. The same placebo constricts the airways of asthmatic people when described as a constrictor and dilates them when described as a dilator. Red sugar pills stimulate; blue ones depress—brand-name placebos work better than generic. And higher dosages are usually more effective.
This placebo effect is well documented, but that doesn't make it easy to deal with. If a control-group patient improves, it needn't be because of the placebo; some simply get better, whether through the natural course of the illness or through reversion to the mean. You need, essentially, a control for your control—another group you are not even
trying
to please. How do you formulate that? “Would you like to participate in a study where we do nothing for your condition?” Might this cheery message affect the patient's well-being? We are already weaving a tangled web.
Quantifying the placebo effect can be essential, particularly in subjective areas like pain or depression. A meta-analysis of all the effectiveness trials submitted to the FDA for the six most widely prescribed antidepressants approved for use between 1987 and 1999 found that, on the standard 50-point scale for depression, the drugs, on average, moved the patients' mood by only two points more than did the placebo. In other words, the placebo showed about 80 percent of the effectiveness of the medication. So, although antidepressants showed a
statistically
significant effect, it was
clinically
negligible—if you assume that the effect of medication should be additional to the placebo effect.

Other books

Summertime Dream by Babette James
Holding On To Love by Neal, A.E.
A King's Commander by Dewey Lambdin
Flashman in the Peninsula by Robert Brightwell
Me and Mr Booker by Cory Taylor
SCARRED by Price, Faith
Dreaming in Cuban by Cristina Garcia
Archmage by R. A. Salvatore