Failure

by Eva Amsen

Prologue: I have been working on this blog post about failure for months and kept cutting bits out of it because it was so long and boring. This is actually the second-to-last version, because I cut out a bit too much in the end, and that’s how I knew it was done. (I wish they could do that at the hairdresser: “No, this is too much, can you just undo that last bit?”)

The post originated from a session I held at BioBarCamp in August of this year. There was another part to the story – that of the value of alternative careers – but I left that out of this. I’ll probably get back to it at some point. Because I took so long writing it, I also added things that I did not mention in Palo Alto, such as the Douglas Prasher story.

During the writing process I needed to purge some other thoughts, which led to this earlier blog post in October, Intro to Failure.

I’m completely bored with this post now, having looked at it on and off for the past 5 months, but here it is:

Science is an extremely competitive line of work, and the unit of success is the publication record. You need many publications in good journals to get good jobs and to get funded. These good journals publish interesting work. To be interesting, you also need to be original. Original experiments are those that have not been done before, and that means that you don’t know, going into it, if they even work or not. Whether a new type of experiment will work on the first try or not has nothing to do with how smart the scientist is or how hard they work. It’s luck. But the person who is so lucky to get everything to work right away will be the first to publish. Meanwhile, someone who wasn’t as lucky on the first attempt will have switched methods, talked to a lot of people to figure out how to get the experiment to work, read some more papers, tried other experiments, and still has not even started to test their initial hypothesis, let alone think about publishing. At some

Whether a new type of experiment will work on the first try or not has nothing to do with how smart the scientist is or how hard they work. It’s luck. But the person who is so lucky to get everything to work right away will be the first to publish. Meanwhile, someone who wasn’t as lucky on the first attempt will have switched methods, talked to a lot of people to figure out how to get the experiment to work, read some more papers, tried other experiments, and still has not even started to test their initial hypothesis, let alone think about publishing. At some

Meanwhile, someone who wasn’t as lucky on the first attempt will have switched methods, talked to a lot of people to figure out how to get the experiment to work, read some more papers, tried other experiments, and still has not even started to test their initial hypothesis, let alone think about publishing. At some point they might give up, and study something else, at which point they have to start at square one.

Are they any less smart than the person who already published their paper? No. Have they worked any less hard? No. They might even have worked harder. Do they have any less experience? Again, no: they may have looked into more techniques than the person who got everything to work right away, and they certainly have a lot more experience in troubleshooting problems! But that doesn’t count. No publication means failure in this line of work.

It is this publish or perish pressure that drives some people to extreme measures in an attempt to keep their jobs. My favourite example of this is William Summerlin, who, in 1974, painted mice with Sharpies and claimed that it was a successful skin transplant. There have been several other reported cases of fraud, and those are just the ones that were caught. Many others commit scientific fraud and get away with it. Worse, many scientists commit fraud without even realizing they do so.

The pressure is not just to publish, but to publish positive data. But scientific data, especially in biology, are usually not simply positive or negative. There is a spectrum of possibilities, and while some results are very obviously negative or positive, most are somewhere in the middle. Where do we draw the line? By convention we use statistics.

Results from biological experiments are considered significant (positive results) when the P-value is smaller than 0.05, and not significant (negative results) when the P-value is bigger than 0.05. A P-value of 0.05 means that the probability of randomly getting the same result (or better) as you got in that experiment is 5%. If the P value is 0.049, the chance of randomly getting that result (or better) is 4.9%, and if P equals 0.051 the chance is 5.1%. Of these examples only P=0.049 would be considered statistically significant and publishable.

Do you think it really matters, biologically speaking, if the P value that comes out of your calculations is 0.049 or 0.051? Nah.

A few years ago, a study looked at the reported P-values in political science papers. The social sciences also use the rule that “positive data” are those with a P-value just under a certain cutoff point. If you plot the number of times a reported P-value is at a certain distance below or over the cutoff value, you would expect there to be a bell curve distribution, where very little values are really low and most of them are closer to the cutoff. Because of a bias towards positive data, you would even expect there to be more values under the cutoff than over. But what was found was not only a very high peak just under the cutoff value, but also a dip just above it:
<img src=”http://farm4.static.flickr.com/3008/3145969749_f8c7d6d901.jpg?v=0″>
(It shows distance from critical value (c.v.), which is 1.64 for one-tailed tests, and the dotted line is the critical z statistic for p=0.05. This explains it better.)

There were less P-values just a little bit too high than there were P values much too high. The blog post I linked to suggests that this happened because when people found a P value of 0.051 they just repeated the study until they got 0.049 — a value that is publishable. It also suggests that this only happened near the cutoff point.

Such practices undoubtedly occur in other fields as well. Is this bad? Not for the authors: they got their paper out. Not for research in general: results with P=0.051 are no less relevant than those with a P value of 0.049. These are no Sharpie-coloured mice.

But the fact that people feel the need to fudge their data (because that’s what the non-reporting of the P=0.051 values is) to publish and to secure their jobs, that is bad. That means there is something wrong with the system by which success in science is determined.

There are other clues that something may be wrong. If the system worked, people with the potential of being the best and most necessary people in their field would not lose their funding. You might say that we will never know: once they’re gone, how do you determine if they were the best in their field?

Well, you can’t, really. Unless that person has left an obvious legacy: Two of the winners of this year’s Nobel Prize in Chemistry could not have done their work if they had not received the DNA for Green Fluorescent Protein (GFP) from Douglas Prasher, the guy who initially isolated the gene. Sure, if he hadn’t done it, someone else would have, but Prasher did it first, and that’s what counts, right?

Prasher no longer works in science. His grant money ran out, and he is now driving a shuttle bus for a car dealership in Huntsville, Alabama. If someone working on groundbreaking, Nobel-worthy research cannot keep his lab running, how is anyone else supposed to be able to do so?

If only positive, exciting data are a measure of success – as expressed in the unit of The Published Paper In A Good Journal – then a lot of things get left out. Incremental projects, such as cloning a gene (but not doing many groundbreaking experiments with it)  are relevant. And if the difference between positive data and negative data are for a large part based on luck, isn’t science more a gamble than a career?


(This was based in part on a session I ran at BioBarCamp, on August 7, 2008, in Palo Alto. Thanks to Michael Nielsen for telling me the statistics story (twice), and thanks to the many people who have discussed scientific publishing with me, both online and offline.)

Save

Related Articles

8 comments

Björn Brembs December 29, 2008 - 7:47 AM

Awesome post! This is exactly what is wrong with the current system. It’s a snowball system. That’s when it becomes more of a gamble than a career.

Bob O'Hara December 29, 2008 - 10:20 AM

I guess it could be argued that Prasher was lucky in the first place to be working on a gene that would be so important.
I’ve blogged before about the evils of p-values, but this is a nice demonstration of what effect their worship has (“funnel plots”:http://www.badscience.net/2008/03/beau-funnel/ are another). You don’t even have to remove data or get more of it to shift the p-value: just try 17 analysis, and use the “best” one.

Mike Fowler December 29, 2008 - 8:44 PM

A nice post, Eva, very smoothly written and well worth the effort!
Any chance of suggesting alternative methods to compare between competitors for a limited funding source? That´s hopefully the next big thing – shifting the focus for reviewing bodies (for grants/jobs) from the impact factor of the journal a paper is published in, to the citation rate of a paper itself. Prasher might have done better if judged that way.
Some -shameless self promotion- alternative publishing outlets can be found that can help move focus away from _p_ = 0.05 as a Holy grail, starting with the “Journal of Negative Results – Ecology & Evolutionary Biology”:http://www.jnr-eeb.org/ but there are other Negative Result journals out there too.

Thomas Kluyver December 29, 2008 - 9:14 PM

I’d definitely agree. Mike got there just before me to ask about alternatives, but on his point, I’m not sure that citation rate of the paper is necessarily a great metric–as a speaker pointed out at one stage, an effective way to get a lot of citations is to get something published that is quickly shown to be wrong–lots of people will then cite it to explain why it’s wrong. There are ways around that, but I think there are more serious problems. Papers which get cited a lot may not be the best science–they may be simply more eloquent, have more connections with other people in the field, or just be plain lucky (a few citations from well-read papers will be borrowed by many more authors). And it also reinforces the bias towards the fashionable science (e.g. climate change at the moment) that is already widely done, devaluing people working in fields that aren’t trendy just now.
It’s quite probably a better measure than simply where a paper is published, but I hope you’ll agree that it’s still far from perfect. Which brings us back to alternatives–what’s a better way to assess the quality of science or scientists?

Åsa Karlström January 2, 2009 - 3:47 PM

Eva> Thanks for an interesting post. (I know it can be boring when you’ve rewritten it a bunch of time. still, I’m happy you published it – it’s an important topic.)
I wonder if there is some kind of correlation to “what n” is used in the studies? I mean, sometimes you can simulate what would happen with the P-value if you take the result of n=20 and make both groups (for example) bigger with the same proportions? Would some of the P=0.51 and then p=0.49 be due to increased sample size rather than suggested in the comments that the method of analysing was change etc?

Eva Amsen January 3, 2009 - 7:10 PM

Maybe? I don’t know if increasing sample size was an option in those kind of studies.

steffi suhr January 3, 2009 - 8:58 PM

Is it realistic to think that some day the quality of papers and the research presented might be evaluated via a voting system, as in PloS ONE for example? I don’t know the story of the PloS ONE model – whether people are going for it etc. – if you all do, please don’t mind me rambling.
Of course there would be problems that need solving, such as the visibility of publications and the dependence of people really buying into this, which would have to be addressed. A paper in a more obscure journal would not be read by as many people and not get as many votes, so that would be a bit of the same effect as impact factors right now. Unless you remove the quantity factor from the equation… or does that even matter, as long as it gets good ‘grades’ from the few in any super-specialized area?
The plus would be that the process would be about as democratic as science gets, and a wide variety of people with expertise would have a bearing on the rating of a paper (including grad students who are deeply into the subject area and might otherwise not have any input). Of course connections, luck, etc. would still factor, but I don’t think we’ll ever change that completely (it’s kind of part of anything people do, isn’t it?).
Eva – great post. Sorry my comment doesn’t address the positive results issue. I loved the last sentence of the first paragraph, it’s a fantastic opening 🙂

Åsa Karlström January 7, 2009 - 10:47 PM

Eva> Maybe not the political science papers (I am not sure on what exactly they measure since it is not my area). I was thinking more in terms of mice and random samples with bacteria/cells where the P value usually changes when the sample size increases. Not always, but there is a slight correlation that a P-value with a small N will decrease when the N is increased if there really is a difference. (I was thinking that you see a trend of a difference and have n=10. If you increase the sample size and get the same outcome with an N=20 the P-value will have decreased since it is more statistically correct/likely that the groups are different).
(I’m not trying to be arrogant here, although it is stats101 I guess… I am kind of asking if it would be likely in other studies too? And if you know this, doesn’t this point to the “problem” with over value of 0.05 and decreasing the value of P=0.052. This is one of the reasons I am not too sure about P-values about 0.05 but rather focused on P smaller than 0.01 and 0.001 which means that there are really big differences. )
Of course, there is a grey scale where you would find interesting findings too (the p=0.052) but I wonder 0.05 comes from? The idea that this is the “magic” higher number? Because it is easier to get than 0.01 or 0.01 I guess? And the fact that it has been used in mathematical models for a long time, altough there it is easy to increase sample size … and having slightly more clear cut yes and no questions. I wonder if this is partly because the scientists today aren’t as well versed in stats and math as “they were before”?
[disclaimer for the huge and arbitrary assumption here. I might be way off but looking at a few biology programs, and MDprograms, classes in stats and math aren’t high on the scale unless you are a population biologist/ecologist]

Comments are closed.