I recently learned that I have an above average number of legs. This is no cause for concern: most of you do, too. It was something I first learned when watching Hans Rosling’s The Joy of Stats BBC documentary. He pointed out that, since there are a few people with only one leg or none at all, the average number of legs is about 1.99 – just short of most people’s two.
It shows that sometimes statistics are meaningless. There is no practical application to knowing the exact average number of legs per person. If you told a jeans manufacturer that he was accounting for too many legs, since the average person has less than two, he’d rightly say “What does that have to do with anything?”
And that is pretty much how I’ve seen all statistics for a very long time. Sometimes I understood it, or at least understood how to manipulate some numbers according to the proper rules, but I always thought “What does this have to do with anything?”
I managed to go through all of high school without learning a single thing about statistics. It was part of the other math class, the kind the kids that went into economics took. For the science path, you learned to solve differential equations or calculate the area of a plane intersecting a cube at weird angles, but not statistics.
While in high school, we did a visit to the university once, and took intro classes in two departments. In the biochemistry group of the chemistry department (where I would later study), we isolated DNA from E. coli, and got to take it home in a little vial in ethanol. Awesome. In the economics department, we calculated the odds of finding a yellow marble in a jar of lots of blue and yellow ones. I have no clue what this had to do with economics.
In undergrad, there was a bit of statistics in one of the required courses in the chemistry program, but it ended up being only one question on one exam, and you could get it wrong and still do really well overall. No need for statistics there.
I didn’t really need statistics at all until I had to analyse the data that came out of my PhD research. I tried to look at introductory texts and websites, but nothing made sense. Okay, so that’s the formula for the Student’s T test, but what does that have to do with anything? All textbooks that explained statistics either gave some formulas that I still didn’t know when or how to use, or were talking about situations entirely different from the research I was doing. I couldn’t relate the examples about profit margins to my data from the lab.
Eventually, a lab mate and I convinced our supervisor to buy Intuitive Biostatistics for the lab. It’s by the guy who developed GraphPad, but it’s not shilling the software in any way. It explains when to use which kind of calculations, and why, using examples from biology. Suddenly it all made sense, and I could analyse my work, and even give sensible answers to the questions about statistics at my defense, such as why I calculated confidence intervals for some of my experiments.
My feelings about statistics are best summarized in this section from my thesis. It’s one of my favourite paragraphs in there, and yet another way of saying “What does this have to do with anything?”
bq. “However, a statistically significant difference between transfected and untransfected cells does not necessarily correlate with a biologically significant difference. This is clear from the data collected from non-silencing controls. After both 48 hour and 72 hour transfections, two out of eight non-silencing controls show a statistically significant (P<0.05) reduction (…). This means that there is a strong possibility that about a quarter of all “hits” for which P<0.05 are a false positive.”
But oddly, as much as I dislike the kind of stats that come with data analysis – the kind that I usually don’t see the point of – I really LOVE graphs and data.
Here is a graph that shows how much I love graphs compared to how much I dislike statistics:
It’s a little confusing, though, because this love includes visualisation of website access numbers – which are still called “statistics”. But they’re just the data, and nobody is asking me to do a T-test on the numbers and then grill me about R values and curve fitting and probabilities.
I also really love Information is Beautiful. I have the book, and enjoyed David McCandless’ talk at Science Online this year.
And in undergrad, where I avoided statistics like the plague, a friend and I spent a few days drawing absurd graphs of absolutely everything, xkcd-style, based on fictional scenarios. My favourite of the batch was the number of visitors to the university cafeteria plotted against the cooking time of the green beans they served. It peaked at 10 minutes, but of course it was never that busy, because the beans were always cooked for about 20 minutes.
Years later, in October 2008, a few months after I got completely annoyed by prettifying all the graphs in my PhD thesis, I made a similarly silly graph showing my happiness and IQ over time.
I also make more serious graphs for fun. For the past eleven months (since I moved), I’ve religiously tracked every single penny I spent, and sorted the resulting amounts in a pie chart, to see where my money goes. The different categories are roughly coloured by whether I can reduce them or not. (I’m not showing the legend for personal reasons, but can tell you that pink represents money spent on the cat. I didn’t have my cat with me most of the year, so this is pretty small, and the only section where I’m planning an increase in spending. Although, as I type this, the same cat is pulling decorations off the Christmas tree in the background. Reduce spending! Reduce!)
Another graph I really like is this one. They’re individuals visiting the Node, showing clear dips in weekends.
We now (finally) got stats on our Nature Network blogs as well, but since mine only started tracking on December 24, they’re not very exciting yet. I’m intrigued, though.
Why do I like graphs but not the other kind of stats?
What data visualisation does that R values and T tests don’t, is make it immediately clear what you’re looking at, and how it’s relevant to the real world. Setting P<0.05 as the cut-off value for "statistical significance" is not relevant to the real world. It's arbitrary, and it even affects which data are published.
Here's a graph I've discussed on the blog before. It comes from a study showing that in a certain field, reported P-values are almost never just above 0.05, implying data manipulation or adding more experiments to get to just below 0.05 and allow it to be "significant".
And I’ve got to say: as much as I don’t like the arbitrariness of “statistical significance” and the rules about which analysis to use for what kind of data, I really like this graph.
So, you see, it’s complicated. Initially, I was going to title the post “The Fear of Stats”, but as I started writing, I realized I wasn’t scared of stats: just bored and annoyed and wondering, indeed, what they had to do with various things. Keeping track of lots of data makes for pretty graphs and useful trends. Those kinds of stats are cool. But statistical analysis of data doesn’t always make sense to the people using it. Not just because it’s complicated, but because it’s not always informative of what they’re looking at. It has to make sense in context. You have to be able to actually answer the question “what’s stats go to do with it?”, and not just use it rhetorically like I did in most of this post.