Ten News Stories of 2010 - and the Statistics that Made Them. Part 5

Author: Stephanie Kovalchik

Today, New Year's Eve, we bring you the final two of our 10 most noteworthy stories of 2010 and the statistics that made them possible. After Christmas over-indulgence, we have to confirm that being overweight does indeed do no favours to your health. And we show that statistics, with digital technology, can reveal whole new understandings of human creativity, literature and artistic endeavour through the ages.  Happy 2011!

9. Fat kills. Meta-analysis was at the center of several surprising health studies of 2010. What has made meta-analysis an increasingly popular method of scientific review is that, by pooling data from multiple, compatible studies, researchers can conduct more powerful analyses and estimate more quantities of interest than with a single trial. This year quantitative reviews overturned common beliefs about the accuracy of diagnosing food allergies, the dangers of consuming red meat and the health benefits of vitamin D supplements.

Come January 1st 2011, as many of us make that regrettable transition from reveller to dieter, we might wonder what meta-analyses in 2010 had to say about the harms of being overweight. Perhaps researchers would find that fat has been given a bad wrap by associating with bad company - poverty and laziness; the poor and those who take no exercise tend to be overweight and to die young, but that might not be cause and effect; it might be the poverty and the laziness causing the early deaths, not the fatness. Can we  - hope against hope - reasonably believe that a few extra pounds is not in itself a health concern?

Just a placeholder image

Duck with stuffing, roast pork with

crackling, potatoes fried in caramel,

sweet and sour red cabbage and gravy

- Christmas dinner. Malene Thyssen,



As if in anticipation of this New Year predicament, in this month's issue of the New England Journal of Medicine Berrington de Gonzalez and colleagues reported a pooled analysis of 19 prospective cohort studies of a combined 1.46 million white adults looking specifically at the association between survival and body weight. With such a large data set, their statistical tools - technically, Cox proportional hazards analysis - could maintain high precision for estimates of the risk associations of body-mass index, while accounting for age, smoking, physical activity and other socio-demographic variables. Based on the pooled analysis, adjusted for all those variables,  Berrington de Gonzalez and colleagues found the characteristic J-shaped relationship between mortality and BMI, showing that too few or too many pounds can be a bad thing. The dip of this curve - the section of it where people live longest - was in the BMI range of 20-24. (A BMI of 25 is usually classified as overweight.)  For a woman of 5 foot and 6 inches height (1.7 m) and 140 lb (63.5 kg) weight, which corresponds to a BMI of 22, a gain of 20 lb (9 kg) was associated with a 10% increase in the risk of death by any cause in the following ten years; the risk increase was 20% for a gain of 35 lb (16 kg). So being overweight does indeed increase your risk of dying.  Though diet resolutions might not be well-liked, they might be well to live by.

10. Words, words words. For those who thought a profession in statistics would mean never having to read Shakespeare, be advised: words are the new data. As with so many areas of research, the information age is fundamentally changing how scholars study human creativity of the past and present. Like the DNA sequencing advances that have made it possible for biologists to conduct whole-genome studies, book digitizing projects that have turned millions of texts into searchable, parsable forms - ideal for data analysis - have given literary scholars their own entryway into high-throughput science. Being at the forefront of these digitizing efforts, with 12 million books scanned in more than 400 languages, it is unsurprising that Google is encouraging humanists to revolutionize their qualitative mindset by trying out quantitative approaches to the study of the human arts. This year the company awarded 12 one-year grants for projects that will wed the humanistic and computational sciences. Most of these efforts will make use of the data from Google's Book Digitization Project to advance new tools to support future language and literary studies.

Just a placeholder image

Title page of First Folio of

Shakespeare’s plays, 1623.
Wikimedia Commons.

An early star in the inchoate discipline of the Digital Humanities is the Culturomics project. Led by a group of Harvard investigators and The Google Books Team, this data collection endeavor has extracted each word from 5.2 million books written in seven languages, spanning the 16th century to the present; he sample represents approximately 4% of published words. Even a swift reader would require 80 uninterrupted years to read the entire corpus.

With such a rich, cross-cultural source of human language and thought, the directions for inquiry are fascinating. What is the size of the English lexicon? Are the French more wordy than Brits? How long does it take a word to enter a dictionary? Does this time depend on the word's popularity? What percentage of a language's lexicon does a dictionary miss? The principal ambition of the project is to use the corpus to better understand culture. The Culturomics researchers believe that sweeping insights about the humanities are waiting to be discovered. What can the corpus tell us about when Western societies secularized? Can we learn whether modern societies are more violent? or more creative? Can analysis of the corpus show statisticians how their science influenced popular thought of the 20th century? and who was most responsible for this change?

graph depicting the statistical use of 'significant' in scholarly literature

A pressing issue in realizing the potential of the Culturomics project is the fact that the 4% sample was not a random sample. Statisticians could have an important role in identifying the potential biases of the collected texts and develop sampling and analytic methods to diminish the influence of bias in estimates made in future studies. For those eager to make their contribution, the 1 to 5-gram (gram=unique set of continuous textual characters) frequency tables of the corpus are ready to be mined.

Catch up with parts 1, 2, 3, 4 of our Top Ten News Stories of 2010.

Bookmark and Share

Comment on this article

Submit your comment
  1. Image of unique ID

Skip to Main Site Navigation / Login

Site Search Form

Site Search