Internalize This

by Dan Laskin

The crackers are part of a day-one exercise in sampling, which will become a key concept in the course-because, in this data-saturated, data-dependent world, so many important studies hinge on how and whether your samples (of patients in a clinical trial, for example, or defective pipe welds in a nuclear power plant, or likely voters in an election) allow you to draw conclusions that hold for whole populations (i.e., all patients, all welds, all voters).

As for Greek: well, the students don't recite Homer, but, as they embark on epic number-crunching routines and learn to spot the Achilles heel in a survey design, they become adept with symbols like μ (mu, for the mean of a population) and o- (sigma, for the standard deviation of a population). Throw in terms like x-bar, z-score, and p value, and a typical class discussion on the third floor of Hayes Hall may sound like a foreign language.

In MATH 106, the grammar and syntax of statistics come at you fast. Snipes projects a buoyant, friendly energy as she fills the white boards on both sides of the room, tosses out puns, widens her eyes in encouragement, then stops and squints to make sure everyone is with her. In almost every class, the lecture gives way to a hands-on exercise using real data sets-the Math Department stockpiles them on the campus network, and the classroom is equipped with computers for all of the students, who become proficient with Minitab, a powerful statistical software package.

They learn to display and dissect data in dotplots, scatterplots, boxplots, and histograms. They learn to pick up insights by eyeballing graphs, noting the center, spread, and shape of data. They develop an instinct for skepticism, ferreting out "lurking" or "confounding" variables. Essential ideas soon take hold: for instance, standard deviation, which shows how data disperse from the mean, and normal distribution, which is the famous-and immensely useful-bell curve. By the time Snipes arrives at the Central Limit Theorem, she has already enthused about the wonders of the Empirical Rule, which tells you how data are distributed under the bell curve. (About 68 percent of your data lie within one standard deviation of the mean, about 95 percent within two standard deviations, 99.7 percent within three.)

What's so central about the Central Limit Theorem (CLT)? It goes to a fundamental reality underlying statistical analysis: that studies-whether of Goldfish crackers or pipe welds-almost invariably rely on samples. In essence, the CLT shows that if your sample size is large enough, you can reach some firm conclusions about the whole population.

Briefly, the CLT says that, no matter how the numeric values in a population are actually distributed-even when they're not "normal"-if you take large enough samples, the distribution of the sample means will approach a normal distribution: the bell curve. "That's fantastic," says Snipes, "because we know a lot about normal curves."

The more data you are able to collect, the better. But it's not always possible to collect huge samples, which is why the CLT is such good news. If your sample is random, relatively small sample sizes (thirty is enough for many important situations) will yield useful results.

With this theorem in their toolkit, researchers can design experiments incorporating "confidence intervals" (margins of error) and "hypothesis tests"-principles that lie at the heart of all statistical studies and that make it possible to determine whether results are significant (or whether they could have arisen just by chance.)" The CLT allows you to "leverage statistical information," as Snipes puts it. "If you can really internalize this," she tells her students, "you should be happy for the rest of the course."

Take big enough samples and tell the world about everything from drug safety to unemployment trends. That's worth a whole lot of Goldfish. -Dan Laskin

Delicious