Tag

## RANDOM SAMPLING

Browsing

On Wednesday, we talked about sample bias, or ways to really screw up the results of a survey or study. So how can researchers avoid this problem? By being random.

There are several kinds of samples from simple random samples to convenience samples, and the type that is chosen determines the reliability of the data. The more random the selection of samples, the more reliable the results. Here’s a run down of several different types:

Simple Random Sample: The most reliable option, the simple random sample works well because each member of the population has the same chance of being selected. There are several different ways to select the sample — from a lottery to a number table to computer-generated values. The values can be replaced for a second possible selection or each selection can be held out, so that there are no duplicate selections.

Stratified Sample: In some cases it makes sense to divide the population into subgroups and then conduct a random sample of each subgroup. This method helps researchers highlight a particular subgroup in a sample, which can be useful when observing the relationship between two or more subgroups. The number of members selected from each subgroup must match that subgroup’s representation in the larger population.

What the heck does that mean? Let’s say a researcher is studying glaucoma progression and eye color. If 25% of the population has blue eyes, 25% of the sample must also. If 40% of the population has brown eyes, so must 40% of the sample. Otherwise, the conclusions may be unreliable, because the samples do not reflect the entire population.

Then there are the samples that don’t provide such reliable results:

Quota Sample: In this scenario, the researcher deliberately sets a quota for a certain strata. When done honestly, this allows for representation of minority groups of the population.  But it does mean that the sample is no longer random. For example, if you wanted to know how elementary-school teachers feel about a new dress code developed by the school district, a random sample may not include any male teachers, because there are so few of them. However, requiring that a certain number of male teachers be included in the sample insures that male teachers are represented — even though the sample is no longer random.

Purposeful Sample: When it’s difficult to identify members of a population, researchers may include any member who is available. And when those already selected for the sample recommend other members, this is called a Snowball Sample. While this type is not random, it is a way to look at more invisible issues, including sexual assault and illness.

Convenience Sample: When you’re looking for quick and dirty, a convenience sample is it. Remember when survey companies stalked folks at the mall? That’s a convenience or accidental sample. These depend on someone being at the right (wrong?) place at the right (wrong?) time. When people volunteer for a sample, that’s also a convenience sample.

So whenever you’re looking at data, consider how the sample was formed. If the results look funny, it could be because the sample was off.

On Monday, I’ll tackle sample size (something that I had hoped to include today, but didn’t get to). Meantime, if you have questions about how sampling is done, ask away!

Most of you are probably sick to death of 2012 campaign poll results. But these numbers have become a mainstay of the American political process. In other words, we’re stuck with them, so you might as well get used to it — or at least understand the process as well as you can.

Last Friday, I wrote about how the national polls really don’t matter. That’s because our presidential elections depend on the Electoral College. We certainly don’t want to see one candidate win the popular vote, while the other wins the Electoral College, but it’s those electoral votes that really matter.

Still, polls matter too. I know, I know. Statistics can be created to support *any* cause or person. And that’s true. (Mark Twain popularized the saying, “There are lies, damned lies and statistics.”) But good statistics are good statistics. These results are only as reliable as the process that created them.

But what is that process? If it’s been a while since you took a stats course, here’s a quick refresher. You can put it to use tomorrow, when the media uses exit polls to predict election and referendum results before the polls close.

Random Sampling

If I wanted to know how my neighbors were voting in this year’s election, I could simply ask each of them. But surveying the population of an entire state — or all of the more than 200 million eligible voters in the U.S. — is downright impossible. So political pollsters depend on a tried-and-true method of gathering reliable information: random sampling.

A random sample does give a good snapshot of a population — but it may seem a bit mysterious. There are two obvious parts: random and sample.

The amazing thing about a sample is this: when it’s done properly (and I’ll get to that in a minute) the sample does accurately represent the entire population. The most common analogy is the basic blood draw. I’ve got a wonky thyroid, so several times a year, I need to check to see that my medication is keeping me healthy, which is determined by a quick look at my blood. Does the phlebotomist take allof my blood? Nope. Just a sample is enough to make the diagnosis.

The same thing is true with population samples. And in fact, there’s a magic number that works  well enough for most situations: 1,000. (This is probably the hardest thing to believe, but it’s true!) For the most part, researchers are happy with a 95% confidence interval and ±3% margin of error. This means that the results can be trusted with 95% accuracy, but only outside ±3% of the results. (More on that later.) According to the math, to reach this confidence level, only 1,000 respondents are necessary.

So we’re looking at surveying at least 1,000 people, right? But it’s not good enough to go door-to-door in one neighborhood to find these people. The next important feature is randomness.

If you put your hand in a jar full of marbles and pull one marble out, you’ve randomly selected that marble. That’s the task that pollsters have when choosing people to respond to their questions. And it’s not as hard as you might think.

Let’s take exit polls on Election Day. These are short surveys conducted at the voting polls themselves. As people exit the polling place, pollsters stop certain voters to ask a series of questions. The answers to these questions can predict how the election will end up and what influenced voters to vote a certain way.

The enemy of good polling is homogeneity. If only senior citizens who live in wealthy areas of a state are polled, well, the results will not be reliable. But randomness irons all of this out.

First, the polling place must be random. Imagine writing down the locations of all of the polling places in your state on little strips of paper. Then put all of these papers into a bowl, reach in and choose one. That’s the basic process, though this is done with computer programs now.

Then the polling times must be well represented. If a pollster only surveys people who voted in the morning, the results could be skewed to people who vote on their way home from their night-shift or don’t work at all or who are early risers, right? So, care is made to survey people at all times of the day.

And finally, it’s important to randomly select people to interview. Most often, this can be done by simply approaching every third voter who exits the polling place (or every other voter or every fifth voter; you get my drift).

Questions

But the questions being asked — or I should say the ways in which the questions are asked — are at least as important. These should not be “leading questions,” or queries that might prompt a particular response. Here’s an example:

Same-sex marriage is threatening to undermine religious liberty in our country. How do you plan to vote on Question 6, which legalizes same-sex marriage in the state?

(It’s easier to write a leading question asking for intent rather than a leading exit poll.)

Questions must be worded so that they illicit the most reliable responses. When they are confusing or leading, the results cannot be trusted. Simplicity is almost always the best policy here.

Interpreting the Data

It’s not enough to just collect information. No survey results are 100 percent reliable 100 percent of the time. In fact there are “disclaimers” for every single survey result. First of all, there’s the confidence level, which is generally 95%. This means exactly what you might think: Based on the sample size, we can be 95 percent confident that the results are accurate. Specifically, a 95% confidence intervalcovers 95 percent of the normal (or bell-shaped) curve.

The larger the random sample, the greater the confidence level or interval. The smaller the sample, the smaller the confidence level or interval. And the same is true for the margin of error.

But why 95%? The answer has to do with standard deviation or how much variation (deviation) there is from the mean or average of the data. When the data is normalized (or follows the normal or bell curve), 95% is plus or minus two standard deviations from the mean.

This isn’t the same thing as margin of error, which represents the range of possibly incorrect results.

Let’s say exit polls show that Governor Romney is leading President Obama in Ohio by 2.5 percentage points. If the margin of error is 3%, Romney’s lead is within the margin of error. And therefore, the results are really a statistical tie. However, if he’s leading by 8 percentage points, it’s more likely the results are showing a true majority.

Of course all of that depends — heavily — on the sampling and questions. If either or both of those are suspect, it doesn’t matter what the polling shows. We cannot trust the numbers. Unfortunately, we often don’t know how the samples were created or the questions were asked. Reliable statistics will include that information somewhere. And of course you should only trust stats from sources that you can trust.

Summary

In short, there are three critical numbers in most reliable survey results:

• 1,000 (sample size)
• 95% (confidence interval or level)
• ±3% (margin of error)

Look for these in the exit polling you hear about tomorrow. Compare the exit polls with the actual election results. Which polls turned out to be most reliable?

I’m not a statistician, but I’d be happy to answer your questions or find an expert who can. Ask away!

P.S. I hope every single one of my U.S. readers (who are registered voters) will participate in our democratic process. Please don’t throw away your right to elect the people who make decisions on your behalf. VOTE!