The Sampling Distribution

We have been discussing some concepts that imply a random sample, or a random selection from a distribution, and so forth. We need to be more precise, now, about what we mean by random in these cases.

We define a random sample as one in which every element in the population has an equal, nonzero, chance of being selected into the sample.

A random sample is the best way of generating a sample that is representative of its population. It does not guarantee such an outcome — we will discuss sampling error later. But, it is usually the best way to draw a sample.

It may seem counter-intuitive, at first, that random sampling is a better strategy than purposive sampling — where one goes out and looks for people representing the larger population. The key to its success is that every element has an equal chance of being selected.

In practice, we sometimes modify random sampling, combining it with other strategies designed to target specific groups in the population. When a population includes a small minority group, we often have to oversample for that group. Sometimes we identify important characteristics that define the population and stratify the sample along them. Both of these techniques, though, still rely on random sampling to select elements from subgroups.

Sampling allows the researcher to generalize characteristics of the sample to the population. This kind of inference is based on the well-studied sampling distribution.

Properties of the Sampling Distribution
The most important topic in the introductory statistics course is the logic of inference, and this begins with the sampling distribution.

Illustration of how probability samples are selected from a population.

Imagine some population with a particular mean, mu, and standard deviation, sigma, on some variable. We know that scores vary around the mean, some larger and some smaller. On average, scores differ from mu by sigma.

If we take a random sample, we can calculate its mean, x-bar.

Now, imagine that we take repeated random samples from this population, and calculate the mean for each.

Illustration of how a sampling distribution is derived.

We can take these sample means and generate a frequency distribution. We can define the mean of this distribution, X-bar sub-x-bar, and its standard deviation, sigma sub-x-bar.

Most of the sample means will be relatively close to the population mean. Some will be larger and some smaller. On average, sample means will differ from X-bar sub-x-bar by sigma sub-x-bar.

If we were to draw a large number of samples, the frequency distribution of the sample means would approximate the normal curve. This allows us to take advantage of the properties of the normal curve.

These are the things you want to remember about the sampling distribution:

A. for relatively large samples, the sampling distribution approximates the normal curve for a sufficiently large number of samples;

B. the mean of the sampling distribution equals the population mean;
Mean of sample means approaches the population mean.

C. the standard deviation of the sampling distribution is less than the standard deviation of the population. We call the standard deviation of the sampling distribution the standard error
The standard error is less than the population standard deviation.

In other words, samples are more alike than cases. Think about why this is. In the population standard deviation, the presence of extreme scores (above or below the mean) makes the standard deviation larger. But when we take probability samples from that population and compute the sample standard deviation, some of those extreme cases might be selected, but they are balanced in the sample by less extreme scores. The probability of a sample with only extreme cases (as we know from the multiplication rule) is very rare. FOr this reason, the standard deviation of the sampling distribution, the standard error, tends to be smaller. Samples tend to be more alike than cases.

The Z-Test
Since the sampling distribution has the characteristics of the normal curve, we can generalize the notion of a standard score. We can calculate a z-score for a sample mean.

Formula for the z-score.

This z-score tells us the distance between the sample mean and the population mean, in standard error units.

Formula for the standard error.

The denominator of this formula is defined as the standard error and is calculated thus.

We can mark off the area under the curve in standard error units. We can think of the area under the curve having a range of about six standard errors — just as the normal curve has a range of about six standard deviations.

We can use this knowledge to estimate the probability of drawing a sample from a population with a specific mean, or larger, for example.

Author: Timothy Shortell, Ph.D.

Timothy Shortell, Ph.D., Professor & Chair, Department of Sociology, Brooklyn College CUNY