Random Sampling

The outcome of a statistical experiment may be recorded either as a numerical value or as a descriptive representation. For example, when a pair of dice is tossed and the total is the outcome of interest, we record a numerical value. However, if the students of a certain school are given blood tests and the type of blood is of interest, then a descriptive representation might be more useful. A person’s blood can be classified in 8 ways: , , , or , each with a plus or minus sign, depending on the presence or absence of the antigen.

In this chapter, we focus on sampling from distributions or populations and study such important quantities as the sample mean and sample variance, which will be of vital importance in future chapters. In addition, we introduce the role that the sample mean and variance will play in statistical inference. The use of modern high-speed computers allows the scientist or engineer to greatly enhance their use of formal statistical inference with graphical techniques. Much of the time, formal inference appears quite dry and perhaps even abstract to the practitioner or to the manager who wishes to let statistical analysis be a guide to decision-making.

Populations and Samples

We begin by discussing the notions of populations and samples. Both are mentioned in a broad fashion in the introductory chapter, but more detail is needed here, particularly in the context of random variables.

Definition:

A population consists of the totality of the observations with which we are concerned. The number of observations in the population is called the size of the population.

Examples:

  • The numbers on the cards in a deck, the heights of residents in a certain city, and the lengths of fish in a particular lake are examples of populations with finite size.
  • The observations obtained by measuring the atmospheric pressure every day, from the past on into the future, or all measurements of the depth of a lake, from any conceivable position, are examples of populations whose sizes are infinite.
  • Some finite populations are so large that in theory we assume them to be infinite (e.g., the population of lifetimes of a certain type of storage battery being manufactured for mass distribution).

Each observation in a population is a value of a random variable having some probability distribution . For example:

  • Inspecting items coming off an assembly line for defects: each observation is a value or of a Bernoulli random variable with probability distribution
    $$
    b(x; 1, p) = p^x q^{1-x}, \quad x = 0, 1
  • In the blood-type experiment, the random variable represents the type of blood and is assumed to take on values from to .
  • The lives of storage batteries are values assumed by a continuous random variable, perhaps with a normal distribution.

When we refer to a “binomial population,” a “normal population,” or, in general, the “population ,” we mean a population whose observations are values of a random variable having a binomial distribution, a normal distribution, or the probability distribution . The mean and variance of a random variable or probability distribution are also referred to as the mean and variance of the corresponding population.

In statistical inference, we are interested in drawing conclusions about a population when it is impossible or impractical to observe the entire set of observations. For example, to determine the average length of life of a certain brand of light bulb, it would be impossible to test all such bulbs. Exorbitant costs can also be a prohibitive factor. Therefore, we depend on a subset of observations from the population to help us make inferences concerning that population. This brings us to the notion of sampling.

Definition:

A sample is a subset of a population.

If our inferences from the sample to the population are to be valid, we must obtain samples that are representative of the population. Any sampling procedure that produces inferences that consistently overestimate or consistently underestimate some characteristic of the population is said to be biased. To eliminate bias, it is desirable to choose a random sample: observations made independently and at random.

Suppose we select a random sample of size from a population . Let , , represent the th measurement or sample value. The random variables constitute a random sample from the population if the measurements are obtained by repeating the experiment independent times under essentially the same conditions. Thus, the random variables are independent and each has the same probability distribution . Their joint probability distribution is:

Definition:

Let be independent random variables, each having the same probability distribution . Then form a random sample of size from the population , with joint probability distribution

Example:

If we select storage batteries from a manufacturing process and record the life of each battery, with the value of , the value of , etc., then are the values of the random sample . If the population of battery lives is normal, each has the same normal distribution as .

Some Important Statistics

Our main purpose in selecting random samples is to elicit information about unknown population parameters. For example, to estimate the proportion of coffee-drinkers in the US who prefer a certain brand, we select a large random sample and compute the sample proportion . Since many random samples are possible, varies from sample to sample; it is a value of a random variable, called a statistic.

Definition:

Any function of the random variables constituting a random sample is called a statistic.

Location Measures of a Sample: The Sample Mean, Median, and Mode

Let represent random variables.

  • Sample mean:

    The statistic assumes the value for a given sample.

  • Sample median:

    The sample median is the middle value of the sample.

  • Sample mode:
    The value of the sample that occurs most often.

    Example:

    Suppose a data set consists of the following observations:

0.32,\ 0.53,\ 0.28,\ 0.37,\ 0.47,\ 0.43,\ 0.36,\ 0.42,\ 0.38,\ 0.43

>The sample mode is $0.43$, since it occurs more than any other value.

Variability Measures of a Sample: The Sample Variance, Standard Deviation, and Range

A measure of location or central tendency in a sample does not by itself give a clear indication of the nature of the sample. Thus, a measure of variability in the sample must also be considered.

The variability in a sample displays how the observations spread out from the average.

  • Sample variance:

    The computed value for a given sample is denoted .

    Example:

    A comparison of coffee prices at 4 randomly selected grocery stores in San Diego showed increases from the previous month of , , , and cents for a 1-pound bag. Find the variance of this random sample of price increases.

    Solution:
    Sample mean: cents.
    Sample variance:

  • An alternative formula for the sample variance is:

    Theorem:

    If is the variance of a random sample of size , we may write

  • Sample standard deviation:

    where is the sample variance.

  • Sample range:

Example:

Find the variance of the data , representing the number of trout caught by a random sample of 6 fishermen.

Solution:
Calculating the sample mean, we get
We find that , , and . Hence,

Sample standard deviation:

Sample range:

Sampling Distribution of Means

The first important sampling distribution to be considered is that of the mean . Suppose that a random sample of observations is taken from a normal population with mean and variance . Each observation , , of the random sample will then have the same normal distribution as the population being sampled.

Hence, we conclude that

has a normal distribution with mean

and variance

If we are sampling from a population with unknown distribution, either finite or infinite, the sampling distribution of will still be approximately normal with mean and variance , provided that the sample size is large. This amazing result is an immediate consequence of the following theorem, called the Central Limit Theorem.

The Central Limit Theorem

The Central Limit Theorem is one of the most important results in probability theory and statistics. It provides the theoretical foundation for many statistical procedures and explains why the normal distribution appears so frequently in nature.

Theorem:

The Central Limit Theorem states that if is the mean of a random sample of size taken from a population with mean and finite variance , then the limiting form of the distribution of

as , is the standard normal distribution .

The normal approximation for will generally be good if , provided the population distribution is not terribly skewed. If , the approximation is good only if the population is not too different from a normal distribution. If the population is known to be normal, the sampling distribution of will follow a normal distribution exactly, no matter how small the size of the samples.

The sample size is a guideline to use for the Central Limit Theorem. However, as the statement of the theorem implies, the presumption of normality on the distribution of becomes more accurate as grows larger. The following figure illustrates how the theorem works:

bookhue

Illustration of the Central Limit Theorem (distribution of for , moderate , and large ). (Walpole et al., 2017).

The figure shows how the distribution of becomes closer to normal as grows larger, beginning with the clearly nonsymmetric distribution of an individual observation (). It also illustrates that the mean of remains for any sample size and the variance of gets smaller as increases.

Example:

An electrical firm manufactures light bulbs that have a length of life that is approximately normally distributed, with mean equal to hours and a standard deviation of hours. Find the probability that a random sample of bulbs will have an average life of less than hours.

Solution:
The sampling distribution of will be approximately normal, with and . The desired probability is given by the area of the shaded region in the following figure:

bookhue

Corresponding to , we find that

and therefore

Inferences on the Population Mean

One very important application of the Central Limit Theorem is the determination of reasonable values of the population mean . Topics such as hypothesis testing, estimation, quality control, and many others make use of the Central Limit Theorem.

Case Study: Automobile Parts

An important manufacturing process produces cylindrical component parts for the automotive industry. It is important that the process produce parts having a mean diameter of . The engineer involved conjectures that the population mean is . An experiment is conducted in which parts produced by the process are selected randomly and the diameter measured on each. It is known that the population standard deviation is . The experiment indicates a sample average diameter of . Does this sample information appear to support or refute the engineer’s conjecture?

Solution:
This example reflects the kind of problem often posed and solved with hypothesis testing machinery introduced in future chapters. We will not use the formality associated with hypothesis testing here, but we will illustrate the principles and logic used.

Whether the data support or refute the conjecture depends on the probability that data similar to those obtained in this experiment () can readily occur when in fact (See the following figure). In other words, how likely is it that one can obtain with if the population mean is ?

bookhue

The probability that we choose to compute is given by . In other words, if the mean is 5, what is the chance that will deviate by as much as ?

Misplaced &P(|\overline{X} - 5| \geq 0.027) &= P(\overline{X} - 5 \geq 0.027) + P(\overline{X} - 5 \leq -0.027) \\ &= 2P\left(\frac{\overline{X} - 5}{0.1/\sqrt{100}} \geq 2.7\right) \end{aligned}$$ Here we are simply standardizing $\overline{X}$ according to the Central Limit Theorem. If the conjecture $\mu = 5.0$ is true, $\frac{\overline{X}-5}{0.1/\sqrt{100}}$ should follow $N(0, 1)$. Thus, $$2P\left(\frac{\overline{X} - 5}{0.1/\sqrt{100}} \geq 2.7\right) = 2P(Z \geq 2.7) = 2(0.0035) = 0.007.