### [A, SfS] Chapter 1: Sampling, Descriptive Statistics, Intr: 1.2: Random Sampling

### Random Sampling

We want to have a sample that represents the population from which it is selected, so that we can make generalizations from the sample to the population.

Such a sample should be free from bias toward any characteristic that is not shared by every element of the population. One sampling method that is likely to produce a biased sample is *convenience* *sampling*.

Convenience Sample

A **convenience sample** is the lazy way to select a sample. This means that the researcher just selects elements based on the most convenient approach, such as walking up to people in public.

This includes the **voluntary response sample**, whereby the researcher publicizes the need for participants in a study and accepts all those who voluntarily offer to be participants.

#\text{}#

The best tool for obtaining a representative sample is *random sampling*.

Random Sampling

**Random sampling** is an unbiased method for selecting a representative sample.

This method requires that the researcher has access to a **sampling frame**, i.e., a list of every element in the population. Then the researcher needs some system for randomly selecting elements from this list; this is usually accomplished using computer algorithms.

We will now discuss four sampling methods that make use of random sampling:

*Simple random sampling**Systematic random sampling**Cluster sampling**Stratified sampling*

#\text{}#

Simple Random Sample

In a **simple random sample** (SRS) consisting of size *n*, every possible subset of *n* elements from the population has an equal chance of being selected.

If there are #25# students in a classroom and we select *n* = #3# students using SRS, there are #2300# possible subsets consisting of #3# students, and each one has the same chance of being selected.

Most of these #2300# subsets of students would be somewhat representative of the whole class in terms of characteristics such as *gender, race, national origin, sexuality, religion, economic background, etc. *But many of the #2300# subsets would not be representative, such as a subset consisting of all females from the same country.

So there is a large chance that a SRS will result in a somewhat representative sample (especially if *n* is large relative to the size of the population or if there is not much variation among the elements of the population), but also some chance that the SRS will be biased. Since we can't ever be #100 \%# certain that the SRS is representative, we cannot be #100 \%# certain that any conclusions we make form the sample are valid.

#\text{}#

An alternative to simple random sampling is the *systematic random sample*.

Systematic Random Sampling

With **systematic random sampling**, the researcher begins at some arbitrary element of the sampling frame, selects that element as the first element of the sample, and then skips ahead by some fixed amount *k* and selects that element as the second element of the sample, and then skips ahead again by *k* and selects that element, and so on until *n* elements have been selected (starting back over at the beginning if necessary).

Suppose my sampling frame consists of #2000# names, and I want a sample size of #20#. I decide to begin by choosing the #5th# name, then I skip ahead by #100# and choose the #105th# name, then the #205th# name, then the #305th# name, and so on until I choose the #1905th# name, which gives me #20# names for my sample.

This differs from an SRS, because many subsets cannot be chosen; only those for which the elements are equally-spaced apart from each other in the sampling frame by a distance of *k*. However, if the order of elements in the sampling frame is arbitrary (such as alphabetical order of names), a systematic random sample has a good chance of producing a representative sample.

#\text{}#

Stratified Random Sample

A **stratified random sample **is one in which the sampling frame is first divided into separate categories based on some criterion, such as *gender *or *race* or *national origin* or *species*. Then a SRS is selected from each category.

It is usually not necessary that the SRS for each category be of the same size; it depends on the researcher's goals. The idea is to make sure that each category is adequately represented in the sample, which a SRS cannot guarantee.

If a researcher wants the sample to be half male and half female, a stratified random sample with equal sample sizes for males and females would be a good way to ensure that.

But if a researcher wants the sample to contain all four blood types in the same proportion as they occur in the population, a stratified sample with different sample sizes from each blood type (more for type O, fewer for type AB) would make more sense.

#\text{}#

Cluster Sampling

In **cluster sampling**, we have population elements which are already grouped together in some way, such as primary-school students who are already grouped into different primary schools, or football players who are already grouped into different teams.

The researcher would then take an SRS of the clusters, and then within each of these clusters select a SRS of elements.

A researcher might take a SRS of #10# primary schools out of all the primary schools in the region. Then within each of those #10# primary schools the researcher might take a SRS of #10# students.

Thus the sample would contain #100# students in total. This might be much easier than randomly selecting #100# students from a list of all primary school students in the region.

#\text{}#

It is important to differentiate between the following two types of scientific studies:

*Experiments**Observational studies*

Experimental Units

When conducting an experiment, researchers begin with a pool of **experimental units** (which are referred to as **subjects** if human or animal) selected from the population in (usually) a non-randomized way.

When people are involved, participation in the experiment can be demanding or even risky, so the subjects have to be recruited (possibly with a small payment for their participation). This means that the pool may be biased with regard to various characteristics.

Experiments and Causality

Once the sample has been drawn, the researchers then use random assignment of the experimental units among two or more treatment groups. This method makes it very likely that the only systematic difference between the treatment groups is the difference in treatments received.

Consequently, if significant differences are found with respect to one or more measured variables among the treatment groups, we can infer that the differences are *caused* by the differences in treatment. In scientific research, to establish *causality*, we must use this experimental approach.

This conclusion can then be inferred to the population from which the pool was taken, but only to the extent that the pool is representative of the population. If the pool consisted only of university students, one must be cautious about inferring conclusions to older or younger segments of the population, or even to people of the same age range who do not attend university.

#\text{}#

Observational Study

In an **observational study**, researchers record information based on measurements on the elements of a sample, whether they use a survey (assuming human elements), or observe the elements, or observe evidence left behind by the elements.

Correlation and causation

Unlike in an experiment, researchers conducting an observational study do not subject the elements in the sample to any conditions. Consequently, if the researchers conclude that there is some association between measurements of two or more variables, they cannot conclude that the association is due to changes in one variable *causing* changes in the other variable.

The catchphrase is: "Correlation does not imply causation".

It might be that the association is a coincidence. Or the association might be because there is another variable not measured, which we call a **confounding ****variable **or **lurking variable**, that is responsible for the changes in both measured variables, and thus the observed association is a **spurious effect**.

It might be observed that there is a strong positive correlation between alcohol use and lung cancer. But it cannot be concluded using alcohol *causes* a rise in lung cancer. It could possibly be that both phenomena are explained by their strong connection to another variable: smoking.

**Pass Your Math**independent of your university. See pricing and more.

Or visit omptest.org if jou are taking an OMPT exam.