1. Subjects and Variables Chapter Two § 0
Statistical techniques are tools for scholars in many fields: science, social science, applied science, health, engineering, education, and business. The unit of study is a subject: a person, manufactured part, corporation, state of the union, nation, or other individual, and many of its properties can be described by words or numbers. A variable is a quantity that can take on different values in different instances. (These values may be words or phrases, but are usually numbers.) A level is a possible value of a variable, and a score (or datum) is an observed level (for a particular subject). A statistical data set is a set of subjects together with their scores on one or more subject variables.
A population is a set of all of the subjects of interest, together with data on one or more subject variables. But usually we don't have information on a population; we only wish we did. Most often, the data set we have is for only some of the subjects in our population. A sample is a set of some of the subjects of interest, together with scores on one or more subject variables. A numerical property (e.g., mode, median or mean) of a population is called a parameter, but a numerical property of a sample is called a statistic.
The set of levels (possible values) is called the scale of a variable. We classify a variable according to the kind of mathematical arithmetic that can be performed on its levels. All scales are considered nominal, in the sense that their levels are names. But if the levels can be put in a meaningful order (not just alphabetical), we say the scale is ordinal. And if the levels have arithmetic differences that have the same meaning at different parts of the scale, we say the scale is interval. (Remember that all variables are nominal and an interval variable is also ordinal.)
2. Descriptive Statistics Chapter Two § 6
Different graphical displays are appropriate for variables on different scales. Pie charts and bar graphs are appropriate for nominal variables that have no more than about ten levels. A box plot works only for an ordinal variable. A histogram is a bar graph with bars listed in an increasing order from left to right, so it works only for an ordinal (or interval) variable. (A histogram can be understood with a large number of bars.)
Different summary statistics are appropriate for variables on different scales. With a nominal variable, we can report the mode (most common level). With an ordinal variable, we can also report the percentiles in a data set (including the median, quartiles, minimum, and maximum). With an interval variable, we can also report the range, inter-quartile range, midrange, mean, variance, and standard deviation.
The range and standard deviation are both nonnegative, and if either is zero, then all the scores in the data set are the same. And the bigger the range or standard deviation, the more spread out the scores are. But the range considers only the minimum and maximum scores, while the standard deviation considers all scores in a data set.
The histogram of an interval variable is called symmetric about a level L when it can be folded at L so the two parts of the scale of levels come together and the two sides of the histogram match. If a histogram is symmetric, it is symmetric about a level which is simultaneously the median and the mean: the median because half the subjects are below it and half above it, and the mean because scores in the histogram occur in paired strips equidistant from the center L of symmetry.
A histogram which is not symmetric is called skewed. If the mean is higher than the median, we say the histogram is skewed positively or high. But if the mean is lower than the median, we say the histogram is skewed negatively or low. However, the mean cannot be farther from the median than the amount of the standard deviation.
In any set of scores (data), whether it is a population or a sample, we can find a mean and a standard deviation. Then each raw score x has a standard score z = (x – mean)/standard deviation. (You should be able to solve for any quantity in this equation, given all the others.)
Chebychev's rule says that the fraction of scores within k standard deviations of the mean (i.e., for which -k <= z <= k ) is at least 1-1/k^2. The number k is intended to be more than 1, but need not be an integer. (Be especially familiar with the cases k=2 and 3.)
3. Probability Chapter Two §1.5 §1.6 §4.5 Chapter Three §4.3 §4.5
When a sample of some fixed size n is chosen from a given population, certain statements are true, whether or not a subject is replaced after being drawn (is allowed to be chosen repeatedly). If at each draw, each available subject is as likely to be selected as any other, then every possible sample arrangement is as likely to be obtained as any other. Such a selection process is called random sampling, and any sample that results from it is called a random sample. An event is any set of possible sample arrangements which share some common feature, and the fraction of all possible sample arrangements of size n for which some event occurs is called the probability of that event. Typical events are different for variables using different types of measurement scale: nominal, ordinal, and interval.
A typical nominal variable is the party affiliation (Republican, Democrat, Independent) of a registered voter in a certain state. A typical event would be the occurrence of at least x Republicans in a sample of size n. The statistic used for this is the binomial random variable X. In the case n=1, the probability of obtaining a Republican is precisely the fraction of Republicans in the state.
A typical ordinal variable is an opinion response on a scale of 1,2,3,4,5,6,7,8,9 with associated word phrases like "very strongly agree". (Such a variable is said to use a Likert scale.) A typical event would be the occurrence of a sample median of at least 6. In the case n=1, the probability of the response being at least 6 is the fraction of subjects who respond to the question with at least 6.
A typical interval variable is a measurement like blood pressure. A typical event would be the occurrence of a sample mean of at least 140. In the case n=1, the probability of a person's blood pressure being at least 140 is the fraction of persons in the population whose blood pressure is at least 140.
On the final exam, I will provide you normal and binomial tables.
Be prepared to use them.
Know that the mean (expected value) of a binomial random variable is
the product np.
4. Two-Way Tables Chapter Two §1
Two-way tables aid in computing probabilities [cf. the example in Luft Chap Two Sec 1.6-1.7].
4.1 Unconditional Probability of Events Chapter Two §1.5
EXAMPLE: Consider a room containing 12 people: 8 women and 4 men. Seven of the women are Democrats and one is a Republican. Two of the men are Democrats and two are Republicans. This situation is illustrated by the table below on the left.
D R
Total
W 7 1
8
M 2 2
4
Total 9 3 12
If we choose one person at random from the room, then
P(W) = 8/12 P(D) = 9/12
P(M) = 4/12 P(R) = 3/12
P(WnD) = 7/12 P(WnR)
= 1/12
P(MnD) = 2/12 P(MnR)
= 2/12
In finding the probability of a union, we must first calculate
the count in the union.
P(WUD) = (2+7+1)/12
P(WUR) = (2+2+1)/12
4.2 Conditional Probability (Row Fraction) Chapter Two §1.6.2
But if we choose from only the women, we have
P(D|W) = 7/8 P(R|W) = 1/8
Or if we choose from only the Democrats, we have
P(W|D) = 7/9 P(M|D) = 2/9
The table under 4.1 is dependent because the row fractions within
a column are unequal:
P(D|W) = 7/8 P(D|M) = 2/4
The terminology for this uses the words given or provided or on condition that. For example the statement P(D|W) = 7/8 might be read, "The probability of choosing a Democrat from the room at random, given that the person is a woman, is 7/8".
4.3 Independent Events Chapter Two §1.7.2
But in the corresponding independent table
D R
Total
W 6 2
8
M 3 1
4
Total 9 3 12
the row fractions within a column are equal:
P(D|W) = 6/8 P(D|M) = 3/4
A short-cut check for independence is whether the so-called product
rule is satisfied:
P(WnD) = P(W) P(D)
6/12 = (8/12)(9/12)
On the other hand, if we are told that W and D are independent events,
we can use the product rule to compute the probability of their intersection
P(WnD) = P(W) P(D) = (8/12)(9/12)
= 1/2
In the independent table, we get different probabilities of unions.
P(WUD) = (3+6+2)/12
P(WUR) = (6+2+1)/12
4.4 Mutually Exclusive Events Chapter Two §1.4
By contrast, if two events A and B are mutually exclusive, that means
their intersection is empty (contains no subjects)
AnB = O
(there should be a line through the oh O)
so their probability of their intersection is zero.
P(AnB) = 0
Clearly if A and B are mutually exclusive, they are dependent.
Moreover, the table for A and B looks like this:
B
B' Total
A 0 P(A)
P(A)
A' P(B) ?
?
Total P(B) ? 1
Therefore P(AUB) = P(A) + P(B)
Hypothesis testing seeks to establish a research hypothesis by discrediting a null hypothesis. The observed level of significance (p-value) is the probability of the supporting event recurring, if the null hypothesis is true (at its edge). The observed level measures the consistency of the sample data with the null hypothesis.
When a specified level alpha is provided, we agree to reject the null hypothesis if the observed level (p-value) is less than or equal to alpha. But if the observed level is greater than alpha, we cannot reject the null hypothesis, and in fact we cannot reach a conclusion at all, though we may adopt a predefined course of inaction. In particular, we cannot establish the null hypothesis.
If the observed level is less than the specified level alpha, it is customary to speak of evidence (or to reject the null hypothesis) at the specified level alpha, and in fact at every level larger than the observed level. Thus if we are told there is evidence at the 10% level, it is quite possible that the observed level is 5% or 4% or .001; we wouldn't know for sure without knowing the observed level (p-value).
To say "the data are not significant" means that we did not reject the null hypothesis; it implies that a specified level alpha was chosen (even if we are not told what it is), and the observed level was bigger. Also, to say "there is evidence at all levels" means the observed level was zero to the number of decimal places that it was calculated (usually three or four); thus the observed level is less than .0005 .
We can report the result of a hypothesis test either in terms of the
research hypothesis (alternative hypothesis) or the null hypothesis.
To illustrate, suppose we obtain the verbal SAT scores of a random sample
of students and ask whether there is evidence at the 5% level that the
population mean verbal SAT is above 500. If the observed level
(p-value) is computed to be 40%, then the following statements are equivalent.
1. At the 40% level there is evidence that the mean verbal SAT
in the population is above 500.
There is not evidence at the 5% level.
2. At the 40% level we can reject the hypothesis that
the mean verbal SAT in the population is at or below 500.
At the 5% level we cannot reject the hypothesis
that
the mean verbal SAT in the population is at or below 500.
But it is NOT correct to say "At the 40% level there is not evidence
that the mean verbal SAT in the population is above 500" (not correct),
nor to say "At the 5% level we can reject the hypothesis that
the mean verbal SAT in the population is at or below 500 (not
correct).
The null hypothesis is universally abbreviated Hsub0, but the alternative hypothesis may be abbreviated Hsub1, HsubA, or even Hsuba (an especially bad choice). Be able to write hypotheses for a sign test, both in terms of m and in terms of p . [Five §1.1]
6. Decision Errors Chapter Four §3.3
Ideally, we would always reject the null hypothesis when it is false and accept it when it is true. But we cannot avoid making decision errors, because we are analyzing only part of a population.
If the null hypothesis is true, we make a Type I Error to reject it. We will do so if the observed level (p-value) is less than or equal to the specified level alpha. Therefore if the null hypothesis is true, the probability of a Type I error is alpha.
If the null hypothesis is false, it is not possible to make a Type I Error, but we make a Type II Error if we fail to reject the null hypothesis. The smaller we make alpha, the less likely we are to reject the null hypotheses, even if it is false. Thus the smaller we make the probability of a Type I Error, the larger we make the probability of a Type II Error.
7. Power Of A Statistical Test Chapter Four §3.4
It is possible that the alternative hypothesis is true, and yet the data look consistent with the null hypothesis. This usually happens because the actual value of the population parameter is close to the edge of the "null" hypothesis. In fact, an alternative hypothesis often contains many possibilities for many possible values of the parameter under study. Thus in Example A, the alternative hypothesis p>.4 contains the possibilities p=.41, p=.42, p=.5, p=.6, and many others. Some of these would give rise to sample data very consistent with p=.4, which is part of the null hypothesis p£.4 .
The ability to detect such a circumstance and establish an alternative
hypothesis close to the null hypothesis is called the power of a
test. Power is defined precisely as the probability
of accepting the alternative hypothesis when it is true. This
power depends on the size of the sample, the scale of the variable, and
any special features of the population distribution which may be invoked
by the test. In fact, a large sample can detect very small deviations
from the null hypothesis. Such small deviations are considered statistically
significant, but need not be practically significant. So it is important
to have a sense for the size of effect that the test can discern.
When inferring the value of a population parameter, use the most powerful
test which applies to the population and sample.
8. Confidence Intervals
Whenever a random sample of size n is used to construct a confidence interval a <= m <= b, then the phrase with 95% confidence refers to the method of construction: if this same method were applied to all possible sample arrangements (of size n) then 95% of the resulting intervals would contain m.
The larger the level of confidence used to construct a confidence interval, the wider the confidence interval will be. The only way to avoid this problem is to increase the sample size. Before choosing a random sample, we can specify the level of confidence and the desired width of the confidence interval (or the half-width: the margin of error) and compute a sample size large enough so that such a confidence interval can be achieved. The limitation of this method is our ability to estimate the population standard deviation accurately in advance.
9. Z versus t Chapter Six §4.2
All large-sample methods (using Z) are based on the Central Limit Theorem, which says: the larger the sample, the more nearly the sample mean random variable is normally distributed. They assume accurate knowledge of sigma, such as s from a large sample.
Strictly speaking, t is used instead of Z when the population is normal and we are using the s from the sample at hand to compute a modified standard score. In such cases, it doesn't matter what the sample size n is, though small n makes t useful in a way that Z is not. But it is also customary to use t when the sample is large, even if the population is not known to be normal. The result will be similar to what would be obtained with Z, but allow for the uncertainty in using s for sigma.