Hypothesis Testing

Learning Objectives For This Lesson

Upon completion of this lesson, you should be able to:


Hypothesis Testing

Previously we used confidence intervals to estimate some unknown population parameter. For example, we constructed 1-proportion confidence intervals to estimate the true population proportion – this population proportion being the parameter of interest. We even went as far as comparing two intervals to see if they overlapped – if so we concluded that there was no difference between the population proportions for the two groups – or if the interval contained a specific parameter value.

Statistical Significance

A sample result is called statistically significant when the p-value for a test statistic is less than level of significance, which for this class we will keep at 0.05. In other words, the result is statistically significant when we reject a null hypothesis.

Five Steps in a Hypothesis Test (Note: some texts will label these steps differently, but the premise is the same)

  1. Check any necessary assumptions and write null and alternative hypotheses.
  2. Calculate an appropriate test statistic.
  3. Determine a p-value associated with the test statistic.
  4. Decide between the null and alternative hypotheses.
  5. State a "real world" conclusion.

Now let’s try to tie together the concepts we discussed regarding Sampling and Probability to delve further into statistical inference with the use of hypothesis tests.

Two designs for producing data are sampling and experimentation, both of which should employ randomization. We have learned that randomization is advantageous because it controls bias. Now we will see another advantage: because chance governs our selection, we may make use of the laws of probability – the scientific study of random behavior – to draw conclusions about the entire population from which the units (e.g. students, machined parts, U.S. adults) originated. Again, this process is called statistical inference.

Previously we had defined population and sample and what we use to describe their values, but we will revisit these:

Parameter: a number that describes the population. It is fixed but rarely do we know its value. (e.g. the true proportion of PSU undergraduates that would date someone of a different race.)

Statistic: a number that describes the sample. This value is known but can vary from sample to sample, for instance from the Class Survey data we may get one proportion of those who said they would date someone of a different race, but if I gave that survey to another sample of PSU undergraduate students do you really believe that the proportion from that sample would be identical to ours?

EXAMPLES

    1. A survey is carried out at a university to estimate the mean GPA of undergraduates living off campus current term. Population: all undergraduates at the university who live off campus; sample: those undergraduates surveyed; parameter: mean GPA of all undergraduates at that university living off campus; statistic: mean GPA of sampled undergraduates.

    2. A balanced coin is flipped 100 times and percentage of heads is 47%. Population: all coin flips; sample: the 100 coin flips; parameter: 50% - percentage of all coin flips that would result in heads if the coin is balanced; statistic: 47%.


Hypothesis Testing for a Proportion

Ultimately we will measure statistics (e.g. sample proportions and sample means) and use them to draw conclusions about unknown parameters (e.g. population proportion and population mean). This process, using statistics to make judgments or decisions regarding population parameters is called statistical inference.

Example 2 above produced a sample proportion of 47% heads and is written:

p-hat [read p-hat] = 47/100 = 0.47

P-hat is called the sample proportion and remember it is a statistic (soon we will look at sample means, x-bar.) But how can p-hat be an accurate measure of p, the population parameter, when another sample of 100 coin flips could produce 53 heads? And for that matter we only did 100 coin flips out of an uncountable possible total!

The fact that these samples will vary in repeated random sampling taken at the same time is referred to as sampling variability. The reason sampling variability is acceptable is that if we took many samples of 100 coin flips an calculated the proportion of heads in each sample then constructed a histogram or boxplot of the sample proportions, the resulting shape would look normal (i.e. bell-shaped) with a mean of 50%.

[The reason we selected a simple coin flip as an example is that the concepts just discussed can be difficult to grasp, especially since earlier we mentioned that rarely is the population parameter value known. But most people accept that a coin will produce an equal number of heads as tails when flipped many times.]

A statistical hypothesis test is a procedure for deciding between two possible statements about a population. The phrase significance test means the same thing as the phrase "hypothesis test."

The two competing statements about a population are called the null hypothesis and the alternative hypothesis.

NOTATION: The notation Ho represents a null hypothesis and Ha represents an alternative hypothesis and po is read as p-not or p-zero and represents the null hypothesized value. Shortly, we will substitute μo for when discussing a test of means.

Ho: p = po
Ha: p ≠ po     or     Ha: p > po     or     Ha: p < po    [Remember, only select one Ha]

The first Ha is called a two-sided test since "not equal" implies that the true value could be either greater than or less than the test value, po. The other two Ha are referred to as one-sided tests since they are restricting the conclusion to a specific side of po.

Example 3 – This is a test of a proportion:

A Tufts University study finds that 40% of 12th grade females feel they are overweight. Is this percent lower for college age females? Let p = proportion of college age females who feel they are overweight. Competing hypothesis are:

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.
Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Example 4 – This is a test of a mean:

Is there a difference between the mean amount that men and women study per week? Competing hypotheses are:

Null hypothesis: There is no difference between mean weekly hours of study for men and women, writing in statistical language as μ1 = μ2
Alternative hypothesis: There is a difference between mean weekly hours of study for men and women, writing in statistical language as μ1≠ μ2

This notation is used since the study would consider two independent samples: one from Women and another from Men.

Test Statistic and p-value

A small p-value favors the alternative hypothesis. A small p-value means the observed data would not be very likely to occur if we believe the null hypothesis is true. So we believe in our data and disbelieve the null hypothesis. An easy (hopefully!) way to grasp this is to consider the situation where a professor states that you are just a 70% student. You doubt this statement and want to show that you are better that a 70% student. If you took a random sample of 10 of your previous exams and calculated the mean percentage of these 10 tests, which mean would be less likely to occur if in fact you were a 70% student (the null hypothesis): a sample mean of 72% or one of 90%? Obviously the 90% would be less likely and therefore would have a small probability (i.e. p-value).

Using the p-value to Decide between the Hypotheses

In general, the smaller the p-value the stronger the evidence is in favor of the alternative hypothesis.

EXAMPLE 3 CONTINUED:

In a recent elementary statistics survey, the sample proportion (of women) saying they felt overweight was 37 /129 = .287. Note that this leans toward the alternative hypothesis that the "true" proportion is less than .40. [Recall that the Tufts University study finds that 40% of 12th grade females feel they are overweight. Is this percent lower for college age females?]

Step 1: Let p = proportion of college age females who feel they are overweight.

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.
Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Step 2:

If npo ≥ 10 and n(1 – po) ≥ 10 then we can use the following Z-test statistic: Since both (129)*(0.4) and (129)*(0.6) > 10 [or consider that the number of successes and failures, 37 and 92 respectively, are at least 10] we calculate the test statistic by:

formula

Note: In computing the Z-test statistic for a proportion we use the hypothesized value po here not the sample proportion p-hat in calculating the standard error! We do this because we "believe" the null hypothesis to be true until evidence says otherwise.

formula

Step 3: The p-value can be found from Standard Normal Table

Calculating p-value:

The method for finding the p-value is based on the alternative hypothesis:

2P(Z ≥ | z | ) for Ha : p ≠ po
P(Z ≥ z ) for Ha : p > po
P(Z ≤ z) for Ha : p < po

In our example we are using Ha : p < .40 so our p-value will be found from P(Z ≤ z) = P(Z ≤ -2.62) and from Standard Normal Table this is equal to 0.0044.

Step 4: We compare the p-value to alpha, which we will let alpha be 0.05. Since 0.0044 is less than 0.05 we will reject the null hypothesis and decide in favor of the alternative, Ha.

Step 5: We’d conclude that the percentage of college age females who felt they were overweight is less than 40%. [Note: we are assuming that our sample, since not random, is representative of all college age females.]

The p-value= .004 indicates that we should decide in favor of the alternative hypothesis. Thus we decide that less than 40% of college women think they are overweight.

The "Z-value" (-2.62) is the test statistic. It is a standardized score for the difference between the sample p and the null hypothesis value p = .40. The p-value is the probability that the z-score would lean toward the alternative hypothesis as much as it does if the true population really was p = .40.

Using Software to Perform a One Proportion Test Analysis Using Raw Data

To perform a one proportion test analysis in Minitab using raw data:

  1. Open Minitab data set Class_Survey.MTW
  2. Go to Stat > Basic Stat > 1- proportion
  3. Click the radio button for Samples in Columns (this is the default)
  4. Click the text box under this title (cursor should be in this box)
  5. Select from the variables list the variable Gender (be sure the variable Gender appears in the text box)
  6. Check the box for Perform Hypothesis Test and enter 0.5 (note that for Minitab versions earlier than 15 this test is found under the Options)
  7. Minitab output

  8. Click Options and select the correct Alternative (e.g. not equal to)
  9. Check the box for Use Test and Interval Based on Normal Distribution (remember to verify this use by checking that the number of successes and failures are at least ten)
  10. Click OK twice

Minitab output

To perform a one proportion test analysis in SPSS using raw data:

  1. Import data Class_Survey.XLS into SPSS
  2. Since the variable Gender has text responses (i.e. Male, Female) we need to recode this variable into a numeric. We will use 1 to represent Male and 0 for Female.
  3. Go to Transform > Recode Into Different Variables
  4. Enter Gender into the Output Variable Window
  5. In the text box under Output Variable labeled Name: enter Male
  6. Click Change
  7. Click the button Old and New Values
  8. Under Old Value click Value and type in Male
  9. Under New Value enter in the Value text box the value 1
  10. Under Old → New click Add
  11. Repeat steps 8 through 11 typing in Female and 0

spss output

  1. Click Continue
  2. Click OK (you should now have a new column of ones and zeroes titled Male)
  3. Go to Analyze > Nonparametric Tests > Binomial
  4. Enter the variable Male into the text box for Test Variable List
  5. The Test Proportion value is defaulted at 0.5; if this is not correct then change
  6. Click OK

This should result in the following output:

spss output

Using Software to Perform a Summarized One Proportion Test Analysis

To perform a summarized one proportion test analysis in Minitab:

  1. Open Minitab without data
  2. Go to Stat > Basic Stat > 1- proportion
  3. Click the radio button for Summarized Data
  4. Enter 37 for Number of Events and 129 for Number of Trials
  5. Check the box for Perform Hypothesis Test and enter 0.4 (note that for Minitab versions earlier than 15 this test is found under the Options)
  6. Click Options and select the correct Alternative (e.g. less than)
  7. Minitab output

  8. Check the box for Use Test and Interval Based on Normal Distribution (remember to verify this use by checking that the number of successes and failures are at least ten)
  9. Click OK twice

This should result in the following output:

Minitab output

To perform a summarized one proportion test analysis in SPSS:

  1. Open SPSS without data
  2. Enter in the first empty cell the number of successes, 37
  3. Enter in the cell below that one the number of failures, 92
  4. SPSS output

  5. Click Data > Weight Cases
  6. Click the radio button Weight Cases By and enter in the text box the variable of interest from the variable list (should only be one variable VAR00001 if you started with an empty data set) --(see image spss_02)
  7. SPSS output

  8. Click OK
  9. Go to Analyze > Nonparametric Tests > Binomial
  10. Enter the variable of interest into the Test Variable List (see image spss_03)
  11. SPSS output

  12. Change the test proportion value to 0.4
  13. Click OK
  14. NOTE: SPSS does not provide a method based on the normal approximation (even though the notation in the output references based on Z approximation). SPSS uses exact methods based on binomial distribution. However, the hypotheses setup, decision rules and conclusion use the same approach as that for when using normal approximation techniques, i.e. z- method.

This should result in the following output:

spss output

The p-value= .004 indicates that we should decide in favor of the alternative hypothesis. Thus we decide that less than 40% of college women think they are overweight.

The "Z-value" (-2.62) is the test statistic. It is a standardized score for the difference between the sample p and the null hypothesis value p = .40. The p-value is the probability that the z-score would lean toward the alternative hypothesis as much as it does if the true population really was p = .40.


Hypothesis Testing for a Mean

Quantitative Response Variables and Means

We usually summarize a quantitative variable by examining the mean value. We summarize categorical variables by considering the proportion (or percent) in each category. Thus we use the methods described in this handout when the response variable is quantitative. Again, examples of quantitative variables are height, weight, blood pressure, pulse rate, and so on.

Null and Alternative Hypotheses for a Mean

Test Statistics

The test statistic for examining hypotheses about one population mean:

formula

where x-bar the observed sample mean, μ0 = value specified in null hypothesis, s = standard deviation of the sample measurements and n = the number of differences.

Notice that the top part of the statistic is the difference between the sample mean and the null hypothesis. The bottom part of the calculation is the standard error of the mean.

It is a convention that a test using a t-statistic is called a t-test. That is, hypothesis tests using the above would be referred to as "1-sample t test".

Finding the p-value

Recall that a p-value is the probability that the test statistic would "lean" as much (or more) toward the alternative hypothesis as it does if the real truth is the null hypothesis.

When testing hypotheses about a mean or mean difference, a t-distribution is used to find the p-value. This is a close cousin to the normal curve. T-Distributions are indexed by a quantity called degrees of freedom, calculated as df = n – 1 for the situation involving a test of one mean or test of mean difference.

The p-values for the t-distribution are found in your text or a copy can be found at the following link: T-Table. To interpret the table, use the column under DF to find the correct degree of freedom. Use the top row under Absolute Value of t-Statistic to locate your calculated t-value. Most likely you will not find an exact match for your t-value so locate the range for your t-value. This means that your t-value will be either less than 1.28; between two t-statistics in the table; or greater than 3.00. Once you located the range, then find the corresponding p-value(s) associated with your range of t-statistics. This would be your p-value used to compare to alpha of 0.5.

NOTE: the t-statistics increase from left to right, but the p-values decrease! So if your range for the t-statistic is greater than 3.00 your p-value would be less than the corresponding p-value listed in the table.

Examples of reading T-Table [recall degrees of freedom for 1-sample t are equal to n − 1, or one less than the sample size] and is read as p-value = P(T > |t|). NOTE: If this formula appears familiar it should as this closely resembles that for finding probability values using Standard Normal Table with z-values.

  1. If you had sample of size 15 resulting in DF = 14 and t-value = 1.20 your t-value range would be less than 1.28 producing a p-value of p > 0.111. That is, the probability that P(T < 1.28) is greater than 0.111.
  2. If you had sample of size 15 resulting in DF = 14 and t-value = 1.95 your t-value range would be from 1.80 to 2.00 producing a p-value of 0.033 < p < 0.047. That is, the probability that P(1.80 < T < 2.00) is between 0.0333 and 0.047.
  3. If you had sample of size 15 resulting in DF = 14 and t-value =3.20 your t-value range would be greater than 3.00 producing a p-value of p < 0.005. That is, the probability that P(T > 3.00) is less than 0.005.

NOTE: The increments for the degrees of freedom in T-Table are not always 1. This column increases by 1 up to DF = 30, then the increments change. If your DF is not found in the table just go to the nearest DF. Also, note that the last row, "Infinite", displays the same p-values as those found in Standard Normal Table. This is because as n increases the t-distribution maps the standard normal distribution.


Using Software to Perform a One Mean Test Analysis Using Raw Data

Example:

Students measure their pulse rates. Is the mean pulse rate for college age women equal to 72 (a long-held standard for average pulse rate)?

Null hypothesis: μ = 72
Alternative hypothesis: μ ≠72

Pulse rates for n = 35 women are available.

To perform a one mean hypothesis test in Minitab using raw data:

  1. Open data set
  2. Go to Stat > Basic Statistics > 1-Sample t
  3. Click inside the text area for Sample From Columns Select GPA and move GPA into the text box for Sample From Columns
  4. Click the check box for Perform Hypothesis Test and enter the hypothesized value into the text box for Hypothesized Mean (e.g. 3.0)
  5. Click Options. Here you can select correct alternative hypothesis (default is not equal to - keep that for now)
  6. Click OK
  7. Click OK

This should result in the following output:

minitab output

To perform a one mean hypothesis test in SPSS:

  1. Import data into SPSS
  2. Go to Analyze > Compare Means > One-Sample T Test
  3. Select GPA and move GPA into the text box for Test Variable(s)
  4. Enter in the text box for Test Value the hypothesized value being tested (e.g. 3.0)
  5. Click OK

Special Note: SPSS performs all tests as two-sided. If interested in a 1-sided alternative (e.g. "greater than") we would have to divide the p-value in half.

This should result in the following output:

spss output

spss output

Using Software to Perform a Summarized One Mean Test Analysis

To perform a one mean hypothesis test in Minitab using summarized data:

  1. Open Minitab
  2. Go to Stat > Basic Statistics > 1-Sample t
  3. Click the radio button for Summarized Data
  4. Enter the appropriate values (e.g. Sample Size: 35, Mean: 76.8, Standard Deviation: 11.62)
  5. Click the check box for Perform Hypothesis Test and enter the hypothesized value into the text box for Hypothesized Mean (e.g. 72)
  6. Click Options. Here you can select correct alternative hypothesis (default is not equal to - keep that for now)
  7. Click OK
  8. Click OK

This should result in the following output (image_003 included in conversion folder):

minitab output

SPSS cannot perform a hypothesis test for a mean using summarized data.

INTERPRETATION:

The p-value is p = 0.019. This is below the .05 standard, so the result is statistically significant. This means we decide in favor of the alternative hypothesis. We're deciding that the population mean is not 72.

The test statistic is

formula

Because this is a two-sided alternative hypothesis, the p-value is the combined area to the right of 2.47 and the left of −2.47 in a t-distribution with 35 – 1 = 34 degrees of freedom.


Example 2:

In the same "survey" there were n = 57 men. Is the mean pulse rate for college age men equal to 72?

Null hypothesis: μ = 72
Alternative hypothesis: μ ≠72

RESULTS:

Minitab output

INTERPRETATION:

The p-value is p = 0.236. This is not below the .05 standard, so we do not reject the null hypothesis. Thus it is possible that the true value of the population mean is 72. The 95% confidence interval suggests the mean could be anywhere between 67.78 and 73.06.

The test statistic is

formula

The p-value is the combined probability that a t-value would be less than (to the left of ) −1.20 and greater than (to the right of +1.20).


Errors, Practicality and Power in Hypothesis Testing

Errors in Decision Making – Type I and Type II

How do we determine whether to reject the null hypothesis? It depends on the level of significance α, which is the probability of the Type I error.

What is Type I error and what is Type II error?

When doing hypothesis testing, two types of mistakes may be committed and we call them Type I error and Type II error.

Decision
Reality
H0 is true
H0 is false
Reject H0 and conclude Ha
Type I error
Correct
Do not reject H0
Correct
Type II error

If we reject H0 when H0 is true, we commit a Type I error. The probability of type I error is denoted by alpha, α (as we already know this is commonly 0.05)

If we accept H0 when H0 is false, one commits a type II error. The probability of Type II error is denoted by Beta, β:

Our convention is to set up the hypotheses so that type I error is the more serious error.

Example 1: Mr. Orangejuice goes to trial where Mr. Orangejuice is being tried for the murder of his ex-wife.

We can put it in a hypothesis testing framework. The hypotheses being tested are:

  1. Mr. Orangejuice is guilty
  2. Mr. Orangejuice is not guilty

Set up the null and alternative hypotheses where rejecting the null hypothesis when the null hypothesis is true results in the worst scenario:

H0 : Not Guilty
Ha : Guilty

Here we put Mr. Orangejuice is not guilty in H0 since we consider false rejection of H0 a more serious error than failing to reject H0. That is, finding an innocent person guilty is worse than finding a guilty man innocent.

Type I error is committed if we reject H0 when it is true. In other words, when Mr. Orangejuice is not guilty but found guilty.

α = probability( Type I error)

Type II error is committed if we accept H0 when it is false. In other words, when Mr. Orangejuice is guilty but found not guilty.

β = probability( Type II error)

Relation between α, β

Note that the smaller we specify the significance level, α, the larger will be the probability, β of accepting a false null hypothesis.

Cautions About Significance Tests

  1. If a test fails to reject Ho, it does not necessarily mean that Ho is true – it just means we do not have compelling evidence to refute it. This is especially true for small sample sizes n. To grasp this, if you are familiar with the judicial system you will recall that when a judge/jury renders a decision the decision is "Not Guilty". They do not say "Innocent". This is because you are not necessarily innocent, just that you haven’t been proven guilty by the evidence, (i.e. statistics) presented!
  2. Our methods depend on a normal approximation. If the underlying distribution is not normal (e.g. heavily skewed, several outliers) and our sample size is not large enough to offset these problems (think of the Central Limit Theorem from Chapter 9) then our conclusions may be inaccurate.

Power of a Test

When the data indicate that one cannot reject the null hypothesis, does it mean that one can accept the null hypothesis? For example, when the p-value computed from the data is 0.12, one fails to reject the null hypothesis at = 0.05. Can we say that the data support the null hypothesis?

Answer: When you perform hypothesis testing, you only set the size of Type I error and guard against it. Thus, we can only present the strength of evidence against the null hypothesis. One can sidestep the concern about Type II error if the conclusion never mentions that the null hypothesis is accepted. When the null hypothesis cannot be rejected, there are two possible cases: 1) one can accept the null hypothesis, 2) the sample size is not large enough to either accept or reject the null hypothesis. To make the distinction, one has to check . If at a likely value of the parameter is small, then one accepts the null hypothesis. If the is large, then one cannot accept the null hypothesis.

The relationship between and :

If the sample size is fixed, then decreasing will increase . If one wants both to decrease, then one has to increase the sample size.

Power = the probability of correctly rejecting a false null hypothesis = 1 - .

© 2008 The Pennsylvania State University. All rights reserved.