Two-Sample Hotelling's T-Square

This lesson is concerned with the two sample Hotelling's T2 test. This test is the multivariate analog of the two sample t-test in univariate statistics. These two sample tests, in both cases are used to compare two populations. Two populations may correspond to two different treatments within an experiment.

For example, in a completely randomized design with two treatments, are randomly assigned to the experimental units. Here, we would like to distinguish between the two treatments. Another situation occurs where the observations are taken from two distinct populations of a sample units. But in either case, there is no pairing of the observations as was in the case where the paired Hotelling's T2 was applied.

Example: Swiss Bank Notes

This is an example where we are sampling from two distinct population occurs with the Swiss Bank Notes.

Here, we have observations that are taken from two populations of 1,000 franc Swiss bank notes.

1. The first population is the population of Genuine Bank Notes

2. The second population is the population of Counterfeit Bank Notes

While the above is an example of a more recent version issue of the note, it depicts the different measurement locations taken in this study. For both population of bank notes the following measurements were taken:

Objective: To determine if counterfeit notes can be distinguished from the genuine Swiss bank notes.

This is essential for the police if they wish to be able to use these kinds of measurements to determine if a bank notes are genuine or not and help solve counterfeiting crimes.

Before considering the multivariate case, we will first consider the univariate case.


Univariate Case

Suppose we have data from a single variable from two populations:

The data will be denoted in Population 1 as: X11,X12,...,X1n1

The data will be denoted in Population 2 as: X21,X22,...,X2n2

For both populations the first subscript will denote which population the note is from. The second subscript will denote which observation we are looking at from each population.

Here we will make the standard assumptions:

  1. The data from population i is sampled from a population with mean μi. This assumption simply means that there are no sub-populations to note.
  2. Homoskedasticity: The data from both populations have common variance σ2
  3. Independence: The subjects from both populations are independently sampled.
  4. Normality: The data from both populations are normally distributed.

Here we are going to consider testing, , that both populations have have equal population means, against the alternative hypothesis that the means are not equal.

We shall define the sample means for each population using the following expression:

We will let s2i denote the sample variance for the ith population, again calculating this using the usual formula below:

Assuming homoskedasticity, both of these sample variances, s21 and s22, are both estimates of the common variance σ2. A better estimate can be obtained, however, by pooling these two different estimates. This calculating will be done by calculating the pool variance given by the expression below:

Here each sample variance is given a weight equal to the sample size less 1 so that the pool variance is simply equal to n1 - 1 times the variance of the first population plus n2 - 1 times the variance of the second population, divided by the total sample size minus 2.

Our test statistic is the students' t-statistic which is calculated by dividing the difference in the sample means by the standard error of that difference. Here the standard error of that difference is given by pooled variance times the sum of the inverses of the sample sizes as shown below:

under the null hypothesis, Ho of the equality of the population means, this test statistic will be t-distributed with n1 + n2 - 2 degrees of freedom.

We will reject Ho at level α if the absolute value of this test statistic exceeds the critical value from the t-table with n1 + n2 - 2 degrees of freedom evaluated at α over 2..

All of this should be familiar to you from your introductory statistics course.

Next, let's consider the multivariate case...


Multivariate Case

In this case we are replacing the random variables Xij, for the jth sample for the ith population, with random vectors Xij, for the jth sample for the ith population. These vectors contain the observations from the p variable.

In our notation, we will have our two populations:

The data will be denoted in Population 1 as: X11,X12,...,X1n1

The data will be denoted in Population 2 as: X21,X22,...,X2n2

Here the vector Xij represents all of the data for all of the variables for sample unit j, for population i.

This vector contains elements Xijk where k runs from 1 to p, for p different variables that we are observing. So, Xijk is the observation for variable k of subject j from population i.

The assumptions here will be analogous to the assumptions in the univariate setting.

Assumptions:

  1. The data from population i is sampled from a population with mean vector μi. Again, corresponding to the assumption that there are no sub-populations.
  2. Instead of assuming homoskedsticity, we now see that the data from both populations have common variance-covariance matrix Σ, instead of simply saying that they have a common variance.
  3. Independence. The subjects from both populations are independently sampled.
  4. Normality. Both populations are normally distributed.

Consider testing the null hypothesis that the two populations have identical population mean vectors. This is represented below as well as the general alternative that the mean vectors are not equal.

So here what we are testing is:

Or, in other words...

For the null hypothesis, that is fine only if the population means are identical for all of the variables.

The alternative is that at least one pair of these means is different. This is expressed below:

This could be different for only one or it could be all of them.

Here we will carry out the test. We will define the sample mean vectors, calculated the same way as before, using data only from the ith population.

Similarly, using data only from the ith population, we will define the sample variance-covariance matrices:

Under out assumption of homogeneous variance-covariance matrices, both S1 and S2 are estimators for the common variance-covariance matrix Sigma. A better estimate can be obtained by pooling the two estimates using the expression below:

Again, the sample variance-covariance matrix is weighted by the sample sizes plus 1.


Two-sample Hotelling's T-Square

Now we are ready to define the Two-sample Hotelling's T-Square test statistic. As in the expression below you will note that this is all computation of differences in the sample mean vectors. It also involves a calculation of the pooled variance-covariance matrix multiplied by the sum of the inverses of the sample size. The resulting matrix is then inverted.

For large samples, this test statistic will be approximately chi-square distributed with p degrees of freedom. However, as before this approximation does not take into account the variation due to estimating the variance-covariance matrix. So, as before, we will look at transforming this Hotelling's T2 statistic into an F-statistic using the following expression. Note that this is a function of the sample sizes of the two populations and the number of variables measured, p.

Under the null hypothesis, Ho : μ1 = μ2 this F-statistic will be F-distributed with p and n1 + n2 - p degrees of freedom. We would reject Ho at level α if it exceeds the critical value from the F-table evaluated at α.

 


Example: Swiss Bank Notes

The two sample Hotelling's T2 test can be carried out using the Swiss Bank Notes data using the SAS program swiss10.sas.

SAS Program Discussion - swiss10.sas

options ls=78;
title "2-Sample Hotellings T2 - Swiss Bank Notes";
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  run;
proc iml;
  start hotel2;
    n1=nrow(x1);
    n2=nrow(x2);
    k=ncol(x1);
    one1=j(n1,1,1);
    one2=j(n2,1,1);
    ident1=i(n1);
    ident2=i(n2);
    ybar1=x1`*one1/n1;
    s1=x1`*(ident1-one1*one1`/n1)*x1/(n1-1.0);
    print n1 ybar1;
    print s1;
    ybar2=x2`*one2/n2;
    s2=x2`*(ident2-one2*one2`/n2)*x2/(n2-1.0);
    print n2 ybar2;
    print s2;
    spool=((n1-1.0)*s1+(n2-1.0)*s2)/(n1+n2-2.0);
    print spool;
    t2=(ybar1-ybar2)`*inv(spool*(1/n1+1/n2))*(ybar1-ybar2);
    f=(n1+n2-k-1)*t2/k/(n1+n2-2);
    df1=k;
    df2=n1+n2-k-1;
    p=1-probf(f,df1,df2);
    print t2 f df1 df2 p;
  finish;
  use swiss;
    read all var{length left right bottom top diag} where (type="real") into x1;
    read all var{length left right bottom top diag} where (type="fake") into x2;
  run hotel2;
		  

Here, we start with the usual options and title statement. The data are entered using data step swiss. The data are stored in the file swiss3.txt and are in seven different columns. The first column gives the type of note. Note the dollar sign that appears after the word type indicating that this is an alphanumeric variable. This is then followed by the six measurements on each note, length, left width, right width, bottom, top and diagonal.

The iml program used for carrying out Hotelling's T2 test is called hotel2. When using this program there is no need to change anything between the start statement and the finish statement that appears in this iml procedure. The only thing you need to change is what appears are the bottom of the procedure.

First of all, you need to specify which SAS dataset you want to analyze. In this case the name of the SAS dataset is called swiss, so we are going to use swiss.

Then you need to read in the data for the two types of notes into x1 and x2 using the read statement. In both cases, we are going to read all. In both cases the variables we want to analyze are specified inside the curly brackets: { }

The where specifies which subset of the data we want to input into x1 and x2. So in the first of these two read statements we going to read where type is equal to "real" into x1, "real" for the genuine bank notes. The second of these lines where type is equal to "fake" into x2. It is completely arbitraty which of these two you read into x1 or x2, you just need to keep track of which ones are read.

Finally at the end you run hotel2.

The output is given in swiss10.lst.

SAS Output Discussion - swiss10.lst

The top of the page you see that N1 is equal to 100 indicating that we have 100 bank notes in the first sample. In this case 100 real or genuine notes.

This is then followed by the sample mean vector for this population of notes. The sample mean vectors is copied into the table below:

Means
Variable
Genuine
Counterfeit
Length
214.969
214.823
Left Width
129.943
130.300
Right Width
129.720
130.193
Bottom Margin
8.305
10.530
Top Margin
10.168
11.133
Diagonal
141.517
139.450

Below this appears elements of the sample variance-covariance matrix for this first sample of notes, that is for the real or genuine notes. Those numbers are also copied into matrix that appears below:

Sample variance-covariance matrix for genuine notes:

The next item listed on the output we have the information for the second sample. First appears the sample mean vector for the second population of back notes. These results are copied into the table above.

The sample variance-covariance for the second sample of notes, the countefeit note, is given under S2 in the output and also copied below:

Sample variance-covariance matrix for the counterfeit notes:

This is followed by the pooled variance-covariance matrix for the two sample of notes. For example, the pooled variance for the length is about 0.137. The pooled variance for the left-hand width is about 0.099. The pooled covariance between length and left-hand width is 0.045. You should be able to see where all of these numbers appear in the output. These results are also copied from the output and placed below:

The two-sample Hotelling's T2 statistic is 2412.45. The F-value is about 391.92 at 6 and 193 degrees of freedom. The 6 is equal to the number of variables being measured and the 193 comes from the fact that we have 100 sample for a total of 200. We subtract the number of variables and get 194, and then subtract 1 more to get 193. In this case the p-value is close to 0, here we will write this as 0.0001.

In this case we can reject the null hypothesis that the mean vector for the counterfeit notes equals the mean vector for the genuine notes giving the evidence as usual: (T2 = 2412. 45; F = 391. 92; d. f. = 6, 193; p < 0.0001)

Conclusion:

Our conclusion here is that the counterfeit notes can be distinguished from the genuine notes.

After concluding that the counterfeit notes can be distinguished from the genuine notes the next step in our analysis is to determine which variables they are different???

All we can conclude at this point is that two types of notes differ on at least one variable. It could be just one variable, or a combination of more than one variable. Or, potentially, all of the variables.

To assess which variable these notes differ on we will consider the (1 - α) × 100% Confidence Ellipse for the difference in the population mean vectors for the two population of bank notes, μ1 - μ2. This ellipse is given by the expression below:

On the left hand of the expression we have a function of the differences between the sample mean vectors for the two populations and the pooled variance-covariance matrix Sp, as well as the two sample sizes n1 and n2. On the right hand side of the equation we have a function of the number of variables, p, the sample sizes n1 and n2, and the critical value from the F-table with p and n1 + n2 - p - 1 degrees of freedom, evaluated at α.

To understand the geometry of this ellipse, let λ1 through λp, below:

denote the eigenvalues of the pooled variance-covariance matrix Sp, and let

e1, e2,..., ep

denote the corresponding eigenvectors. Then the kth axis of this p dimensional ellipse points in the direction specified by the kth eigenvector, ek.

And, it has a half-length given by the expression below:

Note, again, that this is a function of the number of variables, p, the sample sizes n1 and n2, and the the critical value from the F-table.

The (1 - α) × 100% confidence ellipse yields simultaneous (1 - α) × 100% confidence intervals for all linear combinations of the form given in the expression below:

So, these are all linear combinations of the differences in the sample means between the two populations variable where we are taking linear combinations to cross variables. This is expressed in three terms in long hand from the far left hand side of this expression above, simplified using a summation sign in the middle term, and finally in vector notation in the final term to the right.

These confidence intervals are given by the expression below:

This involves the same linear combinations of the differences in the sample mean vectors between the two populations, plus or minus the first radical term which contains the function of the sample sizes and the number variables times the critical value from the F-table. The second radical term is the standard error of this particular linear combination.

Here, the terms denote the pooled covariances between variables k and l.

Interpretation: We with the single sample Hotelling's T2 test the interpretation of these confidence intervals are the same. Here, we are (1 - α) × 100% confident that all of these intervals cover their respective linear combinations of the differences between the means of the two populations. In particular, we are also (1 - α) × 100% confident that all of the intervals of the individual variables also cover their respective differences between the population means. For the individual variables we are looking at, say, the kth individual variable we are looking at the difference between the sample means for that variable, k, plus or minus the same radical term that we had in the expression previously, times the standard error of that difference between the sample means for the kth variable which involves the inverses of the sample sizes and the pooled variance for variable k.

So, here, is the pooled variance for variable k. These intervals are called simultaneous confidence intervals.

Let's work through an example of their calculation using the Swiss Bank Notes data.


Example: Swiss Bank Notes

An example of the calculation of simultaneous confidence intervals using the Swiss Bank Notes data is given in the expression below:

Here we note that the sample sizes are both equal to 100, n = n1 = n1 = 100, so there is going to be some sort of location??? of our formula as shown by the first term in the calculation above.

Let's look at our results of the differences between the genuine notes, the lengths of the genuine notes minus the lengths of the counterfeit notes.

The sample mean for the length of the genuine notes was 214.969. The sample mean for the length of the countefeit notes was 214.823. We add and subtract, the radical term. Here p is equal to 6 and n is equal to 100 for each set of bank notes. The critical value from the F-table, with in this case, 6 and 193 degrees of freedom was 2.14. The standard error of the difference in these sample means is given by the second radical where we have 2 times the pooled variance for the length, which was 0.137, looking at the variance covariance matrix, and n again is 100.

Carrying out the math we end up with an interval that runs from -0.044 to 0.336 as shown above.

The SAS program swiss11.sas can be used to compute the simultaneous confidence intervals for the 6 variables.

SAS Program Discussion - swiss11.sas

options ls=78;
title "Confidence Intervals - Swiss Bank Notes";
%let p=6;
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  run;
data real;
  set swiss;
  if type="real";
  variable="length";   x=length; output;
  variable="left";     x=left;   output;
  variable="right";    x=right;  output;
  variable="bottom";   x=bottom; output;
  variable="top";      x=top;    output;
  variable="diagonal"; x=diag;   output;
  keep type variable x;
  run;
proc sort;
  by variable;
  run;
proc means noprint;
  by variable;
  id type;
  var x;
  output out=pop1 n=n1 mean=xbar1 var=s21;
data fake;
  set swiss;
  if type="fake";
  variable="length";   x=length; output;
  variable="left";     x=left;   output;
  variable="right";    x=right;  output;
  variable="bottom";   x=bottom; output;
  variable="top";      x=top;    output;
  variable="diagonal"; x=diag;   output;
  keep type variable x;
  run;
proc sort;
  by variable;
  run;
proc means noprint;
  by variable;
  id type;
  var x;
  output out=pop2 n=n2 mean=xbar2 var=s22;
data combine;
  merge pop1 pop2;
  by variable;
  f=finv(0.95,&p,n1+n2-&p-1);
  t=tinv(1-0.025/&p,n1+n2-2);
  sp=((n1-1)*s21+(n2-1)*s22)/(n1+n2-2);
  losim=xbar1-xbar2-sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1));
  upsim=xbar1-xbar2+sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1));
  lobon=xbar1-xbar2-t*sqrt((1/n1+1/n2)*sp);
  upbon=xbar1-xbar2+t*sqrt((1/n1+1/n2)*sp);
  run;
proc print;
  run;
		  

This program starts with the usual options and titles statements.

Next, we must specifiy the number of variables to be analyzed. In this case we specify this by using the statement %let p = 6.

The data are entered using the data step swiss. The data are stored in the file swiss3.txt. This dataset has columns for type of note, length, left, right, bottom, top and diagonal for the 6 variables.

The next thing we must do is pull out the data for the real bank notes for one of the two populations. This is going to be accomplished using the data step "real". Here we set swiss, where we can take all of the information from swiss and put it into real. Then with the if statement we are going to select the real bank notes by writing if type = "real" using quotes. This dataset as in the previous programs for computing simultaneous confidence intervals we need a separate line of data for each of the variables. So, the next 6 lines of this program specifies the information for each of the variables. First we define the name of the variable, by setting variable = length, left right and so on on each of these 6 lines. Then we set the x equal to the name that we gave to each of the variables in the input statement above. So, for length we set x=length, for left-hand width we set x=left and so on.

The final, third column here in these 6 lines is the output statement which causes SAS to create a line in the SAS dataset.

The keep statement below the 6 lines of code tells SAS that we just want to keep the information on the type of note, the variable name and the value of the variable x.

This is followed by a sort statement where the data are going to be sorted by variable.

The means procedure will be given noprint option causing SAS not to actually print out any of the results for the means at this time. The means are going to be calculated by variable. We set id type so that they type of population is brought along with the SAS dataset that we are going to create here. The variable statement specifies x as the variable we want to analyze. And the results are going to be output into a new SAS dataset using the output statement, where we specify out=pop1 for the first population of bank notes. Then we specify sample size for this population n to be equal to n1, the mean to be equal to xbar1, and the variance to be equal to s21.

Following this means procedure we are going to do the same thing for the "fake" or counterfeit notes. Again, we create a new SAS dataset called "fake". We set swiss into fake to bring the original variables into that dataset. We set if type="fake" then select these fake bank notes.

The next 6 lines, just as above, create separate lines of data for each of the variables and again, we will have it keep type variable in x.

The data in this dataset "fake" will be sorted by variable, and we will compute means by variable, again with the noprint option so that no actual results are printed at this time.

The results are going to be output in the SAS dataset "pop2" with sample size set to n2, with means equal to xbar2 and the variances equal to s22.

The results of the two means procedures are going to be combined into a single dataset called "combined". This is carried out using the merge statement which merges dataset pop1 and pop2 which were created by the two means procedures that preceeded.

These datasets are going to be merged by variable. This is to make sure that everything is going to line up.

The following line f=inv denotes f inverse, computes the critical value from the F-table, in this case with p and n1+n2-p-1 degrees of freedom.

Skipping the line for t, which we will return to later when we talk about Bonferroni intervals, sp calculates the pooled variance for each variable. This weights the individual variances for the individual populations by their sample sizes less 1.

losim and upsim computer the lower and upper bounds for the simultaneous confidence intervals for the variables.

lobon and upbon will compute similar information for the Bonferroni intervals.

The results are listed in swiss11.lst.

SAS Output Discussion - swiss11.lst

The results are given in columns for losim and upsim. Those entries are copied into the table below:

Variable
95% Confidence Interval
Length
-0.044, 0.336
Left Width
-0.519, -0.195
Right Width
-0.642, -0.304
Bottom Margin
-2.698, -1.752
Top Margin
-1.295, -0.635
Diagonal
1.807, 2.327

You need to be careful where they appear in the table in the output. Note that the variables are now sorted in alphabetic order. For example, length would be the fourth line of the output data.

The interval for length can then be seen to be -0.044 to 0.336 as was obtained from the hand calculations previously. In any case you should be able to find the numbers for the lower and upper bound of the simultaneous confidence intervals from the SAS output and see where they appear in the table above.

When interpreting these intervals we need to see which intervals include 0, which ones fall entirely below 0, and which ones fall entirely above 0.

The first thing that we notice is that interval for length includes 0. This suggests that we can not distinguish between the lengths of the counterfeit and genuine bank notes. The intervals for both width measurements fall below 0.

Since these intervals are being calculated by taking the genuine notes minus the counterfeit notes this would suggest that the counterfeit notes are larger on these variables and we can conclude that the left and right margins of the counterfeit notes are wider than the genuine notes.

Similarly we can conclude that the top and bottom margins of the counterfeit are also too large. Note, however, that the interval for the diagonal measurements fall entirely above 0. This suggests that the diagonal measurements of the counterfeit notes are smaller than that of the genuine notes.

Conclusions:


Profile Plots

Simultaneous confidence intervals may be plotted using swiss12.sas.

SAS Program Discussion - swiss12.sas

options ls=78;
title "Confidence Intervals - Swiss Bank Notes";
%let p=6;
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  run;
data real;
  set swiss;
  if type="real";
  variable="length";   x=length; output;
  variable="left";     x=left;   output;
  variable="right";    x=right;  output;
  variable="bottom";   x=bottom; output;
  variable="top";      x=top;    output;
  variable="diagonal"; x=diag;   output;
  keep type variable x;
  run;
proc sort;
  by variable;
  run;
proc means noprint;
  by variable;
  id type;
  var x;
  output out=pop1 n=n1 mean=xbar1 var=s21;
data fake;
  set swiss;
  if type="fake";
  variable="length";   x=length; output;
  variable="left";     x=left;   output;
  variable="right";    x=right;  output;
  variable="bottom";   x=bottom; output;
  variable="top";      x=top;    output;
  variable="diagonal"; x=diag;   output;
  keep type variable x;
  run;
proc sort;
  by variable;
  run;
proc means noprint;
  by variable;
  id type;
  var x;
  output out=pop2 n=n2 mean=xbar2 var=s22;
data combine;
  merge pop1 pop2;
  by variable;
  f=finv(0.95,&p,n1+n2-&p-1);
  sp=((n1-1)*s21+(n2-1)*s22)/(n1+n2-2);
  diff=xbar1-xbar2; output;
  diff=xbar1-xbar2-sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1)); output;
  diff=xbar1-xbar2+sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1)); output;
  run;
filename t1 "swiss12.ps";
goptions device=ps300 gsfname=t1 gsfmode=replace;
proc gplot;
  axis1 length=4 in;
  axis2 length=6 in;
  plot diff*variable / vaxis=axis1 haxis=axis2 vref=0 lvref=21;
  symbol v=none i=hilot color=black;
  run;
		  

This program is similar to the program swiss11.sas. It contains the same options and title statement. Let sets the number of parameters, p=6.

The data are input into using datastep swiss.

They are split into separate datasets, real and fake notes. For each set of notes we split out separate lines for each of the variables. Sort the data by variables and apply a means procedure to compute variable means and variances.

The data are combined at the bottom in the datastep combine which computes the information required for the plot.

The plotting is carried out in the gplot procedure.

The results are shown in the plot of simultaneous confidence intervals and from this plot it is easy to see how the two sets of bank notes differ. It is immediately obvious that the intervals for length includes 0, and that the intervals for bottom, left, right and top are all below 0, while the interval for diagonal measurements falls above 0.

Since we are taking the genuine minus the counterfeit notes, this would suggest that both the left and right margins of the counterfeit notes are larger than those of the genuine notes. The bottom and top margins are also larger for the counterfeit notes than they are for the genuine notes. Finally, the diagonal measurements are smaller for the counterfeit notes than for the genuine notes.


As in the one-sample case, the simlutaneous confindence intervals should be computed only when we are centrally interested in linear combinations of the variables. If the only thing that is of interest, however, is the confidence intervals for the individual variables with no linear combinations, then a better approach is to calculate the Bonferroni corrected confidence intervals which are given in the expression below:

Bonferroni Corrected (1 - α) x 100% Confidence Intervals

These, again, involve the difference in the population means for each of the variables, plus or minus the critical value from the t-table times the standard error of the difference in these sample means. The t-table is evaluated from a t-distribution with n1+n2-2 degrees of freedom, evaluated at α divided by 2p where p is the number of variables to be analyzed. The radical term gives the standard error of the difference in the sample mean and involves the pooled variance for the kth variable times the sum of the inverses of the sample sizes.

For length of the bank notes that expression can be simplified to the expression that follows since the sample sizes are identical. The average length of the genunine notes was 214.959 from which we subtract the average length from the counterfeit notes, 214.823. As for the t-table, we will be looking it up for 2 times the sample size minus 2 or 198 degrees of freedom at the critical value for 0.05 divided by 2 times the number of variables, 6, or 0.05/12. The critical value turns out to be about 2.665. The standard error is obtained by taking 2 times the pooled variance, 0.137 divided by 100. Carrying out the math we wnd up with an interval that runs from 0.006 to 0.286 as shown below:

These calculations can also be obtained from the SAS program swiss11.sas.

SAS Program Discussion - swiss11.sas

...

data combine;
  merge pop1 pop2;
  by variable;
  f=finv(0.95,&p,n1+n2-&p-1);
  t=tinv(1-0.025/&p,n1+n2-2);
  sp=((n1-1)*s21+(n2-1)*s22)/(n1+n2-2);
  losim=xbar1-xbar2-sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1));
  upsim=xbar1-xbar2+sqrt(&p*(n1+n2-2)*f*(1/n1+1/n2)*sp/(n1+n2-&p-1));
  lobon=xbar1-xbar2-t*sqrt((1/n1+1/n2)*sp);
  upbon=xbar1-xbar2+t*sqrt((1/n1+1/n2)*sp);
  run;
proc print;
  run;
		  

Looking at the datastep combine and moving down, (right), we can see that under combine the fourth line sets t=tinv. This calculates the critical value from the t-table as desired. Then the lower and upper bounds for the Bonferroni intervals are calculated under lobon and upbon at the bottom of this datastep.

The output as given in swiss11.lst, in the columns for lobon and upbon.

SAS Output Discussion - swiss11.lst

Again, make sure you note that the variables are given in alphabetical order rather than in the original order of the data. In any case, you should be able to see where the numbers appearing in the SAS output appear in the table below:

In summary, we have:

Variable
95% Confidence Interval
Length
0.006, 0.286
Left Width
-0.475, -0.239
Right Width
-0.597, -0.349
Bottom Margin
-2.572, -1.878
Top Margin
-1.207, -0.723
Diagonal
1.876, 2.258

The intervals are interpreted in a way similar as before. Here we can see that:

Conclusions:


Profile Plots

Profile plots provide another useful graphical summary of the data. These are only meaningful if all variables have the same units of measurement. They are not meaningful if the the variables units measurement. For example, some variables are measured in grams while other variables are measured in centimeters. In this case profile plots should not be constructed.

Plot 1 shows the profile plot for the swiss bank notes data. In this plot we can see that we have the variables listed on the x-axis and the means for each of the variables is given on the y-axis. The variable means for the fake notes are given by the circle, while the variable means for the real notes are given by the squares. These data points are then connected by straight line segments.

This plot was obtained by the swiss13a.sas.

SAS Program Discussion - swiss13a.sas

The program starts off with the usual options and title statement.

The data are input as before in previous programs.

This plot requires separate lines of data for each variable, so these are created as before in the 6 lines of code following the input statement.

We create a new variable called variable under which we specify the name we want to give each variable, in this case, length, left, right bottom, top and diag. We set x = to the variable we called in the input statement, using the same names. Then after the semicolon we have an output statement.

The data are then sorted by type in variable.

Then the means are calculated by type and variable in the means procedure. You will note that the by statement appears just below the means procedure and has the same format as the by statement that appears in the sort statement.

The variable we want to calculate the means on in this case is x, created in the dataset swiss.

The results are going to be output to a new SAS dataset called "a". Here the means are just going to be called xbar.

Finally, in the plot procedure we will plot xbar by variable equals type. The axis statement specifies the appearance we want to give the axis. In this case we want to give y-axis 4 inches long, the x-axis 6 inches long, and we want to label the y-axis with the name "Mean".

The symbol statements correspond to each of the two populations. Symbol 1 setting v=J and f=special means use a special font. Capital J must be specified here which asks for closed circles. h=2 indicates that we want to make the closed circles twice as large as the default setting. i=join means we want to connect these dots and we need to specify that the color will be black.

Under symbol 2 I set v=K to specify squares. The remaining parts of the statement are the same as for the first symbol.

 

Returning to our plot, we note that the two population of the bank notes are plotted right on top of one another, so this plot is not particularly useful in this particular example. This is not very informative for the Swiss bank notes, since the variation among the variable means far exceeds the variation between the group means within a given variable,

This plot can be obtained by the swiss13b.sas.

SAS Program Discussion - swiss13b.sas

Comparing the two programs we notice only one difference between the programs. If you look in the data step where the values for the variable x is specified, in each case we are going to subtract off the government specifications. So, for length we are subtracting off the government specification of 215. In left and right we subtract off 130, and so on.

 

The results can be found in Plot 2. From this plot we can see that the bottom and top margins of the counterfeit notes are larger than the corresponding mean for the genuine notes. Likewise, the left and right margins are also wider for the counterfeit notes than the genuine notes. However, the diagonal and length measurement for the counterfeit notes appear to be smaller than the genuine notes. Please note, however, this plot does not show which results are significant. Significance would be obtained from the previous simultaneous or Bonferroni confidence intervals.

One of the things that we look for in these plots is to see if the line segments joining the dots are parrallel to one another. In this case, they do not appear to be even close to being parrellel for any pair of variables.


Profile Analysis

Profile Analysis is used to test the null hypothesis that these line segments are indeed parrallel. If the variables have the same units of measurement, it may be appropriate to test this hypothesis that the line segments in the profile plot are parallel to one another. They might be expected to be parrellel in the case where all of the measurements for the counterfeit notes were consistently some constant larger than the measurements for the genuine notes.

To test this null hypothesis we use the following procedure:

Step 1: For each group, we create a new random vector Yij corresponding to the jth observation from population i. The elements in this vector are the differences between the values of the successive variables as shown below:

Step 2: Apply the two-sample Hotelling's T-square to the data Yij to test the null hypothesis that the means of the Yij's for population 1 and the same as the means of the Yij's for population 2. In shorthand this reads as follows:


Testing for Parrallel Profiles

To test for parrallel profiles may be carried out using the SAS program swiss14.sas.

SAS Program Discussion - swiss14.sas

options ls=78;
title "Profile Analysis - Swiss Bank Notes";
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  diff1=left-length;
  diff2=right-left;
  diff3=bottom-right;
  diff4=top-bottom;
  diff5=diag-top;
  run;
proc iml;
  start hotel2;
    n1=nrow(x1);
    n2=nrow(x2);
    k=ncol(x1);
    one1=j(n1,1,1);
    one2=j(n2,1,1);
    ident1=i(n1);
    ident2=i(n2);
    ybar1=x1`*one1/n1;
    s1=x1`*(ident1-one1*one1`/n1)*x1/(n1-1.0);
    print n1 ybar1;
    print s1;
    ybar2=x2`*one2/n2;
    s2=x2`*(ident2-one2*one2`/n2)*x2/(n2-1.0);
    print n2 ybar2;
    print s2;
    spool=((n1-1.0)*s1+(n2-1.0)*s2)/(n1+n2-2.0);
    print spool;
    t2=(ybar1-ybar2)`*inv(spool*(1/n1+1/n2))*(ybar1-ybar2);
    f=(n1+n2-k-1)*t2/k/(n1+n2-2);
    df1=k;
    df2=n1+n2-k-1;
    p=1-probf(f,df1,df2);
    print t2 f df1 df2 p;
  finish;
  use swiss;
    read all var{diff1 diff2 diff3 diff4 diff5} where (type="real") into x1;
    read all var{diff1 diff2 diff3 diff4 diff5} where (type="fake") into x2;
  run hotel2;
		  

Turning our attention to the data step swiss where the data are input, we will note that we have created 5 new variables, diff1 through diff5. Each variable is the difference between a pair of the original variables. Basically the pairs, in successive orders, thus diff1 is set to be left minus length, diff2 is right minus left, diff3 is bottom - right and so on.

Again, we use the iml program hotel2 to carry out the analysis.

At the bottom of the program we specify that we want to use swiss, the name of the SAS dataset. In this case, we are going to read all of the variables, diff1 through diff5 into x1 and x2.

x1 recieves those variables where the type is equal to real, while x2 recieves those variables where the type is equal to fake.

The results, (swiss14.lst), yield the Hotelling T2 of 2356.38 with a corresponding F-value of 461.76. Since there are 5 differences we will have 5 for the numerator degrees of freedom and the denominator degrees of freedom is equal to the total number of observations of 200, (100 of each type), minus 5 - 1, or 194. The p-value is very close to 0 indicating that we can reject the null hypothesis.

Conclusion: We reject the null hypothesis of parallel profiles between genuine and counterfeit notes (T2 = 2356. 38; F = 461. 76; d. f = 5, 194; p < 0. 0001).


Model Assumptions and Diagnostics Assumptions:

In carrying out any statistical analysis it is always important to consider the assumptions under which that analysis was carried out. Also, to assess where those assumptions may be satisfied for this data.

Let's recall the four assumptions underlying the Hotelling's T2 test.

  1. The data from population i is sampled from a population with mean vector μi.
  2. The data from both populations have common variance-covariance matrix Σ
  3. Independence. The subjects from both populations are independently sampled. (Note that this does not mean that the variables are independent of one another.)
  4. Normality. Both populations are multivariate normally distributed.

The following will consider each of these assumptions for diagnosing their validity.

Assumption 1: The data from population i is sampled from a population mean vector μi.

Assumption 2: For now we will skip Assumption 2 and return to it at a later time.

Assumption 3: Independence

Assumption 4: Multivariate Normality

To assess this assumption we can produce employ the following diagnostic procedures:

Note that the Central Limit Theorem implies that the sample mean vectors are going to be approximately multivariate normally distributed regardless of the distribution of the original variables.

So, in general Hotelling's T-square is not going to be sensitive to violations of this assumption.

Now let us return to assumption 2.


Assumption 2. The data from both populations have common variance-covariance matrix Σ.

This assumption may be assessed using Barlett's Test.

Bartlett's Test

Suppose that the data from population i have variance-covariance matrix Σi; for population i = 1, 2. What we wish to do is to test the null hypothesis that Σ1 is equal to Σ2 against the general alternative that they are not equal as shown below:

Here, the alternative is that the variance-covariance matrices differ in at least one of their elements.

The test statistic for Bartlett's Test is given by L prime as shown below:

This involves a finite population correction factor c, which is given below. Inside the brackets above, we see it involves also the determinants of the sample variance-covariance matrices for the individual populations as well as the pooled sample variance for the variance matrix. So, what we have in the brackets is the total number of observations minus 2 times the log of the determinant of the pooled variance-covariance matrix, minus n1-1 times the log of the determinant of the sample variance-covariance matrix for the first population, minus n2-1 times the log of the determinant of the sample variance-covariance matrix for the second population. (Note that is this formula, the logs are all the natural logs.)

The finite population correction factor, c, is given below:

It is a function of the number of variables p, and the sample sizes n1 and n2.

Under the null hypothesis, Ho : Σ1 = Sigma;2 , the Bartlett's test statistic is going to be approximately chi-square distributed with p(p+1) divided by 2 degrees of freedom.

We will reject Ho at level α if the test statistic exceeds the critical value from the chi-square table evaluated at level α.

Bartlett's Test may be carried out using the SAS program swiss15.sas.

SAS Program Discussion - swiss15.sas

Here the data are input as usual.

Bartlett's test can be carried out using the discriminant analysis procedure, "proc discrim". Here to ask for Bartlett's Test we set pool = test. Below this procedure we see a class statement where type is specified. Here, the option type corresponds to the name we gave to the two populations of bank notes, in this case the type of note, fake or real.

The variable statement below the class statement lists the variables that are included in the analysis. In this case, length, left, right, bottom, top and diag.

options ls=78;
title "Bartlett's Test - Swiss Bank Notes";
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  run;
proc discrim pool=test;
  class type;
  var length left right bottom top diag;
  run;
		  

The output for swiss15.lst on the first page just gives information.

SAS Output Discussion - swiss15.lst

On the top we can see that we have 200 observations on 6 variables and we have two classes for populations. DF total or total degrees of freedom is the total number of observations minus 1, 199. The DF within classes is basically the total sample size minus 2, in this case 198.

The class level information is not particualrly useful at this time, but it does tell us that we have 100 observations on each type of note.

Within Covariance Matrix observations gives us basically the size of the two variance-covariance matrices, which in this case are 6 by 6 matrices corresponding to the 6 variables in our analyses. It also gives the natural log of the determinant of the variance-covariance matrices. For the fake notes the natural log of the determinant of the covariance matrix is -10.79, for the real notes the natural log of the determinant of the covariance matrix is -11.2, and for the pooled the natural log of the determinant of the covariance matrix is -10.3.

Under the null hypothesis that the variance-covariance matrices for the two populations natural logs of the determinants, and the variance-covariance matrixes should be approximately the same for the fake and the real notes.

The results of Bartlett's Test are on bottom of page two of the output. Here we get a test statistic of 121.90 with 21 degrees of freedom, the 21 coming from the 6 variables. The p-value for the test is less than 0.0001 indicating that we reject the null hypothesis.

The conclusion here is that the two populations of bank notes have different variance-covariance matrices in at least one of their elements. This is backed up by the evidence given by the test statistic (L' = 121. 899; d. f. = 21; p < 0. 0001). Therefore, the assumption of homogeneous variance-covariance matrices is violated.

Notes:


Testing for Equality of Mean Vectors when Σ1 ≠ Sigma;2

The following considers test for equality of the population mean vectors under violations of the assumption homogenity of variance-covariance matrices.

Here we will consider the modified Hotelling's T2 test statistic given in the expression below:

Again, this is a function of the differences between the sample means for the two populations. But instead of being a function of the pooled variance-covariance matrix we can see that the modified test statistic is written as a function of the sample variance-covariance matrix, S1, for the first population and the sample variance-covariance matrix, S2, for the second population. It is also a function of the sample sizes n1 and n2.

For large samples, that is if both samples are large, T2 is approximately chi-square distributed with p d.f. We will reject Ho : μ1 = μ2 at level α if T2 exceeds the critical value from the chi-square table with p d.f. evaluated at level α.

For small samples, we can calculate an F transformation as before using the formula below.

This formula is a function of sample sizes n1 and n2, and the number of variables p. Under the null hypothesis this will be F-distributed with p and approximately ν degrees of freedom, where 1 divided by ν is given by the formula below:

This is involves summing over the two samples of bank notes, a function of the number of observations of each sample, the difference in the sample mean vectors, the sample variance-covariance matrix for each of the individual samples, as well as a new matrix ST which is given by the expression below:

We will reject Ho : μ1 = μ2 at level α if the F-value exceeds the critical value from the F-table with p and ν degrees of freedom evaluated at level α.

A reference for this particular test is given in: Seber, G.A.F. 1984. Multivariate Observations. Wiley, New York.

This modified Hotelling's T2 test can be carried out on the Swiss Bank Notes data using the SAS program swiss16.sas.

SAS Program Discussion - swiss16.sas

options ls=78;
title "2-Sample Hotellings T2 - Swiss Bank Notes (unequal variances)";
data swiss;
  infile "c:/WINDOWS/Temp/swiss3.txt";
  input type $ length left right bottom top diag;
  run;
proc iml;
  start hotel2;
    n1=nrow(x1);
    n2=nrow(x2);
    k=ncol(x1);
    one1=j(n1,1,1);
    one2=j(n2,1,1);
    ident1=i(n1);
    ident2=i(n2);
    ybar1=x1`*one1/n1;
    s1=x1`*(ident1-one1*one1`/n1)*x1/(n1-1.0);
    print n1 ybar1;
    print s1;
    ybar2=x2`*one2/n2;
    s2=x2`*(ident2-one2*one2`/n2)*x2/(n2-1.0);
    st=s1/n1+s2/n2;
    print n2 ybar2;
    print s2;
    t2=(ybar1-ybar2)`*inv(st)*(ybar1-ybar2);
    df1=k;
    p=1-probchi(t2,df1);
    print t2 df1 p;
    f=(n1+n2-k-1)*t2/k/(n1+n2-2);
    temp=((ybar1-ybar2)`*inv(st)*(s1/n1)*inv(st)*(ybar1-ybar2)/t2)**2/(n1-1);
    temp=temp+((ybar1-ybar2)`*inv(st)*(s2/n2)*inv(st)*(ybar1-ybar2)/t2)**2/(n2-1);
    df2=1/temp;
    p=1-probf(f,df1,df2);
    print f df1 df2 p;
  finish;
  use swiss;
    read all var{length left right bottom top diag} where (type="real") into x1;
    read all var{length left right bottom top diag} where (type="fake") into x2;
  run hotel2;
		  

This program is used in the same way as previous programs for the two sample Hotelling's T2 test.

It starts with the usual options and title statement.

The data are input in the data step swiss as before.

The program for carrying out this modified Hotelling's T2 test is called hotel2m. It is run under the iml procedure.

To use this iml code at the bottom of the program after the finish statement specify the name of the SAS dataset you wish to analyze, in this case we will use swiss. We want to read all of the data for all the real notes into x1 in the first read statement. Here we are reading all of the variables as listed. Similarly we will read into x2 where type is equal to fake.

This program carries out both the large sample approximation and the small sample approximation test.

The output is given in the file (swiss16.lst).

SAS Output Discussion - swiss16.lst

As before, we are given the sample sizes for each population, the sample mean vector for each population, followed by the sample variance-covariance matrix for each population.

In the large sample approximation we find that T2 is 2412.45 with 6 degrees of freedom, since we have 6 variables and a p-value that is close to 0.

Note that this value for the Hotelling's T2 is identical to the value that we obtained for our un-modified test. This will always be the case if the sample sizes are equal to one another.


Simultaneous (1 - α) x 100% Confidence Intervals

As before, the next step is to determine how these notes differ. This may be carried out using the simultaneous (1 - α) × 100% confidence intervals.

For Large Samples: simultaneous (1 - α) × 100% confidence intervals may be calculated using the expression below:

This involves the differences in the sample means for the kth variable, plus or minus the square root of the critical value from the chi-square table times the sum of the sample variances divided by their respective sample sizes.

For Small Samples: it is better use the expression below:

Basically the chi-square value and the square root is replaced by the critical value from the F-table, times a function of the number variables p, and the sample sizes n1 and n2.

Example: Swiss Bank Notes

An example of the large approximation for length is given by the hand calculation in the expression below:

Here the sample mean for the length for the genuine notes was 214.969. We will subtract the sample mean for the length of the counterfeit notes of 214.823. The critical value for a chi-square distribution with 6 degrees of freedom evaluated at 0.05 is 12.59. The sample variance for the first population of genuine note is 0.15024 which we will divide by a sample size of 100. The sample variance for the second population of counterfeit note is 0.15024 which will also divide by its sample size of 100. This yields the confidence interval that runs from -0.04 through 0.332.

The results of these calculations for each of the variables are summarized in the table below. Basically, they give us results that are comparable to the results we obtained earlier under the assumption of homgenity for variance-covariance matrices.

Variable
95% Confidence Interval
Length
-0.040, 0.332
Left Width
-0.515, -0.199
Right Width
-0.638, -0.308
Bottom Margin
-2.687, -1.763
Top Margin
-1.287, -0.643
Diagonal
1.813, 2.321

 

Click on the "Next" above, to continue this lesson.

© 2004 The Pennsylvania State University. All rights reserved.