Cluster Analysis

Ward’s Method

This is an alternative approach for performing cluster analysis. Basically, it looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association.

This method involves an agglomerative clustering algorithm. It will start out at the leaves and work its way to the trunk, so to speak. It looks for groups of leaves that it forms into branches, the branches into limbs and eventually into the trunk. Ward's method starts out with n clusters of size 1 and continues until all the observations are included into one cluster.

This method is most appropriate for quantitative variables, and not binary variables.

Based on the notion that clusters of multivariate observations should be approximately elliptical in shape, we assume that the data a from each of the clusters will be realized in a multivariate distribution. Therefore, it would follow that they would fall into an elliptical shape when plotted in a p-dimensional scatter plot.

Notation that we will use is as follows: Let Xijk denote the value for variable k in observation j belonging to cluster i.

Furthermore, for this particular method we have to define this the following:

Using Ward's Method we will start out with all sample units in n clusters of size 1 each. In the first step of the algorithm, n - 1 clusters are formed, one of size two and the remaining of size 1. The error sum of squares and r2 values are then computed. The pair of sample units that yield the smallest error sum of squares, or equivalently, the largest r2 value will form the first cluster. Then, in the second step of the algorithm, n - 2 clusters are formed from that n - 1 clusters defined in step 2. These may include two clusters of size 2, or a single cluster of size 3 including the two items clustered in step 1. Again, the value of r2 is maximized. Thus, at each step of the algorithm clusters or observations are combined in such a way as to minimize the results of error from the squares or alternatively maximize the r2 value. The algorithm stops when all sample units are combined into a single large cluster of size n .


Example: Woodyard Hammock Data

We will take a look at the implementation of Ward's Method using the SAS program wood5.sas.

SAS Program - wood5.sas

launch SAS program

As you can see, this program is very similar to the previous program, (wood1.sas), that was discussed earlier in this lesson. The only difference is that we have specified that method=ward in the cluster procedure as highlighted above. The tree procedure is used to draw the tree diagram shown below, as well as to assign cluster identifications. Here we will look at four clusters.

SAS tree plot - Ward's Method We had decided earlier that we wanted four clusters therefore we put the break in in the plot and have highlighted the resulting clusters. It looks as though there are two very well defined clusters because of there is pretty large break between the first and second branches of the tree. The partitioning results into 4 clusters yielding clusters of sizes 31, 24, 9, and 8.

Referring back to the SAS output, the results of the ANOVAs were found and have copied them here for discussion.

Results of ANOVA's
Code
Species
F
p-value
carcar
Ironwood
67.42
< 0.0001
corflo
Dogwood
2.31
0.0837
faggra
Beech
7.13
0.0003
ileopa
Holly
5.38
0.0022
liqsty
Sweetgum
0.76
0.5188
maggra
Magnolia
2.75
0.0494
nyssyl
Blackgum
1.36
0.2627
ostvir
Blue Beech
32.91
< 0.0001
oxyarb
Sourwood
3.15
0.0304
pingla
Spruce Pine
1.03
0.3839
quenig
Water Oak
2.39
0.0759
quemic
Swamp Chestnut Oak
3.44
0.0216
symtin
Horse Sugar
120.95
< 0.0001

d.f. = 3, 68

We have boldfaced those species whose F-values, using a Bonferoni correction, show as being significant. These include Ironwood, Beech, Holly, Blue Beech and Horse Sugar.

The next thing we will do is look at the cluster Means for these significant species:

Cluster
Code
1
2
3
4
carcar
2.8
18.5
1.0
7.4
faggra
10.6
6.0
5.9
6.4
ileopa
7.5
4.3
12.3
7.9
ostvir
5.4
3.1
18.3
7.5
symtin
1.3
0.7
1.4
18.8

Again, we have boldfaced those values that show an abundance of that species within the different clusters.

Note that this interpretation is cleaner than the interpretation obtained earlier from the complete linkage method. This suggests that Ward's method may be preferred for the current data.

The results can then be summarized in the following dendrogram:

In summary, this method is performed in essentially the same manner as the previous method the only difference is that the cluster analysis is based on Analysis of Variance instead of distances.

Click on the "Next" above, to continue this lesson.

© 2004 The Pennsylvania State University. All rights reserved.