# Cluster Analysis

### Ward’s Method

This is an alternative approach for performing cluster analysis. Basically, it looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association.

This method involves an agglomerative clustering algorithm. It will start out at the leaves and work its way to the trunk, so to speak. It looks for groups of leaves that it forms into branches, the branches into limbs and eventually into the trunk. Ward's method starts out with *n* clusters of size 1 and continues until all the observations are included into one cluster.

This method is most appropriate for quantitative variables, and not binary variables.

Based on the notion that clusters of multivariate observations should be approximately elliptical in shape, we assume that the data a from each of the clusters will be realized in a multivariate distribution. Therefore, it would follow that they would fall into an elliptical shape when plotted in a *p*-dimensional scatter plot.

Notation that we will use is as follows: Let *X _{ijk}* denote
the value for variable

*k*in observation

*j*belonging to cluster

*i*.

Furthermore, for this particular method we have to define this the following:

**Error Sum of Squares**:-
**Total Sum of Squares**: **R-Square**:

Here we are summing over all variables, and all of the units within each cluster. Here, we are comparing the individual observations for each variable against the cluster means for that variable. Note that when the Error Sum of Squares is small, then this suggests that our data are close to their cluster means, implying that we have a cluster of like units.

The total sums of squares is defined in the same as always. Here we are comparing the individual observations for each variable against the grand mean for that variable.

This *r*^{2} value is interpreted as the proportion of variation explained by a particular clustering of the observations.

Using Ward's Method we will start out with all sample units in *n * clusters of size 1 each. In the first step of the algorithm, *n* - 1 clusters are formed, one of size two and the remaining of size 1. The error sum of squares and *r*^{2} values are then computed. The pair of sample units that yield the smallest error sum of squares, or equivalently, the largest *r*^{2} value will form the first cluster. Then, in the second step of the algorithm, *n* - 2 clusters are formed from that *n* - 1 clusters defined in step 2. These may include two clusters of size 2, or a single cluster of size 3 including the two items clustered in step 1. Again, the value of * r*^{2} is maximized. Thus, at each step of the algorithm clusters or observations are combined in such a way as to minimize the results of error from the squares or alternatively maximize the *r*^{2} value. The algorithm stops when all sample units are combined into a single large cluster of size *n *.

### Example: Woodyard Hammock Data

We will take a look at the implementation of Ward's Method using the SAS program wood5.sas.

As you can see, this program is very similar to the previous program, (wood1.sas), that was discussed earlier in this lesson. The only difference is that we have specified that **method=ward** in the cluster procedure as highlighted above. The tree procedure is used to draw the tree diagram shown below, as well as to assign cluster identifications. Here we will look at four clusters.

We had decided earlier that we wanted four clusters therefore we put the break in in the plot and have highlighted the resulting clusters. It looks as though there are two very well defined clusters because of there is pretty large break between the first and second branches of the tree. The partitioning results into 4 clusters yielding clusters of sizes 31, 24, 9, and 8.

Referring back to the SAS output, the results of the ANOVAs were found and have copied them here for discussion.

Results of ANOVA's |
|||

Code |
Species |
F |
p-value |

carcar |
Ironwood |
67.42 |
< 0.0001 |

corflo |
Dogwood |
2.31 |
0.0837 |

faggra |
Beech |
7.13 |
0.0003 |

ileopa |
Holly |
5.38 |
0.0022 |

liqsty |
Sweetgum |
0.76 |
0.5188 |

maggra |
Magnolia |
2.75 |
0.0494 |

nyssyl |
Blackgum |
1.36 |
0.2627 |

ostvir |
Blue Beech |
32.91 |
< 0.0001 |

oxyarb |
Sourwood |
3.15 |
0.0304 |

pingla |
Spruce Pine |
1.03 |
0.3839 |

quenig |
Water Oak |
2.39 |
0.0759 |

quemic |
Swamp Chestnut Oak |
3.44 |
0.0216 |

symtin |
Horse Sugar |
120.95 |
< 0.0001 |

*d.f. * = 3, 68

We have boldfaced those species whose F-values, using a Bonferoni correction, show as being significant. These include Ironwood, Beech, Holly, Blue Beech and Horse Sugar.

The next thing we will do is look at the cluster Means for these significant species:

Cluster |
||||

Code |
1 |
2 |
3 |
4 |

carcar | 2.8 |
18.5 |
1.0 |
7.4 |

faggra | 10.6 |
6.0 |
5.9 |
6.4 |

ileopa | 7.5 |
4.3 |
12.3 |
7.9 |

ostvir | 5.4 |
3.1 |
18.3 |
7.5 |

symtin | 1.3 |
0.7 |
1.4 |
18.8 |

Again, we have boldfaced those values that show an abundance of that species within the different clusters.

- Cluster 1: Beech (faggra): Canopy species typical of old-growth forests.
- Cluster 2: Ironwood (carcar): Understory species that favors wet habitats.
- Cluster 3: Holly (ileopa) and Blue Beech (ostvir): Understory species that favor dry habitats.
- Cluster 4: Horse Sugar(symtin): Understory species typically found in disturbed habitats.

Note that this interpretation is cleaner than the interpretation obtained earlier from the complete linkage method. This suggests that Ward's method may be preferred for the current data.

The results can then be summarized in the following dendrogram:

In summary, this method is performed in essentially the same manner as the previous method the only difference is that the cluster analysis is based on Analysis of Variance instead of distances.