BA using SAS: 04 Predictive Modeling Techniques

4.4 Cluster Analysis

Cluster Analysis

Cluster Analysis is the process of forming groups or related variables for the purpose of drawing important conclusions based on the similarities within the group.

We will begin by asking a few basic questions such as:

Why do we perform cluster analysis?
What insights does cluster analysis give us?

The figure below shows an example of clusters.

The observed data is plotted as scatters and then grouped based on the similarities into three clusters.
The greater the similarity within a group and greater the difference between the groups, more distinct is the clustering.
Generally clusters are made in such a way as to satisfy this criteria.
Often there are no assumptions about the underlying distribution of the data.
The reason for taking such an approach is that the objects in a group are similar to one another and are different from the objects in other groups. Therefore it is very easy to find a pattern here.

Types of Cluster Analysis

Hierarchical Clustering: Also known as nesting clusters as it also clusters to exist within bigger clusters to form a tree. It can be either agglomerative or divisive.

Agglomerative clustering is a bottom up approach, here each individual object has its own cluster. And cluster merge as per giving criteria as we move up the hirarchy.
Divisive clustering is the opposite of agglomerative i.e. a top down approach. Here all the objects form a single cluster and are divided into smaller clusters as we move down the hirarchy.

Partitioned clustering: Division of the set of data objects into non-overlapping clusters such that each object is in exactly one subset.
Overlapping clustering: Used to reflect the fact that an object can simultaneously belong to more than one group.
Exclusive clustering: They assign each object to a single cluster. i.e. there is no overlapping.
Complete clustering: It assigns every object to a cluster. i.e. there cannot be any object un-assigned to a cluster.

Types of Clusters

Well-separated: The distance between any two points in different groups is greater than the distance between any two points within a group. They need not be globular.
Prototype - based: The prototype of a cluster is often a centroid for data with contineous attributes. Such clusters tend to be globular.
Graph - based: When data is represented as a graph where nodes are the objects and links represent connection among the objects. They tend to be globular.
Density - based: This method is employed when the clusters are irregular and when noise and outliers are present.
Shared - property: Also known as conceptual clustering its the process of identifying the pattern in the cluster to successfully segregate into groups of cluster.

Methods to form clusters

K-means: It's a prototype based clustering technique that attempts to define the number of clusters(K). They are represented as centroids. Most common method. The idea is to minimise the distance between the data and the corresponding cluster centroid. K means analysis is based on one of the simplest algorithms for solving the cluster problem, and is therefore much faster then hirarchial cluster analysis. The K-means analysis is typically considered when the sample size is larger than 100. Note, however, that K-means cluster analysis assumes that the centroid of the observations or atleast the number of groups to be clustered is already known.
Agglomerative Hirarchical Clustering: It refers to collection of closely related clustering techniques that produce a hirarchical clustering by starting with each point as singleton cluster and repeatedly merging the closest cluster untill a single, all encompassing cluster remains. Some of these techniques has natural representation of graph based clustering, while others have interpretations in terms of a proto type based approach.
DBSCAN: It's a density based clustering algorithm that produces a partitioned clustering, in which number of clusters is automatically determined by the algorithm. Points in low density region are classified as noise and omitted and thus DBSCAN doesn't produce complete clustering.

Clustering in SAS Studio

In SAS clustering can be done using cluster or fasclus procedure.
In this example we will use the cluster procedure and see how the analysis can be done on the data.

To keep it simple we will use a data set already present in the sashelp library i.e. sashelp.mileages.
The mileages data set consists of the flying mileages between 10 American cities in a matrix format.
Let us first have a look at the data set.
We use the 'proc print' procedure with the data attribute having the value sashelp.mileages.

proc print noobs data=sashelp.mileages;

run;

In the results we can see the flying mileages values and half of the matrix is null values.
Now we will use the cluster procedure to cluster the city spaced on their flying mileages.
The cluster procedure does a hierarchical clustering data based on any of the 11 predefined methods.
The methods include:

average linkage
centroid method
complete linkage
density linkage
among other methods of clustering.

The differences between these methods are mainly in the way they calculate the distances between points.
We will try to cluster the data using 5 of the 11 methods and see how they differ from each other using graphical outputs.

title 'Cluster Analysis of Flying Mileages Between 10 American Cities';
ods graphics on;

To recap the hierarchical methods start by observing each point as a cluster and merge two clusters together till there is only one cluster in the data.

Average Linkage Method - Clustering

First let us use the average linkage method to computer the clusters.

title2 'Using METHOD=AVERAGE';

proc cluster data=sashelp.mileages(type=distance) method=average pseudo;

id City;

run;

As seen in the previous videos we will use the 'ods graphics' to output the statistics.
Keeping the ods graphics on, while using the cluster procedure will output the dendogram of the process.
In the proc cluster syntax we specify the data and method is given as average.
Once we execute the code we get the cluster results table and the dendogram.

From the results table we can see that in the first step there are 9 clusters by merging New York and Washington D. C. to a single cluster the Frequency column shows the number of observation as merged into a cluster for that particular step.
We can also notice that the clustering is based on the Norm RMS Distance or the average distance between pairs of observation in each cluster.
In the first three steps the clusters are formed by merging the individual observation.
In the fourth step with 6 clusters we can see that cluster 7 and cluster 9 are merged. And thus the frequency is 2 + 2 = 4.
Similarly you can look at all the rows and in the final step there is only a single cluster which contain all 10 observations.

The dendogram shows the clusters joined in graphical format. This shows a hierarchy of which two observations or clusters were merged to a single cluster reached as a function of average distance in the clusters.

Density Method - Clustering

Next let us try the density method for clustering. Here as we know the clustering is done based on parametric probabilities of the estimates.

title2 'Using METHOD=DENSITY K=3';
proc cluster data =sashelp.mileages (type=distance) method=density k=3;
id City;
run;

We specify the proc cluster procedure with the data as sashelp.mileages and method as density. Additionally we give a k value of 3.
This specify that we will use the kth nearest neihbor method to obtain the density estimates.
On Executing this code we get the cluster table and the dendogram.

As you can see from the table the clustering starts from different cities when compared to the previous method.
The first cluster is obtained by merging Atlanta and Washington D. C. Here the normalized density and maximum density in each cluster are used in selecting the observations or clusters to be merged.
The tie column specifies if there was a tie between two or more cluster options. In this table we can see that there was a tie in the 6th cluster step.

The dendogram shows the hierarchy of clusters as a function of the cluster fusion density.

Centroid Method - Clustering

Next we will try the centroid method of clustering.

This is very similar to the average method.
Except that the distance between cluster is calculated as the euclidian distance between their centroids or mean rather than averaging the distance between pairs of observations.

title2 'Using METHOD=CENTROID';
proc cluster data=sashelp.mileages(type=distance) method=centroid pseudo;
id City;
run;

We specify the data and the method as centroid in the cluster proc statement and run the code.

And you can see that the Norm Centroid Distance is used in deciding which centroids to merge.
The dendogram shows the graphical output of the order in which the clusters were merged as a function of distance between their centroids.

Single Linkage Method - Clustering

Now let us perform the clustering using the single linkage method.

title2 'Using METHOD=CENTROID';
proc cluster data=sashelp.mileages(type=distance) method=single;
id City;
run;

In the proc cluster code we change the method attribute to single and run the code.

In the single linkage method the distance between the two cluster is calculated as the minimum distance between the single pair of observation.
Though this performs well theoretically. This method has its limitation due to very little constraints placed on the clusters.

We can see from the table that the clusters are decided based on Norm Minimum distance and the dendogram shows the graphical output shows against the minimum distance between the clusters.

Ward Method - Clustering

Finally let us try a complex method of clustering called a WARD's Minimum Distance method.

title2 'Using METHOD=WARD';
proc cluster data=sashelp.mileages(type=distance) method=ward pseudo;
id City;
run;

We now change the method attribute of the proc cluster statement to ward and execute the code.
In this method the distance between the clusters is been calculated as the ANOVA sum of squares between two clusters added up over the other variable.

This is generally used to cluster a smaller data set, and is quite sensitive to outliers.
The table shows the clustering order with the Semi-partial R-Squared R-Square value.
The dendogram shows the graphical results as a function of Semi-Partial R-Squared.

Similar to the above methods we can use the other methods as well to form clusters on this data.

However the eml method cannot be used on this particular data set since it requires data co-ordinates.
From the dendograms we can see the most clustering methods produce clusters dividing the cities along the east west dimension. There is a slight disagreements between the methods as to where dender must belong to.
And the average and ward method suggest that dender in Huston might form a separate cluster.

Clustering Case Study

Objective of this case study is to present an outline steps that helps in understanding cluster analysis using SAS.
In this case study we will use well known data set the iris data published by Fisher in (1936). It has been widely used in examples of cluster analysis.
The sepal.length, sepal.width, petal.length and petal.width are measured in mm on 50 iris specimen from each of the three species Iris Sertosa, Iris Versicolor, and Iris Virginica.
Masage and Solar in 1980 discovered the variety of cluster analysis of the iris data.
Let us start with looking at the data.
The iris data set is already available in SAS under the SAS help library.

We will use the proc print statement to display the data set and click run.

title 'Cluster Analysis of Fisher (1936) Iris Data'

proc print sashelp.iris;
run;

In the results section We can see the entire dataset of the 50 specimen given.

This case study analyses the Iris data by using wards method and two-stage density linkage.
The fast clus procedure can also be used in combination with the cluster procedure to analyze large data set.
As we have seen earlier the cluster procedure is used to form clusters. However with the large data sets the CPU time required to form the cluster is almost square to cube of the number of observations. In such cases we use the fast clus procedure. In this procedure the time required is proportional to number of observations and hence is preferred in large data sets.
In the following code we will be using a macro show to display the cluster results.Macros are used in SAS to group a set of procedures.
For those who are familiar with programming languages, macros are equivalent of functions or methods.

%macro show
proc freq;
tables cluster * species / nopercent norow nocol plot=none;
run;

proc candisc noprint out=can;
class cluster;
var petal: sepal:;
run;

proc sgplot data=can;
scatter y=can2 x=can1 / group=cluster;
run;
%mend;

Here the show macro invokes the freq procedure to cross tabulate clusters and species.
The candisk procedure computes canonical variables for discriminating among the clusters.
The first two canonical variables are plotted to show cluster membership.
The sgplot procedure plots a scatter of the canonical variables grouped by the clusters.
In the show macro we have included the tables candisc and the plotting functions.

Clustering using Ward's Method

First let us use the Ward's minimum variance method to group the clusters.

title2 'By Ward's Method';
ods graphics on

proc cluster data=sashelp.iris method=ward print=15 ccc pseudo;
var petal: sepal:;
copy species;
run;

proc tree noprint ncl=3 out=out;
copy petal: sepal: species:
run;

%show

We will specify the title as 'By Ward's Method' and turn on ods graphics
Next we enter the cluster procedure with data=sashelp.iris and method=ward
The variables petal and sepal are numerical variables and it is defined by using the var statement.
The values given the in copy statement i.e. the copy species; is used to copy the variable from the input data set to the output tree data set to be used in displaying the cluster.
Next we specify the tree procedure to output the cluster membership.
Finally we use the show macro to display all the desired output.

From the output we see the cluster history as seen earlier we know that the procedure start with displaying all observation as individual clusters.
In the cluster procedure we mentioned the print value as 15 and hence the final 15 cluster values are displayed in this table.
Since ward's method uses ANOVA the semi-partial R-square and R-square values are used in grouping the cluster.
The ccc pseudo F and t-squared values are used in selecting the optimum number of clusters.

Here we can see that ccc is having a local peak at 3 clusters with a higher peak at 5.
F statistics has a peak at 3 clusters and the t-squared statistics suggest 3 or 6 clusters.

The cross tabulation matrix shows the predictive cluster values against the species.
We can see that Satosa completely falls into the second cluster, Versicolor falls majorly into the first cluster, whereas Virginica falls in both 3rd and first cluster.
We can note that first cluster has more observaions i.e. 64, and the third cluster has least number of observation i.e. 36.

The scatter plot displayed the clusters in different colors.

Clustering using Two Stage Density Linkage Method

The two stage 'density linkage method' is in proposition with single linkage method that we saw in the earlier example.
In the proc cluster procedure specify method as two stage and we give a k value of 8 to specify that we would like to use kth nearest neighbor method.
The other parts of the code are as explained previously. So we click run.

As you can see last 15 clusters are displayed as seen in the cluster history table.
The table also specifies that the three modal clusters have been formed.

The ccc, pseudo F and t-squared statistics suggest 2 clusters.

From the cross tabulation table we can see that Satosa is completely clustered into cluster 1. Versicolor and Verginica are clustered into cluster 2 and 3 respectively with an exception of only 6 observation points.
The clusters have 50 observations each.
Scatter plot displays the observations points with the clusters in different colors.
Comparing the two results we can see that for this particular data set the two stage density linkage performs better than the ward's method.for clustering.
However clustering is generally used in cases where there is no classification and in such case we might not be able to compare the results with the classified output to verify the goodness of clustering algorithm. So choosing the correct clustering algorithm for any particular data set is a separate process on its own.
As mentioned earlier the fast clus procedure can be used in cases where there are last number of observations.
There is also an ace clus procedure or the Approximate Co-variance Estimation for clustering. Which can be used in prep process data to be subsequently clustered by the fast clus or the cluster procedure.

This concludes the clustering.

1442/2352

BA using SAS

Wednesday, July 13, 2016

04 Predictive Modeling Techniques_Part 4