BA using SAS: July 2016

Sunday, July 17, 2016

04 Predictive Modeling Techniques_Part 5 Time Series

Time Series

Time series data is an ordered sequence of observation on a quantitative variable measured over an equally spaced time interval.
Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematics, finance, whether forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering and other places.
Time Series analysis is used in

Analysing time series data
Forecasting the future value of the variable under consideration.

In time series analysis it is assumed that the data consists of identifiable components and random errors which usually makes the pattern difficult to identify.
E.g. Sales of quilts and blankets in a store across a period of five years.

Components of a Time Series

Long term trend - The smooth long term direction of time series where the data can increase or decrease in some pattern.
Seasonal variation - Patterns of change in a time series within a year which tends to repeat every year.
Cyclical variation - Its much alike seasonal variation but the rise and fall of time series over periods are longer than one year.
Irregular variation - Any variation that is not explainable by any of the three above mentioned components. They can be classified into - stationary and non-stationary variation.

When the data neither increase not decreases, i.e. its completely random its called stationary variation.
When the data has some explainable portion remaining and can be analyzed further then such case is called non-stationary variation.

Decomposition of Time Series

We can see a sample plot of decomposition of time series.
The observed sales of quilt and blanket as discussed in the previous section is displayed in the first box.
The observed time series data is decomposed into the trend, seasonal and random components.
As you can see that the trend component shows and increase and decrease in sales over a period of time. Overall we can infer that the sales increases.
The seasonality component shows the seasonal pattern of the sales data.
The random component counts for the differences in the trend and the seasonal data from the observed data.

Moving Average

Moving average is a widely used indicator in technical analysis that helps in smoothing out actions by filtering out the noise
i.e. the residuals from random fluctuations.
Moving average is also otherwise called as trend follower or lagging indicator because it always depend on historical data.
Commonly used moving averages are

Simple Moving Average (SMA)
Exponential Moving Average (EMA)

A simple moving average is calculated by adding the value for a number of time periods and then dividing this total by the same number of time periods.
Exponential moving average gives a higher weighting to recent prices but in case of simple moving average it assigns equal weighting to all values.

Goals of Time Series Analysis

Descriptive:
Identify different patterns in correlated data which helps in finding the trend and seasonal variation
Explanation:
Understanding and modeling the data
Forecasting:
Predicting the short-term trends from the previous existing patterns
Intervention Analysis:
How does a single event change the time series?
Quality Control:
Deviation of a specified size indicate a problem.

Steps for Simple Moving Average in Excel

Open the file for performing moving average
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins
Click on Data Analysis
Select Moving Average and Click Ok
Select the input range for the data by clicking and dragging the data
Check on "Labels in First Row" if data extracted has column name in first row.
Specify the interval value as required
Check on Chart Output and click ok.
The results will be obtained in a new worksheet ply by default.

Demo for Simple Moving Average in Excel

Let us take an example of a watch manufacturing company in the center of belgium.
The company is experiencing fluctuations in the net sales quantity and hence we want to find if there is any trend involved in it.
The company maintains a record of monthly net sales quantity over a period of 4 years.

These kind of data are time series data as it is stored over a long period of time.
To find these trend changes we need to perform time series analysis on the data.
We might need to know the future net sales quantity of products in that company.
And this is where analytics will help.
This helps in discovering the historical pattern that is hidden within the data.
And with this discovered pattern we are going to explore the future pattern, i.e. how the future net quantity sales would be.
The management and the sales heads of that company are really interested in finding where their sales quantity trends increase and where their trends decrease.
In out time series data we have different crest and truss.
And if all these trends are included in prediction model. We will not be able to get the accurate results.
So in order to make our prediction results still more accurate you need to smoothen your results.
In order to smooth en the non-normalized patterns we can use two methods.
The simple moving average and exponential moving average.
In order to perform simple moving average in excel. Open the data set in Excel.
Select the data analysis tool from the top right corner of the data ribbon.

Once the Data Analysis Windows is displayed, select the Moving Average option and click on ok.

A pop-up window of Moving Average is displayed.
Select the Input Range as net sales by clicking and dragging the mouse on the data.
Check "Labels in First Row" option which specify that the label names are in the first row.
Then define the moving average interval as 5. i.e. we use the last 5 months data to average and smooth en the time series data.
Define the Output range on which you require the output results by clicking on the desired cell you want to display the desired output.

Check on "Standard Errors" and "Chart Output" and finally click Ok.
The results are displayed in the cells, and a line chart is displayed with the predicted values.
The results are calculated with the moving average interval of 5 and so the first 4 values are specified as N/A. The remaining predicted net sales quantities are obtained by averaging the previous 5 values.
It also provides the residual term for the particular net sales value. i.e. the difference between the actual net sale quantity and predicted net sales is being calculated. Since the interval is given as 5 the residual term is calculated as the square root of the average differences between the past 5 values and the predicted value.

Then a line graph is plotted between the two values i.e. predicted and forecasted values.
You can see that the forecast ed values in red are more smoothen, compared to the actual plotted values in blue.
More smoothing can be achieved by increasing the interval, but it might loosen the effect of more recent sales values.
In simple smoothing averages the weights given to the past n interval values are equal.
However in case you want to provide more weightage or consideration to the current value and the weightage of older values should decrease exponential, then you should allow exponential smooting average.

Steps for Exponential Moving Average in Excel

Open the file for performing exponential moving average
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins
Click on Data Analysis
Select Exponential Moving Average and Click Ok
Select the input range for the data by clicking and dragging the data
Specify the damping factor as required.
Check on "Labels in First Row" if data extracted has column name in first row.
Specify the interval value as required
Check on Chart Output and click Ok.
The results will be obtained in a new worksheet ply by default.

Demo for Exponential Moving Average in Excel

Open the data set in Excel.
Open the Data Analysis from Data Tab.
Select Exponential Smoothing from the list of option

The window for exponential smoothing is displayed.

Here define your input range for which you need to perform exponential smoothing.
In our example we are going to provide the net sales quantity as our input range.
Then define the damping factor value. A high value of damping factor may dampen your results more, giving the results more smooth values.
In our example we will smoothen the data with value 0.45.
Check the Labels in First row option to specify that labels are being specified in your first row.
Specify your output range where your results need to be displayed.

Check the Chart Outputs and Standard Errors and finally click Ok.
The results will be displayed in the output range and a chart is generated.

The results provided the predicted sales quantity value with an exponential smoothing damping factor of 0.45.
The need of using this exponential smoothing is to provide more weightage to recent net sales quantity and less declined exponential values to the older value. i.e for the value of 0.45 a 55% weightage is given to the previous value and exponential lower values are given to the older values.
The results also provide the residual term to introduce the difference between the actual values and predicted values.
The graph shows a line output of the Actual and Predicted values with exponentially smoothened effect.
If the damping factor is very high then your results will get more smoothened out and you can for a normal straight line.
Thus using the simple moving average and exponential smoothing functions we have generated a trend of time series data and have arrived a way of predicting the future net sales quantity data.

Case Study - Time Series Analysis.

Let us take the example of a retail store which contains the sales data for last 15 weeks and we need to see how the retail store is going to perform in the future weeks. We need to know the demand before hand. And this is where Analytics will help. For this prediction we are going to perform a time series analysis to this sales data. This helps in discovering the historical patterns and is hidden within those data, and with this discovered pattern, we are going to explore the future pattern and how the future sales would be. Just to give a business perspective we want to have predicted demand so that the retail store accordingly plans the stock levels and is ready in case the demand suddenly goes up or down. In our data set we are having two variables. The first variables specifies the week and the second variable specifies the demand in that week for the retail store. Now let us look at the data set given through a chart.

We see that in our time series data there are many fluctuations. For the sake of simplicity we will assume that the fluctuation are random. Now we need to take care of these fluctuations as it may affect our prediction results. So in order to reduce these fluctuations, we need to smoothen out these crest an truss in our prediction model.
The most commonly used smoothing techniques are simple moving average and exponential moving average. There are few more but we will for the moment discuss the common ones.
Let us start with calculating the simple moving average for our data set.
A simple moving average uses the average of most recent end data points in a time series and helps in forecasting for the future time period. We will make the explanationexplicit by doing this for our data set.
For our data set we are going to perform a moving average for the 15 week demand with an interval of 5 week. i.e. prediction for a particular week is done by doing an average of the last 5 weeks.
Let us do this on Excel.
Select the Data Analysis from data tab, and then option Moving Average.
The moving average window pops-up.
Here specify the demand as your input range.
Check on Labels in first row (in case first row contains label names.)
Define the Moving Average interval as 5.
Specify the output range and check on the chart output to display your demand and moving average as chart.
Now you will get the forecasted demand of 5 week moving average.
Also you can find out the absolute error by calculating the difference between forecasted demand and actual demand.
From individual absolute errors we obtain the mean absolute error. Alternatively you can also get the mean absolute error by checking the option of standard errors from the moving average pop-up window while feeding the inputs for the model.
The mean absolute error helps in choosing the right interval for moving average by applying trial and error method we can find the best interval that provides you most accurate forecasted value.
If the absolute mean error is less than we will get the high accurate prediction results.
Now let us observe the prediction values given by moving average. Note that the moving average are close to just the previous average.
what would happen if we increase the interval from 5 to 8 weeks. Let us again repeat the steps mentioned earlier to calculate the moving average forecast with interval as 8 weeks.

After looking at the new chart and comparing with the old we see that the newer forecast is much more smooth or averaged.As we keep increasing the intervals the forecasted values will be much more averaged out.similarly the decreasing intervals leads to much more fluctuation values of predicted demand. Ofcourse what interval to choose as mentioned earlier depends upon absolute mean square error.
One of the limitations of moving average forecast is that it does not predict the spike in sales for a week and averages out the demand.This leads to inaccurate forecast for weeks whether there is an increase in sales as compared to earlier weeks. To understand this look at Week7, Week11 and Week13. The demand is way higher than the moving average value.
Also the model does not take care of the observation values from older time line. This can easily create problem if the recent sales are not normal and the factor increase in the sales was only and anomaly for certain time period. The next model we are going to use for forecasting takes care of giving a weightage to the previous demands and also the latest ones.
We will do exponential smoothing to take care of all the demand values and accordingly come out with a forecast. We assign a damping factor. The factor gives some weightage to the recent values and some weightage to the previous forecast values.

Previous sales get factored in the forecast value. So lets begin with the steps involved to get some smoothen forecast values.

As already explained before, to perform exponential smoothing, click on data analysis tab and select exponential smoothing and specify the input range. Input range in our case will be the demand data set. We then need to specify the damping factor based on which exponential smoothing is to be made. For our case we are going to specify the damping factor as 0.24. Then check on the Label to specify that your first row contains the column names. Then specify the output range and check on Chart Output and click on Ok.

If you look at the chart the forecast values resemble the demand values.But in each of the case the forecasted values seems to mirror the last week demand value. Taking the example of Week7 and Week8 Week8 forecast values seems closer to Week 7 demand value than Week8 actual demand value.

We do the earlier steps again and put damping factor as 0.8. The output chart that comes out is completely different from the earlier chart. The forecasted values are much smoother than the earlier values. This explains the effect of the damping factor. More the damping factor, smoother the graph would become.
Choosing the right damping factor is again on mean absolute error and sometime depends on business experience.
But even this model does not solve all the problems mentioned after moving average forecast. We still have not solved the problems of sudden spikes in particular week.
Coming back to the original assumption of our problem, the spikes are random, and hence we use smoothing techniques but what if there is a pattern to the spikes.
Ofcourse there are more advanced models such has hotwinters, arima, multiple linear regression, etc. which can take care of the sudden spikes and forecast the value of demand more accurately. This is where advanced analytics comes into the picture.

Time Series - Business Case Study

Now that we have covered the fundamentals of analytics course through Excel and SAS let us take a business case where all the concepts learned from lesson 1 to now make a use. The intention here is to give a way the business problem is typically solved.

In our business case we have a retail store who wants to derive values from their point of sales data across different stores.

Point of sales data is data which is captured in each store and covers the billing process from the products sold. It does not give any details about the customers themselves, but a lot of useful information from the products being sold has come out as data as well we will see going further in the case.
Now let's look at the data set given. The first tab shows the weekly data starting from 30 Dec 2012 to 31 Mar 2014.
The data is for 5 product categories, namely Kitchen, Grains, Pulses, Vegetables, and Merchandise. The numbers are quantity sold of each type.

Second tab specifically has data for broccoli 500 gm packets sold across outlets in a particular city.
Rainfall gives the value in centimeter for the particular city across time.
Season column give the value of 1 for winters and 0 for other seasons. Winter assumed from Nov to Feb.
We need to first understand the data and start to work over.

Scatter Plot

Going back to category wise data let us draw scatter plot understanding the quantity sold for each category across timeline.

The scatter plot shows that pulses and kitchen's items are sold less compared to other. But there are no fluctuations and sold uniformly.
Merchandise Sales quantity increases and decreases randomly and there seems to be no pattern in crest and truss.
Grains are the highest selling category and have fluctuations where there seems to be a pattern.
Similar is the case with the vegetable category.

Descriptive Analysis

Now let us do some descriptive analysis on the data.
We do a descriptive statistics on each of the category together. So what do we observe?

Same as in the scatter chart.
Merchandise has the highest standard deviation and hence so much fluctuation.
Grains, Pulses and Vegetables are negatively skewed, whereas Kitchen and Merchandise are positively skewed.
The range of pulses is minimum and also the standard deviation is the lowest, which signifies that the demand for pulses is uniform throughout the year and does not vary much.
After understand the descriptive part of the variables we will look into the diagnostic part of Analytics.
This part will answer any relationship we have among variables or between dependent and independent variable.

ANOVA

We will first do an ANOVA test to see the relationship between the 5 categories.

Again using excel we go to the Data Tab, Select Data Analysis, Choose ANOVA single factor from the options and select all the 5 columns except the time column including the header. Choose alpha as 0.05.

The results from the test shows the p-value is much less than the significant value of 0.05 chosen for the test.
So we reject the hypothesis that the means of all the 5 categories are the same.
This leads us to the direction that the demand quantity of these groups is independent of each other and should be taken care separately.
We can still find how different the means are by doing a pairwise t-test for each combination. But that again does not form the part of analysis here.

Correlation

Next test we perform with the data given is the correlation between the product categories demand quantity.
Why did we choose correlation and not chi-square test? Because the values of each of the variables here are continuous numerical and non-categorical.

We again click the data analysis tab, choose correlation and put the input range including the 5 columns. So what do we see in the results?
Looking at all the values from 0 to 1 we see interesting observations.
Grains and Vegetables has a correlation of 0.9. Meaning the demand quantity follow each other and have similar patterns.
Even Pulses and Vegetables has a correlation of 0.77. Again demand quantities of Vegetables and Pulses follow each other but to a lesser extent than between grains and vegetables.
All the values are positive and none of them negative.
All the demand quantities are positively correlated to each other.
None of the categories affect negatively or cannibalise the demand.
What it also implies is that we predict the demand of vegetables and get a working model.
The same model can be used to predict the demand of Grains.to a large extent and pulses to a smaller extent.

Let us now look at the second tab and understand the data given.

Wednesday, July 13, 2016

04 Predictive Modeling Techniques_Part 4

4.4 Cluster Analysis

Cluster Analysis

Cluster Analysis is the process of forming groups or related variables for the purpose of drawing important conclusions based on the similarities within the group.

We will begin by asking a few basic questions such as:

Why do we perform cluster analysis?
What insights does cluster analysis give us?

The figure below shows an example of clusters.

The observed data is plotted as scatters and then grouped based on the similarities into three clusters.
The greater the similarity within a group and greater the difference between the groups, more distinct is the clustering.
Generally clusters are made in such a way as to satisfy this criteria.
Often there are no assumptions about the underlying distribution of the data.
The reason for taking such an approach is that the objects in a group are similar to one another and are different from the objects in other groups. Therefore it is very easy to find a pattern here.

Types of Cluster Analysis

Hierarchical Clustering: Also known as nesting clusters as it also clusters to exist within bigger clusters to form a tree. It can be either agglomerative or divisive.

Agglomerative clustering is a bottom up approach, here each individual object has its own cluster. And cluster merge as per giving criteria as we move up the hirarchy.
Divisive clustering is the opposite of agglomerative i.e. a top down approach. Here all the objects form a single cluster and are divided into smaller clusters as we move down the hirarchy.

Partitioned clustering: Division of the set of data objects into non-overlapping clusters such that each object is in exactly one subset.
Overlapping clustering: Used to reflect the fact that an object can simultaneously belong to more than one group.
Exclusive clustering: They assign each object to a single cluster. i.e. there is no overlapping.
Complete clustering: It assigns every object to a cluster. i.e. there cannot be any object un-assigned to a cluster.

Types of Clusters

Well-separated: The distance between any two points in different groups is greater than the distance between any two points within a group. They need not be globular.
Prototype - based: The prototype of a cluster is often a centroid for data with contineous attributes. Such clusters tend to be globular.
Graph - based: When data is represented as a graph where nodes are the objects and links represent connection among the objects. They tend to be globular.
Density - based: This method is employed when the clusters are irregular and when noise and outliers are present.
Shared - property: Also known as conceptual clustering its the process of identifying the pattern in the cluster to successfully segregate into groups of cluster.

Methods to form clusters

K-means: It's a prototype based clustering technique that attempts to define the number of clusters(K). They are represented as centroids. Most common method. The idea is to minimise the distance between the data and the corresponding cluster centroid. K means analysis is based on one of the simplest algorithms for solving the cluster problem, and is therefore much faster then hirarchial cluster analysis. The K-means analysis is typically considered when the sample size is larger than 100. Note, however, that K-means cluster analysis assumes that the centroid of the observations or atleast the number of groups to be clustered is already known.
Agglomerative Hirarchical Clustering: It refers to collection of closely related clustering techniques that produce a hirarchical clustering by starting with each point as singleton cluster and repeatedly merging the closest cluster untill a single, all encompassing cluster remains. Some of these techniques has natural representation of graph based clustering, while others have interpretations in terms of a proto type based approach.
DBSCAN: It's a density based clustering algorithm that produces a partitioned clustering, in which number of clusters is automatically determined by the algorithm. Points in low density region are classified as noise and omitted and thus DBSCAN doesn't produce complete clustering.

Clustering in SAS Studio

In SAS clustering can be done using cluster or fasclus procedure.
In this example we will use the cluster procedure and see how the analysis can be done on the data.

To keep it simple we will use a data set already present in the sashelp library i.e. sashelp.mileages.
The mileages data set consists of the flying mileages between 10 American cities in a matrix format.
Let us first have a look at the data set.
We use the 'proc print' procedure with the data attribute having the value sashelp.mileages.

proc print noobs data=sashelp.mileages;

run;

In the results we can see the flying mileages values and half of the matrix is null values.
Now we will use the cluster procedure to cluster the city spaced on their flying mileages.
The cluster procedure does a hierarchical clustering data based on any of the 11 predefined methods.
The methods include:

average linkage
centroid method
complete linkage
density linkage
among other methods of clustering.

The differences between these methods are mainly in the way they calculate the distances between points.
We will try to cluster the data using 5 of the 11 methods and see how they differ from each other using graphical outputs.

title 'Cluster Analysis of Flying Mileages Between 10 American Cities';
ods graphics on;

To recap the hierarchical methods start by observing each point as a cluster and merge two clusters together till there is only one cluster in the data.

Average Linkage Method - Clustering

First let us use the average linkage method to computer the clusters.

title2 'Using METHOD=AVERAGE';

proc cluster data=sashelp.mileages(type=distance) method=average pseudo;

id City;

run;

As seen in the previous videos we will use the 'ods graphics' to output the statistics.
Keeping the ods graphics on, while using the cluster procedure will output the dendogram of the process.
In the proc cluster syntax we specify the data and method is given as average.
Once we execute the code we get the cluster results table and the dendogram.

From the results table we can see that in the first step there are 9 clusters by merging New York and Washington D. C. to a single cluster the Frequency column shows the number of observation as merged into a cluster for that particular step.
We can also notice that the clustering is based on the Norm RMS Distance or the average distance between pairs of observation in each cluster.
In the first three steps the clusters are formed by merging the individual observation.
In the fourth step with 6 clusters we can see that cluster 7 and cluster 9 are merged. And thus the frequency is 2 + 2 = 4.
Similarly you can look at all the rows and in the final step there is only a single cluster which contain all 10 observations.

The dendogram shows the clusters joined in graphical format. This shows a hierarchy of which two observations or clusters were merged to a single cluster reached as a function of average distance in the clusters.

Density Method - Clustering

Next let us try the density method for clustering. Here as we know the clustering is done based on parametric probabilities of the estimates.

title2 'Using METHOD=DENSITY K=3';
proc cluster data =sashelp.mileages (type=distance) method=density k=3;
id City;
run;

We specify the proc cluster procedure with the data as sashelp.mileages and method as density. Additionally we give a k value of 3.
This specify that we will use the kth nearest neihbor method to obtain the density estimates.
On Executing this code we get the cluster table and the dendogram.

As you can see from the table the clustering starts from different cities when compared to the previous method.
The first cluster is obtained by merging Atlanta and Washington D. C. Here the normalized density and maximum density in each cluster are used in selecting the observations or clusters to be merged.
The tie column specifies if there was a tie between two or more cluster options. In this table we can see that there was a tie in the 6th cluster step.

The dendogram shows the hierarchy of clusters as a function of the cluster fusion density.

Centroid Method - Clustering

Next we will try the centroid method of clustering.

This is very similar to the average method.
Except that the distance between cluster is calculated as the euclidian distance between their centroids or mean rather than averaging the distance between pairs of observations.

title2 'Using METHOD=CENTROID';
proc cluster data=sashelp.mileages(type=distance) method=centroid pseudo;
id City;
run;

We specify the data and the method as centroid in the cluster proc statement and run the code.

And you can see that the Norm Centroid Distance is used in deciding which centroids to merge.
The dendogram shows the graphical output of the order in which the clusters were merged as a function of distance between their centroids.

Single Linkage Method - Clustering

Now let us perform the clustering using the single linkage method.

title2 'Using METHOD=CENTROID';
proc cluster data=sashelp.mileages(type=distance) method=single;
id City;
run;

In the proc cluster code we change the method attribute to single and run the code.

In the single linkage method the distance between the two cluster is calculated as the minimum distance between the single pair of observation.
Though this performs well theoretically. This method has its limitation due to very little constraints placed on the clusters.

We can see from the table that the clusters are decided based on Norm Minimum distance and the dendogram shows the graphical output shows against the minimum distance between the clusters.

Ward Method - Clustering

Finally let us try a complex method of clustering called a WARD's Minimum Distance method.

title2 'Using METHOD=WARD';
proc cluster data=sashelp.mileages(type=distance) method=ward pseudo;
id City;
run;

We now change the method attribute of the proc cluster statement to ward and execute the code.
In this method the distance between the clusters is been calculated as the ANOVA sum of squares between two clusters added up over the other variable.

This is generally used to cluster a smaller data set, and is quite sensitive to outliers.
The table shows the clustering order with the Semi-partial R-Squared R-Square value.
The dendogram shows the graphical results as a function of Semi-Partial R-Squared.

Similar to the above methods we can use the other methods as well to form clusters on this data.

However the eml method cannot be used on this particular data set since it requires data co-ordinates.
From the dendograms we can see the most clustering methods produce clusters dividing the cities along the east west dimension. There is a slight disagreements between the methods as to where dender must belong to.
And the average and ward method suggest that dender in Huston might form a separate cluster.

Clustering Case Study

Objective of this case study is to present an outline steps that helps in understanding cluster analysis using SAS.
In this case study we will use well known data set the iris data published by Fisher in (1936). It has been widely used in examples of cluster analysis.
The sepal.length, sepal.width, petal.length and petal.width are measured in mm on 50 iris specimen from each of the three species Iris Sertosa, Iris Versicolor, and Iris Virginica.
Masage and Solar in 1980 discovered the variety of cluster analysis of the iris data.
Let us start with looking at the data.
The iris data set is already available in SAS under the SAS help library.

We will use the proc print statement to display the data set and click run.

title 'Cluster Analysis of Fisher (1936) Iris Data'

proc print sashelp.iris;
run;

In the results section We can see the entire dataset of the 50 specimen given.

This case study analyses the Iris data by using wards method and two-stage density linkage.
The fast clus procedure can also be used in combination with the cluster procedure to analyze large data set.
As we have seen earlier the cluster procedure is used to form clusters. However with the large data sets the CPU time required to form the cluster is almost square to cube of the number of observations. In such cases we use the fast clus procedure. In this procedure the time required is proportional to number of observations and hence is preferred in large data sets.
In the following code we will be using a macro show to display the cluster results.Macros are used in SAS to group a set of procedures.
For those who are familiar with programming languages, macros are equivalent of functions or methods.

%macro show
proc freq;
tables cluster * species / nopercent norow nocol plot=none;
run;

proc candisc noprint out=can;
class cluster;
var petal: sepal:;
run;

proc sgplot data=can;
scatter y=can2 x=can1 / group=cluster;
run;
%mend;

Here the show macro invokes the freq procedure to cross tabulate clusters and species.
The candisk procedure computes canonical variables for discriminating among the clusters.
The first two canonical variables are plotted to show cluster membership.
The sgplot procedure plots a scatter of the canonical variables grouped by the clusters.
In the show macro we have included the tables candisc and the plotting functions.

Clustering using Ward's Method

First let us use the Ward's minimum variance method to group the clusters.

title2 'By Ward's Method';
ods graphics on

proc cluster data=sashelp.iris method=ward print=15 ccc pseudo;
var petal: sepal:;
copy species;
run;

proc tree noprint ncl=3 out=out;
copy petal: sepal: species:
run;

%show

We will specify the title as 'By Ward's Method' and turn on ods graphics
Next we enter the cluster procedure with data=sashelp.iris and method=ward
The variables petal and sepal are numerical variables and it is defined by using the var statement.
The values given the in copy statement i.e. the copy species; is used to copy the variable from the input data set to the output tree data set to be used in displaying the cluster.
Next we specify the tree procedure to output the cluster membership.
Finally we use the show macro to display all the desired output.

From the output we see the cluster history as seen earlier we know that the procedure start with displaying all observation as individual clusters.
In the cluster procedure we mentioned the print value as 15 and hence the final 15 cluster values are displayed in this table.
Since ward's method uses ANOVA the semi-partial R-square and R-square values are used in grouping the cluster.
The ccc pseudo F and t-squared values are used in selecting the optimum number of clusters.

Here we can see that ccc is having a local peak at 3 clusters with a higher peak at 5.
F statistics has a peak at 3 clusters and the t-squared statistics suggest 3 or 6 clusters.

The cross tabulation matrix shows the predictive cluster values against the species.
We can see that Satosa completely falls into the second cluster, Versicolor falls majorly into the first cluster, whereas Virginica falls in both 3rd and first cluster.
We can note that first cluster has more observaions i.e. 64, and the third cluster has least number of observation i.e. 36.

The scatter plot displayed the clusters in different colors.

Clustering using Two Stage Density Linkage Method

The two stage 'density linkage method' is in proposition with single linkage method that we saw in the earlier example.
In the proc cluster procedure specify method as two stage and we give a k value of 8 to specify that we would like to use kth nearest neighbor method.
The other parts of the code are as explained previously. So we click run.

As you can see last 15 clusters are displayed as seen in the cluster history table.
The table also specifies that the three modal clusters have been formed.

The ccc, pseudo F and t-squared statistics suggest 2 clusters.

From the cross tabulation table we can see that Satosa is completely clustered into cluster 1. Versicolor and Verginica are clustered into cluster 2 and 3 respectively with an exception of only 6 observation points.
The clusters have 50 observations each.
Scatter plot displays the observations points with the clusters in different colors.
Comparing the two results we can see that for this particular data set the two stage density linkage performs better than the ward's method.for clustering.
However clustering is generally used in cases where there is no classification and in such case we might not be able to compare the results with the classified output to verify the goodness of clustering algorithm. So choosing the correct clustering algorithm for any particular data set is a separate process on its own.
As mentioned earlier the fast clus procedure can be used in cases where there are last number of observations.
There is also an ace clus procedure or the Approximate Co-variance Estimation for clustering. Which can be used in prep process data to be subsequently clustered by the fast clus or the cluster procedure.

This concludes the clustering.

1442/2352

Saturday, July 9, 2016

04 Predictive Modeling Techniques_Part 3

4.3 Logistic Regression

Logistic Regression

It's a statistical method that is used in analyzing dataset where one or more independent variables would determine the outcome
The dependent variables are binary (True or False)
Find the best fitting model to describe the relationship between the dichotomous characteristics and a set of independent variables
Logistic regression generates the coefficient of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.

logit(p) = B0 + B1X1 + B2X2 + B3X3 + ... + BnXn

where, p is the probability of presence of the characteristic of interest.

The logit transformation is defined as the logged odds

odds = (p/1-p)

logit(p) = ln (p/1-p)

Method to develop a logistic model

At first we finalize on the data i.e. the target and the predictor variable.
Then we collect the data by defining proper position and observation windows.
Before the data is used for further analysis data preparation and data treatment are done and a hygiene check is performed.
Often we are required to create new variables for the model, so we identify derived variables from the given data.
Next steps are fine classing and coarse classing based on some continuous variables if required.
Finally we get the logistic model and it is analyzed through proper diagnostics.

Linear Regression vs Logistic Regression

First let us recap what Linear Regression is:

Linear regression is mainly used to establish a relationship between a dependent and independent variables. It helps in estimating the impact of independent variable on a dependent variable.
Example - using a linear regression, the relationship between temperature (T) and icecream sales(I) is found to be

I = 2T + 4000

This equation says that for every 1 degree rise in temperature, there is a demand of 4002 ice creams.

On the other hand:

Logistic regression helps in finding out the probability of an event and this event is captured in binary format i.e. 0 or 1.
Example - In order to know whether customers will buy a product or not, run a Logistic Regression on the relevant data. The dependent variable would be a binary variable. either 1 or 0 i.e. yes or no.
In terms of graphical representation, Linear Regression gives a linear line as an output, once the value are plotted on the graph. Whereas, the logistic regression an S shaped line to a logistic function.

Logistic Regression

In this demo we will implement logistic regression on a sample data set.

In this data set we have the survey results of various people about their willingness to subscribe to a news paper.
Their response is pulled with only two outcomes whether they are willing to subscribe to the newspaper or not.
Since it is binary response variable we can use logistic regression to predict the outcome.
The data set consists of three variables and we will look at each of them.
The first variable specifies the gender of the variable. Male or Female.
Second variable specifies the age.
The third variable is the response variable and denotes their subscription willingness, i.e. whether the individual is interested in subscribing or not.
In most analytics tools and languages there are the LogIt and ProbeIt procedures, used in fitting the logistic model.
For this example we will use the probit model to fit a model to the data.
Generally the probit procedure is used to fit a logistic regression model to the probability of a positive response, i.e. the subscription of a newspaper as a function in the variables sex and age.
Specifically the probability of subscribing can be obtained using cumulative logistic distribution function of intercept with logistic coefficient values of sex and age.
By default the probit procedure models the probability of the lower response level for binary data. The probit model is made such that subscription as 1 is defined as the acceptance of subscription,
Correspondingly if it specifies 0 then the subscription is being rejected.
Here in this case we are going to use a probit procedure that calculates the maximum likely hood estimates of regression parameters.
The probability analysis is mainly used to analyze the qualitative dependent variable i.e. a dichotomous value with in the regression framework.
In our case study the value of subscription is binary by nature. i.e. accepted or rejected.
However, the other variables are measured in terms of ordinal values rather than counting into a continuous variable.
In this example the probability of an individual subscribing to a model calculated as a cumulative logistic distribution function of the intercept, B0 and the values of sex and age with their respective coefficients B1 and B2.
For the logistic regression and the demo that follow we will use a codes to generate a desired output. Instead of the graphical options as the earlier analytics functions.
Let us look at the code to perform logistic regression.
In the first section of the code we create the data set.
The data set is named as 'news'.

data news;

The input syntax declares three input variables under the news data set.
Here we specify the three variables the 'sex', 'age' and 'subs'. The '$' symbol is used to specify the variable preceding it is a character data. i.e. in this data the sex contains character data.

input sex $ age subs @@;

We build the data set by giving the values in the same format, sex, age and subs.
You can see that we have created the data set of 40 records.

datalines;

Female 35 0 Male 44 0

Male 45 1 Female 47 1

Female 51 0 Female 47 0

Male 54 1 Male 47 1

Female 35 0 Female 34 0

Female 48 0 Female 56 1

Male 46 1 Female 59 1

Female 46 1 Male 59 1

Male 38 1 Female 39 0

Male 49 1 Male 42 1

Male 50 1 Female 45 0

Female 47 0 Female 30 1

Female 39 0 Female 51 0

Female 45 0 Female 43 1

Male 39 1 Male 31 0

Female 39 0 Male 34 0

Female 52 1 Female 46 0

Male 58 1 Female 50 1

Female 32 0 Female 52 1

Female 35 0 Female 51 0

;

Let us look at the next section

proc format;

value subscrib 1 = 'accept' 0 = 'reject';

run;

The format procedure provides a convenient way to do a table lookup to SAS.
User generated SAS format can be used to assign descriptive labels to data values, create new variables and find unexpected values.
The format procedure can also be used to generate data, extract and merge data sets.
Finally the probit code is given

proc probit data=news;

class subs sex;

model subs=sex age / d=logistic itprint;

format subs subscrib;

We specify the data as news
The class statement specifies the classification or binary values in our data set. i.e. the subs and sex variables.
In the next line we build the model, specifying that the subs value is the response variable, and has to be modeled using the sex and age variables as the explanatory variables.
The final line specifies the model to use the subscribe variable while displayed the values of subs variable i.e. as 'accept' and 'reject' instead of '1's and '0's.
After giving the code specified here we will click run to execute the code.
The result shows the logistic regression of subscription status.

Starting with the iteration history for parameter estimates.
In the integration history table the Loglikelihood, intercept and values for the Sex and Age variables are displayed.
We can see that the loglikekyhood is ended by the 6th iteration.
The next table displays the :

dependent variables
number of observations
name of distribution and
loglikelyhood estimation of the model.

The class level table shows the class variables sex and subs and the two values for the class variables.

The negative of gradient values and negative of Hessian values are displayed.
The final table shows the "Analysis of Maximum Likelyhood Parameter Estimates".
The intercept values is displayed as -5.762.
The estimate of female is -2.42 and this shows that the women are less interested in subscribing to the news paper then men. This is the inference found based on Sex variable.
To find the insights based on age variable we need to see the loglikelyhood parameter of age.
And this parameter estimate shows the value as 0.1649. This positive coefficient values shows that if the age of the individual us higher then his willingness to accept the subscription is higher. And conversely in case of young individuals their willingness to reject the subscription is higher.
The table also shows the values of

standard error,
95% confidence limits,
The chi-squared values, and
P-values for every variables including intercept.

Logistic Regression Case Study

We will take a medical case study to illustrate the applications of logistic regression.
Let us consider a hospital diagnosing a threat level of tumor patient given a few attribute values.
The threat level of the patient is categorized into a binary output 'mild' or 'severe'
The hospital maintains a record of three attributes of evey patients pertaining to their threat levels.
The first variables is the drug influencing values i.e. the dosage of drug given to the patient to cure the tumor.
The second variable is the observed weight of the tumor.
The third variable is a binary response variable and this specifies the seriousness of threat, whether it is severe or not.
Seen here are two sample records of the data. The first patient has a 3.70 drug dosage value 0.825 tumor weight and is categorized as a severe threat level patient.
The second patient as a comparitively lower drug dosage value of 0.60, a tumor weight of 0.75, and is categorized as non-severe threat patient.
We will use a set of data to illustrate the diagnostic measure for detecting influential observations and to quantify their effects on various aspects of the maximum likelyhood fit.
Let us start by creating the data set in SAS.

data tumor;

length Response $12;

input Drug Weight Response @@;

LogDrug=log(Drug);

LogWeight=log(Weight);

datalines;

3.70 0.825 severe 3.50 1.090 severe

1.25 2.500 severe 0.75 1.500 severe

0.80 3.200 severe 0.70 3.500 severe

0.60 0.750 non-severe 1.10 1.700 non-severe

0.90 0.750 non-severe 0.90 0.450 non-severe

0.80 0.570 non-severe 0.55 2.750 non-severe

0.60 3.000 non-severe 1.40 2.330 severe

0.75 3.750 severe 2.30 1.640 severe

3.20 1.600 severe 0.85 1.415 severe

1.70 1.060 non-severe 1.80 1.800 severe

0.40 2.000 non-severe 0.95 1.360 non-severe

1.35 1.350 non-severe 1.50 1.360 non-severe

1.60 1.780 severe 0.60 1.500 non-severe

1.80 1.500 severe 0.95 1.900 non-severe

1.90 0.950 severe 1.60 0.400 non-severe

2.70 0.750 severe 2.35 0.030 non-severe

1.10 1.830 non-severe 1.10 2.200 severe

0.95 1.900 non-severe 0.75 1.900 non-severe

1.30 1.625 severe

We use the data statement to create the data set and here we name the data set as tumor.
The length statement is used to specify the maximum value that a variable can take.
Here we specify the length of the Response variable as 12.
The input statement specifies the variables in the data set. i.e. the Drug Dosage, Weight of tumor and the Response.
To perform the logistic regression we will perform a log function on the two explanatory variables, drug and weight. We declare two variables LogDrug and LogWeight calculated as log function of Drug and Weight variables respectively.
Next we have the datalines statement.Note that the datalines statement is used with the input statement to read data that you enter directly in the program, rather than read data stored in an external file.
Now we specify all the data values corresponding to the input format of Drug Weight and Response variable.

ods graphics on;

title 'Occurrence of tumor';

proc logistic data=tumor plots=effect;

model Response=LogDrug LogWeight;

run;

ods graphics off;

Then define the ods graphics on. Ods statistical graphics is a functionality that is mainly used for easily creating statistical graphics. With ods graphics over 60 statistical procedures can produce graphs as automatically as they do tables. For statistical procedures that support ods graphics, you invoke the functionality with the statement 'ods graphics on'. Graphs and Tables created by these procedures are then integrated in your ods output destination.
We specify the title of the specification as 'Occurrence of tumor'.
We use the logistic procedure for this case study instead of the probit model that we saw earlier.
So here we specify proc logistic data=tumor. We will specify plot=effect to get the predictive probability output. The plots are part of the ods package. And SAS provides the lists of plots that can be used along with the logistic regression procedure.
Here we will use a simple probability plot to model our output.
Then define the model Response=LogDrug LogWeight; The model statement names the response variable and the explanatory effect. Including co-variance main effects interactions and nested effects.
Here the Response is the dependent variable and LogDrug and LogWeight are explanatory variable.
Finally we click run to get the results in the results tab.

The first table gives the Model Information. We have used the tumor data set with the Response as Response variable. The response variable has two levels 'severe' and 'non-severe'. The data has been modeled as a logit procedure by using the Fisher's scoring Optimization Technique.
There are 39 observations in the data and all of them have been used to build the logistic model.
The response profiles shows the statistics of the Response variable. i.e. there are 19 'non-severe' outcomes and 20 'severe' outcomes for the Response variable in the data set.
The model fit statistics gives the criteria for goodness of fit of a model.
The most commonly used criteria is AIC, which deals with the goodness of fit of a model against complexity of a model. When we construct same model on the same data, the AIC value of the different models can be used to compare them against each other and decide on the model that best fits the data.

In the next table various tests are performed to test the NULL hypothesis and the results are displayed.
The NULL hypothesis is that the response variable cannot be modeled from the given explanatory variables i.e. Beta=0.
All the tests result shown here of a very low P-value and thus we can deduce that there is a model where the response variable can be explained as a function of the explanatory variable.
Next we come to the most important result that helps in making conclusions on the data. "The Analysis of Maximum Likelyhood Estimates" shows the estimate values for Intercept Drug and Weight variables.
Result of the model shows that both the drug and weight are most significant to the occurrence of tumor as shown from the P-values of 0.0131 and 0.0055 respectively.
Their positive parameter estimates indicate that higher the drug consumption rate or a larger weight of tumor is likely to increase the probability of severe tumor threat.
The following table shows the ratio of odds estimates and the association of probabilities and responses.

The effect plot are shown under the final section under influence diagnostics.
It shows different residuals and leverage plots.

The final plot shows the predicted probabilities for the response variable with 95% confidence limits.
As shown in the legend the blue circles denote the observed values and the predicted probability is shown by the blue line.
From the predicted probabilities model outputs the predicted values of the response variable either 0 or 1.

Conclusion:

To conclude in this case study we fit a logistic model for the given tumor data set and find that the direct dosage and weight of tumor strongly affect severity of the threat for the patient.