BA using SAS: 04 Predictive Modeling Techniques

Time Series

Time series data is an ordered sequence of observation on a quantitative variable measured over an equally spaced time interval.
Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematics, finance, whether forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering and other places.
Time Series analysis is used in

Analysing time series data
Forecasting the future value of the variable under consideration.

In time series analysis it is assumed that the data consists of identifiable components and random errors which usually makes the pattern difficult to identify.
E.g. Sales of quilts and blankets in a store across a period of five years.

Components of a Time Series

Long term trend - The smooth long term direction of time series where the data can increase or decrease in some pattern.
Seasonal variation - Patterns of change in a time series within a year which tends to repeat every year.
Cyclical variation - Its much alike seasonal variation but the rise and fall of time series over periods are longer than one year.
Irregular variation - Any variation that is not explainable by any of the three above mentioned components. They can be classified into - stationary and non-stationary variation.

When the data neither increase not decreases, i.e. its completely random its called stationary variation.
When the data has some explainable portion remaining and can be analyzed further then such case is called non-stationary variation.

Decomposition of Time Series

We can see a sample plot of decomposition of time series.
The observed sales of quilt and blanket as discussed in the previous section is displayed in the first box.
The observed time series data is decomposed into the trend, seasonal and random components.
As you can see that the trend component shows and increase and decrease in sales over a period of time. Overall we can infer that the sales increases.
The seasonality component shows the seasonal pattern of the sales data.
The random component counts for the differences in the trend and the seasonal data from the observed data.

Moving Average

Moving average is a widely used indicator in technical analysis that helps in smoothing out actions by filtering out the noise
i.e. the residuals from random fluctuations.
Moving average is also otherwise called as trend follower or lagging indicator because it always depend on historical data.
Commonly used moving averages are

Simple Moving Average (SMA)
Exponential Moving Average (EMA)

A simple moving average is calculated by adding the value for a number of time periods and then dividing this total by the same number of time periods.
Exponential moving average gives a higher weighting to recent prices but in case of simple moving average it assigns equal weighting to all values.

Goals of Time Series Analysis

Descriptive:
Identify different patterns in correlated data which helps in finding the trend and seasonal variation
Explanation:
Understanding and modeling the data
Forecasting:
Predicting the short-term trends from the previous existing patterns
Intervention Analysis:
How does a single event change the time series?
Quality Control:
Deviation of a specified size indicate a problem.

Steps for Simple Moving Average in Excel

Open the file for performing moving average
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins
Click on Data Analysis
Select Moving Average and Click Ok
Select the input range for the data by clicking and dragging the data
Check on "Labels in First Row" if data extracted has column name in first row.
Specify the interval value as required
Check on Chart Output and click ok.
The results will be obtained in a new worksheet ply by default.

Demo for Simple Moving Average in Excel

Let us take an example of a watch manufacturing company in the center of belgium.
The company is experiencing fluctuations in the net sales quantity and hence we want to find if there is any trend involved in it.
The company maintains a record of monthly net sales quantity over a period of 4 years.

These kind of data are time series data as it is stored over a long period of time.
To find these trend changes we need to perform time series analysis on the data.
We might need to know the future net sales quantity of products in that company.
And this is where analytics will help.
This helps in discovering the historical pattern that is hidden within the data.
And with this discovered pattern we are going to explore the future pattern, i.e. how the future net quantity sales would be.
The management and the sales heads of that company are really interested in finding where their sales quantity trends increase and where their trends decrease.
In out time series data we have different crest and truss.
And if all these trends are included in prediction model. We will not be able to get the accurate results.
So in order to make our prediction results still more accurate you need to smoothen your results.
In order to smooth en the non-normalized patterns we can use two methods.
The simple moving average and exponential moving average.
In order to perform simple moving average in excel. Open the data set in Excel.
Select the data analysis tool from the top right corner of the data ribbon.

Once the Data Analysis Windows is displayed, select the Moving Average option and click on ok.

A pop-up window of Moving Average is displayed.
Select the Input Range as net sales by clicking and dragging the mouse on the data.
Check "Labels in First Row" option which specify that the label names are in the first row.
Then define the moving average interval as 5. i.e. we use the last 5 months data to average and smooth en the time series data.
Define the Output range on which you require the output results by clicking on the desired cell you want to display the desired output.

Check on "Standard Errors" and "Chart Output" and finally click Ok.
The results are displayed in the cells, and a line chart is displayed with the predicted values.
The results are calculated with the moving average interval of 5 and so the first 4 values are specified as N/A. The remaining predicted net sales quantities are obtained by averaging the previous 5 values.
It also provides the residual term for the particular net sales value. i.e. the difference between the actual net sale quantity and predicted net sales is being calculated. Since the interval is given as 5 the residual term is calculated as the square root of the average differences between the past 5 values and the predicted value.

Then a line graph is plotted between the two values i.e. predicted and forecasted values.
You can see that the forecast ed values in red are more smoothen, compared to the actual plotted values in blue.
More smoothing can be achieved by increasing the interval, but it might loosen the effect of more recent sales values.
In simple smoothing averages the weights given to the past n interval values are equal.
However in case you want to provide more weightage or consideration to the current value and the weightage of older values should decrease exponential, then you should allow exponential smooting average.

Steps for Exponential Moving Average in Excel

Open the file for performing exponential moving average
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins
Click on Data Analysis
Select Exponential Moving Average and Click Ok
Select the input range for the data by clicking and dragging the data
Specify the damping factor as required.
Check on "Labels in First Row" if data extracted has column name in first row.
Specify the interval value as required
Check on Chart Output and click Ok.
The results will be obtained in a new worksheet ply by default.

Demo for Exponential Moving Average in Excel

Open the data set in Excel.
Open the Data Analysis from Data Tab.
Select Exponential Smoothing from the list of option

The window for exponential smoothing is displayed.

Here define your input range for which you need to perform exponential smoothing.
In our example we are going to provide the net sales quantity as our input range.
Then define the damping factor value. A high value of damping factor may dampen your results more, giving the results more smooth values.
In our example we will smoothen the data with value 0.45.
Check the Labels in First row option to specify that labels are being specified in your first row.
Specify your output range where your results need to be displayed.

Check the Chart Outputs and Standard Errors and finally click Ok.
The results will be displayed in the output range and a chart is generated.

The results provided the predicted sales quantity value with an exponential smoothing damping factor of 0.45.
The need of using this exponential smoothing is to provide more weightage to recent net sales quantity and less declined exponential values to the older value. i.e for the value of 0.45 a 55% weightage is given to the previous value and exponential lower values are given to the older values.
The results also provide the residual term to introduce the difference between the actual values and predicted values.
The graph shows a line output of the Actual and Predicted values with exponentially smoothened effect.
If the damping factor is very high then your results will get more smoothened out and you can for a normal straight line.
Thus using the simple moving average and exponential smoothing functions we have generated a trend of time series data and have arrived a way of predicting the future net sales quantity data.

Case Study - Time Series Analysis.

Let us take the example of a retail store which contains the sales data for last 15 weeks and we need to see how the retail store is going to perform in the future weeks. We need to know the demand before hand. And this is where Analytics will help. For this prediction we are going to perform a time series analysis to this sales data. This helps in discovering the historical patterns and is hidden within those data, and with this discovered pattern, we are going to explore the future pattern and how the future sales would be. Just to give a business perspective we want to have predicted demand so that the retail store accordingly plans the stock levels and is ready in case the demand suddenly goes up or down. In our data set we are having two variables. The first variables specifies the week and the second variable specifies the demand in that week for the retail store. Now let us look at the data set given through a chart.

We see that in our time series data there are many fluctuations. For the sake of simplicity we will assume that the fluctuation are random. Now we need to take care of these fluctuations as it may affect our prediction results. So in order to reduce these fluctuations, we need to smoothen out these crest an truss in our prediction model.
The most commonly used smoothing techniques are simple moving average and exponential moving average. There are few more but we will for the moment discuss the common ones.
Let us start with calculating the simple moving average for our data set.
A simple moving average uses the average of most recent end data points in a time series and helps in forecasting for the future time period. We will make the explanationexplicit by doing this for our data set.
For our data set we are going to perform a moving average for the 15 week demand with an interval of 5 week. i.e. prediction for a particular week is done by doing an average of the last 5 weeks.
Let us do this on Excel.
Select the Data Analysis from data tab, and then option Moving Average.
The moving average window pops-up.
Here specify the demand as your input range.
Check on Labels in first row (in case first row contains label names.)
Define the Moving Average interval as 5.
Specify the output range and check on the chart output to display your demand and moving average as chart.
Now you will get the forecasted demand of 5 week moving average.
Also you can find out the absolute error by calculating the difference between forecasted demand and actual demand.
From individual absolute errors we obtain the mean absolute error. Alternatively you can also get the mean absolute error by checking the option of standard errors from the moving average pop-up window while feeding the inputs for the model.
The mean absolute error helps in choosing the right interval for moving average by applying trial and error method we can find the best interval that provides you most accurate forecasted value.
If the absolute mean error is less than we will get the high accurate prediction results.
Now let us observe the prediction values given by moving average. Note that the moving average are close to just the previous average.
what would happen if we increase the interval from 5 to 8 weeks. Let us again repeat the steps mentioned earlier to calculate the moving average forecast with interval as 8 weeks.

After looking at the new chart and comparing with the old we see that the newer forecast is much more smooth or averaged.As we keep increasing the intervals the forecasted values will be much more averaged out.similarly the decreasing intervals leads to much more fluctuation values of predicted demand. Ofcourse what interval to choose as mentioned earlier depends upon absolute mean square error.
One of the limitations of moving average forecast is that it does not predict the spike in sales for a week and averages out the demand.This leads to inaccurate forecast for weeks whether there is an increase in sales as compared to earlier weeks. To understand this look at Week7, Week11 and Week13. The demand is way higher than the moving average value.
Also the model does not take care of the observation values from older time line. This can easily create problem if the recent sales are not normal and the factor increase in the sales was only and anomaly for certain time period. The next model we are going to use for forecasting takes care of giving a weightage to the previous demands and also the latest ones.
We will do exponential smoothing to take care of all the demand values and accordingly come out with a forecast. We assign a damping factor. The factor gives some weightage to the recent values and some weightage to the previous forecast values.

Previous sales get factored in the forecast value. So lets begin with the steps involved to get some smoothen forecast values.

As already explained before, to perform exponential smoothing, click on data analysis tab and select exponential smoothing and specify the input range. Input range in our case will be the demand data set. We then need to specify the damping factor based on which exponential smoothing is to be made. For our case we are going to specify the damping factor as 0.24. Then check on the Label to specify that your first row contains the column names. Then specify the output range and check on Chart Output and click on Ok.

If you look at the chart the forecast values resemble the demand values.But in each of the case the forecasted values seems to mirror the last week demand value. Taking the example of Week7 and Week8 Week8 forecast values seems closer to Week 7 demand value than Week8 actual demand value.

We do the earlier steps again and put damping factor as 0.8. The output chart that comes out is completely different from the earlier chart. The forecasted values are much smoother than the earlier values. This explains the effect of the damping factor. More the damping factor, smoother the graph would become.
Choosing the right damping factor is again on mean absolute error and sometime depends on business experience.
But even this model does not solve all the problems mentioned after moving average forecast. We still have not solved the problems of sudden spikes in particular week.
Coming back to the original assumption of our problem, the spikes are random, and hence we use smoothing techniques but what if there is a pattern to the spikes.
Ofcourse there are more advanced models such has hotwinters, arima, multiple linear regression, etc. which can take care of the sudden spikes and forecast the value of demand more accurately. This is where advanced analytics comes into the picture.

Time Series - Business Case Study

Now that we have covered the fundamentals of analytics course through Excel and SAS let us take a business case where all the concepts learned from lesson 1 to now make a use. The intention here is to give a way the business problem is typically solved.

In our business case we have a retail store who wants to derive values from their point of sales data across different stores.

Point of sales data is data which is captured in each store and covers the billing process from the products sold. It does not give any details about the customers themselves, but a lot of useful information from the products being sold has come out as data as well we will see going further in the case.
Now let's look at the data set given. The first tab shows the weekly data starting from 30 Dec 2012 to 31 Mar 2014.
The data is for 5 product categories, namely Kitchen, Grains, Pulses, Vegetables, and Merchandise. The numbers are quantity sold of each type.

Second tab specifically has data for broccoli 500 gm packets sold across outlets in a particular city.
Rainfall gives the value in centimeter for the particular city across time.
Season column give the value of 1 for winters and 0 for other seasons. Winter assumed from Nov to Feb.
We need to first understand the data and start to work over.

Scatter Plot

Going back to category wise data let us draw scatter plot understanding the quantity sold for each category across timeline.

The scatter plot shows that pulses and kitchen's items are sold less compared to other. But there are no fluctuations and sold uniformly.
Merchandise Sales quantity increases and decreases randomly and there seems to be no pattern in crest and truss.
Grains are the highest selling category and have fluctuations where there seems to be a pattern.
Similar is the case with the vegetable category.

Descriptive Analysis

Now let us do some descriptive analysis on the data.
We do a descriptive statistics on each of the category together. So what do we observe?

Same as in the scatter chart.
Merchandise has the highest standard deviation and hence so much fluctuation.
Grains, Pulses and Vegetables are negatively skewed, whereas Kitchen and Merchandise are positively skewed.
The range of pulses is minimum and also the standard deviation is the lowest, which signifies that the demand for pulses is uniform throughout the year and does not vary much.
After understand the descriptive part of the variables we will look into the diagnostic part of Analytics.
This part will answer any relationship we have among variables or between dependent and independent variable.

ANOVA

We will first do an ANOVA test to see the relationship between the 5 categories.

Again using excel we go to the Data Tab, Select Data Analysis, Choose ANOVA single factor from the options and select all the 5 columns except the time column including the header. Choose alpha as 0.05.

The results from the test shows the p-value is much less than the significant value of 0.05 chosen for the test.
So we reject the hypothesis that the means of all the 5 categories are the same.
This leads us to the direction that the demand quantity of these groups is independent of each other and should be taken care separately.
We can still find how different the means are by doing a pairwise t-test for each combination. But that again does not form the part of analysis here.

Correlation

Next test we perform with the data given is the correlation between the product categories demand quantity.
Why did we choose correlation and not chi-square test? Because the values of each of the variables here are continuous numerical and non-categorical.

We again click the data analysis tab, choose correlation and put the input range including the 5 columns. So what do we see in the results?
Looking at all the values from 0 to 1 we see interesting observations.
Grains and Vegetables has a correlation of 0.9. Meaning the demand quantity follow each other and have similar patterns.
Even Pulses and Vegetables has a correlation of 0.77. Again demand quantities of Vegetables and Pulses follow each other but to a lesser extent than between grains and vegetables.
All the values are positive and none of them negative.
All the demand quantities are positively correlated to each other.
None of the categories affect negatively or cannibalise the demand.
What it also implies is that we predict the demand of vegetables and get a working model.
The same model can be used to predict the demand of Grains.to a large extent and pulses to a smaller extent.

Let us now look at the second tab and understand the data given.

BA using SAS

Sunday, July 17, 2016

04 Predictive Modeling Techniques_Part 5 Time Series