4.1 Predictive Modeling Techniques
Regression Analysis
- Regression analysis mainly focuses on finding a relationship between a dependent variable and one or more independent variable.
- Predict the value of a dependent variable based on the value of at least on independent variable.
- Coefficient explains the impact of changes in an independent variable on the dependent variable.
- Y = f(X,B)
where: - Y is the dependent variable
- X is the independent variable
- B is the unknown coefficient
- Widely used in prediction and forecasting
Types of Regression
- Regression Models are mainly classified as Uni-Variate and Multi-Variate models.
- In Uni-Variate model the response variable is just affected by one predictor variable. It is the simplest from of statistical analysis.
- In Multi-Variate model the response variable is affected by more than one predictor variable.
- The Uni-Variate and Multi-Variate models can be further classified into Linear and Non Linear models.
- A linear model is fitted with a straight line else it is considered as a non-linear model.
- The Uni-variate Linear Regression model is further classified as Simple and Multiple.
- Usually more than one dependent variable as an influence into dependent variable.
- When one independent variable is used in a regression is called simple regression.
- When two or more independent variable are used it is called multi regression.
4.2 Linear Regression
Linear Regression
- It's a common technique to determine how one variable of interest is affected by another.
- It's used for three main purposes:
- For describing the linear dependence of one variable on the other.
- For prediction of values of other variable from the one which has more data.
- Correction of linear dependence of one variable one the other.
- A line is fitted through the group of plotted data.
- Y = A + BX + E
Where, - A = intercept coefficient
- B = slope coefficient
- E = residuals
- The residual value is a discrepancy between the actual and the predicted value.
- The distance of the plotted points from the line gives the residual value.
- The procedure to find the best fit is called the least-squares method.
Linear Regression
- Above graph illustrates the concept of linear regression.
- In this graph the independent variable Y is plotted on the Y - axis.
- The independent variable is plotted on the X - axis.
- The dots in orange are the observed values of the dependent variable.
- The fitted linear regression line is drawn in blue.
- For the given scatter plot data a regression line has been fitted.
- The regression line is the line with the smallest possible distances with itself and each data point.
- As you can see the regression line touches some data points but not others.
- The slope of the line as given by the coefficient data is shown as a blue dotted line.
- The intercept A is the difference between the origin and the regression line.
- The line is fitted using the equation mentioned in the previous slide. i.e. Y = A + BX + E.
- For input Xi the predicted values shown here from the line.
- The observed value for the Xi can be seen from the orange dot.
- Difference between the observed and the predicted value is denoted by E.
Coefficient of determination R Square
- A measure of goodness of fit - How well your model does fit the data?
- R Square is calculated in different ways for different models.
- For a simple linear regression R square is calculated of the square of the correlation coefficient between the observed and the predicted values.
- We will now look at how different values of R square are interpreted.
- In the first figure the line is perfectly horizontal and the R square is 0, which implies no linear relationship.
- In the second figure the R square value is -1 implying a negative linear relationship.
- In the third figure the R square value is +1 implying a positive linear relationship.
How good is the Model?
- How do we use the value of R square to know how good the model is?
- Based on R square value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
- The differences between observations that are not explained by the model are the error terms or residuals.
- Suppose we have a case in which R square value is 0.74. This means that 74% of variance in the values of the dependent variable is explained by the model and the remaining 26% which is not explained is its residual or error term.
Linear Regression in SAS Studio
- You are familiar to SAS Studio and how to upload a file to SAS.
- In this section we will look at the steps to perform Linear Regression in SAS Studio.
- The steps involved are:
- Open SAS
- Extract data into SAS system from the disk
- Click on tasks
- Then click on statistics and select the option linear regression.
- Select a dependent variable and one or more explanatory variable.
- Click on methods tab, define the value for confidence level and check on include intercept.
- Click on options tab and check statistics plot if required.
- Click Run.
- Regression output will be displayed in results window.
Linear Regression in SAS Studio (Demo)
- Open the SAS Studio
- Use the PROC Procedure to upload a new file.
- Here for our convenience we will use the same CARS sample data set from the library.
- Here we will check the price variable MSRP being affected by all the other factors like Weight, Cylinder and Horsepower.
- If we can find the relationship between these factors and the MSRP, we can then make the prediction of future values of MSRP given the values for other factors.
- In order to perform this, click on Tasks and select Linear Regression under Statistics.
- Once the Linear Regression Data tab is displayed select the CARS data set under the DATA.
- Under the Roles Section select the Dependent Variable as MSRP.
- Select the Explanatory variables as Weight, Cylinders and Horsepower.
- Click on the Methods tab.
- Here you can specify the confidence level.
- Check Include intercept to include intercept in the model.
- Choose the Model Selection Method from the following:
- None
- Forward Selection
- Backward Selection
- Step wise Selection
- Minimum R square Improvement
- Maximum R square Improvement
- All possible regression
- You can also choose the model selection statistics and plots
- Click the Options Tab
- Here you can select the values to be displayed including
- Parameter Estimates
- Standardized Regression Coefficient
- Confidence limits for estimates
- Sums of Squares
- Sequential sum of square (Type I)
- Partial sum of square (Type II)
- Partial and Semi Partial Correlation
- Squared partial correlation
- Squared semi-partial correlation
- Diagnostics
- Analysis of influence
- Analysis of residuals
- Predicted values
- Colinearity
- Heteroscedasticity
- Autocorrelation
- You can also select the Diagnostic and Residual PLOTS.
- Diagnostic Plots
- Display as Panel of Plots
- Residual for each explanatory
- Click on the Outputs tab.
- Here you can choose to display the parameter estimates of the eliminates data set and the statistics data set
- Under each of these you can choose the data to be saved in the particular output data sets.
- For this example to keep it simple we will use the default settings for the linear regression.
- Finally click Run.
- The Linear Regression results will be displayed in the results window.
- Let us now look at the results.
- It shows the dependent variable MSRP.
- Number of Observations Read and Used to construct the model.
- The Analysis of Variance tab provide the degrees of freedom, sum of squared and mean squared values. From these values the F statistics is calculated and the corresponding P-value is displayed.
- The result shows the P-values as 0.0001 and this is less than the standard value 0.05. So we reject the NULL hypothesis and conclude that there is some linear relationship between the MSRP values and the given factors.
- The R-squared value displayed as 0.69 which means that the 69% of the relationship can be explained i.e. the three variables weight, cylinder and horsepower can explain the MSRP variable to an extent of 69%.
- The parameter estimates provides the values of intercepts and regression estimates.
- The intercept value is -8817 and the table provides the corresponding regression coefficient for all the independent variables. From the parameter estimates we can draw the regression equation as.
MSRP = -8817 - 3.49 * Weight + 390 * Cylinder + 240 * Horsepower - The regression equation helps us in predicting the future trend of MSRP given the values of weight cylinder and horsepower.
- The P-values for the weight and horsepower variable are less than 0.05 so we are rejecting the NULL hypothesis and we make a conclusion that there is a relationship between these variables and the MSRP variable.
- However in case of the cylinder variable the P-value is greater than 0.05 and thus we are not rejecting the NULL hypothesis. There is probably no relationship between the cylinder and MSRP and hence it is not necessary to consider cylinder as an independent factor for MSRP.
- The plot shows the observed vs the predicted values for MSRP.
- The next section show the fit diagnostic for the MSRP variable.
- The normality plot, the Cook's Distant plot, and a histogram of the residual values.
- The final section shows the Residual by Regressors values for the MSRP variable with the factors weight, horsepower and cylinder. i.e. the plot show the residuals against the different values weight, horsepower and cylinder.
Linear Regression in Excel
- Click on Office button and select Excel options
- Select add-ins, then select Analysis toolPak and click on Go
- Check on Analysis toolPak and click on Ok.
- Click on Data from menu bar on the right most side you will see the Data Analysis tool
- Click on Data Analysis
- Click on Regression
- Select the input Y range and input X range
- Check on 'Labels in first row' option
- Check on Residuals and Normal probability plot if required
- Click Ok and results will be obtained in new worksheet ply
Linear Regression in Excel Demo
- For this linear regression let us consider retail sales of a store in California.
- This retail store has good sales through out the year and the management wants to further increase the sales.
- They invest into advertising their products and the store, In hopes of increasing the sales and thus increasing the profit.
- However they want to know if the investment yeilds enough profits, and if an investment in advertising increase their sale.
- To find this relationship the records of total sales and the amount spent on advertisement is being tracked.
- We have the data set of two variables.
- The first variable Sales, specifies the amount of total sales today.
- The second variable Advertisement, specifies the amount that is being spent on the advertisement.
- We will try to find out the relationship between sales and advertisement, using simle linear regression in excel.
- The regression method helps in finding the relationship between two factors i.e. the dependent variable and the independent variable.
- This helps us in fitting a regression model and with this we can make future prediction of dependent variable.
- Let us start performing linear regression in excel.
- Open the data set in excel.
- To perform regression in excel click on Data ribbon and click on Data Analysis tool.
- On clicking data analysis tool you will see various data analysis tool options.
- From this select Regression and click Ok.
- The regression window is displayed.
- From this select the input Y range as Sales Column and input X range as Advertisement Column.
- In our example we want to find out the impact on sales due to the expense in advertisement.
- The dependent variable should be specified in the input Y range. So here we will select Sales as our dependent Y range.
- The independent variables is specified in the input X range. Here we will select the advertisement column as our input X range.
- Then check on Labels in order to show that the label names are being specified in the first row.
- The confidence level option can be used to change the confidence level.
- Here we will use the default option of 95% confidence interval.
- In the output options, click the New Worksheet Ply radio button to get the results in a new worksheet ply.
- In the residuals sections there are options to display Residuals(Error Term), Standardized Residuals and Residual and Line Fit plots.
- The normal probability plot can also be displayed to check the error distributions.
- We will click on Line Fit Plots and Residuals Plot.
- Click on Ok.
- The results will be displayed in a new worksheet.
- The Regression Statistics result window provides the R-squared value, Multiple R-squared value and Adjusted R-squared value.
- The R-squared value is 0.97 and this specifies that the 97% of the variation in Sales variable can be explained by the changes in the advertisement value.
- From this we can also infer that the variables Sales and Advertisement are 97% related to each other and there is a high correlation between them.
- If the F value is 5811 and the P-value obtained is less than 0.05 specifies that we should reject the NULL hypothesis and shows the fact that there is a relationship between the dependent and independent variable.
- From this result we can construct the regression line as
- Sales = 31.35 + 4.81 * Advertisement + error term.
- This table also specifies the T-stat and P-value.
- From this fitted regression line we can predict the future value.
- In our example we can predict the future values of Sales with given values of advertisement.
- From this we can draw an useful insight which can help in business growth. i.e. if the amount spent on advertisement is increased then the sales of that particular product will also increase.
- The next table defines the predicted values, number of observations and Residuals.
- This is displayed to construct the residual plots.
- Incase the plots are not needed we can not check the options.
- Finally we have the plot where the Sales Values (Y-axis) is plotted against the Advertisement (X-axis) where sales and actual sales is plotted and the regression line is also displayed.
- You can note that all predicted sales values fall on the regression line and the residual terms are being minimized.
- This helps in forecasting the future trend in sales from advertisement.
- The graph we obtained shows a positive trend, and thus we conclude that the sales and advertisement variables are positively correlated.
Case Study
- This survey result provided the death rate, number of doctors, number of hospitals, annual per capita income and population density for every city.
- From the given data, we will try to find relations within the data and thus bring out useful insights, which would ultimately help the government organization in taking necessary steps to improve the society helth case facilities.
- Specifically, we need to find the factors that cause an increased mortality rate.
Sample Data
- The above is the sample data.
- In this case we want to find how death rate is related to all other variable.
- Let us look at the data in hand.
- We have 5 variables.
- X1-Death Rate - Specifies the death rate per 1000 residents.
- X2-Doctors Availability - Specifies number of doctors per 100000 residents.
- X3-Hospital Availability - Specifies number of hospitals per 100000 residents.
- X4-Annual Per Capita - Income in 1000s of dollars.
- X5-Population Density- Specifies the details of Population Density per Square Mile.
Regression using SAS
- Load the data set into SAS Studio using the proc import procedure
- Select the Linear Regression under Statistics section from Tasks Tab on the left.
- We are going to consider the first variable which is the death rate per 1000 residents as the dependent variable and all the other variables as explanatory variables.
- After specifying these details in the linear regression window, click on run option.
- Then you will get the regression results in the results window.
- The regression results provide the number of observations in our data set and the number of observations used in fitting the regression line.
- The Analysis of Variance table gives us the ANOVA output for the fitted and residual values.
- We hope that you have a basic understanding of ANOVA from previous section so we will start with the obtained results.
- For the ANOVA in multiple linear regression, the NULL hypothesis is that the values of data i.e. the intercepts and coefficients in the equation is 0. It does not specify which of the coefficient is not zero. The alternative hypothesis is that atleast one of the variable is linearly related to the response variable.
- We can see here that the Sum of Squares for the Model. i.e. the Square of Variation between the response and its means is 20.65.
- The Mean Square value is 5.16. Which is obtained by dividing the Sum of Squares by the Degrees of Freedom (df).
- Similarly for the Error, we get the Sum of Squares as 123.07 and Mean Square as 2.56.
- With these values we calculate the F-value and the P-value.
- The P value of obtained as 0.1 which is not less than the default value of 0.05. Thus we do not reject the Null hypothesis.
- And we will conclude that there are external factors apart from the variables we have in our data set that make an effect on the response variable i.e. the death rate.
- The R-Square and the Adjusted R-Square values gives us the goodness of fit of the model.
- The R Square value of 0.143 explains that 14% of the variability in values of death rate can be explained by the given data.
- We can thus deduce that the given variables do not make a significant impact on the death rate of a particular city.
- Now lets concentrate on the parameter estimates table.
- The intercept values that is obtained is 12.26 with the standard error of 2.02. It shows the parameter estimates for the variables X2, X3, X4, and X5. From these results we can fit a regression line and as per the equation:
Death Rate = 12.26 + (0.007 * X2) + (0.0005 * X3) - (0.33 * X4) - (0.009 * X5)
- This equation will help us in finding the relationship between the variables and predicting the future values. i.e. by applying the values of all other independent variables in this regression equation, we can find the predicted value of death rate.
- Next the scatter plot of the observed values and predicted values is displayed.
- There are a couple of points that completely deviate from the line, with overall we can see that the prediction is moderate.
- We will look at few of these plots in detail:
- As we know earlier residuals are the difference between the predicted and estimated values.
- The first plot shows us a graphical summary of the predicted values against the residuals.
- The residual quantile plot show the plot of quantiles i.e. the points taken from the cumulative distribution functions at regular intervals against the residuals. It shows the probability distribution of the dependent variable.
- The Cook's distance plot shows the calculated values for each observation. It tries to detect points which are more influential than the other points. Though there is no particular rule on interpretting the Cook's distance but we can know that the points that have very large values will require further investigation.
- The percent residual plot shows the Histogram of the residuals. The curve shows that the errors follow a normal distribution. The linear trend of the quantile residual plot also supports that the error follows a normal distribution.
- Finally we see the plot of each individual variable against the residuals. We can see that there is no particular trend. There residuals are placed around the zero line which says that the assumption that the model is linear is acceptable. Also the variances in the error term are equal. There are a couple of outliers in the whole data set which can be seen from the plot with the maximum deviation from the zero line.
Relationship between the variables.
First we will find the relationship between the availability of doctors (X2) and the population density (X5)
- To find this relation do a linear regression analysis with the doctor availability as dependent variable and population density as explanatory variable. Then click on run.
- The resultant window shows the intercept and coefficient values from which we can derive a regression equation and help in prediction and finding a relation.
Doctor Availability = 117.86 - 0.01 * Population Density(X5)
- From this equation we can infer that with the increase 0.01 the population density the availability of doctors will decrease by 1 unit.
Next we will find the relation between the number of hospitals(X3) and the population density(X5)
- To find this relationship perform a linear regression between the Hospitals Availability(X3) as Dependent variable and Population Density(X5) as Explanatory variable.
- Then click on Run.
- The resultant window shows the intercept and coefficient value from which we derive a regression line.
- The regression equation is:
Hospital Availability = 444.22 + 1.31 * Population Density(X5)
- So this equation specified that with a 1.31 rise of population density there is an increase in one unit hospitals.
Next we will find the relation between the Number of Doctors(X2) and the Number of Hospitals(X3)
- Perform a Linear Regression between the Doctors available(X2) as dependent variable and Hospitals available(X3) as Explanatory variable.
- Click on Run.
- The regression equation obtained to make a prediction is:
Doctors available(X2) = 96.23 + 0.033 * Hospitals available(X3)
- This equation specifies that with a 0.03 rise in Hospital Availability there is an increase in 1 unit of Doctor availability.
Finally we will find the relation between the Number of Hospitals(X3) and the annual per capita income(X4)
- Perform a Linear Regression between the Hospital available(X3) as dependent variable and per capita income(X4) as Explanatory variable.
- Click on Run to obtain the results.
- The regression equation obtained to make the prediction is:
Hospital Available(X3) = 509.5 + 8.50 * Annual per capita income(X4)
- This equation shows that with increase of 8.50 Annual per capita income there is a rise in one unit change in Hospital Available.
Regression using Excel
- Open the file in Excel.
- Click on Data ribbon and click on Data Analysis tool.
- We will perform linear regression on the same data set.
- Open the Regression window and specify the input x and input y range.
- The input y range is the dependent variable i.e. X1 and the input x range are the independent variable i.e. X2, X3, X4, and X5.
- We will output the Residuals and Residual plots and the Normal Probability plot.
- Let us look at the Linear Regression output
- The statistics shows the R-square, Adjusted R-square, Standard Error.
- We can see that we got the same output as the SAS output with 14% variation explained by the data.
- The number of observations as we know is 53.
- Next the ANOVA results are displayed.
- Since we extensively covered these results in previous session we will not go much into the details.
- We can note that the values that are obtained are same as the output obtained from SAS studio.
- In the next sections we see the Coefficient, Standard Error, t-stat, P-value, lower and upper 95% levels.
- The coefficients of the intercept, X2, X3, X4 and X5 give the same value and we get the same equation for the regression line.
- Next we have the residual output section. Here the predicted values of X1 are displayed along with the residuals.
- For example the first observation as a death rate of 8 and the predicted death rate is 8.97. Hence the residuals or the error term is -0.97.
- The probability output shows the predicted values of X1 in increasing order starting at 3.6.
- The percentile column shows the quantile value i.e. the points taken from the cumulative distribution function. You can notice that the percentile values start from 0 and ends at 99. ie. places in the plot where the particular X1 value occurs.
- With the percentile and the X1 values the normal probability plot is displayed.
- Since it almost follows a linear trend we can say that the error follows a normal distribution.
- With the predicted X1 values we can plot a graph of predicted v/s actual values.
- We will make a scatter plot with actual values of X1 v/s the predicted values of X1.
- We will change the axes values to start at 4 to verify whether we get the same output as SAS.
- to get a trend use the trend-line option from the ribbon and select a linear trend. This will fit a linear trend to the scatter plot.
- We can see that the plot is identical to the one obtained from SAS.
- Since most values are closer to the trend line, we can see that the regression model performs quite well, except of couple of outliers.



































