Saturday, July 9, 2016

04 Predictive Modeling Techniques_Part 3

4.3 Logistic Regression

Logistic Regression

  • It's a statistical method that is used in analyzing dataset where one or more independent variables would determine the outcome
  • The dependent variables are binary (True or False)
  • Find the best fitting model to describe the relationship between the dichotomous characteristics and a set of independent variables
  • Logistic regression generates the coefficient of a formula to predict a logit transformation of the probability of presence of the characteristic of interest.
                logit(p) = B0 + B1X1 + B2X2 + B3X3 + ... + BnXn
                where, p is the probability of presence of the characteristic of interest.
  • The logit transformation is defined as the logged odds
                odds = (p/1-p)
                logit(p) = ln (p/1-p)

Method to develop a logistic model

  • At first we finalize on the data i.e. the target and the predictor variable.
  • Then we collect the data by defining proper position and observation windows.
  • Before the data is used for further analysis data preparation and data treatment are done and a hygiene check is performed.
  • Often we are required to create new variables for the model, so we identify derived variables from the given data.
  • Next steps are fine classing and coarse classing based on some continuous variables if required.
  • Finally we get the logistic model and it is analyzed through proper diagnostics.

Linear Regression vs Logistic Regression

First let us recap what Linear Regression is:
  • Linear regression  is mainly used to establish a relationship between a dependent and independent variables. It helps in estimating the impact of independent variable on a dependent variable.
  • Example - using a linear regression, the relationship between temperature (T) and icecream sales(I) is found to be
          I = 2T + 4000
  • This equation says that for every 1 degree rise in temperature, there is a demand of 4002 ice creams.
On the other hand:
  • Logistic regression helps in finding out the probability of an event and this event is captured in binary format i.e. 0 or 1.
  • Example - In order to know whether customers will buy a product or not, run a Logistic Regression on the relevant data. The dependent variable would be a binary variable. either 1 or 0 i.e. yes or no.
  • In terms of graphical representation, Linear Regression gives a linear line as an output, once the value are plotted on the graph. Whereas, the logistic regression an S shaped line to a logistic function.

Logistic Regression

  • In this demo we will implement logistic regression on a sample data set.
  • In this data set we have the survey results of various people about their willingness to subscribe to a news paper.
  • Their response is pulled with only two outcomes whether they are willing to subscribe to the newspaper or not.
  • Since it is binary response variable we can use logistic regression to predict the outcome.
  • The data set consists of three variables and we will look at each of them.
  • The first variable specifies the gender of the variable. Male or Female.
  • Second variable specifies the age. 
  • The third variable is the response variable and denotes their subscription willingness, i.e. whether the individual is interested in subscribing or not.
  • In most analytics tools and languages there are the LogIt and ProbeIt procedures, used in fitting the logistic model.
  • For this example we will use the probit model to fit a model to the data.
  • Generally the probit procedure is used to fit a logistic regression model to the probability of a positive response, i.e. the subscription of a newspaper as a function in the variables sex and age.
  • Specifically the probability of subscribing can be obtained using cumulative logistic distribution function of intercept with logistic coefficient values of sex and age.
  • By default the probit procedure models the probability of the lower response level for binary data. The probit model is made such that subscription as 1 is defined as the acceptance of subscription, 
  • Correspondingly if it specifies 0 then the subscription is being rejected.
  • Here in this case we are going to use a probit procedure that calculates the maximum likely hood estimates of regression parameters.
  • The probability analysis is mainly used to analyze the qualitative dependent variable i.e. a dichotomous value with in the regression framework.
  • In our case study the value of subscription is binary by nature. i.e. accepted or rejected. 
  • However, the other variables are measured in terms of ordinal values rather than counting into a continuous variable.
  • In this example the probability of an individual subscribing to a model calculated as a cumulative logistic distribution function of the intercept, B0 and the values of sex and age with their respective coefficients B1 and B2.
  • For the logistic regression and the demo that follow we will use a codes to generate a desired output.  Instead of the graphical options as the earlier analytics functions.
  • Let us look at the code to perform logistic regression. 
  • In the first section of the code we create the data set.
  • The data set is named as 'news'.
data news;
  • The input syntax declares three input variables under the news data set.
  • Here we specify the three variables the 'sex', 'age' and 'subs'. The '$' symbol is used to specify the variable preceding it is a character data. i.e. in this data the sex contains character data.
input sex $ age subs @@;
  • We build the data set by giving the values in the same format, sex, age and subs.
  • You can see that we have created the data set of 40 records.
datalines;
Female 35 0 Male   44 0
Male   45 1 Female 47 1
Female 51 0 Female 47 0
Male   54 1 Male   47 1
Female 35 0 Female 34 0
Female 48 0 Female 56 1
Male   46 1 Female 59 1
Female 46 1 Male   59 1
Male   38 1 Female 39 0
Male   49 1 Male   42 1
Male   50 1 Female 45 0
Female 47 0 Female 30 1
Female 39 0 Female 51 0
Female 45 0 Female 43 1
Male   39 1 Male   31 0
Female 39 0 Male   34 0
Female 52 1 Female 46 0
Male   58 1 Female 50 1
Female 32 0 Female 52 1
Female 35 0 Female 51 0
;
    • Let us look at the next section
    proc format;
        value subscrib 1 = 'accept'  0 = 'reject';
    run;
    • The format procedure provides a convenient way to do a table lookup to SAS.
    • User generated SAS format can be used to assign descriptive labels to data values, create new variables and find unexpected values.  
    • The format procedure can also be used to generate data, extract and merge data sets.
    • Finally the probit code is given
    proc probit data=news;
        class subs sex;
        model subs=sex age / d=logistic itprint;
        format subs subscrib;
    • We specify the data as news
    • The class statement specifies the classification or binary values in our data set. i.e. the subs and sex variables.
    • In the next line we build the model, specifying that the subs value is the response variable, and has to be modeled using the sex and age variables as the explanatory variables.
    • The final line specifies the model to use the subscribe variable while displayed the values of subs variable i.e. as 'accept' and 'reject' instead of '1's and '0's.
    • After giving the code specified here we will click run to execute the code.
    • The result shows the logistic regression of subscription status.
    • Starting with the iteration history for parameter estimates.
    • In the integration history table the Loglikelihood, intercept and values for the Sex and Age variables are displayed. 
    • We can see that the loglikekyhood is ended by the 6th iteration.
    • The next table displays the :
      • dependent variables
      • number of observations
      • name of distribution and
      • loglikelyhood estimation of the model.
    • The class level table shows the class variables sex and subs and the two values for the class variables.
    • The negative of gradient values and negative of  Hessian values are displayed.
    • The final table shows the "Analysis of Maximum Likelyhood Parameter Estimates".
    • The intercept values is displayed as -5.762.
    • The estimate of female is -2.42 and this shows that the women are less interested in subscribing to the news paper then men. This is the inference found based on Sex variable.
    • To find the insights based on age variable we need to see the loglikelyhood parameter of age.
    • And this parameter estimate shows the value as 0.1649. This positive coefficient values shows that if the age of the individual us higher then his willingness to accept the subscription is higher. And conversely in case of young individuals their willingness to reject the subscription is higher.
    • The table also shows the values of 
      • standard error, 
      • 95% confidence limits,
      • The chi-squared values, and 
      • P-values for every variables including intercept.

     Logistic Regression Case Study

    • We will take a medical case study to illustrate the applications of logistic regression.
    • Let us consider a hospital diagnosing a threat level of tumor patient given a few attribute values.
    • The threat level of the patient is categorized into a binary output 'mild' or 'severe'
    • The hospital maintains a record of three attributes of evey patients pertaining to their threat levels.
    • The first variables is the drug influencing values i.e. the dosage of drug given to the patient to cure the tumor.
    • The second variable is the observed weight of the tumor.
    • The third variable is a binary response variable and this specifies the seriousness of threat, whether it is severe or not.
    • Seen here are two sample records of the data. The first patient has a 3.70 drug dosage value 0.825 tumor weight and is categorized as a severe threat level patient.
    • The second patient as a comparitively lower drug dosage value of 0.60, a tumor weight of 0.75, and is categorized as non-severe threat patient.
    • We will use a set of data to illustrate the diagnostic measure for detecting influential observations and to quantify their effects on various aspects of the maximum likelyhood fit.
    • Let us start by creating the data set in SAS.
    data tumor;
        length Response $12;
        input Drug Weight Response @@;
        LogDrug=log(Drug);
        LogWeight=log(Weight);
        datalines;
    3.70 0.825 severe 3.50 1.090 severe
    1.25 2.500 severe 0.75 1.500 severe
    0.80 3.200 severe 0.70 3.500 severe
    0.60 0.750 non-severe 1.10 1.700 non-severe
    0.90 0.750 non-severe 0.90 0.450 non-severe
    0.80 0.570 non-severe 0.55 2.750 non-severe
    0.60 3.000 non-severe 1.40 2.330 severe
    0.75 3.750 severe 2.30 1.640 severe
    3.20 1.600 severe 0.85 1.415 severe
    1.70 1.060 non-severe 1.80 1.800 severe
    0.40 2.000 non-severe 0.95 1.360 non-severe
    1.35 1.350 non-severe 1.50 1.360 non-severe
    1.60 1.780 severe 0.60 1.500 non-severe
    1.80 1.500 severe 0.95 1.900 non-severe
    1.90 0.950 severe 1.60 0.400 non-severe
    2.70 0.750 severe 2.35 0.030 non-severe
    1.10 1.830 non-severe 1.10 2.200 severe
    0.95 1.900 non-severe 0.75 1.900 non-severe
    1.30 1.625 severe
    • We use the data statement to create the data set and here we name the data set as tumor.
    • The length statement is used to specify the maximum value that a variable can take.
    • Here we specify the length of the Response variable as 12.
    • The input statement specifies the variables in the data set. i.e. the Drug Dosage, Weight of tumor and the Response.
    • To perform the logistic regression we will perform a log function on the two explanatory variables, drug and weight. We declare two variables LogDrug and LogWeight calculated as log function of Drug and Weight variables respectively.
    • Next we have the datalines statement.Note that the datalines statement is used with the input statement to read data that you enter directly in the program, rather than read data stored in an external file.
    • Now we specify all the data values corresponding to the input format of Drug Weight and Response variable.
    ods graphics on;
        title 'Occurrence of tumor';
        proc logistic data=tumor plots=effect;
            model Response=LogDrug LogWeight;
        run;
        ods graphics off;
    • Then define the ods graphics on. Ods statistical graphics is a functionality that is mainly used for easily creating statistical graphics. With ods graphics over 60 statistical procedures can produce graphs as automatically as they do tables. For statistical procedures that support ods graphics, you invoke the functionality with the statement 'ods graphics on'. Graphs and Tables created by these procedures are then integrated in your ods output destination. 
    • We specify the title of the specification as 'Occurrence of tumor'. 
    • We use the logistic procedure for this case study instead of the probit model that we saw earlier.
    • So here we specify proc logistic data=tumor. We will specify plot=effect to get the predictive probability output. The plots are part of the ods package. And SAS provides the lists of plots that can be used along with the logistic regression procedure.
    • Here we will use a simple probability plot to model our output.
    • Then define the model Response=LogDrug LogWeight; The model statement names the response variable and the explanatory effect. Including co-variance main effects interactions and nested effects.
    • Here the Response is the dependent variable and LogDrug and LogWeight are explanatory variable.
    • Finally we click run to get the results in the results tab.
    • The first table gives the Model Information. We have used the tumor data set with the Response as Response variable. The response variable has two levels 'severe' and 'non-severe'. The data has been modeled as a logit procedure by using the Fisher's scoring Optimization Technique.
    • There are 39 observations in the data and all of them have been used to build the logistic model.
    • The response profiles shows the statistics of the Response variable. i.e. there are 19 'non-severe' outcomes and 20 'severe' outcomes for the Response variable in the data set.
    • The model fit statistics gives the criteria for goodness of fit of a model. 
    • The most commonly used criteria is AIC, which deals with the goodness of fit of a model against complexity of a model. When we construct same model on the same data, the AIC value of the different models can be used to compare them against each other and decide on the model that best fits the data.
    • In the next table various tests are performed to test the NULL hypothesis and the results are displayed.
    • The NULL hypothesis is that the response variable cannot be modeled from the given explanatory variables i.e. Beta=0.
    • All the tests result shown here of a very low P-value and thus we can deduce that there is a model where the response variable can be explained as a function of the explanatory variable.
    • Next we come to the most important result that helps in making conclusions on the data. "The Analysis of Maximum Likelyhood Estimates" shows the estimate values for Intercept Drug and Weight variables. 
    • Result of the model shows that both the drug and weight are most significant to the occurrence of tumor as shown from the P-values of 0.0131 and 0.0055 respectively.
    • Their positive parameter estimates indicate that higher the drug consumption rate or a larger weight of tumor is likely to increase the probability of severe tumor threat.
    • The following table shows the ratio of odds estimates and the association of probabilities and responses.
    • The effect plot are shown under the final section under influence diagnostics.
    • It shows different residuals and leverage plots.
    • The final plot shows the predicted probabilities for the response variable with 95% confidence limits.
    • As shown in the legend the blue circles denote the observed values and the predicted probability is shown by the blue line.
    • From the predicted probabilities model outputs the predicted values of the response variable either 0 or 1.

    Conclusion:

    • To conclude in this case study we fit a logistic model for the given tumor data set and find that the direct dosage and weight of tumor strongly affect severity of the threat for the patient.









    No comments:

    Post a Comment