Sunday, May 8, 2016

02 Statistical Concepts And Their Application In Business

    Agenda

    • Get an overview of Statistical Methods
    • Understand Population Samples
    • Develop a sampling plan and know Sampling Methods 
    • Know what is Descriptive Statistics
    • Know what are its components
    • Learn about Probability theory and distributions
    • Know what is Confidence Interval 
    • Learn about the concepts of tests of significance
    • Differentiate between One sided and two sided hypothesis testing
    • Know the various tests of significance
    • Know about non-parametric testing
    • Understand the main topic better with case studies

    Statistical Methods

    Statistics is a applied/business mathematics where we collect, organize, analyze, and interpret numerical facts
    • Descriptive Statistics
      • Sample
      • Measure of Central Tendency
      • Measures of dispersion
    • Inferential Statistics
      • Population
      • Estimation
      • Hypothesis Testing

    Population and Samples


    • A population is any entire collection of objects or observations from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
    • For each population there are many possible samples.
    • It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.
    • A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group.
    • A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling.

    Developing a sampling plan


    • Define the target population – in terms of number of elements, sampling unit, extent and time.
    • Select a sampling method – probability or non-probability sampling.
    • Obtain the sampling frame – must contain all the potential factors.
    • Determination of sample size – for desired level of accuracy.
    • Choose data collection method – procedure to obtain the data.
    • Develop operational plan – which technique fits the best.
    • Execute operational plan – verification of specified procedure.

    Sampling Techniques




    Descriptive Statistics

    Analyse Data to extract meaningful information

    Measure of Central Tendency

    Help describe, show and summarize data in a meaningful manner
    Measure of Central Tendency

    • Mean
    • Median
    • Mode


    Mean

    mean is the average of the numbers
    a calculated "central" value of a set of numbers

    Median

    Median is the number in the middle
    Number of values above and below median is same

    Mode

    Mode is the value that occurs often
    A set of data can have more than one mode

    When to use what?

    Mean:

    The average is required
    The variable is continuous / discrete

    Median:

    The variable is discrete
    There are abnormal extreme values / Non-normal data
    The characteristic under study is qualitative

    Mode:

    The variable is discrete
    There are abnormal extreme values
    The characteristic under study is qualitative

    Measure of Dispersion

    The spread or dispersion of a set of scores around some central value
    Describes the amount of heterogeneity or variation within a distribution of scores

    Measure of Dispersion


    • Variance
    • Standard Deviation


    Variance and Standard Deviation

    Variance is an average of squared deviations about the mean
    Standard deviation is the squared root of variance. Example data : 2,5,5,4,6,8
    n= 6
    Mean = (2+5+5+4+6+8)/6 = 5 Example data : 2,5,4,6,8
    Variance =
    Standard Deviation =
    [(2−5)2+ (5−5)2+ (4−5)2+ (6−5)2+ (8−5)25]/5 = 20/5 = 4
    √4 = 2

    It is the measure of how spread the distribution is?

    Case Study – Descriptive Statistics

    Business Case: A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.
    The telecom company surveyed 1000 of its customers on all the above services.
    Use Descriptive analysis to study customer spending to determine which services are most profitable.

    N Valid N Min Max Mean Standard Deviation
    Long distance last month 1000 1000 0.9 99.95 11.72 10.36
    Toll free last month 1000 475 0 173 13.27 16.9
    Equipment last month 1000 386 0 77.7 14.21 19.07
    Calling card last month 1000 678 0 109.25 13.78 14.08
    Wireless last month 1000 296 0 111.95 11.58 19.72

    On average, customers spend the most on equipment rental, but there is a lot of variation in the amount spent.
    Customers with calling card service spend only slightly less, on average, than equipment rental customers, and there is much less variation in the values.
    The real problem here is that most customers don't have every service, so a lot of 0's are being counted. One solution to this problem is to treat 0's as missing values so that the analysis for each service becomes conditional on having that service.

    Probability Theory

    Probability us a branch of mathematics that deals with the uncertainty of an even happening in the future.
    Probability value always occurs within a range of 0 to 1.
    Probability of an event, P(E) = (No. of favorable occurrences)/(No. of possible occurrences)
    Let us take the example of an unbiased coin which has two faces. Heads and Tails. As a result we will have two possible outcomes with equal probability. i.e. the probability of getting a head is equal to getting a tail i.e. 1/2 = 0.5.

    Assigning Probabilities

    Classical method - based on equally likely outcomes.
    E.g. Rolling a dice.  Probability of getting any number out of 1,2,3,4,5,6 is 1/6.
    Relative frequency method - based on experimentation or historical data.
    E.g. A car agency has 5 cars. His past record as shown in the table shows his cars used in past 60 days.
    No. of cars used No. of days Probability
    0 3 (3/60) = 0.05
    1 10 (10/60) = 0.17
    2 16 (16/60) = 0.27
    3 15 (15/60) = 0.25
    4 9 (9/60) = 0.15
    5 7 (7/60) = 0.12

    According to the table above there were:
    No cars used for 3 days
    1 car used for 10 days
    2 cars used for 16 days
    3 cars used for 15 days
    4 cars used for 9 days
    5 cars used for 7 days
    The probability can thus be calculated based on the relative frequency of each occurrence.
    i.e. dividing each of the days with total of 60 days.

    Subjective Method: based on judgement.
    E.g. 75% chance that England will adopt to Euro currency by 2020.

    Probability Distribution

    Probability distribution for a random variable gives information about how the probabilities are distributed over the values of that random variable.
    Its defined by f(x) which gives probability if each value.
    E.g. Suppose we have sales data for AC sale in last 300 days.

    Unit Sold No. of days Probability of units Sold f(x)
    0 10         0.03
    1 55         0.18
    2 150         0.5
    3 55         0.18
    4 25         0.08
    5 5         0.02



    Binomial Distribution

    Binomial Distribution satisfies:


    • A fixed number of trials
    • Each trial is independent of the others
    • The probability of each outcome remains constant from trial to trial.

    Example of binomial experiments


    • Tossing a coin 20 times, what is the probability of getting head 5 times?
    • Getting a diamond King from a pack of 52 cards.


    Case Study - Binomial Distribution

    Example of binomial distribution: Amir buys a chocolate bar every day during a promotion that says one out of six chocolate bars has a gift coupon within.
    Answer the following questions:

    • What is the distribution of the number of chocolates with gift coupons in seven days?
    • What is the probability that Amir gets no chocolates with gift coupons in seven days?
    • Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?
    • Amir buys a bar every day for six weeks. What is the probability that he gets at least three gift coupons? 
    • How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?

    (Assume that the conditions of binomial distribution apply: the outcomes for Amir's purchase are independent, and the population of chocolate bars is effectively infinite.)

    Formula = nCr p^r q^(n-r)
    n - number of trials
    r - number of successful outcomes
    p - probability of success
    q - probability of failure

    Other Important formulas include
    p + q = 1
    So q = 1- p  i.e.

    Now in this case p = 1/6 as Amir has a chance of winning one coupons out of every 6 chocolates he buys.
    The number of favorable cases is 1.
    And the total number of cases is 6.

    So q = 1-p = 1 - 1/6 = 5/6.

    • What is the distribution of the number of chocolates with gift coupons in seven days?
    7 C r (1/6) ^ r (5/6) ^ (7-r)

    • What is the probability that Amir gets no chocolates with gift coupons in seven days?
    Probability of failing 7 days : P (X=0) = (5/6)^7
    • Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?
    Probability of winning a coupon in the 7th day : 1/6
    • Amir buys a bar every day for six weeks. What is the probability that he gets at least three gift coupons?
    Number of winning atleast 3 wrappers in six weeks:
    P(X>=3) = 1 - P ( X<=2) 
    = 1 - (P (X=0) + P (X=1) + P (X=2))
    = 1 - (0.0005 + 0.0040 + 0.0163 )
    = 0.979
    • How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?
    Number of purchase days required so that the probability of success is greater than 0.95:
    P(X>=1) >= 0.95 (As per Binomial Distribution)
    P(X=1) + P(X=2) + ... + P(X=6) >= 0.95 but since Summation ( P(X=r)) = 1 so,
    1- P(X=0) >= 0.95 >> P(X=0)<=0.05 >> (5/6)n <= 0.05 ...taking log both sides
    n log(5/6) <= log(0.05)
    So, n>= 16.67
    That is n=17 days minimum.

    Normal Distribution

    Normal Distribution is a theoretical model of the whole population.
    It is perfectly symmetrical about the central value;the mean Mu represented by zero.



    Poisson distribution

    Discrete probability distribution for events that happen randomly in time.
    Following conditions need to be satisfied -
    The event results in a success or failure
    The average number of successes, Mu is known
    Probability of success is proportional to the region/time.
    Probability of success in an extremely small region/time is almost zero.
    Properties: Mean and variance is equal and is denoted by Mu.

    Examples
    Average number of houses sold by a company is 5 per day. What is the probability that exactly 4 houses will be sold tomorrow?
    Average number of births in a hospital is 2.1 births per hour. What is the probability that there will be exactly 6 births in the next two hours?


    Skewness and Kurtosis

    Skewness - measure of deviation from symmetry

    • Difference between median and mean
    • Right or left skewed
      • Skewness negative - more negative values (Left Skewed) more values to the left of median.
      • Skewness positive - more positive values (Right Skewed) more values to the right of median.

    Skewness can be removed from the data by doing mathematical transformations of the variable like logarithmic, squared, squared root etc.

    Kurtosis - measure of peakedness of the distribution

    • High Kurtosis - tall peak, rapid decline in the tails
    • Low Kurtosis - flat peaks, gradual decline in the tails
    • Extreme case - uniform distribution

    Case Study - Skewness and Kurtosis

    N Skewness Kurtosis
    Statistic Std. Error Statistic Std. Error
    Long Distance last month 1000 2.966 0.077 14.012 0.155
    Toll Free last month 475 3.465 0.112 26.735 0.224
    Equipment last month 386 0.756 0.124 0.641 0.248
    Calling card last month 678 2.15 0.094 7.572 0.187
    Wireless last month 296 1.359 0.142 3.079 0.282

    Equipment last month data is more accurate in nature and its SD is comparatively lower than the other measures.
    Conclusion - Equipment is the segment where the telecom company is getting more profits than the others and it can invest more.

    Confidence Interval


    • It's a rule for a population parameter to determine an interval that is likely to include the parameter based on the sample information.
    • Supposing that a random variable has been taken and the random samples were taken repeatedly from the population, certain percentage of interval contains unknown value.
    • In such case, if population is repeatedly sampled and intervals calculated in that fashion then 95% of interval contains true value of the unknown parameter.
    • This interval is then said to be 95% confident for the population proportion.
    • Data Requirements
      • Confidence level
      • Statistic
      • Margin of Error
      • Range of the confidence interval = sample statistic + margin of error.
      • The uncertainty associated with the confidence interval is specified by the confidence level.

    How to Construct a Confidence Interval


    • Identify a sample statistic - Choose the statistic that will be used to estimate a population parameter.
    • Select a confidence level - It describes the uncertainty of a sampling method.
    • Find the margin of error.
    • Margin of error = Critical value * Standard error of statistic
    • Specify the confidence interval - The range of the confidence interval is defined by the following equation.
    • Confidence Interval = Sample Statistic +/- Margin of error

    2.3 Tests of Significance 

    • Tests used in assessing the evidence in favor of or against a given assumption
    • Begins with a Null Hypothesis, Ho
    • Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
    • Two types of tests:
      • One sided tests
      • Two sided tests
    • Results decided by calculating the "p - value"
    • P value can be defined at the probability that the calculate test statistic can take extreme value as the absurd value given that the null hypothesis is true.
    • Interpretation:
      • If p-value is less than the significance level alpha, reject the null hypothesis.
      • General values of alpha are 0.05, 0.01.
    • General Assumptions:
      • The distribution is almost normal
      • The sample in the distribution have almost unequal variances.

    One sided hypothesis testing

    • Muo = null value
    • Null hypothesis Mu = Muo
    • Alternative hypothesis: Mu < Muo or Mu > Muo
    Example: Given a sample of heights of 100 males in New York, decide whether the height has increased in general form a given average height of 5 feet 9 inches.
    • Null Value: Muo = 5 feet 9 inches
    • Null Hypothesis: Mu = 5.9
    • Alternative Hypothesis: Mu > 5.9
    Using one of various hypothesis tests, calculate "p-value" and reject null hypothesis if p-value is less than 0.05.

    Two sided hypothesis testing

    • Muo = null value
    • Null Hypothesis: Mu = Muo
    • Alternative hypothesis: Mu <> Muo
    Example: given a sample heights of 100 males in New York, decide whether the height has increased/decreased in general form a given average height of 5 feet 9 inches.
    • Null Value = Muo = 5.9 
    • Alternative Hypothesis = Mu <> 5.9
    Using one of various hypothesis tests, calculate p-value and reject null hypothesis if p-value is less than 0.05.

    2.4 Tests of Significance

    • One Sample z test- The Z test is used to compare the mean with the given standard
    • Two Sample z test - The Z test is used to compare the means of two groups.
      The standard deviation need not be known to calculate the Z statistics.
      The Z test is generally used when the number of samples is greater than 30.
    • T test 
    • The t test is used with mean statistics as well but to calculate the t statistic the standard deviation must be known the test is preferred if the number of samples is less than 30. As earlier the t test can be one sample two sample or paired t tests.
    • One Sample t test - 
    • Two Sample t test - When the compared groups are independent. e.g.  To compare the marks or students of two different schools.
    • Paired t test - When the compared groups are paired. To compare the marks of students of same schools before and after a training class.
    • Chi-Squared test - For goodness of fit is used to test if there is a different between the observed values and the expected values according to a particular hypothesis.
    • F test - Annalysis of Variance (ANNOVA) - To compare variances of two or more groups. The mostly used f test is ANNOVA.
    • F test - Regression - lesser used is the regression analysis.
    In all the analysis tests the null hypothesis states that there is no difference between mean or variances and the alternative hypothesis suggests otherwise.

    Chi-Squared Tests

      Compare the observed results against an expected result based on a hypothesis
      Steps:
      • State the null hypothesis
      • Prepare the contingency table for the variable
      • Determite the expected results
      • Calculate the chi-squared values
      • Calculate the degree of freedom
      • Based on the above, calculate the p-value
      • If p-value <0.05, reject the null hypothesis
      Test of independence:
      • Verify if two variables are independent
      • Same steps as above

      Case Study - Chi Squared Test

      A city has a newly opened nuclear plant, and there are families staying dangerously close to the plant. A health safety officer wants to take this case up to provide relocation for the families that live in the surrounding area. to make a strong case, he wants to prove with numbers that an exposure to radiation levels is leading to an increase in diseased population. He formulates a contingency table of exposure and disease.

      ExposureDisease YesDisease NoTotal
      Yes371350
      No175370
      Total5466120

      Does the data suggest an association between the disease and exposure?

      Steps:
      • Calculate the number of individuals of exposed and unexposed groups expected in each disease category (yes or no) if the probabilities were the same.
      • If there were no effect of exposure, the probabilities should be same and the chi-squared statistics would have a very low value.
      Proportion of population exposed = (50/120)=0.42
      Proportion of population not exposed = (70/120)=0.58

      Thus, expected values:
      Popolation with disease = 54
      Exposure Yes: 54 * 0.42 = 22.5
      Exposure No: 54 * 0.58 = 31.5

      Population without disease = 66
      Exposure Yes: 66 * 0.42 = 27.5
      Exposure No: 66 * 0.58 = 38.5

      ExposureDisease YesDisease NoTotalTotal Proportion
      Yes Actual37135050/120 = 0.42
      Yes Expected54 * 0.42 = 22.5 66 * 0.42 = 27.5
      No17537070/120 = 0.58
      No Expected54 * 0.58 = 31.566 * 0.58 = 38.5
      Total5466120

      • Calculate the Chi-Squared statistic
      X^2 = Summation of  [(Observed Freq. - Expected Freq.)^2/ Expected Freq]
      = ((37-22.5)^2 / 22.5) + ((13-27.5)^2 / 27.5) + ((17-31.5)^2 / 31.5) + ((53-38.5)^2 / 38.5) 
      = 29.1
      • Calculate the degree of freedom:
      df = (Number of rows -1) x (Number of columns -1)
      df = (2-1) x (2-1)
      df = 1
      • Calculate the p-value from the chi-squared table(found online).
        For Chi-Squared value 29.1 and degree of freedom =1, from the table, p-value is < 0.001
      • Interpretation: There is 0.001 chance of obtaining such discrepancy between expected and observed values if there is no association.

      ANNOVA

      • Analysis of Variance - used to compare more than two groups
      • Extension of the independent t-tests
      • Factor variable - variable defining the groups
      • Response variable - variable being compared
      • One way ANNOVA
        • Groups of a single variable
        • E.g.: Is there a difference in student's marks based on the row he is seated - front / middle / back?
      • Two way ANNOVA
        • Two independent variables
        • E.g.: Does the race and gender affect a person's yearly income?

      Case Study - One way ANNOVA


      • Marks obtained in the same subject by three students belonging to three different schools are given below.
      • Does the data suggest any association between school and marks?
      SchoolABC
      Marks 1828338
      Marks 2837859
      Marks 3976855

      The basic idea in ANNOVA: Partition the total variation in the data into the variation between groups and variation between groups.
      Steps:

      • Calcaute the means

      School A: mean(82, 83, 97) = 87.3
      School B: mean(83, 78, 68) = 76.3
      School C: mean(38, 59, 55) = 50.6


      • Calcualte the grand mean

      Grand: mean(82, 83, 97, 83, 78, 68, 38, 59, 55) = 71.4


      • Calculating the variations

      Sum of Squared Deviations about the grand mean, across all observed values: SStotal = 2630.2
      Sum of Squared Deviations of group mean about the grand mean - three group mean against the grand mean: SSbetween=2124.2
      Sum of Squared Deviations of observations within a group about their group mean; added across all groups: SSwithin=506



      • Calculate the degree of freedom for every variance:

      df_total = number of observations -1 = 9-1 = 8
      df_between = number of groups -1 = 3 -1 = 2
      df_within = number of observations - number of groups = 6


      • Calculate the Mean Squared Variances

      Mean Suared variance between group MS_between = SS_between / df_between = 2124/2 = 1062
      Mean Suared variance within group MS_within = SS_within / df_within = 506/6 = 84.3


      • Calculate the f-statistics

      F-value = MS_between/MS_within = 1062.1/84.3 = 12.59


      • Calculate the p-value from the F-table

      P-value for given f-value 12.59 and degree of freedom 2 and 6 is 0.007


      • Conclusion: since the p-value is less than alpha, we can conclude by rejecting the null hypothesis, that there is a difference in the marks obtained by students belonging to different groups.

      2.5 Non Parametric Testing

      • Referred to as "distribution free", as they don't involve making assumptions of any data.
      • They have lower power than the parametric tests and hence are always given the second preference after the parametric tests
      • These tests are typically focused on median rather than mean
      • They involve straight forward procedures like counting and ordering
      • There are at least one non-parametric test done for each parametric test and are classified into following categories.
        • Tests of differences between groups (independent samples)
        • Tests of differences between variables (dependent variables)
        • Tests of relationship between variables
      One usually computes the correlation coefficient.
      Non parametric equivalence to the standard correlation coefficient are 
      • Spearman's R
      • Kendall's Tau
      • Coefficient Gamma
      Appropriate non-parametric testing for testing the relationship between the two variables are the chi-squared tests, the pi coefficient and the fisher exact test. In addition a simultaneous test for relationship between multiple cases is available. Kendall coefficient of concordance. This test is often used to express the inter-relative  agreement among independent judges who are rating ranking the same simulate

      Non Parametric Tests

      TestsParametricNon Parametric
      One Qualitative Response VariableOne Sample TestSign Test
      One Qualitative Response Variable - Two Values from Paired SamplesPaired Sample T - testWilcoxon Signed Rank Test
      One Qualitative Response Variable - One Qualitative Independent Variable with Two GroupsTwo Independent Sample T - testWilcoxon Rank Sum or Mann Whitney Test
      One Qualitative Response Variable - One Qualitative Independent Variable with Three or more GroupsANNOVAKruskall Wallis

      Correlation

      Measure of association between variables
      Positive and negagive correlation, ranging between +1 and -1
      A value of +1 or positive correlation applies that if the value of independent variable increases the value of response variable also increases.
      Similarly, a value of -1 or negative correlation applies that if the value of independent variable increases the value of response variable decreases.

      Positive Correlation Example:
      Earning and expenditure - more a person earns more he/she spends.
      Negative Correlation Example:
      Speed and time - As the speed of the vehicle increases the time taken to cover a given distance decreases.

      Parametric - normal distribution and hogeneous variance.
      Pearson correlation
      Non Parametric - no assumption, nominal variable
      Spearman correlation

      Correlation Coefficient

      r: correlation coefficient
      -1: Perfectly Negative
      +1: Perfectly Positive
      0 - 0.2 : No or very weak association
      0.2 - 0.4 : Weak association
      0.4 - 0.6 : Moderate association
      0.6 - 0.8 : Strong association
      0.8 - 1 : Very strong to perfect association

      Summary

      • Overview of Statistical Methods
      • Population, Samples &  Sampling Plan and Sampling Methods
      • Descriptive Statistics - Measure of Central Tendency and Measure of Dispersion
      • Probability Theory and Distributions
      • Confidence Interval
      • What are Tests of Significance
      • The process flow of hypothesis testing
      • One Sided and Two Sided Hypothesis Testing
      • Various Tests used in calculating p-value
      • What is Non-Parametric Testing and why it is used.
      • Non-parametric alternatives for the usual tests of significance

      03 Basic Analytic Techniques - Part 1

      3.1 Basic Analytic Techniques using SAS and Excel - Part 1

      Objective Slide

      • Get an introduction to SAS Studio
      • Learn how to upload a file to SAS
      • Know the advantages if using SAS
      • Understand data exploration through SAS and Excel
      • Implement distribution analysis throgh SAS and Excel
      • Learn the application of data visualization
      • Visualize data using SAS and Excel
      • Understand diagnostic analytics
      • Perform correlation, ANNOVA and t-tests through SAS and Excel
      • Understand the concepts with the help if case studies.

      Introduction - SAS Studio

      One of the most effective, interative tool for data exploration and analysis.
      • It is used for Exploring data
      • Transforming data
      • Analysing distribution
      • Fitting and evaluating models
      • Developing interactive programs
      • Creating custom analysis
      • Sharing analysis

      Command used for uploading a file in SAS:

      • Use proc import procedure to import data file into working directory.
      • The file from the SAS Folder is taken into the Working directory by using the syntax;
        • data file = file destination
      • The command out =  "file name" directs the file name to the work library which cab be  viewed under the work folder of library.
      • Specify the type of file by using the syntax dbms=filetype
      • Syntax replace is used if any manipulation is to be done with the data
      • Run is used in executing the commands.

      Demo - Load Data File

      Let us now see a video to understand these commands better.
      In this section we will now see:
      • How to upload a file
      • How to view a file and
      • The different syntax used for the same to upload a file into the SAS working directory.
      • Click on folder section, click upload and load the required file using the Choose File (browse) option.
      • After loading the file use proc import procedure to import data file into working directory.

      proc import datafile="/folders/myfolder/sasuser.v94/sales_reg.csv"
          out=Sales
          dbms=csv
          replace;
      run;

      • Here in this case we have uploaded the data file into "/folders/SASuser.v94/" by the name sales_reg so we give the appropriate syntax as shown above.
      • Second command out=sales directs the sales file to work library. This can be viewed under the work folder of the library.
      • Specify the type of file by using the syntax "dbms=filetype"
      • Syntax replace is used if any manipulation is done with the data. For example using only one column from the data set.
      The run icon is used to execute the commands.

      To view and check whether the right data is loaded into the system we can use "proc print".
      The syntax is "proc print data=filename"

      proc print;
      run;



      Once you click on the run icon you can see the loaded file.

      After loading the file into SAS library. 
      All the statistical functions can be performed on the data. 
      SAS also provides a wide range of sample data for learning data analysis.
      During the learning phase you can work on these sample data to get a better understanding of the concepts that are covered in this course.

      The steps to upload a file has not been added for excel. Since Excel does not require importing data.
      In the next section we will look at the advantages of using excel for data exploration.

      3.2 Data Exploration

      Since data exploration was already covered in the previous slides we will cover how data exploration can be covered using other tools.

      Data Exploration through SAS

      • Complete exploration of data is possible in very less time.
      • Simplified steps for data exploration
      • Quality of statistical analysis is better compared to other tools.
      • User Friendly
      • Time for debugging program errors significantly reduced and the training required to learn SAS is relatively less.
      • Macro approach - integration of graphical analysis tool; thereby quickly completing data analysis tasks without using SAS program code.
      • Limitations -  Only the most commonly used data analysis methods available.

      Steps for data exploration using SAS studio

      The steps involved are:

      • Open SAS
      • Extract data into the SAS system from the disk
      • Click on Tasks
      • Click on summary statistics
      • Check whether we have the right data set under the field "DATA".
        • If not, then load appropriate data set.
      • Once the appropriate data set is loaded, pick the Analysis variables
      • In certain cases, we need to fill in an additional field named Classification Variables
      • Run the program.
      • The summary statistics will be displayed in the results window.

      Now let us understand the flow and instructions better.

      • In this section we will look at how to explore data through SAS studio.
      • The first step is to upload the data set on which you want to perform the analysis.
      • You can upload a new data set as shown in the previous section or 
      • You can use an existing sample data set within SAS Studio under the Library > SASHelp.



      • We will select a data set which is already available in SAS...the CARS data set and form data exploration.
      • To start with click on the Tasks tab under which you can see various functions which can be performed.
      • Click on summary statistics under Statistics. A new tab Summary Statistics is displayed.
      • And you will see the data section under the Summary Statistics.
      • Select the particular data set on which you are going to perform summary statistics.
      • The next step is to provide the Analysis variable.
      • Click the + button and the list of columns is displayed. In this case we are considering the column MSRP for analysis.
      • You can also specify the classification variable if required. We will classify cars based on the Make.



      • Now click the Options Tab
      • A list of Basic Statistics and Additional Statistics option is displayed.
      • Here few statistics are already selected. We will check on the Median under Basic Statisitics and Mode under Additional Statistics
      • You can look at the list of options and select them as necessary and the case that is being analyzed.
      • There are also options for Percentile Statistics, Plots etc. We will look at them at the later sections of this lesson.



      • The information tab provides the properties and resources. You can note that as you can make changes to data and options tab. The code window at the right keeps getting updated with the corresponding code.
      • Finally Click on the Run icon and the output will be displayed in the Results tab.



      • The result shows mean, standard deviation, minimum, maximum, median and mode for the analysis variable MSRP classified by the Make as seen in the first column.
      • You can also download the results as .RTF or .PDF file from the icons on the top left of the Results tab.

      Steps for Data Exploration using Excel

      • Open the file
      • Click on the Office button and select Excel Options, and select Add-ins.
      • Select the Analysis ToolPak option from the list of options, and click on Go.
      • Place a check next to the Analysis ToolPak Add-in and click OK.
      • Click on Data ribbon from Menu. On the right most side, the Data Analysis tool options will be displayed.
      • Click on the Data Analysis button. 
      • Select Descriptive Statistics from the options and click OK.
      • Select an input range for the data , by clicking the icon in Input Range field. Click and drag on the data to select the range.
      • Check on labels in first row option if data extracted has column name in its first row.
      • Check on Summary Statistics and click OK.
      • The results will be obtained in a new worksheet.
      Let us know look at a demo to understand the steps better.
      • Let us take a students marks list as a data set to perform data exploration.


      • To perform any data analysis we need to load the data analysis "add-in" in Excel.
      • To load "add-in" click on "Office Button" (on the top left) and "Excel Options".

      • From the "Excel Options" dialog box, click on the "Add-Ins" option on the left.
      • The list of available "Add-ins" will be displayed. Select "Analysis ToolPak". Then select "Go..." button.

      • Now you will see the list of "Add-ins" available.
      • Check option "Analysis ToolPak" and click on Ok.

      • Go to the Data Option in the Menu Bar.  Select the Data Analysis Tool.



      • Select 'Descriptive Statistics' from the window that is displayed and click OK.

      • Now you can specify your input variables in the Input Range by clicking and dragging on the required cells.


      • Here we are selecting Percentage 1 column as our input range. Check on "Labels in first row" options.
      • Specify your output range where your results are to be obtained. By default the results are displayed in the "New Worksheet Ply".
      • Check on "Summary Statistics". You can also change your "Confidence Level for Mean" if you want  and click on Ok.

      • Results are obtained in a new worksheet. You can see that the statistics are displayed.
      • Mean, Median, Mode, Std. Dev., Variance, Kurtosis, etc. 
      • It is easier to infer that the minimum of this variable is 12.5 and the maximum value is 80. Thus, the range is 67.5.
      • Count shows the number of instances i.e. 115. The average of the %ages are nearly 53.2. 
      • Similarly other statistics can be used to get useful information from the data.

      Steps for Distribution Analysis in SAS Studio

      • Open SAS
      • Extract data into the SAS system from the disk.
      • Click on the Tasks section
      • Click on Distribution Analysis 
      • Define the appropriate variable as discussed in the previous slide.
      • Once this is done, click on Options
      • Check histogram, to view a graphical result and click on run.
      • The histogram results will be displayed in the results window.
      • Additionally, there are also other options for checking normality. That is, if required, you can perform a goodness of fit test, normal probability plot, etc.
      Now let us look at a demo to under the steps better:

      How to perform Distribution Analysis for given data through SAS.

      • First we will load the data set into SAS. (Following the steps mentioned into earlier videos)
      • To illustrate the distribution analysis we will look at the same car sample data set from the SASHELP library.
      • Once the file is loaded just select task "Distribution Analysis" from the Statistics section.
      • Fill in the required fields 
      • In the Data  Field select SASHELP.CARS
      • Select the Analysis Variable and the Classification Variable.
      • In our example we will analyse the Horsepower of the CAR.
      • For the classification variable we will again use the Make of the Car.


      Click on the options tab.
      Here check the Histogram Options to view the graphical output.

      Other options are also available to check the normality assumptions, goodness of fit, etc.

      Finally click on the Run Icon
      You will see the histogram plot of the Horse Power in the resultant window.





      In the next section we will cover data distribution in Excel.

      Steps for performing Distribution Analysis in Excel.

      • Open the file for performing data for exploration.
      • Make sure that the Analysis ToolPak add-In is installed in Excel.
      • If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins.
      • Click on data analysis from the right corner of data ribbon.
      • Select Histogram from the list of options and click on ok. 
      • Select the input range for the data by clicking and dragging on the data.
      • Check on labels in first row option if data extracted has column name in its first row.
      • Check on chart output and click ok.
      • The result will be obtained in New Worksheet ply by default.

      Let us now look at a demo to understand the steps better.

      • We will use the student mark sheet as an example to show how to perform distributions analysis.
      • For performing analysis operation for data analysis tool we need to install the data analysis "Add-ins"
      • Following the same steps which we discussed in the previous video they have the data and the analysis tool.
      • Click on the Data tab and you will see the Data analysis tool at the top right corner.
      • The data analysis window is displayed. Select the Histogram option to analyse the distribution of the data.
      • The Histogram Dialog is displayed.
      • Please select the Input range. In this example we will use the Percentage1 column.
      • If required you can also specify the Bin Range.
      • To Specify the bin range enter the values in a column in the same sheet. So create a column Bin with values 30, 40, 50, 60, 70, and 80.


      • Select this as bin values. Check on Labels if your first row specifies the column name.
      • Check Chart Output to give your result as chart.
      • Finally click on Ok.
















    • The resultant worksheet ply defines the bins and frequency for particular input variable and a Histogram chart is also displayed.




      • From the resultant tab we can see that there are 2 percentages below 30, 5 percentages above 80 and similarly the frequencies for other bins are displayed.
      • It can also be seen that the maximum of marks fall within 50 to 70 range. With 80 values in this range.
      • Similar inferences can be made from Mark1, Mark2 and Percentage2 as well.

      3.3 Data Visualization

      Data Visualization

      Data Visualization is the process of graphically or pictorially representing your information with qualitative understanding.
      It makes interpretation very easier.
      Things  essential to make good visualization:

      • Effective format
      • Accurate result
      • Efficient visuals
      • Aesthetics
      • Adaptable form
      In case of business organization it helps in decision making by

      • Identifying regions where improvement has to be done.
      • Influencing customer behavior and predicting sales volumes.
      • Helping in increasing revenues and decreasing dispenses.

      Steps for Data Visulization in SAS Studi

      • Open SAS
      • Extract the required data into the SAS system from the disk.
      • Click on Tasks.
      • Click on the graph section and select the appropriate chart for your analysis.
      • Click on bar chart from the options tp make bar chart comparision between car make and its weight.
      • Confirm whether the right data set is loaded under the data option.
      • Select the appropriate category variable.
      • Select the appropriate response variable (Group Variable if required).
      • Click on Run.
      • The bar chart will be displayed in the result window.
      • Similarly, you can also perform various other chart like bar-line chart, line chart, pie chart and scatter plot etc.
      Now let us look at a demo to understand the steps better.

      • We will try to plot the weight of cars on the basis of their make.
      • We will first load the particular data-set into SAS studio.
      • In this video we will take the same example of cars data set.
      • To perform data visualization in SAS click on the Tasks tab, and select bar chart from the graph section.
      • You can also visualize your data by selecting other graphs that suits your data like bar-line chart, line chart, and scatter plot etc. 
      • Here we have chosen bar chart for R visualization.
      • In the data tab ensure that the CARS data is loaded into the data field.
      • Select the category variable by clicking the + icon.
      • Here for example we have chosen "Make" as our category variable.
      • Click on OK.
      • Next choose the response variable, in our example we will use "Weight" as our response variable.
      • You can also group your data by specifying group variable. We will leave this option blank for now.

      • Click on the Options tab.
      • Here you can specify whether your graph should be vertical or horizontal.
      • You can also add a title and a footnote to the bar chart.
      • Let us add the title as "Weight by Model".
      • You can also aggregate by mean or sum by Statistics section.
      • We will use the default option by aggregating by the average.

      • Other options are also available like Axis, legend details, category, graph size etc.
      • Here we will use the default options.
      • In the information tab the resources and explanation are being displayed.
      • Finally click on run.
      • The Bar chart is displayed.

      • Since the bar chart does not fit within the window let us open it in a new browser tab.using the right most icon on the top.
      • We can infer from the plot that the "Hummer" cars are the heaviest having an average over 6000 pounds.
      • Majority of cars are having average weight around 4000 pounds.
      • Similar plots can be created to analyse different variables.

      Steps for Data Visualization in Excel

      • Open the file for performing data visualization.
      • Select the Input Range
      • Click on Insert from the menu bar
      • Click on the required graph icon from the Chart section.
      • The graph or chart will be displayed.
      • The chart tools ribbons are displayed with the design, layout and format sections.
      • To modify the legend or name the axes, select the layout section.
      • You can see the labels and axes sections in the ribbon. These ribbons can be used to modify the chart in any way.
      Now let us look at the video to understand the concepts better:
      • Let us look at the steps to visualize data through excel.
      • As usual we will look at the student mark list example.
      • Select the variable for which you need to make the visualization.
      • For this example let us try to visualize the Mark1 variable.
      • Select the Mark1 column or click and drag on the cell.
      • After selecting the variable, click on the Insert Menu.
      • You can see many charts that help in visualizing the data.
      • In our example since it is a continuous variable, we will use the line chart option.
      • Click on the line icon and excel shows the different types of line charts.
      • Let us choose the line with the markers option.
      • On Selecting the option the chart is automatically displayed.
      • Let us resize the chart to make it more visually readable.
      • The x-axis denotes the entry and the y-axis denotes the marks obtained.
      • To know any particular data hover the mouse over the marker and the value is displayed.
      • For example let us find out the lowest value in the data.
      • When we hover the mouse over the lowest values it displays the value 4 at point 106.
      • As mentioned before the chart option can be changed from Design, Layout and Format ribbons which is available when we select the chart.

      Case Study

      Now we will take up the case studies to apply the techniques of data visualization and data exploration in real world problems so that we can understand the concepts better.
      • Let us now take a banking related data set as an example for doing data exploration.
      • Officer David from the sales department has been asked to do a study on the overall performance of a bank by analyzing the bank loans. As a part of the investigation, David has asked his colleague to categorize the loans as seasoned or bad loans with respect to Age category.
      • The following data categories have been captured:
        • Age Group
        • Total number of loans
        • Number of good loans
        • Number of bad loans
        • Percentage of good loans
        • Percentage of bad loans
      Let us have a look at the data.
      • The table below shows the bank loan data
      • As mentioned before the number of good loans and bad loans has been tabulated as per the age category.
      • There are 15 categories of the age in the table.
      • Given the total number of loans and the number of good and bad loans, the percentage of good and bad loans has been calculated and tabulated.
      Now we will look at the questions that David wants to answer:
      • Should bank loans be restricted to 40-45 age category?
      • There are any relationship between loans and age?
      • Is there any pattern for defaulters?
      • How is the bank performing on the whole?
      Now we will look at how to approach these questions with the help of SAS and Excel.
      • In the following demo we will look at different statistics and graphical representation of the data that will help finding the answers.
      • We will start visualizing the data to get meaningful insights.
      • As shown earlier the steps the visualize data are
      • Open SAS
      • Extract data into the SAS system from the disk.
      • Click on Tasks
      • Click on Graphs and you will see various charts for visualization.
      • Select appropriate chart for the analysis.
      • Here we will use Histogram to plot number of good loans and number of bad loans.
      • Select Histogram from the Graph section of the Tasks.
      • Confirm whether the right data is loaded under the data in Data tab.
      • Select the Analysis variable i.e. "Number of Good Loans"
      • Run the code.
      • The histogram is displayed in the results window
      • Change the Analysis variable from "Number of Good Loans" to "Number of Bad Loans" and run the code
      • A new histogram is displayed for the number of bad loans.

      Data Exploration using SAS - Descriptive Analysis

      • To get the better understanding of the data we use data exploration. i.e. the Descriptive Analysis.
      • Thus we get the numerical summaries of the data.
      • Let us try to obtain descriptive statistics through SAS.
      • Open SAS
      • Extract Data into the SAS from the disk.
      • Click on Tasks
      • Click on Summary Statistics
      • To do the basic data exploration.
      • Check whether we have the right data set under the field data.
      • Select the Analysis Variables as 
        • Number of Good Loans
        • Number of Bad Loans
      • Select the Classification Variable as "Age Group".
      • Run the program.
      • The Summary Statistics are displayed in the result window.
      • From these results we obtain the following inferences
      • Number of Good Loans in comparatively greater than the Number of Bad Loans. And hence we can conclude that the bank is performing well in this segment.
      • Since the percentage of good loans is high in the 42 to 45 years Age_Group category, we can conclude that it is worth taking a risk in the category.
      • However, there are other categories, that have better percentage of good loans, and hence it is preferable not to concentrate on 42 to 45 years Age_Group category.
      • We will try to answer the remaining questions through Excel.

      Data Exploration and Visualization using Excel - Descriptive Analysis

      • Open the file in Excel for Data Visualization.
      • Select the Input range.
      • Click on Insert
      • Select the required graph.
      • The graph will then be displayed
      • Let us start creating graphs
      • First we create a graph of "Number of Bad Loans" v/s the Age_Category.
      • Select the Age_Category and  "Number of Bad Loans" column.
      • Click on the Insert Menu and select any Bar Graph.
      • The resulting Bar Graph shows the bar of "Number of Bad Loans" and the function of the Age_Category on the X-axis.
      • Here the first thing that comes to our notice that  the maximum number of bad loans is in the 42-45 Age_Category.
      • However, it does not necessarily mean that the loans should not be given to this category as maximum profit also falls under the same category.
      • Another trend that can be noticed from this graph is that the defaulters are fewer as the borrower gets older. That is a good insight.
      • It supports that fact that there is a definite relationship between age and loan patterns.
      • Let us now create a graph of total number of loans v/s the age group.
      • Select the two columns and click on the insert menu.
      • Let us create the line chart.
      • From the chart we can infer that  the distribution of loans across age group is reasonably smooth normally distributed curve without too many outliers.
      • In most business cases age often shows this kind of interesting patterns for most products. However, other closely related businesses cannot produce similar smooth curves. In such cases we might have to smoother the curve.
      • Adding to the above insights it is very difficult to draw results form the end values. But neglecting the end values could be fatal if it is not based on sound business acumen.Often neglecting the end point is a common mistake as far as analysis is concerned. 

      Summary Statistics using Excel

      • To obtain summary statistics in Excel, open the file.
      • Make sure that the data analysis tool pack is installed.
      • Click on Data Analysis from the Data Menu.
      • Click on Descriptive Statistics and click Ok.
      • Select the Input range from the data. In this case select the whole data set.
      • Check on "Labels in the First Row" options, since the data extracted has the column names in its first row.
      • Check on "Summary Statistics" and click Ok.
      • The results are obtained in a new worksheet.
      • The results show different statistics for every variable of the data set.
      • For the questions if the bank is performing well on the whole let us look at the mean of the number of good loans and the number of bad loans.
      • There are an average of 3952 good loans and about a 100 bad loans. So we can confidently say bank is doing well in the overall sense.