BA using SAS: 02 Statistical Concepts And Their Application In Business

Agenda

Get an overview of Statistical Methods
Understand Population Samples
Develop a sampling plan and know Sampling Methods
Know what is Descriptive Statistics
Know what are its components
Learn about Probability theory and distributions
Know what is Confidence Interval
Learn about the concepts of tests of significance
Differentiate between One sided and two sided hypothesis testing
Know the various tests of significance
Know about non-parametric testing
Understand the main topic better with case studies

Statistical Methods

Statistics is a applied/business mathematics where we collect, organize, analyze, and interpret numerical facts

Descriptive Statistics

Sample
Measure of Central Tendency
Measures of dispersion

Inferential Statistics

Population
Estimation
Hypothesis Testing

Population and Samples

A population is any entire collection of objects or observations from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
For each population there are many possible samples.
It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.
A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group.
A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling.

Developing a sampling plan

Define the target population – in terms of number of elements, sampling unit, extent and time.
Select a sampling method – probability or non-probability sampling.
Obtain the sampling frame – must contain all the potential factors.
Determination of sample size – for desired level of accuracy.
Choose data collection method – procedure to obtain the data.
Develop operational plan – which technique fits the best.
Execute operational plan – verification of specified procedure.

Sampling Techniques

Descriptive Statistics

Analyse Data to extract meaningful information

Measure of Central Tendency

Help describe, show and summarize data in a meaningful manner
Measure of Central Tendency

Mean
Median
Mode

Mean

mean is the average of the numbers
a calculated "central" value of a set of numbers

Median

Median is the number in the middle
Number of values above and below median is same

Mode

Mode is the value that occurs often
A set of data can have more than one mode

When to use what?

Mean:

The average is required
The variable is continuous / discrete

Median:

The variable is discrete
There are abnormal extreme values / Non-normal data
The characteristic under study is qualitative

Mode:

The variable is discrete
There are abnormal extreme values
The characteristic under study is qualitative

Measure of Dispersion

The spread or dispersion of a set of scores around some central value
Describes the amount of heterogeneity or variation within a distribution of scores

Measure of Dispersion

Variance
Standard Deviation

Variance and Standard Deviation

Variance is an average of squared deviations about the mean
Standard deviation is the squared root of variance. Example data : 2,5,5,4,6,8
n= 6
Mean = (2+5+5+4+6+8)/6 = 5 Example data : 2,5,4,6,8
Variance =
Standard Deviation =
[(2−5)2+ (5−5)2+ (4−5)2+ (6−5)2+ (8−5)25]/5 = 20/5 = 4
√4 = 2

It is the measure of how spread the distribution is?

Case Study – Descriptive Statistics

Business Case: A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.
The telecom company surveyed 1000 of its customers on all the above services.
Use Descriptive analysis to study customer spending to determine which services are most profitable.

	N	Valid N	Min	Max	Mean	Standard Deviation
Long distance last month	1000	1000	0.9	99.95	11.72	10.36
Toll free last month	1000	475	0	173	13.27	16.9
Equipment last month	1000	386	0	77.7	14.21	19.07
Calling card last month	1000	678	0	109.25	13.78	14.08
Wireless last month	1000	296	0	111.95	11.58	19.72

On average, customers spend the most on equipment rental, but there is a lot of variation in the amount spent.
Customers with calling card service spend only slightly less, on average, than equipment rental customers, and there is much less variation in the values.
The real problem here is that most customers don't have every service, so a lot of 0's are being counted. One solution to this problem is to treat 0's as missing values so that the analysis for each service becomes conditional on having that service.

Probability Theory

Probability us a branch of mathematics that deals with the uncertainty of an even happening in the future.
Probability value always occurs within a range of 0 to 1.
Probability of an event, P(E) = (No. of favorable occurrences)/(No. of possible occurrences)
Let us take the example of an unbiased coin which has two faces. Heads and Tails. As a result we will have two possible outcomes with equal probability. i.e. the probability of getting a head is equal to getting a tail i.e. 1/2 = 0.5.

Assigning Probabilities

Classical method - based on equally likely outcomes.
E.g. Rolling a dice. Probability of getting any number out of 1,2,3,4,5,6 is 1/6.
Relative frequency method - based on experimentation or historical data.
E.g. A car agency has 5 cars. His past record as shown in the table shows his cars used in past 60 days.

No. of cars used	No. of days	Probability
0	3	(3/60) = 0.05
1	10	(10/60) = 0.17
2	16	(16/60) = 0.27
3	15	(15/60) = 0.25
4	9	(9/60) = 0.15
5	7	(7/60) = 0.12

According to the table above there were:
No cars used for 3 days
1 car used for 10 days
2 cars used for 16 days
3 cars used for 15 days
4 cars used for 9 days
5 cars used for 7 days
The probability can thus be calculated based on the relative frequency of each occurrence.
i.e. dividing each of the days with total of 60 days.

Subjective Method: based on judgement.
E.g. 75% chance that England will adopt to Euro currency by 2020.

Probability Distribution

Probability distribution for a random variable gives information about how the probabilities are distributed over the values of that random variable.
Its defined by f(x) which gives probability if each value.
E.g. Suppose we have sales data for AC sale in last 300 days.

Unit Sold	No. of days	Probability of units Sold f(x)
0	10	0.03
1	55	0.18
2	150	0.5
3	55	0.18
4	25	0.08
5	5	0.02

Binomial Distribution

Binomial Distribution satisfies:

A fixed number of trials
Each trial is independent of the others
The probability of each outcome remains constant from trial to trial.

Example of binomial experiments

Tossing a coin 20 times, what is the probability of getting head 5 times?
Getting a diamond King from a pack of 52 cards.

Case Study - Binomial Distribution

Example of binomial distribution: Amir buys a chocolate bar every day during a promotion that says one out of six chocolate bars has a gift coupon within.
Answer the following questions:

What is the distribution of the number of chocolates with gift coupons in seven days?
What is the probability that Amir gets no chocolates with gift coupons in seven days?
Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?
Amir buys a bar every day for six weeks. What is the probability that he gets at least three gift coupons?
How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?

(Assume that the conditions of binomial distribution apply: the outcomes for Amir's purchase are independent, and the population of chocolate bars is effectively infinite.)

Formula = nCr p^r q^(n-r)
n - number of trials
r - number of successful outcomes
p - probability of success
q - probability of failure

Other Important formulas include

p + q = 1
So q = 1- p i.e.

Now in this case p = 1/6 as Amir has a chance of winning one coupons out of every 6 chocolates he buys.
The number of favorable cases is 1.
And the total number of cases is 6.

So q = 1-p = 1 - 1/6 = 5/6.

What is the distribution of the number of chocolates with gift coupons in seven days?

7 C r (1/6) ^ r (5/6) ^ (7-r)

What is the probability that Amir gets no chocolates with gift coupons in seven days?

Probability of failing 7 days : P (X=0) = (5/6)^7

Amir gets no gift coupons for the first six days of the week. What is the chance that he will get a one on the seventh day?

Probability of winning a coupon in the 7th day : 1/6

Amir buys a bar every day for six weeks. What is the probability that he gets at least three gift coupons?

Number of winning atleast 3 wrappers in six weeks:

P(X>=3) = 1 - P ( X<=2)

= 1 - (P (X=0) + P (X=1) + P (X=2))

= 1 - (0.0005 + 0.0040 + 0.0163 )

= 0.979

How many days of purchase are required so that Amir's chance of getting at least one gift coupon is 0.95 or greater?

Number of purchase days required so that the probability of success is greater than 0.95:

P(X>=1) >= 0.95 (As per Binomial Distribution)

P(X=1) + P(X=2) + ... + P(X=6) >= 0.95 but since Summation ( P(X=r)) = 1 so,

1- P(X=0) >= 0.95 >> P(X=0)<=0.05 >> (5/6)n <= 0.05 ...taking log both sides

n log(5/6) <= log(0.05)

So, n>= 16.67

That is n=17 days minimum.

Normal Distribution

Normal Distribution is a theoretical model of the whole population.

It is perfectly symmetrical about the central value;the mean Mu represented by zero.

Poisson distribution

Discrete probability distribution for events that happen randomly in time.

Following conditions need to be satisfied -

The event results in a success or failure

The average number of successes, Mu is known

Probability of success is proportional to the region/time.

Probability of success in an extremely small region/time is almost zero.

Properties: Mean and variance is equal and is denoted by Mu.

Examples

Average number of houses sold by a company is 5 per day. What is the probability that exactly 4 houses will be sold tomorrow?

Average number of births in a hospital is 2.1 births per hour. What is the probability that there will be exactly 6 births in the next two hours?

Skewness and Kurtosis

Skewness - measure of deviation from symmetry

Difference between median and mean
Right or left skewed

Skewness negative - more negative values (Left Skewed) more values to the left of median.
Skewness positive - more positive values (Right Skewed) more values to the right of median.

Skewness can be removed from the data by doing mathematical transformations of the variable like logarithmic, squared, squared root etc.

Kurtosis - measure of peakedness of the distribution

High Kurtosis - tall peak, rapid decline in the tails
Low Kurtosis - flat peaks, gradual decline in the tails
Extreme case - uniform distribution

Case Study - Skewness and Kurtosis

	N	Skewness		Kurtosis
		Statistic	Std. Error	Statistic	Std. Error
Long Distance last month	1000	2.966	0.077	14.012	0.155
Toll Free last month	475	3.465	0.112	26.735	0.224
Equipment last month	386	0.756	0.124	0.641	0.248
Calling card last month	678	2.15	0.094	7.572	0.187
Wireless last month	296	1.359	0.142	3.079	0.282

Equipment last month data is more accurate in nature and its SD is comparatively lower than the other measures.
Conclusion - Equipment is the segment where the telecom company is getting more profits than the others and it can invest more.

Confidence Interval

It's a rule for a population parameter to determine an interval that is likely to include the parameter based on the sample information.
Supposing that a random variable has been taken and the random samples were taken repeatedly from the population, certain percentage of interval contains unknown value.
In such case, if population is repeatedly sampled and intervals calculated in that fashion then 95% of interval contains true value of the unknown parameter.
This interval is then said to be 95% confident for the population proportion.
Data Requirements

Confidence level
Statistic
Margin of Error
Range of the confidence interval = sample statistic + margin of error.
The uncertainty associated with the confidence interval is specified by the confidence level.

How to Construct a Confidence Interval

Identify a sample statistic - Choose the statistic that will be used to estimate a population parameter.
Select a confidence level - It describes the uncertainty of a sampling method.
Find the margin of error.
Margin of error = Critical value * Standard error of statistic
Specify the confidence interval - The range of the confidence interval is defined by the following equation.
Confidence Interval = Sample Statistic +/- Margin of error

2.3 Tests of Significance

Tests used in assessing the evidence in favor of or against a given assumption
Begins with a Null Hypothesis, Ho
Tests either validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha
Two types of tests:

One sided tests
Two sided tests

Results decided by calculating the "p - value"
P value can be defined at the probability that the calculate test statistic can take extreme value as the absurd value given that the null hypothesis is true.
Interpretation:

If p-value is less than the significance level alpha, reject the null hypothesis.
General values of alpha are 0.05, 0.01.

General Assumptions:

The distribution is almost normal
The sample in the distribution have almost unequal variances.

One sided hypothesis testing

Muo = null value
Null hypothesis Mu = Muo
Alternative hypothesis: Mu < Muo or Mu > Muo

Example: Given a sample of heights of 100 males in New York, decide whether the height has increased in general form a given average height of 5 feet 9 inches.

Null Value: Muo = 5 feet 9 inches
Null Hypothesis: Mu = 5.9
Alternative Hypothesis: Mu > 5.9

Using one of various hypothesis tests, calculate "p-value" and reject null hypothesis if p-value is less than 0.05.

Two sided hypothesis testing

Muo = null value
Null Hypothesis: Mu = Muo
Alternative hypothesis: Mu <> Muo

Example: given a sample heights of 100 males in New York, decide whether the height has increased/decreased in general form a given average height of 5 feet 9 inches.

Null Value = Muo = 5.9
Alternative Hypothesis = Mu <> 5.9

Using one of various hypothesis tests, calculate p-value and reject null hypothesis if p-value is less than 0.05.

2.4 Tests of Significance

One Sample z test- The Z test is used to compare the mean with the given standard
Two Sample z test - The Z test is used to compare the means of two groups.
The standard deviation need not be known to calculate the Z statistics.
The Z test is generally used when the number of samples is greater than 30.
T test
The t test is used with mean statistics as well but to calculate the t statistic the standard deviation must be known the test is preferred if the number of samples is less than 30. As earlier the t test can be one sample two sample or paired t tests.
One Sample t test -
Two Sample t test - When the compared groups are independent. e.g. To compare the marks or students of two different schools.
Paired t test - When the compared groups are paired. To compare the marks of students of same schools before and after a training class.
Chi-Squared test - For goodness of fit is used to test if there is a different between the observed values and the expected values according to a particular hypothesis.
F test - Annalysis of Variance (ANNOVA) - To compare variances of two or more groups. The mostly used f test is ANNOVA.
F test - Regression - lesser used is the regression analysis.

In all the analysis tests the null hypothesis states that there is no difference between mean or variances and the alternative hypothesis suggests otherwise.

Chi-Squared Tests

Compare the observed results against an expected result based on a hypothesis
Steps:

State the null hypothesis
Prepare the contingency table for the variable
Determite the expected results
Calculate the chi-squared values
Calculate the degree of freedom
Based on the above, calculate the p-value
If p-value <0.05, reject the null hypothesis

Test of independence:

Verify if two variables are independent
Same steps as above

Case Study - Chi Squared Test

A city has a newly opened nuclear plant, and there are families staying dangerously close to the plant. A health safety officer wants to take this case up to provide relocation for the families that live in the surrounding area. to make a strong case, he wants to prove with numbers that an exposure to radiation levels is leading to an increase in diseased population. He formulates a contingency table of exposure and disease.

Exposure	Disease Yes	Disease No	Total
Yes	37	13	50
No	17	53	70
Total	54	66	120

Does the data suggest an association between the disease and exposure?

Steps:

Calculate the number of individuals of exposed and unexposed groups expected in each disease category (yes or no) if the probabilities were the same.
If there were no effect of exposure, the probabilities should be same and the chi-squared statistics would have a very low value.

Proportion of population exposed = (50/120)=0.42
Proportion of population not exposed = (70/120)=0.58

Thus, expected values:
Popolation with disease = 54
Exposure Yes: 54 * 0.42 = 22.5
Exposure No: 54 * 0.58 = 31.5

Population without disease = 66
Exposure Yes: 66 * 0.42 = 27.5
Exposure No: 66 * 0.58 = 38.5

Exposure	Disease Yes	Disease No	Total	Total Proportion
Yes Actual	37	13	50	50/120 = 0.42
Yes Expected	54 * 0.42 = 22.5	66 * 0.42 = 27.5
No	17	53	70	70/120 = 0.58
No Expected	54 * 0.58 = 31.5	66 * 0.58 = 38.5
Total	54	66	120

Calculate the Chi-Squared statistic

X^2 = Summation of [(Observed Freq. - Expected Freq.)^2/ Expected Freq]
= ((37-22.5)^2 / 22.5) + ((13-27.5)^2 / 27.5) + ((17-31.5)^2 / 31.5) + ((53-38.5)^2 / 38.5)
= 29.1

Calculate the degree of freedom:

df = (Number of rows -1) x (Number of columns -1)
df = (2-1) x (2-1)
df = 1

Calculate the p-value from the chi-squared table(found online).
For Chi-Squared value 29.1 and degree of freedom =1, from the table, p-value is < 0.001
Interpretation: There is 0.001 chance of obtaining such discrepancy between expected and observed values if there is no association.

ANNOVA

Analysis of Variance - used to compare more than two groups
Extension of the independent t-tests
Factor variable - variable defining the groups
Response variable - variable being compared
One way ANNOVA

Groups of a single variable
E.g.: Is there a difference in student's marks based on the row he is seated - front / middle / back?

Two way ANNOVA

Two independent variables
E.g.: Does the race and gender affect a person's yearly income?

Case Study - One way ANNOVA

Marks obtained in the same subject by three students belonging to three different schools are given below.
Does the data suggest any association between school and marks?

School	A	B	C
Marks 1	82	83	38
Marks 2	83	78	59
Marks 3	97	68	55

The basic idea in ANNOVA: Partition the total variation in the data into the variation between groups and variation between groups.
Steps:

Calcaute the means

School A: mean(82, 83, 97) = 87.3
School B: mean(83, 78, 68) = 76.3
School C: mean(38, 59, 55) = 50.6

Calcualte the grand mean

Grand: mean(82, 83, 97, 83, 78, 68, 38, 59, 55) = 71.4

Calculating the variations

Sum of Squared Deviations about the grand mean, across all observed values: SStotal = 2630.2
Sum of Squared Deviations of group mean about the grand mean - three group mean against the grand mean: SSbetween=2124.2
Sum of Squared Deviations of observations within a group about their group mean; added across all groups: SSwithin=506

Calculate the degree of freedom for every variance:

df_total = number of observations -1 = 9-1 = 8
df_between = number of groups -1 = 3 -1 = 2
df_within = number of observations - number of groups = 6

Calculate the Mean Squared Variances

Mean Suared variance between group MS_between = SS_between / df_between = 2124/2 = 1062
Mean Suared variance within group MS_within = SS_within / df_within = 506/6 = 84.3

Calculate the f-statistics

F-value = MS_between/MS_within = 1062.1/84.3 = 12.59

Calculate the p-value from the F-table

P-value for given f-value 12.59 and degree of freedom 2 and 6 is 0.007

Conclusion: since the p-value is less than alpha, we can conclude by rejecting the null hypothesis, that there is a difference in the marks obtained by students belonging to different groups.

2.5 Non Parametric Testing

Referred to as "distribution free", as they don't involve making assumptions of any data.
They have lower power than the parametric tests and hence are always given the second preference after the parametric tests
These tests are typically focused on median rather than mean
They involve straight forward procedures like counting and ordering
There are at least one non-parametric test done for each parametric test and are classified into following categories.

Tests of differences between groups (independent samples)
Tests of differences between variables (dependent variables)
Tests of relationship between variables

One usually computes the correlation coefficient.

Non parametric equivalence to the standard correlation coefficient are

Spearman's R
Kendall's Tau
Coefficient Gamma

Appropriate non-parametric testing for testing the relationship between the two variables are the chi-squared tests, the pi coefficient and the fisher exact test. In addition a simultaneous test for relationship between multiple cases is available. Kendall coefficient of concordance. This test is often used to express the inter-relative agreement among independent judges who are rating ranking the same simulate

Non Parametric Tests

Tests	Parametric	Non Parametric
One Qualitative Response Variable	One Sample Test	Sign Test
One Qualitative Response Variable - Two Values from Paired Samples	Paired Sample T - test	Wilcoxon Signed Rank Test
One Qualitative Response Variable - One Qualitative Independent Variable with Two Groups	Two Independent Sample T - test	Wilcoxon Rank Sum or Mann Whitney Test
One Qualitative Response Variable - One Qualitative Independent Variable with Three or more Groups	ANNOVA	Kruskall Wallis

Correlation

Measure of association between variables

Positive and negagive correlation, ranging between +1 and -1

A value of +1 or positive correlation applies that if the value of independent variable increases the value of response variable also increases.

Similarly, a value of -1 or negative correlation applies that if the value of independent variable increases the value of response variable decreases.

Positive Correlation Example:

Earning and expenditure - more a person earns more he/she spends.

Negative Correlation Example:

Speed and time - As the speed of the vehicle increases the time taken to cover a given distance decreases.

Parametric - normal distribution and hogeneous variance.

Pearson correlation

Non Parametric - no assumption, nominal variable

Spearman correlation

Correlation Coefficient

r: correlation coefficient
-1: Perfectly Negative
+1: Perfectly Positive
0 - 0.2 : No or very weak association
0.2 - 0.4 : Weak association
0.4 - 0.6 : Moderate association
0.6 - 0.8 : Strong association
0.8 - 1 : Very strong to perfect association

Summary

Overview of Statistical Methods
Population, Samples & Sampling Plan and Sampling Methods
Descriptive Statistics - Measure of Central Tendency and Measure of Dispersion
Probability Theory and Distributions
Confidence Interval
What are Tests of Significance
The process flow of hypothesis testing
One Sided and Two Sided Hypothesis Testing
Various Tests used in calculating p-value
What is Non-Parametric Testing and why it is used.
Non-parametric alternatives for the usual tests of significance

Sunday, May 8, 2016

02 Statistical Concepts And Their Application In Business

Agenda

Statistical Methods

Population and Samples

Developing a sampling plan

Sampling Techniques

Descriptive Statistics

Measure of Central Tendency

Mean

Median

Mode

When to use what?

Mean:

Median:

Mode:

Measure of Dispersion

Measure of Dispersion

Variance and Standard Deviation

Case Study – Descriptive Statistics

Probability Theory

Assigning Probabilities

Probability Distribution

Binomial Distribution

Binomial Distribution satisfies:

Example of binomial experiments

Case Study - Binomial Distribution

Normal Distribution

Poisson distribution

Skewness and Kurtosis

Skewness - measure of deviation from symmetry

Kurtosis - measure of peakedness of the distribution

Case Study - Skewness and Kurtosis

Confidence Interval

How to Construct a Confidence Interval

2.3 Tests of Significance

One sided hypothesis testing

Two sided hypothesis testing

2.4 Tests of Significance

Chi-Squared Tests

Case Study - Chi Squared Test

ANNOVA

Case Study - One way ANNOVA

2.5 Non Parametric Testing

Correlation

Correlation Coefficient

Summary

No comments:

Post a Comment