BA using SAS: 03 Basic Analytic Techniques

3.1 Basic Analytic Techniques using SAS and Excel - Part 2

Diagnostic Analytics

For the questions like:

What happened?
What is happening?
Why something happened?

You focus on the information that you have in hand and on the information that can be used for data mining. this is the purview of descriptive and diagnostic analytics.
Tries to understand the reasons behind the past behavior of data.
Descriptive analysis is generally used for identifying what really happened?
In this course we will look at Correlation, T-tests, and ANOVA in detail.

Steps for Correlation in SAS Studio

From the previous lessons we know that the correlation is the measure of association between two variables and could be either positive or negative. Let us now look at the steps involved in performing correlation in SAS.

Load the data set into SAS Library using proc procedure.
Click on the Tasks tab.
Click on correlation under statistics.
Load the appropriate data set using drop down data option.
Choose the analysis variable and appropriate correlate with value.
Click on the options tab.
Check on correlation and display p value options.
Click on run.
Output is displayed.

Demo for Correlation in SAS Studio

Firstly we need to import the file into the SAS system.
Here for the sake of convenience we will be using the sample CARS data set from the SAS library.
We will try to find if there is any association between the selling price and the horse power or the engine size.
We will be analyzing the selling price variable against both horse power and the engine size.
Click on Tasks
Click on Correlation under Statistics.
Check whether the right data set is loaded under the DATA option.
Choose the appropriate "Analysis variable. For this case choose the MSRP variable.
Under the "Correlate with" option choose the HorsePower and the EngineSize

Click the Options tab from the top.
Check whether options for Correlation and Display p-values are marked.
Click on the Run Icon
The results will be obtained in a new window.

Here we could see strong association between Horsepower and MSRP because of the high value of the correlation coefficient. Moderate association between MSRP and EngineSize.

Steps for Correlation in Excel

Select the Excel data sheet in which you need to perform the correlation.
Click on the data analysis from the Data tab right most side. (Data Analysis tool most be added through the Excel Add-ins options.)
Click on Correlation and select Ok.
Select the input range for correlation by clicking and dragging the cursor over the data.
Check on "Labels in first row" option to define if your first row represents variable names.
The default option for output is the new worksheet ply option.
Finally click on OK.
You will get the results in new worksheet ply.

Demo for Correlation in Excel

In this demo we are going to find the correlation between the Sales and the Money spent on advertisements.

Click on Data Analysis under Data Menu.
Click on Correlation and Ok.

Click and Drag the input range using the mouse to select the data.
If your data is grouped by column check on column.
Check "Labels in First Row", if you have Labels in first row of your input data.
Click Ok.

The results are displayed in the new work sheet.We can see the value of 0.98 for advertisement and sales. As we know the value of 0.8 to 1.0 is considered to be a very strong to perfect association category.

So we can conclude that there is a strong positive correlation between Sales and Advertisement

Case Study - Correlation in SAS Studio

Business Case:

An ice-cream company maintains a database of total sales per day and the corresponding temperature everyday. Given the data for a particular week, the management wants to find whether temperature is a factor that could potentially cause a difference in their sales.

Open the SAS Studio
Load the data for analysis
Click on Tasks Tab
Click on Correlations under Statistics
Select Sales v/s Temp dataset from the working library under DATA option
Select the Analysis Variable as Sales.
Select the Correlate with Variable as Temp.
You can see that the code section is getting updated with the necessary code for running the correlation task.

Let us now fill the details in the options tab.
Ensure that the Correlation, Display p-values and Description Statistics options are checked.
Click on Run.

Output is displayed in the Results Tab

The Results tab shows the list of variables in the data and provides the descriptive statistics.
The correlation coefficient between sales and temperature for the given data is calculated as 0.94761.
From the correlation value tables we can infer that there is a very strong positive relationship between sales and temperature.
A positive correlation coefficient denotes that an increase in temperature results an increase in sales.

Case Study - Correlation in Excel

Let us perform the correlation using the same data set using Excel.

Open the data set in Excel
From the Data Tab click on Data Analysis option
Select the correlation option and click on Ok.

Correlation window appears.
Define the Input Range on which correlation needs to be performed.
Click and drag the cursor to select the Sales and Temp variables in the input range.
Check the "Labels in First Row" as the data we have has the Labels in first row.
The "New worksheet ply" is selected by default to show the results in the new worksheet.
Finally click on Ok.

The results are displayed on the new worksheet.

As you can see that the correlation between Sales and Temp is calculated as 0.947607, same as the one obtained in SAS Studio.
Again this values shows that there is a strong positive relationship between sales and temperature. i.e.if the temperature on a particular day is high the sale of ice-cream is also high.
Though it can be deduced based on intuition that the sale of ice-cream is higher for summer days.
Correlation helps to support the hypothesis with proof obtained from the data.

Steps for ANOVA in SAS Studio

Analysis of Variance or ANOVA can be performed by the following steps:

Load the data set into SAS working library using proc procedure
Click on Tasks
Select one way ANOVA under the statistics section
Load the appropriate dataset using the data drop down option
Select the dependent variable
Select the explanatory variable
Then click on options tab and check on tests for normality under normality assumptions
Then check on Welch's variance weighted under homogeneity of variance
Place a check next to the box plots and diagnostic plots from plot tab
Click on the run icon and you will get the output in a resultant window

Demo for ANOVA in SAS Studio

In this demo we will look at the steps involved in analyzing the variable using ANOVA for a particular data set using SAS studio.

Here for the sake of convenience we will use the same sample car data set.from the SAS help library.
As already discussed ANOVA is used to compare the means between two or more groups.
Here we will compare the MPG_City variable i.e. Miles per Gallon in a city of a Car based on its version. We can see that origin has three values ASIA, EUROPE and UNITED STATES.
To perform ANOVA on these variables, click on Tasks.
Click on One-way ANOVA from Summary Statistics.

From the data option select SASHELP.CARS data set.
Select MPG_City in dependent variable.
Select Origin in Explanatory variable.

Click on OPTIONS tab.
There are settings to check the NORMALITY, HOMOGENEITY OF VARIANCE, etc.
You can also check on plots to be displayed, like Box Plot and Diagnostic Plot.
Finally click on Run

The ANOVA Output between MPG_City and Origin will get displayed in Resultant window.

These results provide the P and F value for the variables and it also displays the box plot distribution between those variable since Box Plot is selected by default.

We already know that if the P value is less than 0.05 then we reject the NULL hypothesis and if not we don't reject the NULL hypothesis.
Here since the P value is less than 0.001 we reject the NULL hypothesis and conclude that the cars of different origins have different MPG values.

In addition to the P value result the test also provides difference in mean and confidence limit of that variable.

Steps for performing ANOVA in Excel

Open the data set in Excel
Select data analysis from the data tab.
From the resulting window, click on ANOVA - single factor.
The single factor ANOVA window appears. Define the input range for which you need to perform ANOVA.
Check on labels in first row if your first row contains variable name.
Check on labels in first row if your first row contains variable name.
Define the alpha level as 0.05.
Click on new worksheet ply radio button to obtain results in new worksheet ply.
Finally click on Ok.

Demo for performing ANOVA in Excel

We will take the example data set of student performance in three sections of a school.

The first column represents the marks of students belonging to section A.
The second column represents the marks of students belonging to section B.
The third column represents the marks of students belonging to section C.
Here we want to check whether the mean mark of the sections equal or not.
Thus our NULL hypothesis is where all the means are equal.
And our ALTERNATIVE hypothesis is at least one mean in different.
To find the results we will perform one-way ANOVA test on the data.
Click on the "Data" menu from the menu bar.
Click on "Data Analysis" tool from the right most side.

Click on "Anova: Single" Factor from the Data Analysis window.

Click and drag using mouse to select the input data.
Check "Labels in First Row" if the input data has labels in it otherwise ignore the labels column.
Mark where you would like to see the output. -- New Worksheet Ply
Finally click on Ok.

The result is displayed in the New Worksheet Ply.

To recap the basics of ANOVA.
The F value is the value for the test statistic and F critical is the citical value.

The test statistics in ANOVA is the ratio of two scale sums of squares reflecting different sources of variability.
That is F is equal to explained variance divided by unexplained variance.
The critical value is the number that the test statistic must exceed to reject the test.
If F > F-critical we reject the NULL hypothesis.
If F < F-critical we do not reject the NULL hypothesis.
Here the F value is obtained as 21.27 and F-critical value is 3.25.
Since F values > F-critical we reject the NULL hypothesis i.e the assumptions that all means are equal.
And we accept the Alternate hypothesis that at least one mean is different.

Case Study - ANOVA

A pharmaceutical company has tested three formulations of a body pain for athletes. There are 27 volunteers selected and 9 were assigned randomly to one of the medications. The athletes were instructed to take the medication during body pain and report the pain in a scale of 1 to 10. (10 being too much of pain).

Pain Rating	Medication Type
4	1
5	1
4	1
3	1
2	1
4	1
3	1
4	1
4	1
6	2
8	2
4	2
5	2
4	2
6	2
5	2
8	2
6	2
6	3
7	3
6	3
6	3
7	3
5	3
6	3
5	3
5	3

The table shows the 9 observations of each medication.
We will now use Analysis of Variance i.e. ANOVA to test the significance difference between means of all three medications.

ANOVA Case Study using SAS Studio

Let us look at the steps for doing ANOVA in SAS Studio.

Load the data set into SAS working library using PROC procedure - refer to the previous steps to upload the data set using the PROC procedure.
Here in our case study we will use the data set of athletes as shown in the previous example.

Through ANOVA we are going to analyse the differences between means of different medications.
Click on Tasks
Select One-Way ANOVA

Select the WORK.MEDICATION data set under DATA option.
Select the Dependent Variable as Pain Rating.
Select the Explanatory Variable as Medication Type.

Click on Options Tab and check for Normality Assumptions and see if Test for normality is checked.
Check on Welch's variance-weighted ANOVA under Homogeneity of Variance.
Check on the Box Plot and Diagnostic Plots under PLOTS option.
Click on Run.

The output is displayed in the RESULTS window.

The RESULT window shows the P values as 0.0003.
Since the P value is less than 0.05 we reject the NULL hypothesis in favor of the Alternative hypothesis.
This shows that there is some difference between the group means.

ANOVA Case Study using Excel

The steps that are involved in ANOVA using Excel are:
Open the data set in Excel.

Select Data Analysis from Data tab.

Click on the "Anova: Single Factor" from the window and click on Ok.

Define input range for which you need to perform ANOVA by clicking and dragging the mouse over the data range.
Check on "Labels in first row"
Let us define the Alpha values as 0.05.
Click on the "New Worksheet Ply" radio button.
Click on Ok.

The result is displayed in New Worksheet Ply.
The result summary shows the Sum, Average and Variance of all groups.
Let us look at the results one by one.
The ANOVA result window shows the Sum of Squared values (SS) between groups is 130.666 and within groups is 74.666.
The Mean Squared (MS) values between groups is 130.666 and within groups is 1.4358.
The Degrees of Freedom (df) between groups is 1 and within groups is 52.
The P value 5.16E-13 shows that it is less than 0.05. Hence we reject the NULL hypothesis value that the means are equal in favor of the alternate hypothesis which states that there is a difference in means.
In our case study we can conclude that different medications have different effects on athletes pain rating.

Steps for performing Once Sample T-test in SAS Studio

We have learned in our previous lessons that one sample t tests is a statistical procedure often performed for testing the mean value of a distribution.
To implement the t-test using SAS let us now look at the steps involved.
Load the data set into SAS working library using proc procedure.
Click on Tasks
Select one sample t-test under Statistics.
Load the appropriate data set using drop down data option.
Select the analysis variable from roles.
Click on options. Select the test that is required for your analysis from the drop box.
Mention the alternate hypothesis value, that is, the mean against which you need to perform the one sample t-test.
Check on the plots that are required for your analysis.
Click on Run.
The output is displayed in the resultant window.

Demo for One Sample t-test in SAS Studio

In this demo we will use the same sample CARS data set from SAS help library.
We will concentrate on the variable retail price of car i.e. stored in the variable named MSRP.
And we will check the mean value of this distribution.
To start with select the SAS sample CARS data set from SAS library and load into the workspace.
Click on Tasks and Select "One-Sample t Test" under Statistics.
From the DATA drop box select SASHELP.CARS sample data set.
In the next steps select Analysis variable from ROLES i.e. MSRP.

Click on the Options Tab.
Select the test which is required for your analysis from the drop box.
Here we will select a "two tailed test".
Then mention the Alternative hypothesis mu values i.e. the mean agains which you want to perform the one sample t-test.
For this example let us give a mean of 3000 dollars. i.e. we would like to know if the average price of a car is higher or lesser than 30000 dollars.
Check on the plots that are required for your analysis.
Here we have checked on Histogram and box plots and the Normality Plot.
Click on Run.

The output is displayed in resultant window.

The resultant window is showing the P value as 0.0033.
Since it is less than 0.05 we will reject the NULL hypothesis.

The histogram also displays the distribution of prices.
The blue curve shows a normal distribution and the red curve shows the distribution for our data.
From the curve we can infer that the distribution is right skewed.

The results also display the distribution of MSRP with 95% confidence intervals for mean and quantile plot for the MSRP variable.

Steps for Paired Two Sample t-test in Excel

Open the data set in excel.
Select data analysis from data tab.
Select paired two sample for mean from the list.
Select input range for Variable1
Select input range for Variable2
Type in the hypothesized mean difference in option provided.
Check on labels if every column has a column name.
Click on new worksheet ply radio button to obtain result in a new worksheet ply.
Then click on "Ok" and get the resultant window in new worksheet ply.

Demo for Paired Two Sample t-test in Excel

Now we will perform 2 sample t-test in Excel. Here for simplicity we will take a small data set.
The data set contains the batting average of 6 Batsman and 6 All rounders.
Here you will check if the two samples have the same or different mean.
To formulate our NULL hypothesis we specify that the sample means are equal.
Correspondently the alternative hypothesis specifies that the sample means are different.
Let us now start the task.
Open Excel.
Load the required data.

Click on the Data Menu
Click on Data Analysis from the right most side.
Select "t-test: Paired Two Sample for Means" from the displayed list.
Click on Ok.

Select the input range for Variable 1 i.e. the averages for batsman.
Select the input range for Variable 2 i.e. the averages for allrounder.
Type in the "Hypothesized Mean Difference" in the option provided. Here we are assuming that the means are equal i.e. mu1 - mu2 = 0. So we input the values as 0.
Check on Labels option if every column has the column name.
Select the place to display the results i.e "New Worksheet Ply".
Click on ok.

The results will be obtained in the New Worksheet Ply.

Here we have performed a 2 tailed test.
If the obtained t stat value is greater than t Critical two-tail then we reject the NULL hypothesis.
However the case here is different. Here the t stat is less than t critical.
Therefore we cannot reject the null hypothesis.
The observed difference between the sample means 36 to 28 is not convincing enough to say the batting averages vary significantly.

Case Study - Independent t-tests

A researcher studying the effect of a drug on the growth of tumors wishes to find if the drug readuces the growth of tumors. The researcher obtains mice with tumors and divides them into two groups. One group of mice were injected with the drug, and the other group was used as the control group. After one week, the tumors were weighed again and the results were noted as follows. The research question is, does the drug have an effect of reducing the tumor size?

Treatment	Weighed Tumor	Treatment	Weighed Tumor
A	0.72	B	0.71
A	0.68	B	0.83
A	0.69	B	0.89
A	0.66	B	0.57
A	0.57	B	0.68
A	0.66	B	0.74
A	0.7	B	0.75
A	0.63	B	0.67
A	0.71	B	0.8
A	0.73	B	0.78

The data set contains 2 variables, the treatment and the weight of the tumors. There are two treatment types A and B, with A being the treatment group and B being the control group. In the next section we will look at the steps to solve this problem using the two sample t-tests in SAS Studio.
Load the data set into the SAS working library using proc procedure as mentioned in earlier slides.
Note that the data set has to be in a single file though the previous slide shows as two different table.
Click on Tasks and select Two-Sample t-test.
The two sample t-test window appears.
From the data drop down box select the data set for which the two sample t test need to be performed i.e. WORK.DRUGTREATMENT.
Then select Weighted_Tumor as Analysis Variable.
Select the Drug_Treatment as the Group Variable.

Select the OPTIONS tab and select the Tails as "Lower one-tailed test".
Since we are interested in checking whether the tumor size reduces as a result of the medication.
Define the Alternative hypothesis value i.e. the difference mu1-mu2 <0 as Zero by default.
Check on the "Tests for normality" under Normality Assumption.
Check on the Histogram and box plot to obtain the graphical summary.
Check on Normality Plot.
Finally click on Run.
The result will be displayed in the resultant window.

The result shows that the mean difference between both treatment is -0.0670 and the standard deviation is 0.0726.
For equal variances the p-values is 0.0269.
For unequal variances the p-value is 0.0293.
Since both p-values obtained are less than 0.05 we reject the null hypothesis in favor of the alternative hypothesis.
i.e. there is a difference in mean. In our case we can interpret the results as follows:
The tumor size decreases with the effect of medication.

The distribution plot and quantile plot are also being displayed in the resultant window.

As seen earlier the results can be downloaded as PDF and RTF file.

Two Sample T- Test in Excel

In this section we will see to solve the same problem using two-sample t-test in excel.
Let us follow the same procedure to perform the two-sample t test in Excel.
After opening the data file select data analysis from the data tab.

Select "t-test: Two-Sample Assuming Unequal Variances" from the Data Analysis Window.
A new window appears with the fields required for the test.

To find the input range for which you need to perform the t-test
Select "treated with drug" as variable 1.
Select "Not Treated with Drug" as variable 2.
Define "Hypothesized Mean Difference" as 0.
Check on "Labels" if your first row contain variable name.
The alpha value is defined as 0.05.
Click on "New Worksheet Ply" to obtain results in a new Worksheet.
Then click on Ok and the result is obtained in a new worksheet ply.

The resultant window shows the mean and variance of the tumor weights for the variables "Treated with Drug" and "Not Treated with Drug" categories.
The one tail P value is 0.0290 and it is less than 0.05.
So we reject the Null hypothesis in favor of alternate hypothesis.
We can thus conclude that the tumor size decreases as an effect of the medication.

3.5 Case Studies

Case Study

David, a construction startup company chairman from Canada wants to construct a residential apartment in a locality near Ottawa. He plans to do a complete analysis to avoid any management mistakes. So, he asks his colleagues, Mike and Tom, to gather all relevant data and start working on the data to get useful insights regarding the residential apartments in this area. He wants to get all the business insights possible, so that he can compete with the competitors, and also avoid taking much risk.

Let us assume that Mike is an expert in Excel, and Tom is an expert at SAS. In the following sections, we will look at the given case from two different perspectives, and try to gather as much information as possible from the data and the tools provided.

Now we will look at the data in detail:

Above is the snapshot of the data displayed. The given data consists of the information on 58 residential buildings. Here in this snapshot we see the data for first 10 buildings. Let us look at different variables in detail.

Local Price gives us the minimum price per square foot for each building.
Bath Count gives us the count of number of bathrooms in each building.
Area in Sq. Ft. mentions the area of the building.
Living space gives us the area of the building which forms the living space i.e. for the rooms.
Garage Count gives us the number of garages in the building.
Room Count gives us the total number of rooms.
Bed Count gives us the number of bedrooms in the building.
Age in years, specifies the number of years since the building has been constructed.
Material gives the materials used in the building e.g. Alumunium, Brick, Wood, etc.
Level Specifies the type of buildings. Here we can see buildings with two storey. Other types given here are split level and ranch.
Finally the Selling Price supplies the price of the building was build for.

Mike's questions:

What is the average, minimum and maximum selling price of residential apartments in particular locality?
What is the most commonly occurring room count the residential apartments belonging to the same locality?
Is there any relationship between the selling price of the apartments and the number of bathrooms or bedrooms?
Does the selling price depends on the age of apartment, and, how do apartments that are old fare?
Does the selling price depend on the area of living space square feet?
What range does the maximum selling price lies between?

To solve the questions we load the data into excel.

Check if the data analysis tool in Excel is installed as an add in.

What is the average, minimum and maximum selling price of residential apartments in particular locality near Ottowa?
Select the Selling Price Column and Calculate the Summary Statistics

We can see that average selling price is 34.27 and the maximum and minimum selling price are 84.9 and 1.2 respectively.

What is the most commonly occurring room count the residential apartments belonging to the same locality?

Select the Room Count column and calculate summary statistics.

We can see that the apartments with 6 Room count are selling the most.

Is there any relationship between the selling price of the apartments and the number of bathrooms or bedrooms?

In order to see whether the selling price and the bathroom count is related or not check for correlation between those two variables.
Here the Analysis variable is selling price and the Correlate with value is Bathroom count.

From the results we can understand that the correlation value is 0.5263.
From the earlier slides we know that this value specifies a moderate relationship between the selling price and the bathroom count.
A similar study can be done on the relationship between the selling price and the number of bedrooms as well.

Does the selling price depends on the age of apartment, and, how do apartments that are old fare?

To see if the selling price is dependent on the age of the apartment perform correlation with the selling price as analysis variable and correlate with value as Age_yr variable.

The result show that there is a weak negative correlation of -0.2105 between the selling price and the age of the apartment.
This means that if the age of the apartment is high then the selling price of that apartment decreases.

Does the selling price depend on the area of living space square feet?

To see if the selling price is dependent on the area of living space perform the correlation with the selling price as the analysis variable and correlate with value as living space.

The obtained results shows that there is a strong positive correlation of 0.7896 between selling price and living space.
So bigger the living space in an apartment, the higher its selling price.

What range does the maximum selling price lies between?

To define the maximum selling price range create a bin value column containing values 10, 20, 30, 40, 50, 60, 70, 80, 90 with an increments of 10.

Plot the histogram for selling price with the newly created bin values as its bin range.

The resultant graph shows the histogram plot of selling prices.
From the graph we can see that the bars for the range 30 to 40 is the highest.
Thus we can understand that the majority of the apartment sold have a selling price ranging between 30 and 40.

With this Mike is done with the given analysis of his data.

Tom's questions:

What is the general trend of all the variables?
What is the average selling price for the different levels, that is, the two-level, split and ranch apartments in this region?
What is the average selling price for apartments grouped by the different materials that is being used for construction, that is, aluminium/wood, brick, brick/wood, wood in this region?
What is the general trend of the number of rooms in the apartment?
Is there any relationship between the local price and other variables such as Bathroom count, Area square feet, Garage count, rooms count, bedrooms count, age and selling price?
Is there any relationship between the selling price and bathroom count?
Is there any relationship between the selling price and number of bedrooms?
Does the distribution of the selling price of apartments fall around an average value of 31?

To solve these questions we will first load the data set into SAS using the proc procedure as mentioned in the previous sections of our lesson. Now we will look at the questions one by one.

Since the procedures to implement the tests have been discussed extensively we will concentrate on the business applications of these tests in this session.

What is the general trend of all the variables?

Tom wants to start by answering the basic questions, so he wants to prepare basic summary of data using bar chart, based on which he can find the maximum selling price, material wise.

To create the bar chart click on the Bar Chart under Graph section.
Select Material under the Category variable.
Select Selling Price under the Response variable.
Click on Run.

From the results it is easy to understand that wood has the maximum selling price, whereas aluminium and brick has comparatively low prices.

What is the average selling price for the different levels, that is, the two-level, split and ranch apartments in this region?

Tom also wants to find the maximum and minimum value in this location.
The variation between other values and the mean value.
To answer this question we need to perform a descriptive analysis on this data.
So choose Summary Statistics under Statistics.
Choose Selling Price under Analysis variable.
Choose Level under Classification variable.
Click on Run

From the results Tom has found that the price of the split level apartment is the lowest while two story has the high value.
The fluctuation of the price in the split level category is also comparatively less.
Ranch and Two story apartment has shown a similar kind of trend as far as fluctuation in selling price is concerned.
Interestingly two story has recorded as the maximum and minimum selling price of the apartments.

What is the average selling price for apartments grouped by the different materials that is being used for construction, that is, aluminium/wood, brick, brick/wood, wood in this region?

In order to answer this question we perform Summary Statistics with
Selling price as analysis variable and
Materials as classification variable.
By this we can find the descriptive details of selling price based on each material being used in the construction.

From the results it can be found that the average selling price of the material Aluminium/Wood is the lowest and Wood is the highest.
The fluctuation of the selling price of the material Aluminium/Wood is comparatively lower than Brick/Wood and Wood.

What is the general trend of the number of rooms in the apartment?

The contractor is in a dilemma whether to increase the number of rooms or not.
Tom has decided to find the general trend, so that he can provide his boss some useful insights.
These kinds of questions can also be answered using Summary Statistics (Descriptive Analysis).
Select Selling price under analysis variable.
Room Count under Classification variable.

Here we have a parabolic curve with the values increasing initially and then showing a declining trend.
It can be inferred that the builders have got in a maximum profit when the room count is equal to 8.
Here the flats were sold at the highest price and the average selling price is also comparatively higher compared to other set of room numbers.

Is there any relationship between the local price and other variables such as Bathroom count, Area square feet, Garage count, rooms count, bedrooms count, age and selling price?

We will use correlation to answer this question.
Choose Local Price as Analysis Variable
Choose Bathroom_Cnt, Bedroom_Cnt, Area_Sqft, Garage_Cnt, Rooms_Cnt, Age_yr, Selling_Price in Correlate with:
Click on Run

From the results we can derive the following important conclusions:

For the Bathroom_Cnt v/s Local Price Column we have two values in each column, the first value gives the correlation value and the second p-value. In this case we have correlation value as 0.44245 and p-value as 0.0006. Since the p-value is less than 0.05 we can conclude that there is a correlation between these two variables.
Similar insights could be found for all other variables.

Is there any relationship between the selling price and bathroom count?

To check the differences between the selling price and the bathroom count Tom uses ANOVA.
Here the null hypothesis is that there is no difference in mean between the groups and the alternative hypothesis is that there is at least one mean that is different.
To perform the test in SAS we input Selling Price as the dependent variable and Bathroom Count as the explanatory variable and run.

From the results as the p-value is less than 0.05 we reject the NULL hypothesis and conclude that there is a difference in the means.

Is there any relationship between the selling price and number of bedrooms?

Since there is no relationship between selling price and bathroom count Tom has decided to check one more variable.
In this case the Selling Price and Bedrooms Count.
Again he decides to use ANOVA to confirm it.
Here the NULL hypothesis is that there is no difference in mean and the alternative hypothesis is that there is a difference in mean.
Again to perform the test in SAS we input the selling price as the dependent variable and bedroom count as the explanatory variable and click on Run.

As the p-value is less than 0.05, we conclude that there is a difference in means i.e. we reject the NULL hypothesis.

Does the distribution of the selling price of apartments fall around an average value of 31?

To answer this question Tom uses One-sample t-test with Selling price as Analysis Variable and

Alternative hypothesis of mean 31. Click on Run.

The p-values is 0.0592.
It is almost equal to 0.05.
Hence we are not rejecting the Null Hypothesis.
From this we can draw a conclusion that there is no significance deviation from the given average and the selling price falls with the mean value 31.

This concludes the list of questions that Tom wanted to answer.

In this case study we have analyzed a data set through descriptive and inferential statistic methods using both SAS tools and Excel Add-ins and we have arrived at a number of useful insights.

Summary

Introduction to SAS Studio

Uploading files to SAS
Advantages if using SAS
Data exploratory through SAS and Excel
Distributive analysis through SAS and Excel
Data visualization using SAS and Excel
Diagnostic analytics using SAS and Excel

Correlation
ANOVA
T-tests

Case Studies for all the concepts

BA using SAS

Monday, June 27, 2016

03 Basic Analytic Techniques - Part 2