Objectives
- Understand what is analytics and the difference between analysis and analytics
- Know the popular tools used in analytics
- Understand the role of a data scientist
- Know the processes involved in analytics
- Define a problem statement
- Collect and summarize data
- Detect and treat outliers in the data
Analytics versus Analysis
Analytics
Analytics is the science of analysis whereby statistics, data mining, computer technology, etc. is used in doing analysis
Analysis
Analysis is the process of breaking down a complex object into its simpler forms
What is Analytics?
- It’s the science of wisely acquiring meaningful results from given data using various methods and technologies.
- Aims at discovering pattern of variation from the given data.
- It helps to understand the future from past data and the uncertainty related to business.
- It’s a sophisticated process that uses statistics, mathematics and economics models to predict the future and prescribe strategies.
How analytics works
Gather Data ==> Organize Data ==> Analyse Data
Analytics Stages
Information Stage
- Descriptive: What is the wearing rate of MRF tyres in the last 8 months?
Insight Stages
- Diagnostic: Why have the wearing rate increased in the last 8 months?
- Predictive: What kind of issues (like mileage) MRF tyres are most likely to face if It don’t address the issue now ?
Decision Stage
- Prescriptive: On what things should MRF tyres should concentrate to reduce the overall effect ?
Popular Tools:
- R
- Revolution R
- R Studio
- Tableau
- SAP HANA
- Weka
- KXEN
- SAS
Role of a Data Scientist
- Inquisitive, can stare at data and spot trends.
- Come out with unrevealed stories hidden in data that helps in creating more useful insights and help solving business problems.
- Work in sync with application developer to get relevant data for analysis.
- Make an analytical plan in such a way that the results satisfy the business needs.
- Come up with an effective data mining architecture and prepare suitable models.
- Respond to and resolve data mining performance issues.
- Generate reports that are affordable from a business perspective.
Data Analytics Methodology
Discovery ==> Data Preparing ==> Model Planning ==> Model Building ==> Deliver Results ==> Put into use
Problem Definition
- What is the problem?
- What is it not?
- We have this problem because?
- We don't have a solution because?
Techniques involved in defining a problem
- State the problem in a general way
- Understand the nature of the problem
- Survey the available literature
- Go for discussions for developing ideas
- Rephrase the research problem into a working proposition
Types of Data
Data can be of two types – qualitative and quantitative
Qualitative Data
- Data expressed as groups or categories
- Descriptive data
- E.g. Dividing a population into high, medium and low height groups
Quantitative Data
- Data expressed as numbers
- Definitive Data
- E.g. The height of a person
Summarizing Data
- Summarizing is the process of converting huge amounts of raw data into a format that can be easily analyzed.
- Summaries differ based on the type of data; and can be descriptive or graphical.
| Batsman | Frequency of not outs |
| Sachin | 11 |
| Sehwag | 2 |
| Dravid | 36 |
| Dhoni | 32 |
| Virat | 7 |
Summarizing Data
Numeric - Descriptive
- Mean
- Median
- Mode
Categorical - Descriptive
- Frequency distribution tables
Numeric - Graphical
- Box plot
Categorical - Graphical
- Bar charts
- Histograms
Data Collection
- Process of collecting relevant data that aids in solving the problem statement
- Data Collection process needs to be defined, and systematic.
- Observations need to be recorded and organized for optimal usefulness
- Collect Relevant Data
- Categorize the Data
- Organize the Data
Data Collection Methods
- Observation
- Experiment
- Census
- Questionnaire
- Survey
- Reporting
- Registration
- Data Sources
- Data collection methods fall broadly into two categories – primary and secondary.
- Primary methods are where the data is gathered directly through investigating, experimenting or observing various entities.
- Secondary methods refer to the methods where the data has already been gathered before the study, and is available as already published facts and reports.
Data Dictionary
- A Data Dictionary is a file that describes the structure of the database itself.
- Includes details like –
- Number of records
- Name of each field
- Characteristic of each field
- Description of each field
- Relationships between different fields
- It helps in analyzing different data variables and their relationships between each other.
Outlier Treatment
- Outlier is a point or an observation that deviates significantly from the other observations.
- Due to experimental errors or “special circumstances”
- Outlier detection tests to check for outliers
- Outlier treatment –
- Retention
- Exclusion
- Other treatment methods
Summary
- What is analytics and analysis, and what are the differences between them
- Popular tools used in analytics
- What does a data scientist do
- The processes involved in analytics life cycle
- How to formally define a problem statement
- Methods of collecting and summarizing data for analytics
- Data dictionary and its contents
- What are outliers and how to detect and treat outliers

No comments:
Post a Comment