[go: up one dir, main page]

100% found this document useful (1 vote)
372 views24 pages

Statistical Modeling For Data Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 24

Data Analysis

GROUP TWO
Cont.….
What is data analysis?
Data analysis is the process of
collecting,cieaning,sorting,and processing
raw data to extract relevant and valuable
information to help businesses. An in-depth
undersanding of data can improve customer
experience,retention,targeting,reducing
operational costs,and problem-solving
methods.
Cont…
Data analysis embraces a whole range of activities of both the qualitative
and quantitative type.
It is usual tendency in behavioral
research that much use of quantative analysis is made and statistical
methods and techniques are employed.
The statistical methods and techniques are employed .
The statistical methods and techniques have got a special position in research
because they provide answers to the problems.
Statistical Modeling

A simple example of data analysis can be seen whenever we


make a decision in our daily lives by evaluating what has
happened in the past or what will happen if we make that
decision. Basically, this is the process of analyzing the past
or future and making a decision based on that analysis.
Statistical modeling is the process of applying statistical
analysis to a dataset. A statistical model is a mathematical
representation (or mathematical model) of observed data.
When data analysts apply various statistical models to the
data they are investigating, they are able to understand and
interpret the information more strategically.
CONT…
This practice allows them to:-

 identify relationships between variables,

 make predictions about future sets of data,

visualize that data so that non-analysts and


stakeholders can consume and leverage it.
Reasons to Learn Statistical Modeling
1. You will be better equipped to choose the right model for your
needs.
There are many different types of statistical models, and an effective

data analyst needs to have a comprehensive understanding of them all.
In each scenario, you should be able to identify not only which model
will help best answer the question at hand, but also which model is
most appropriate for the data you’re working with.
2. You will be better able to prepare your data for analysis.
Data is rarely ready for analysis in its raw form. To ensure your

analysis is accurate and viable, the data must first be cleaned up. This
cleanup often includes organizing the gathered information and
removing “bad or incomplete data” from the sample.
3. You will become a better communicator.
In most organizations, data analysts are required to communicate their

findings with audience.
Important Statistical Techniques in Data Analysis 
Before any statistical model can be created, an analyst
needs to collect or fetch the data housed on

a database,
clouds,

social media, or
within a plain excel file.
Data Analysis Process
Data Collection: Guided by your identified requirements, it’s time to collect the data
from your sources. Sources include case studies, surveys, interviews, questionnaires,
direct observation, and focus groups. Make sure to organize the collected data for
analysis.
 Data Cleaning: Not all of the data you collect will be useful, so it’s time to clean it
up. This process is where you remove white spaces, duplicate records, and basic
errors. Data cleaning is mandatory before sending the information on for analysis.
Data Analysis: Here is where you use data analysis software and other tools to help
you interpret and understand the data and arrive at conclusions. Data analysis tools
include Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase, Redash, and
Microsoft Power BI.
Data Interpretation: Now that you have your results, you need to interpret them and
come up with the best courses of action based on your findings. Data Visualization:
Data visualization is a fancy way of saying, “graphically show your information in a
way that people can read and understand it.” You can use charts, graphs, maps, bullet
points, or a host of other methods. Visualization helps you derive valuable insights by
helping you compare datasets and observe relationships.
Types of Data Analysis
Ideally, the analysts find similar patterns that existed in the past,
and consequently, use those solutions to resolve the present
challenges hopefully.
Predictive Analysis: Predictive analysis answers the question,
“What is most likely to happen?” By using patterns found in
older data as well as current events, analysts predict future
events. While there’s no such thing as 100 percent accurate
forecasting, the odds improve if the analysts have plenty of
detailed information and the discipline to research it thoroughly.
Prescriptive Analysis: Mix all the insights gained from the other
data analysis types, and you have prescriptive analysis.
Sometimes, an issue can’t be solved solely with one analysis
type, and instead requires multiple insights.
CONT…

Statistical Analysis: Statistical analysis answers the question,


“What happened?” This analysis covers data collection, analysis,
modeling, interpretation, and presentation using dashboards. The
statistical analysis breaks down into two sub-categories:
Descriptive: Descriptive analysis works with either complete or selections
of summarized numerical data. It illustrates means and deviations in
continuous data and percentages and frequencies in categorical data.
Inferential: Inferential analysis works with samples derived from complete
data. An analyst can arrive at different conclusions from the same
comprehensive data set just by choosing different samplings.
Text Analysis: Also called “data mining,” text analysis uses
databases and data mining tools to discover patterns residing in large
datasets. It transforms raw data into useful business information. Text
analysis is arguably the most straightforward and the most direct
method of data analysis.
Cont.….
There are two general classes of significance tests: parametric
and nonparametric.
 Parametric tests are more powerful because their data are
derived from interval and ratio measurements.
Non-parametric
tests are used to test hypotheses with nominal and ordinal
data. Parametric techniques are the tests of
choice if their assumptions are met.
Testing for statistical significance follows a relatively well-
defined pattern, although authors differ in the number and
sequence
Data Analysis Methods
Qualitative Data Analysis: The qualitative data analysis
method derives data via words, symbols, pictures, and
observations. This method doesn’t use statistics. The most
common qualitative methods include:
Content Analysis, for analyzing behavioral and verbal data.
Narrative Analysis, for working with data culled from
interviews, diaries, surveys.
Grounded Theory, for developing causal explanations of a
given event by studying and extrapolating from one or more
past cases.
CONT….

Quantitative Data Analysis: Statistical data analysis


methods collect raw data and process it into numerical data.
Quantitative analysis methods include:
Hypothesis Testing, for assessing the truth of a given
hypothesis or theory for a data set or demographic.
Mean, or average determines a subject’s overall trend by
dividing the sum of a list of numbers by the number of items
on the list.
Sample Size Determination uses a small sample taken from a
larger group of people and analyzed. The results gained are
considered representative of the entire body. 
Goodness-of-data
The term goodness-of-fit refers to a statistical test that
determines how well sample data fits a distribution from a
population with a normal distribution. Put simply, it
hypothesizes whether a sample is skewed or represents the
data you would expect to find in the actual population.
A goodness-of-fit is a statistical technique. It is applied to
measure “how well the actual(observed) data points fit into a
Machine Learning model”.
It summarizes the divergence between actual observed data
points and expected data points in context to a statistical or
Machine Learning model.
The Most Common Goodness Of Fit Tests
Broadly, the goodness of fit test categorization can be done based on the
distribution of the predictand variable of the dataset.
The chi-square
Chi-square goodness of fit test is conducted when the predictand variable in
the dataset is categorical. It is applied to determine whether sample data are
consistent with a hypothesized distribution
 Kolmogorov-Smirnov

The Kolmogorov-Smirnov Goodness of Fit Test (K-S test) compares the


dataset under consideration with a known distribution and lets us know if they
have the same distribution. It’s also used to check the assumption of
normality in Analysis of Variance.
Anderson-Darling
The Anderson-Darling is tested to compare the fit of an observed cumulative
distribution function to an expected cumulative distribution function. This test
gives more weight to the tails than the Kolmogorov-Smirnov test.
Descriptive Analysis

Descriptive analysis is a sort of data research that aids in


describing, demonstrating, or helpfully summarizing data
points so those patterns may develop that satisfy all of the
conditions of the data.

It is the technique of identifying patterns and links by utilizing


recent and historical data. Because it identifies patterns and
associations without going any further, it is frequently referred
to as the most basic data analysis.
Types of Descriptive Analysis
 Descriptive analysis can be categorized as one of four
types. They are measures of frequency, central tendency,
dispersion or variation, and position.
 Measures of Frequency
In descriptive analysis, it’s essential to know how
frequently a certain event or response occurs. This is the
purpose of measures of frequency, like a count or percent.
For example, consider a survey where 1,000 participants
are asked about their favorite ice cream flavor. A list of
1,000 responses would be difficult to consume, but the
data can be made much more accessible by measuring
how many times a certain flavor was selected.
CONT

 Measures of Central Tendency


In descriptive analysis, it’s also worth knowing the central (or
average) event or response. Common measures of central
tendency include— mean, median, and mode. As an example.
 Measures of Dispersion
Sometimes, it may be worth knowing how data is distributed
across a range.
 Measures of Position
Last of all, descriptive analysis can involve identifying the
position of one event or response in relation to others. This is
where measures like percentiles and quartiles can be used.
How to Do Descriptive Analysis
Like many types of data analysis, descriptive analysis can be quite open-ended.
In other words, it’s up to you what you want to look for in your analysis. With that
said, the process of descriptive analysis usually consists of the same few steps.
A. Collect data
The first step in any type of data analysis is to collect the data. This can be done in a
variety of ways, but surveys and good old fashioned measurements are often used.
B. Clean data
Another important step in descriptive and other types of data analysis is to clean the
data. This is because data may be formatted in inaccessible ways, which will make it
difficult to manipulate with statistics. Cleaning data may involve changing its textual
format, categorizing it, and/or removing outliers.
C. Apply methods
Finally, descriptive analysis involves applying the chosen statistical methods so as to
draw the desired conclusions. What methods you choose will depend on the data you
are dealing with and what you are looking to determine. If in doubt, review the four
types of descriptive analysis methods explained above.
Inferential Statistics
Inferential statistics uses statistical techniques to extrapolate
information from a smaller sample to make predictions and
draw conclusions about a larger population.
It uses probability theory and statistical models to estimate
population parameters and test population hypotheses based
on sample data. The main goal of inferential statistics is to
provide information about the whole population using
sample data to make the conclusions drawn as accurate and
reliable as possible.
There are two primary uses for inferential statistics:
Providing population estimations.
Testing theories to make conclusions about populations.
Types of Inferential Statistics
Inferential statistics are divided into two categories:

A Hypothesis testing
Testing hypotheses and drawing generalizations about the
population from the sample data are examples of inferential
statistics. Creating a null hypothesis and an alternative
hypothesis, then performing a statistical test of significance
are required.
A hypothesis test can have left-, right-, or two-tailed
distributions. The test statistic’s value, the critical value, and
the confidence intervals are used to conclude. Below are a
few significant hypothesis tests that are employed in
inferential statistics.
CONT…

Z Test:
When data has a normal distribution and a sample size of at least 30, the z
test is applied to the data. When the population variance is known, it
determines if the sample and population means are equal. The following
setup can be used to test the right-tailed hypothesis:
Null Hypothesis: H0: μ=μ0
Alternate hypothesis: H1: μ>μ0
Test Statistic: Z Test = (x̄ – μ) / (σ / √n)
where,
x̄ = sample mean
μ = population mean
σ = standard deviation of the population
n = sample size
Decision Criteria: If the z statistic > z critical value, reject the null
hypothesis.
B Regression analysis
Regression analysis is done to calculate how one variable
will change in relation to another. Numerous regression
models can be used, including simple linear, multiple
linear, nominal, logistic, and ordinal regression.

In inferential statistics, linear regression is the most often


employed type of regression. The dependent variable’s
response to a unit change in the independent variable is
examined through linear regression. These are a few
crucial equations for regression analysis using inferential
statistics:
CONT….

Regression Coefficients:
The straight line equation is given as y = α + βx, where α
and β are regression coefficients.
β=∑n1(xi − x̄)(yi −y) / ∑n1(xi−x)2
β=rxy σy / σx
α=y−βx 
Here, x is the mean, and σx is the standard deviation of the
first data set. Similarly, y is the mean, and σy is the
standard deviation of the second data set.

You might also like