Data and Analysis
Introduction to Data Analysis
Data analysis is the process of examining data to find useful insights (information) and help in decision-making.
• It involves cleaning the data, understanding it, and summarizing it.
• The goal is to make predictions or conclusions about real-world things.
• Data analysis helps reduce risk by giving us facts and figures. It is often shown using charts, tables, or
graphs.
Example: In our daily life, when we decide based on past experiences or what we think will happen in the future,
that’s a simple form of data analysis.
4.1 What is Statistical Modeling?
Statistical modeling uses math and statistics to show the relationship between different variables and predict
outcomes.
Variables: A variable is any characteristic, number, or quantity that can be measured or counted. A variable may
also be called a data item. In an experiment, we might have two variables:
• Depended variable • Independent variable
Example: If you collect data about students' height and weight, you may find that
height depends on weight. A formula like y = 5x + 2 shows this relationship.
4.1.1 Use Cases for Statistical Modeling
Use cases are real-world problems or tasks that can be solved using data analysis.
Companies use data science to solve different problems related to their business.
Steps for solving a use case:
• Planning: Identifying clear goals, resources, risks, and key performance indicators (KPIs) for success.
• Approaches: Common solutions include forecasting, classification, pattern and anomaly detection,
recommendations, and image recognition.
Common examples of use cases:
• Forecasting (predicting future trends).
• Fraud detection (predicting scam calls, messages and emails etc.)
• Classifying data (grouping things into categories).
• Detecting patterns or unusual things (anomalies).
• Making recommendations (like what products to suggest to customers).
• Image recognition (finding objects in pictures).
4.1.2 How to Solve a Data Science Case Study?
When solving a data science case study, the approach can vary depending on the company’s goals and the nature
of the problem. However, here is a general roadmap you can follow for any data science case study:
1. Formulating the Right Question: Start by understanding the problem clearly. Review any
existing research or information related to the case study. The key is to ask the right
question that will guide your analysis.
2. Data Collection: Gather all the data you need. This can involve collecting new data or
using existing data sources. For example, you may pull data from survey s, databases, or
public records.
3. Data Wrangling: Clean and organize the data. This step involves fixing errors, removing
duplicates, filling in missing data, and transforming the raw data into a more useful
format.
4. Data Analysis and Modeling: Analyze the cleaned data and use it to create a
predictive model. This model helps you understand trends and make predictions.
Different statistical or machine learning models can be used depending on the type of
problem.
5. Result Communication: Share your findings with relevant people like managers,
shareholders, or anyone who needs the insights. It's important to clearly explain the
conclusions and how the data supports them.
Real-life Examples of Case Studies:
Production Goods and Services: Analyzing production data can help improve the quality of products.
Stock Market Data Analysis: Predictive analysis of stock market data helps investors make better decisions.
Weather Forecasting: Analyzing weather data is crucial for aviation, agriculture, and planning daily activities.
Medical Records: Analyzing patient data helps doctors diagnose diseases and conduct medical research.
Sales Tracking: Sales data analysis helps businesses plan strategies to avoid losses and increase profits.
Population Record: Governments use population data to plan and distribute resources effectively.
Educational Data: Analysis of student and teacher data helps improve education systems.
Natural Disaster Prediction: Data analysis can predict disasters like earthquakes or floods, helping people
take precautions.
Pandemic Analysis: Analyzing health data helps control pandemics and take timely preventive actions.
Example - Weather Forecasting Case Study: Weather conditions play significant role in our daily life, from
dressing to travelling, planning activities and events etc. Unfavorable weather conditions can cause damage to
life and properties. Let’s walk through a simple case study on predicting the weather using data science:
1. Formulating the Right Question: Analyze existing weather data to predict future conditions. For example,
what weather patterns signal a rainy day or a sunny day?
2. Data Collection: Collect data about temperature, rainfall, wind speed, and atmospheric pressure. You could
use instruments like a thermometer, rain gauge, and barometer, or you could get data from a meteorological
department.
3. Data Wrangling: Clean the data by removing any errors, filling in missing information, and organizing it for
analysis.
4. Data Analysis and Modeling: Use the cleaned data to build a statistical model that can predict the
weather based on the input variables (temperature, rainfall, etc.). This model can then forecast the likelihood
of future weather events.
5. Result Communication: Share the results of your analysis with others. For example, after training your
model with past weather data, it can predict tomorrow’s weather based on current conditions.
4.1.3 Statistical Modeling Techniques:
Statistical modeling relies on data, which can come from various sources like spreadsheets, databases, or the
cloud.
There are two main types of statistical modeling methods: supervised learning and unsupervised learning.
Supervised Learning: In supervised learning, the algorithm
learns from a labeled dataset, where each data point is paired
with the correct output (label). This helps the model learn
patterns in the data and predict results for new, unlabeled data.
Example: If you have a dataset of fruits and vegetables labeled
as "Fruit" or "Not Fruit," the model learns the difference. Then,
when given a new item without a label, it can classify it correctly
as either a fruit or not.
Supervised learning techniques include:
a) Regression Model:
Regression is a statistical approach to understand how one variable (like salary or
house price) depends on other variables (like years of experience or house size).
It helps us find a pattern in the continuous data so we can make predictions or
understand relationships better.
Examples: Regression is used in predicting things like house prices, sales trends, or even the weather.
Linear Regression: It is the simplest form of regression, which draws a straight line through the data points to
make the best predictions.
The formula for linear regression is y = mx + b, where:
m is the slope (how steep the line is).
b is the y-intercept (where the line crosses the y-axis).
Independent Variable (x): A variable that is adjusted or modified to see how it affects the dependent
variable. Its variation does not depend on other variables in the experiment.
Dependent Variable (y): A variable that depends on another for its value. It is tested and measured in an
experiment.
b) Classification Model: Classification is a process of
categorizing data into predefined classes or categories
based on their features or attributes. It is used to Predict
discrete values (e.g., Yes/No answers).
Example: Predicting whether an employee will get a salary
raise (Yes/No). If you're predicting the exact amount of the
raise, it becomes a regression problem.
Unsupervised Learning:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and
must find patterns or group the data by itself.
a) Clustering Algorithms: These are methods of grouping the
objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the
objects of another group.
K-Means clustering is a popular clustering algorithm used in
unsupervised learning.
Example: A telecom company can group its customers based on
call duration and internet usage, then offer tailored packages (long
call durations, heavy internet usage, etc.).
b) Association Rules: Find relationships between items in data.
Example: If a customer buys bread, the system might predict they
will also buy milk because there is a common association between
those items.
4.1.4 Build a Statistical Model Using Python
To build a statistical model to perform predictive analysis various tools and methods can be used. A statistical
model helps in predicting outcomes based on data patterns, and Python provides easy tools to implement this.
Tools for Building Statistical Models
There are several tools available for building statistical models, such as:
• MS-Excel • Weka • R Studio • Python
Among these, Python is widely used because of its simplicity and powerful libraries like NumPy, Pandas, and
Matplotlib.
Datasets for Statistical Models
Many datasets can be downloaded from websites like:
o Kaggle (https://www.kaggle.com) o GitHub (https://github.com)
However, instead of downloading, you can create your own dataset by generating random numbers using
Python.
Example: Create a Linear Regression Model in Python
Here, we will generate random data and build a linear regression model using the equation: 𝒚 = 𝒎𝒙 + 𝒄
Where:
• m is the slope,
• c is the intercept,
• x is a randomly generated number,
• y is calculated based on x using the equation 𝒚 = 𝟑𝒙 + 𝟒.
Steps to Build the Model
1. Generate Random Data: We’ll generate random x values and use them to calculate y.
2. Plot the Data: We’ll use Matplotlib to plot the values of x and y on a graph.
3. Fit a Linear Regression Line: We’ll draw a line that best fits the data points.
Writing Python Code:
• We can write Python code in any IDE or we can use an online Python compiler like Google Colab. Here’s
how to do it:
1. Open your browser and go to Google Colab.
2. Click the "Open Colab" button on the top-right corner.
3. Write the python code in the Google Colab.
4. Run the code by pressing the Run button or Ctrl + Enter keys.
Python Code to Generate Data and Plot a Graph:
4.2 Experimental Design in Data Science
Experimental design is a structured approach to ensure accurate and reliable results in experiments. It involves
careful planning to gather the right data and conduct the experiment effectively, preventing any incorrect
conclusions.
4.2.1 Experimentation in Data Science as a Tool
Experimental design helps organize and run experiments systematically. It focuses on selecting the right sample
size, designing the experiment correctly, and analyzing results efficiently. This technique is used in various fields
like engineering, psychology, agriculture, and medicine.
Experimental Design Flow in Data Science
The process of experimental design in data science usually follows a series of steps:
1. Identify the Research Question: Clearly define the problem or question the experiment aims to address. This
will guide the experiment throughout.
2. Develop Hypotheses: Form hypotheses to predict relationships between variables involved in the
experiment.
3. Find the Variables: Decide which variables will be independent and which will be dependent.
Example: If you are studying how experience affects salary, the years of experience is the independent
variable and salary is the dependent variable.
4. Determine the Experimental Design: Choose the right experimental design. Some common designs include:
• Factorial Design • Randomized Block Design • Completely Randomized Design
5. Calculate Sample Size: Make sure the sample size is large enough to produce reliable statistical results.
6. Random Assignment and Selection: Randomly assign subjects to different groups to avoid bias and ensure
the results are accurate.
7. Carry Out the Experiment: Carefully follow the experiment plan, collect the data, and stick to the
methodology.
8. Data Analysis: After collecting data, perform statistical analysis to test hypotheses and draw conclusions.
Techniques like hypothesis testing, regression analysis, and ANOVA (Analysis of Variance) are used. ANOVA
checks for differences between group means.
9. Interpret and Conclude: Based on data analysis, interpret the results. Consider any practical factors that
might affect the outcomes.
10. Discuss and Report: Present the experimental design, methodology, findings, and conclusions clearly. Proper
documentation allows others to reproduce the experiment and get similar results.
Principles of Experimental Design
The success of an experiment depends on following certain key principles. These principles ensure that results
are accurate and reliable.
1. Principle of Randomization: Subjects or items are divided into groups randomly, ensuring no bias in the
selection process.
Example: When testing a new asthma medicine, divide 100 patients randomly into two groups of 50 each,
ensuring some patients have severe asthma while others have milder symptoms.
2. Principle of Local Control: Establish a control group that doesn’t receive treatment. This allows for
comparison and ensures that differences in results are due to the treatment alone.
Example: One group of asthma patients receives the new medicine, while the control group continues with
their regular treatment.
3. Principle of Blocking: Divide subjects into blocks based on traits (e.g., gender) that may influence results.
This helps eliminate the effect of these traits on the outcome.
Example: Split the asthma patients by gender. Treat 28 women and 22 men with the new medicine and
another 28 women and 22 men with the regular medicine.
4. Principle of Replication: Repeat the experiment multiple times to ensure results are not just coincidental or
random.
Example: Repeat the asthma medicine experiment with different groups or even different demographics. If
the new medicine shows consistent effectiveness, it can be declared more effective.
4.2.2 Correlation and Causation
Correlation refers to a statistical relationship between two variables, meaning they change together. However,
this relationship does not imply that one variable causes the other to change.
Example: If you notice that whenever you send a text message, your phone lags, it might be easy to assume the
texting causes the lagging. However, the real reason might be the phone’s lack of memory (too many apps
running), which is the actual cause of the lag, not the texting.
Causation, on the other hand, means that a change in one variable directly causes a change in another.
Important Note: While causation always implies correlation, not all correlations imply causation. It's essential
to avoid jumping to conclusions when two variables appear to be related.
4.2.3 Population and Random Sample
Population: This term refers to the complete set of items, people,
or events that are being studied. For instance, if you’re studying
the average height of adults in a country, the population includes
all adults in that country.
Random Sample: A random sample is a subset of the population
where every member has an equal chance of being selected. It
ensures that the sample represents the population as closely as
possible. By studying a random sample, you can make inferences
about the entire population without studying everyone.
4.2.4 Parameter and Statistic
Parameter: A parameter is a number that describes a characteristic of a population. Since it’s often impossible
to study an entire population, parameters are usually unknown.
Example: The average age of all people in a country.
Statistic: A statistic is a number that describes a characteristic of a sample of the population. Statistics provide
estimates for parameters.
Example: The average age of people in a random sample is a statistic, which helps estimate the population’s
average age.
Mean, median and mode are different types of averages used to represent typical values in a population.
Therefore, we can say that the mean value of a population is parameter while the mean value of a sample is
statistics.
4.2.5 Data Collection Methods
Data collection is essential for conducting meaningful research and drawing accurate conclusions. There are two
main types of data collection methods: primary and secondary.
Primary Data Collection Methods
Primary data refers to data collected firsthand by the researcher. The methods include:
1. Interviews: Directly asking questions to participants. It allows flexibility in questioning.
2. Observations: Observing behaviors or events and recording the findings, either in a controlled or
uncontrolled environment.
Example: Observing how many people walk their pets in a busy street to decide if a pet food store should
be opened in that area.
3. Surveys and Questionnaires: Collecting data from a large group of people through yes/no, multiple-choice,
or open-ended questions.
4. Focus Groups: Similar to interviews but conducted with a group of people sharing common traits or
experiences. This method helps gather diverse opinions but can be time-consuming.
5. Oral Histories: Collecting data by asking participants about their personal experiences regarding a specific
event or phenomenon.
Secondary Data Collection Methods
Secondary data is data collected by someone else, often for a different purpose. It is easier and less expensive
to obtain but may not be as tailored to the current research need. Common methods include:
1. Internet: A quick and accessible way to gather data from various sources. It's important to verify the
authenticity of the data.
2. Government Archives: Official records are reliable but may not always be easily accessible.
3. Libraries: A valuable source for academic research, business directories, and other documented
information.
4.2.6 Real-World Experimentation Examples
Here are some practical examples of how real-world companies use experimentation to make data-driven
decisions:
A/B Testing: It is a popular testing method used by companies to compare two versions of a feature, webpage,
or advertisement to see which performs better.
1. Facebook and Version Testing: Facebook might test two different versions of a feature to determine which
one gets more engagement from users. The version that performs better based on data (like clicks, interactions,
or conversions) is usually implemented.
2. Airbnb and Price Optimization
Airbnb uses data science to help users, such as homeowners and renters, set optimal prices for their properties.
They experiment with different platform features like search algorithms or the booking flow.
Example: When testing a new booking feature, Airbnb would create two versions (A and B) and assign users
randomly to each group. By analyzing user behavior (e.g., booking rates, user satisfaction), they can decide which
version improves the user experience and should be implemented.
3. YouTube and User Engagement
YouTube uses statistical experimentation, particularly A/B testing, to enhance user engagement and content
discovery.
Example: When introducing a new video recommendation algorithm or layout, YouTube might create two
versions (A and B) and assign users randomly to each group. By comparing data like clickthrough rates, watch
time, or user feedback, YouTube can see which version keeps users engaged longer. This process helps them
make data-driven improvements to the platform.
4.3 Analyze Pre-existing Datasets to Create Summary Statistics and Data Visuals
When analyzing a dataset, several steps are involved to extract meaningful insights and information. These steps
include:
• Data Exploration: Getting a general understanding of the dataset by exploring its features and structure.
• Data Cleaning: Removing unwanted or erroneous data to gain more meaningful insights.
After cleaning the data, you compute summary statistics to understand the central tendency (how closely the
data is grouped) and dispersion (how spread out the data is). This includes calculating the mean, median, mode,
count, and frequency.
Once these steps are complete, the data is ready for visualization using various tools like bar charts, pie charts,
line graphs, etc., to represent the data visually. These visuals help interpret the data more easily and uncover
patterns, trends, and relationships.
4.3.1 Data Products (Charts, Graphs, Statistics)
A data product is a tool or application that uses data to help businesses improve their decision-making and
processes. Data products automate processes, provide real-time recommendations, or deliver data-driven
services. They rely on data analysis to derive insights that can be used to generate value for organizations.
4.3.2 Data Visualization
Data visualization is the graphical representation of information and data using visual elements like charts,
graphs, and maps. This makes it easier to see and understand trends, outliers, and patterns. Data visualization
tools are crucial in the era of Big Data, helping to analyze large datasets and make data-driven decisions.
Some common data visualization methods include:
• Bar Charts: Used to compare different categories of data.
• Pie Charts: Show the proportional representation of
categories.
• Line Graphs: Display trends over time.
• Histograms: Show the distribution of numerical data.
• Boxplots: Visualize the central tendency and spread of
the data.
4.3.3 Data Analysis through Python
Python offers several libraries to help in data analysis and visualization. These libraries can generate various
charts and graphs for better data interpretation. One example dataset that can be analyzed in Python is the Tips
Dataset. total_bill tip gender smoker day time size
16.99 1.01 Female No Sun Lunch 2
The Tips Dataset is a record of the tips given
10.34 1.66 Male No Sun Dinner 3
by customers in a restaurant over two and a 21.01 3.5 Male No Sun Lunch 3
half months in the early 1990s. It is a simple 23.68 3.31 Male No Sun Lunch 2
dataset used to practice data analysis in 24.59 3.61 Female No Sun Dinner 4
Python. The dataset contains six columns: 25.29 4.71 Male No Sun Lunch 4
8.77 2 Male No Sun Dinner 2
• total_bill: The total amount of the bill. 26.88 3.12 Male No Sun Lunch 4
• tip: The tip amount given. 15.04 1.96 Male No Sun Dinner 2
• gender: Whether the customer is male 14.78 3.23 Male No Sun Dinner 2
or female
• smoker: Whether the customer is a smoker or not
• day: The day of the week the bill was generated (e.g., Sun, Sat)
• time: The time of the day (Lunch or Dinner)
• size: The number of people at the table
To start working with the dataset, you'll need to install Python libraries such as Pandas and Matplotlib.
Steps for Analyzing the Dataset:
1. Install Required Libraries: To analyze data in Python, we need to use external libraries. The most important
one is Pandas. Pandas allows you to handle and manipulate datasets efficiently.
To install Pandas, run the following command in your Python environment: pip install pandas
2. Loading the Dataset: The dataset (tips.csv) must be uploaded to Google Drive when using Google Colab for
analysis. To access it in our Colab notebook, follow these steps:
o Upload the file to Google Drive.
o Mount your Google Drive in Colab.
o After mounting, we can read the dataset into a Pandas DataFrame.
3. Data Visualization Using Matplotlib: Once the dataset is loaded into the Pandas DataFrame, various types
of plots can be created using some libraries and methods to visualize the data.
One of the most commonly used libraries for data visualization library in Python is Matplotlib.
Matplotlib:
A low-level data visualization library in Python built on NumPy arrays.
It offers flexibility with various types of plots like scatter plots, line plots, and histograms.
To install Matplotlib, we can run the following command in our terminal: pip install matplotlib
a) Scatter Plot: A scatter plot is used to observe relationships between two variables. Each dot represents
a data point, with its position determined by the values of the variables on the x and y axes.
It's useful for visualizing patterns, trends, and correlations between variables.
The scatter() function from the Matplotlib library is used to create scatter plots.
b) Line Chart: A line chart shows the relationship between two variables on the x and y axes using a continuous
line.
It’s often used to show trends over time or a sequence.
The plot() function is used to create a line chart.
c) Bar Chart: A bar chart represents data categories with rectangular bars, where the height or length of the
bars is proportional to the values they represent.
It’s great for comparing data across categories.
The bar() function is used to create bar charts.
d) Histogram: A histogram displays the distribution of a dataset by grouping data into bins (ranges). The x-axis
represents the bin ranges, while the y-axis shows the frequency of data points within those bins. i.e. how
often each value occurred.
The hist() function is used to create histograms.
e) Boxplot: A boxplot, also known as a box-and-whisker plot, summarizes data distribution. It shows the
minimum, first quartile (25th percentile), median, third quartile (75th percentile), and maximum values,
along with any outliers.
The boxplot() function is used to create boxplots.
f) Pie Chart: What it is: A pie chart is a circular chart divided into slices to show proportions of a whole.
It’s useful for visualizing percentages or proportions of different categories.
The pie() function is used to create pie charts.