[go: up one dir, main page]

0% found this document useful (0 votes)
44 views78 pages

Module 2

The document outlines a course on Exploratory Data Analysis (EDA) and the Data Science Process, emphasizing the importance of EDA in understanding and preparing data for analysis. It covers fundamental concepts, techniques, and tools used in EDA, along with a case study of RealDirect, a data-driven real estate company. The document also introduces basic machine learning algorithms such as Linear Regression, k-Nearest Neighbours, and k-means.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views78 pages

Module 2

The document outlines a course on Exploratory Data Analysis (EDA) and the Data Science Process, emphasizing the importance of EDA in understanding and preparing data for analysis. It covers fundamental concepts, techniques, and tools used in EDA, along with a case study of RealDirect, a data-driven real estate company. The document also introduces basic machine learning algorithms such as Linear Regression, k-Nearest Neighbours, and k-means.

Uploaded by

Divyaraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

MODULE - 2

Course Outcome

At the end of the course the student will be able to:

CO 2. Apply different techniques to Explore Data


Analysis and the Data Science Process
Syllabus
• Exploratory Data Analysis and the Data Science Process Basic tools
(plots, graphs and summary statistics) of EDA, Philosophy of EDA, The
Data Science Process, Case Study: Real Direct (online real estate firm).
Three Basic Machine Learning Algorithms: Linear Regression, k-Nearest
Neighbours (k- NN), k-means.

Textbook: Doing Data Science, Cathy O’Neil and Rachel Schutt, O'Reilly
Media, Inc O'Reilly Media, Inc, 2013

Textbook 1: Chapter 2, Chapter 3


Note:

• Outliers are data points that stand out from


the majority of the data because they are
significantly different or unusual compared
to the rest.
What is Data Science Modelling?

• Data science modeling is a set of steps


from defining the problem to
deploying the model in reality.

• EDA is one of the important Data Modelling


step among several steps.
EDA - Exploratory Data Analysis

• Exploratory data analysis is one of the


basic and essential steps of a data
science project.
• A data scientist involves almost 70% of
his work in doing the EDA of the dataset.
Example of EDA:

1. Starting Point: You don't have a specific


question in mind, like "Are customers satisfied?"
Instead, you're just curious about what the
reviews can tell you.
2. Exploring the Data:
1. Reading Through Reviews: You start by reading some
reviews. You notice some mention quality, some mention price,
and others mention customer service.
2. Creating Word Clouds: You create a word cloud to see which
words appear most frequently. You might find "great",
"expensive", and "helpful" are common.
3. Looking at Ratings: You plot the ratings on a histogram to see
how they are distributed. You might see that most reviews are 4
stars, but there are also a few 1-star reviews.
3. Finding the Unexpected:
1. Surprises: While exploring, you might notice that many
low ratings mention "late delivery," something you didn't
expect to be an issue.
2. Expected Findings: You see lots of mentions of "good
quality," which you expected because your product team
focused on quality.

4. New Insights: Now you realize that, apart from quality,


delivery times are a significant issue that needs addressing.
What is Exploratory Data Analysis
(EDA)?

• Exploratory Data Analysis (EDA) means


that EDA is like having an open mind
and being flexible when looking at data.
What It involves?
• Being Open-Minded: You're ready to find things you
didn't expect, as well as things you thought might be
there.

• Curiosity: You explore the data with a sense of curiosity,


without making any assumptions beforehand.

• In simple terms, it's about being willing to see and


discover anything in the data, whether it's surprising or
expected.
What is Exploratory Data Analysis
(EDA)?

“Exploratory data analysis” is an attitude, a


state of flexibility, a willingness to look for
those things that we believe are not there,
as well as those we believe to be there. —
John Tukey
Who Developed EDA?

• John Tukey, a mathematician at Bell Labs, developed


exploratory data analysis (EDA).

• In EDA, there is no initial hypothesis or model, and the


"exploratory" aspect means that your understanding
of the problem evolves as you analyze the data.
Historical Perspective: Bell Labs
• Bell Labs: A research laboratory established in the
1920s, known for its innovations in physics, computer
science, statistics, and mathematics.

• It has produced notable technologies like the C++


programming language and numerous Nobel Prize
winners.

• Statistics Group: Bell Labs had a highly successful and


productive statistics group, which included John Tukey.
Historical Perspective: Bell Labs
cont..
• Tukey is considered the father of EDA and contributed
to the development of the S language, which evolved
into R, the open-source statistical software.

• Bell Labs is seen as a birthplace of data science due


to its collaborative environment, where statisticians
and computer scientists worked together with large
and complex datasets.
What is Exploratory Data Analysis
(EDA)?
• Exploratory Data Analysis (EDA) is an approach to
analyze the data using visual techniques.

• It is used to discover trends, patterns, or to check


assumptions with the help of statistical summary
and graphical representations.

• It's the process of exploring and becoming familiar


with the data so you can make better decisions
Why EDA?
• EDA is a crucial step in the data analysis process
because it helps you to
• Understand,
• Clean,
• and Prepare your data,
• Generate hypotheses,
• and Make better decisions.
• It's the foundation upon which reliable and
meaningful analysis is built.
Philosophy of Exploratory Data
Analysis
• Exploratory Data Analysis (EDA) is about understanding your data before
trying to convince others of any findings.

• Rachel, while at Google, learned from former Bell Labs statisticians that
EDA is essential even with huge datasets.

• Here’s why EDA is important:

1.Understand the Data: Gain intuition and make comparisons.

2.Sanity Check: Ensure data is in the expected format and scale.

3.Identify Issues: Find missing data or outliers.

4.Debugging: Detect errors in data logging, which helps engineers fix


Here are some references to help you
understand best practices and historical
context:
1. Exploratory Data Analysis by John Tukey (Pearson)
2. The Visual Display of Quantitative Information by Edward Tufte
(Graphics Press)
3. The Elements of Graphing Data by William S. Cleveland (Hobart Press)
4. Statistical Graphics for Visualizing Multivariate Data by William G.
Jacoby (Sage)
5. “Exploratory Data Analysis for Complex Models” by Andrew Gelman
(American Statistical Association)
6. The Future of Data Analysis by John Tukey. Annals of Mathemat ical
Statistics, Volume 33, Number 1 (1962), 1-67.
7. Data Analysis, Exploratory by David Brillinger [8-page excerpt from
International Encyclopedia of Political Science (Sage)]
Exercise: EDA

• There are 31 datasets named nyt1.csv,


nyt2.csv,…,nyt31.csv, which you can find
here:
https://github.com/oreillymedia/doing_data
_science
.
The Data Science Process
1. Raw Data Collection:

• Raw Data: This is the unprocessed, original


data that is gathered directly from various
real-world sources.

• It has not yet been cleaned or transformed


and may contain errors, duplicates, missing
values, or irrelevant information.
Sources of Raw Data: Raw data can come from a wide variety of
sources, including:
• Log Files: Records of events or transactions, such as web server
logs or application logs.
• Social Media: Data from platforms like Twitter, Facebook, or
Google+.
• Sensors: Data from IoT devices, environmental sensors, or
medical devices.
• Databases: Data stored in relational or NoSQL databases.
• Manual Entry: Data entered by humans, such as survey
responses or forms.
• Public Records: Data from public datasets, such as government
reports, academic publications, or historical records.
2. Data Processing

• Data Processing: This involves


transforming raw data into a clean,
organized, and usable format.
3. Data Cleaning
• The raw data often contains noise, errors, and
inconsistencies. It needs to be cleaned and transformed
into a usable format.

• Tools: Python, shell scripts, R, SQL, or a combination


of these tools can be used for data cleaning.

• Output: The cleaned data is typically structured in a


tabular format with rows and columns, making it ready
for analysis.
4. Exploratory Data Analysis
(EDA):
• Initial Examination: Conduct an initial analysis to
understand the data's characteristics and to identify any
issues such as missing values, duplicates, or outliers.

• Iterative Process: EDA is iterative, meaning you may


need to go back and clean the data further if new issues
are discovered during the analysis.
Model Building:

• Algorithm Selection: Choose a suitable algorithm


based on the problem type (e.g., classification,
regression, clustering).

• Examples: Algorithms like k-nearest neighbors (k-NN),


linear regression, and Naive Bayes are common choices.

• Training: Train the model using the cleaned data to learn


patterns and relationships.
Interpretation and
Communication:
• Results Interpretation: Understand the model's
results and insights.

• Communication: Communicate findings through


reports, visualizations, presentations, or
academic papers to stakeholders, such as
managers or coworkers.
Building a Data Product

• Prototype Development: Develop a prototype of


the data product, like a recommendation system,
spam classifier, or search ranking algorithm.

• Deployment: Deploy the data product so that


users can interact with it.
• In summary, the data science process is a
comprehensive and iterative workflow that
involves collecting, cleaning, analyzing, and
modeling data, followed by interpreting and
communicating the results.
A Data Scientists Role in This
Process
• The process of working with data doesn't happen
automatically; it requires a data scientist or a data
science team.
• These experts make key decisions about what data to
collect and why.
• They also formulate questions, create hypotheses, and
plan how to tackle problems.
• In addition to coding and analyzing the data, they are
involved in the entire process from start to finish.
• This involvement ensures that the data collection,
cleaning, analysis, and modeling are all aligned with
the project goals.
Case Study: RealDirect

https://www.youtube.com/watch?v=JXaf2I6C0Ho
About RealDirect

• RealDirect is a real estate company focused


on using data to improve the home
buying and selling process.
Purpose and Mission
• Goal: The primary goal of RealDirect is to leverage data and
technology to streamline and optimize the process of buying
and selling homes. This involves providing real-time data and
insights to homeowners and buyers, reducing commission
costs, and enhancing the overall efficiency of real estate
transactions.

• Mission: RealDirect aims to make the real estate market more


transparent and accessible by offering detailed data and
actionable advice to both buyers and sellers.
Business Model
• Subscription Service: RealDirect offers a subscription
service for home sellers at approximately $395 per month.
This subscription provides access to various selling tools
and data-driven recommendations.
• Reduced Commission: Sellers can also choose to use
RealDirect's agents, who work at a reduced commission rate
of 2%, compared to the typical 2.5% to 3%. This lower rate
is possible due to the efficiency gained from pooling data
and resources.
• Platform Features: The RealDirect platform helps manage
the sale or purchase process with various statuses (e.g.,
active, offer made, showing, in contract) and provides
Technology and Data Use
• Data Integration: RealDirect integrates various sources
of data, including publicly available information and real-
time interaction data, to offer comprehensive insights.
This includes data on property listings, sales trends, and
market conditions.
• User Interface: The platform provides an interface for
sellers with tips and recommendations on how to sell
their house effectively. It also uses interaction data to
give real-time advice on what steps to take next.
• Information Collection: The agents at RealDirect
become proficient in using tools to collect and analyze
data, which helps them stay updated on new and
relevant information.
Challenges and Legal
Considerations
• Legal Hurdles: In New York, RealDirect must comply with laws that
require housing listings to be behind a registration wall. This means
users need to register to see listings, which can be a barrier but is
similar to other platforms like Pinterest.

• Industry Resistance: RealDirect has faced pushback from


traditional real estate brokers who are unhappy with its approach to
lowering commission rates. However, these brokers often have to
cooperate because buyers can find the same listings on other
platforms, leading to transparency in the market.
Competitive Advantage
• Data-Driven Approach: By leveraging data and technology,
RealDirect offers a more efficient and transparent process for buying
and selling homes.

• Cost Savings: The reduced commission rates and subscription model


make it a cost-effective alternative for sellers.

• Comprehensive Service: RealDirect not only provides listings but


also offers detailed information about properties, such as nearby
amenities, price comparisons, and market trends.
• In summary, RealDirect was founded in 2010 in New York
City by Doug Perlson and Perry Tamarkin with the
mission to improve the real estate market using data and
technology.

• The company offers subscription services and reduced


commission rates, integrates various data sources, and
faces challenges from traditional brokers and legal
requirements.
Exercise: RealDirect Data
Strategy

• You've been hired as the chief data scientist at

RealDirect, a real estate website, and need to develop a

data strategy for the company.

• Here are steps to approach this task:


Step 1: Understand the Current
System
• Explore the Website: Navigate the RealDirect
site to understand how buyers and sellers use it
and how it’s organized.

• Research Questions: Identify what data should


be collected, how it would look, how it would be
used for reporting and monitoring, and how it
Step 2: Use Auxiliary Data

• Find Market Data: Since there’s no internal data yet, use external
data from sources like GitHub’s Rolling Sales Update (use
external data from sources like GitHub’s Rolling Sales
Update).

• Data Cleanup: Load and clean the data by fixing outliers,


formatting dates correctly, and ensuring numerical values are
treated as such.

• Exploratory Analysis: Visualize and compare data across


neighborhoods and time to find meaningful patterns.
Step 3: Report Findings

• Write a Report: Summarize your findings in a simple,


clear report for the CEO.

• Communication: Develop strategies to communicate


effectively with non-data scientists. Identify other
people in the company you should talk to for more
information.
Step 4: Understand the Domain

• Step Out of Comfort Zone: Collecting data in a new


field can provide insights for your own work.

• Learn the Vocabulary: Real estate has specific terms.


Asking questions to understand these terms is
important for grasping the problem fully.
• In simple terms, your job is to explore how RealDirect
operates, determine what data to collect and analyze,
use external data to gain insights, and communicate
your findings effectively while learning the real estate
terminology and developing best practices for data
strategy.
Note:

• Data scientist (noun): Person who is better


at statistics than any soft ware engineer
and better at software engineering than
any statistician.

— Josh Wills
Three Basic Machine Learning
Algorithms:
• Linear Regression
• k-Nearest Neighbours (k- NN)
• k-means
Linear Regression

• Linear regression is also a type of machine-


learning algorithm more specifically a supervised
machine-learning algorithm.

• That learns from the labelled datasets and


maps the data points to the most optimized
linear functions which can be used for prediction
on new datasets.
Definition of Linear Regression
• Linear Regression is a statistical method used to
model and analyze the relationships between a
dependent variable and one or more independent
variables.

• The goal is to find the best-fitting straight line


through the data points that can be used to
predict the dependent variable based on the
• The blue dots represent
actual data points,
showing the number of
hours studied and the
corresponding test
scores.
• The red line is the
regression line, which is
the best-fit straight line
that represents the
overall trend of the data.
Linear Regression cont..
• Linear regression is a method used to understand the
relationship between two things.

• Imagine you have a bunch of dots on a graph that represent how


two things change together, like hours studied and test scores.

• Linear regression draws a straight line through these dots in a way


that best shows the overall trend.

• This line can help predict one thing (like a test score) if you know
the other thing (like hours studied).

• It's like finding the best-fitting path that the dots generally follow.
Linear Regression cont..
• Equation of linear regression:
y= β0+β1x
y: Dependent variable (outcome)
x: Independent variable (predictor)
β0: Intercept (the value of y when x=0)
β1​: Slope (the change in y for a one-unit
change in x)
Example 1:
• A social networking site charges a monthly subscription fee of
$25.

• The site's revenue is recorded daily over two years, resulting in a


series of data points (number of users and total revenue).

• The first four data points are (1, 25), (10, 250), (100, 2500), and
(200, 5000).

• A clear relationship 𝑦=25𝑥 is observed.


• Indicating a linear pattern with a coefficient of 25, meaning
The graph shows these data points as blue dots, with a red
dashed line representing the equation 𝑦=25𝑥, illustrating the
linear relationship. ​
Fitting the model
Fitting the model
• To find the optimal line that best fits the data points by minimizing the
distance between the points and the line.

• Least Squares Estimation:

• Linear regression uses a method called least squares estimation.

• The objective is to minimize the Residual Sum of Squares (RSS),


which is the sum of the squared vertical distances between the
observed data points and the predicted values on the line.
In Summary
1. Plotting Points:
• Plot the data points on a graph.
2. Drawing the Line:
• Draw a line that seems to fit the trend of the points.
3. Calculating Residuals:
• Measure the vertical distances (residuals) from each
point to the line.
4. Squaring and Summing Residuals:
• Square these residuals to avoid negative values and
sum them up to get RSS.
5. Finding the Best Fit Line:
• Adjust the line to minimize RSS.
Refer Class notes for the values and full solution
The K-Nearest Neighbors (KNN)
algorithm
• The K-Nearest Neighbors (KNN) algorithm is a
supervised machine learning method employed to
tackle classification and regression problems.

• Evelyn Fix and Joseph Hodges developed this


algorithm in 1951, which was subsequently
expanded by Thomas Cover.
Definition of KNN

• k-Nearest Neighbors (k-NN) is a non-parametric, lazy


learning algorithm that classifies a new data point
based on the classes of its k nearest neighbors.

• The "k" in k-NN is a user-defined constant that


determines the number of nearest neighbors to
consider when making a classification or prediction.
Different Distance Metrics in
KNN
• Euclidean Distance
• Manhattan
• Cosine Similarity
• Jaccard Distance or Similarity
• Mahalanobis Distance
• Hamming Distance
Euclidean Distance
• This distance is the most widely used one as it is the
default metric that SKlearn library of Python uses for
K-Nearest Neighbor.
• It is a measure of the true straight line distance between
two points in Euclidean space.
Manhattan
• This distance is also known as taxicab distance or
city block distance, that is because the way this
distance is calculated.
• The distance between two points is the sum of the
absolute differences of their Cartesian coordinates.
Cosine Similarity

• This distance metric is used mainly to calculate


similarity between two vectors.

• It is measured by the cosine of the angle


between two vectors and determines
whether two vectors are pointing in the
same direction.
Jaccard Distance or Similarity
• This measures how similar two sets are.
• Imagine comparing two friend lists.
• You look at how many friends are in both lists
(intersection) and how many unique friends
are in either list (union).
• The Jaccard Similarity is the ratio of the
intersection to the union. Higher values mean
more similarity.
Mahalanobis Distance

• Similar to Euclidean distance but takes into account how


the data is spread out (correlation) and adjusts for
different scales in the data.
Calculate the Covariance Matrix (S):
•The covariance matrix represents how each feature varies
with every other feature. It helps understand the relationships
and spread of the data.
Invert the Covariance Matrix (S⁻¹):
•The inverse of the covariance matrix is used to adjust for the
relationships between features.
Hamming Distance

• Used for strings of the same length.

• It counts how many positions have different


characters.

• For example, comparing "cat" and "bat" gives a


distance of 1 because only the first letters are
different and olive and ocean is 4 .
K - Means

• K-Means is a popular and straightforward


clustering algorithm used in machine learning and
data analysis to partition a dataset into K distinct,
non-overlapping groups (clusters) based on the
inherent structure of the data. Here's a simple
definition:
K Means Definition

• K-Means is an iterative algorithm that


divides a set of data points into K
clusters, where each data point belongs
to the cluster with the nearest mean
(centroid).
How K-Means Works
1.Choose K: Decide the number of clusters, K, you want
to form in the dataset.
2.Initialize Centroids: Randomly select K data points
from the dataset as the initial centroids (cluster centers).
3.Assign Clusters: Assign each data point to the nearest
centroid, forming K clusters.
4.Update Centroids: Calculate the new centroids by
taking the mean of all data points assigned to each
cluster.
5.Repeat: Repeat steps 3 and 4 until the centroids no
longer change significantly or a maximum number of
iterations is reached.

You might also like