0% found this document useful (0 votes)

16 views94 pages

Datascience Presentation

Uploaded by

techiesid02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views94 pages

Datascience Presentation

Uploaded by

techiesid02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

What is Data Science

• Data science, also known as data-driven science, is an interdisciplinary field about

scientific methods, processes, and systems to extract knowledge or insights from data in
various forms, either structured or unstructured, similar to data mining.
• The inventor of the World Wide Web, Tim Berners-Lee, is often quoted as having said ,
“Data is not information, information is not knowledge, knowledge is not understanding,
understanding is not wisdom”

• This quotes a kind of Pyramid, where data and raw materials that make up the foundation at
the bottom of the pile, and information, knowledge, understanding and wisdom represent
higher and higher levels of the pyramid

• The major goal of a data scientist is to help people to turn data into information and
onwards up the pyramid

• Data science is different from other areas such as mathematics of statistics. Data science is
an applied activity and data scientists serve the needs and solve the problems of data users

• Before you can solve problem , you need to identify it and this process is not always as
obvious as it might seem
Business intelligence

•“Business intelligence is the

process of transforming data into
•information and through
discovery transforming that
information into knowledge.”
What Is Business
Intelligence?
• How are sales year-to-date and How do they compare to
last year?
• Who is most likely to respond to me current marketing
campaign and
how will they impact revenue?
• What is the turnover in employees compared to the last
five years?
• How is potential fraud cost being managed over time?
• What are my most profitable products by region, by year,
and year-to-date?
This is a simple business question, but the
actual query can be quite complex

“What was the percentage change in revenue for a grouping

of our top 20% products from one year ago over a rolling
three-month time, period compared to this year for each
region of the world?”
Business users typically want to answer questions that include terms
such as what, where, who,and when.
For example, you find the following essential questions embedded in
the sample question:

• What products are selling best? (“…top 20%…”)

• Where are they selling? (“…each region of the world…”)
• When have they performed the best? (“…percentage change in
revenue…”)
Few scenarios where data mining can be helpful:

A retailer wants to increase revenues by identifying all potentially

high-value customers to offer incentives to them.

The retailer also wants guidance in store layout by determining the

products most likely to be purchased together.

A government agency wants faster and more accurate methods of

highlighting possible fraudulent activity for further investigation,
and so on.
Change Is the Motivation

• Data was and has been always critical to organizations.

• Data creating business value is not a new idea, however with times changing, volumes –
variety –velocity increasing it has gained more importance.
• Change in approach is inevitable and probably much needed now than before.
• Success battles were fought on basis of data and its effective use is becoming the core basis of
competition & success.
Data Science Process
Most real-life projects that involve data can be broken down into several steps:

1. Data Acquisition - we need to find (or collect) the data, and get some representation
of it into the computer

2. Data Cleaning - Inevitably, there will be errors in the data, either because they were entered
incorrectly, we misunderstood the nature of the data, records were duplicated or omitted. Many times,
data is presented for viewing, and extracting the data in some other form becomes a challenge.

3. Data Organization - Depending on what you want to do, you may need to reorganize your data. This is
especially true when you need to produce graphical representations of the data. Naturally, we need the
appropriate tools to do these tasks.

4. Data Modelling and Presentation - We may fit a statistical model to our data, or we may just produce a
graph that shows what we think is important. Often, a variety of models or graphs needs to be considered.
It's important to know what techniques are available and whether they are accepted within a particular
user community.
• Overview of Data Science

Data acquisition, profiling, preparation, and visualization.

Feature engineering.
Model training
Model evaluation, explanation, and interpretation
Model deployment.
Who is a Data Scientist

Person who is better at statistics than any software engineer and

better at software engineering than any statistician.
Technologies Used In Data Science

Data Web Machine Data

Big Data
Cleansing Scraping Learning Visualization
Technologies Used In Data Science

Big Data

Big data is a term for data sets that

are so large or complex that
traditional data processing application
software is inadequate to deal with
them.
Big Data Analytics (or data science) can be studied at three levels.

A researcher level that focuses on the underlying mathematics and

computing deeply

A business level that focuses on interpretation and business

applications

An engineering level where the focus is on building working systems

with known knowledge.

Think analytically, rigorously and systematically about a business

problem and come up with a solution that leverages the available data
Traditional Data Architecture
Modern Data Architecture
Technologies Used In Data Science

Big Data Apache Microsoft Big data in

NoSQL Hive Sqoop PolyBase Presto.
Technologies Hadoop. HDInsight. EXCEL
Technologies Used In Data Science
Data Cleansing

Data cleansing or data cleaning is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data.
Data Cleaning Techniques

• Data cleaning techniques are used to correct, transform, and organize data to improve its quality and accuracy. Here are
some of the most common data-cleaning techniques:
• Data Normalization: Normalization is the process of transforming data into a standard format, making it easier to process
and clean.
• Data Transformation: Data transformation is the process of converting data from one format to another, making it easier to
use and analyze.
• Data Integration: Data integration is the process of combining data from multiple sources into a single, consistent format.
• Data Reduction: Data reduction is the process of removing unnecessary data, such as duplicates or irrelevant information,
to simplify and improve data quality.
• Data Imputation: Data imputation is the process of filling in missing data with estimates or values derived from other data.
• Data Deduplication: Data deduplication is the process of removing duplicate data entries to ensure data accuracy and
consistency.
• Data Enrichment: Data enrichment is the process of adding additional information to data, such as geolocation data or
demographic information, to enhance its value.
Data Cleansing
Technologies Used In Data Science

Web Scrapping

Web Scraping is a technique employed to extract large amounts of

data from websites whereby the data is extracted and saved to a local
file in your computer .
• Web scraping is all about collecting content from websites. Scrapers
come in many shapes and forms and the exact details of what a
scraper will collect will vary greatly, depending on the use cases.

• A very common example is search engines, of course. They

continuously crawl and scrape the web for new and updated content,
to include in their search index.
Technologies Used In Data Science

ScrapingBee
ScrapeBox
ScreamingFrog
Scrapy
pyspider
Beautiful Soup
Diffbot
Common Crawl
Technologies Used In Data Science
Data Visualization

Data visualization is a general term that describes any effort to help people understand the significance of data by placing
it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and
recognized easier with data visualization software.
Technologies Used In Data Science
Machine Learning Software’s
Tensor Flow
PyTorch
Scikit-Learn
Keras
XGBoost
Apache Spark Mllib
Microsoft Azure Machine Learning
RapidMiner
Technologies Used In Data Science

Machine Learning

Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers
the ability to learn without being explicitly programmed." Samuel, an American pioneer in the field of
computer gaming and artificial intelligence, coined the term "machine learning" in 1959 while at IBM.
Applications Of Datascience
Top Data Science Trends For 2024

• Augmented Analytics
• Responsible AI
• Edge Computing for Data Science
• Quantum Computing Integration
• Continuous Learning Models
• Natural Language Processing (NLP) Advancements
• Federated Learning
• Blockchain in Data Science
Introduction to Machine
Learning
What is Machine Learning ?

Machine Learning is a field of

study that gives computers the
ability to learn without being
explicitly programmed
What is Machine Learning?
The ability to perform a task in a situation which has never been encountered before (Learning
= Generalization)

Automating automation

Getting computers to program themselves

Writing software is the bottleneck

Let the data do the work instead!

A short story
Samuel's claim to fame was that back in the 1950, he wrote a checkers playing
program and the amazing thing about this checkers playing program
was that Arthur Samuel himself wasn't a very good checkers player.
But what he did was he had to programmed maybe tens of thousands of games against
himself, and by watching what sorts of board positions tended to lead to wins and
what sort of board positions tended to lead to losses,
the checkers playing program learned over time what are good board positions and
what are bad board positions.
And eventually learn to play checkers better than the Arthur Samuel
himself was able to.
This was a remarkable result.
Arthur Samuel himself turns out not to be a very good checkers player.
But because a computer has the patience to play tens of thousands of
games against itself, no human has the patience to play that many games.
By doing this, a computer was able to get so much checkers playing experience
that it eventually became a better checkers player than Arthur himself.
What is Machine Learning
• “A computer program is said to learn from experience E with some
class of tasks T and performance measure P if its performance at tasks
in T, as measured by P, improves with experience E.” -Tom M.
Mitchell Consider playing checkers.
• E = the experience of playing many games of checkers
• T = the task of playing checkers.
• P = the probability that the program will win the next game
• classification problems where the goal is to categorize objects into a
fixed set of categories.
• Face detection: Identify faces in images (or indicate if a face is
present).
• Email filtering: Classify emails into spam and not-spam.
• Medical diagnosis: Diagnose a patient as a sufferer or non-sufferer of
some disease.
• Weather prediction: Predict, for instance, whether or not it will rain
tomorrow.
• Facial recognition technology allows social media platforms to help
users tag and share photos of friends.
• Optical character recognition (OCR) technology converts images of
text into movable type
• Recommendation engines, powered by machine learning, suggest
what movies or television shows to watch next based on user
preferences.
• Self-driving cars that rely on machine learning to navigate may soon
be available to consumers.
Traditional Programming Vs Machine Learning
• Machine learning is the science of getting computers to act without being explicitly
programmed. In the past decade, machine learning has given us self-driving cars, practical
speech recognition, effective web search, and a vastly improved understanding of the human
genome. Machine learning is so pervasive today that you probably use it dozens of times a
day without knowing it.
Types of Learning

Supervised Unsupervised Reinforcement

Learning Learning Learning
Types of Learning
Algorithms
• Supervised learning
• Prediction
• Classification (discrete labels), Regression (real values)
• Unsupervised learning
• Clustering
• Probability distribution estimation
• Finding association (in features)
• Dimension reduction
Algorithms
• The success of machine learning system also depends on the algorithms.

• The algorithms control the search to find and build the knowledge structures.

• The learning algorithms should extract useful information from training examples.
Algorithms

Supervised learning Unsupervised learning

Semi-supervised learning 59
Machine learning structure
• Supervised learning
• Regression analysis is a statistical technique used to find the relations between two or more
variables. In regression analysis one variable is independent and its impact on the other
dependent variables is measured. When there is only one dependent and independent variable
we call is simple regression. On the other hand, when there are many independent variables
influencing one dependent variable we call it multiple regression
Machine learning structure
• Unsupervised learning
Machine Learning Applications
Machine Learning: Problem Types

Game Playing
(Reinforcement
Learning)
ML in Big Data
Ability to learn on large corpus of data is a real boon for ML.

Even simplistic ML models shine when they are trained on huge amount
of data.

With big data toolsets, a wide variety of ML application have started to

emerge other than academic specifics one

Big data democratising ML for general public

The data science process
Machine Learning

K-Nearest Neighbours
Different Learning Methods

Eager Learning
• Explicit description of target function on the whole training set

Instance-based Learning
• Learning=storing all training instances
• Classification=assigning target function to a new instance
• Referred to as “Lazy” learning
Different Learning Methods
Eager Learning

Any random movement

=>It’s a mouse

I saw a mouse!
Instance Based Learning

Its very similar to a

Desktop!!
Classification
• Given: dataset of instances with known categories
• Goal: using the “knowledge” in the dataset, classify a given instance
• predict the category of the given instance that is rationally consistent with the dataset
Instance Based Learning
K-Nearest Neighbor Algorithm
• Weighted Regression
• Case-based reasoning
K_Nearest Neighbours
• For a given instance T, get the top k dataset instances that are “nearest” to T
• Select a reasonable distance measure
• Inspect the category of these k instances, choose the category C that represent the most instances
• Conclude that T belongs to category C
K_Nearest Neighbours

Features
• All instances correspond to points in an n-dimensional Euclidean space
• Classification is delayed till a new instance arrives
• Classification done by comparing feature vectors of the different points
• Target function may be discrete or real-valued
K_Nearest Neighbour Classifier
Learning by Analogy
• Tell me who your friends are and I’ll tell you who you are?
• A new example is assigned to the most common class among the (K) examples that are most similar to it
K_Nearest Neighbour Algorithm
To determine the class of a new example E:

• Calculate the distance between E and all examples in the training set
• Select K-nearest examples to E in the training set
• Assign E to the most common class among its K-nearest neighbors

E
Distance Between Neighbors
Each example is represented with a set of numerical attributes

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2

• “Closeness” is defined in terms of the Euclidean distance between two examples

• The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:

n
D( X , Y ) =  (x
i =1
i − yi ) 2

Distance (Jay,Rina) = (35−41)2 +(95,000−215,000)2 +(3 − 2)2

K_Nearest Neighbours: Instance Based Learning
• No model is built: Store all training examples

• Any processing is delayed until a new instance must be classified

No response

Response
No response

No response

Response Class: Response

K_Nearest Neighbours: Example

Customer Age Income No. credit cards Response

Jay 35 35K 3 No

Rina 22 50K 2 Yes

Hema 63 200K 1 No

Tommy 59 170K 1 No

Neil 25 40K 4 Yes

Dravid 37 50K 2 ?
K_Nearest Neighbours: Example

No.
Custome Incom Respons
Age credit Distance from Dravid
r e e
cards

Jay 35 35K 3 No (35 − 37)2 +(35 − 50)2 +(3 − 2)2

= 15.16

Rina 22 50K 2 Yes 15

Hema 63 200K 1 No 152.23

Tommy 59 170K 1 No 122

Neil 25 40K 4 Yes 15.74

Dravid 37 50K 2 ? 0
K_Nearest Neighbours: Strengths and Weaknesses
Strengths
Weaknesses

• Simple to implement and use • Need a lot of space to store all examples

• Comprehensible: easy to explain • Takes more time to classify a new

prediction example than with a model (need to
calculate and compare distance from new
example to all other examples)
• Robust to noisy data by averaging
k-nearest neighbors • Each attribute is treated equally

• Some appealing applications (will discuss

next in personalization)
K_Nearest Neighbours: Classifier

Classification Tree Modes K-Nearest Neighbors

Age > 40

No Yes No response

Response
No response
Class=No
Income>100K
Response
No response
Yes No
Response Class: Response

No No cards>2
Response

Yes No

Response No
Response
K_Nearest Neighbours: Strenghts and Weaknesses

Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2

Distance (Jay, Rina)=sqrt [(35-45)2+(95,000-215,000)2 +(3-2)2]

• Distance between neighbors could be dominated by some attributes with relatively large numbers
(e.g., income in our example)

• Important to normalize some features

(e.g., map numbers to numbers between 0-1)

Example: Income
Highest income = 500K
Davis’s income is normalized to 95/500, Rina income is normalized to 215/500, etc.)
K_Nearest Neighbours: Strenghts and Weaknesses

Normalization of Variables

Customer Age Income No. credit cards Response

55/63= 35/200= 3/4=

Jay No
0.175 0.175 0.75
22/63= 50/200= 2/4=
Rina Yes
0.34 0.25 0.5
63/63= 200/200= 1/4=
Hema No
1 1 0.25
59/63= 170/200= 1/4=
Tommy No
0.93 0.175 0.25
25/63= 40/200= 4/4=
Neil Yes
0.39 0.2 1
37/63= 50/200= 2/4=
Dravid Yes
0. 58 0.25 0.5
K-Nearest Neighbor: Strengths & Weaknesses
• Distance works naturally with numerical attributes
d(Rina,Johm)= (35−37)2+(35−50)2 +(3−2)2 =15.16
• What if we have nominal attributes?

Example: Married

Customer Married Income No. credit cards Response

Jay Yes 35K 3 No

Rina No 50K 2 Yes
Hema No 200K 1 No
Tommy Yes 170K 1 No
Neil No 40K 4 Yes
Dravid Yes 50K 2 Yes
Non-Numeric Data
• Feature values are not always numbers
• Example
• Boolean values: Yes or no, presence or absence of an attribute
• Categories: Colors, educational attainment, gender
• How do these values factor into the computation of distance?
Dealing with Non-Neumeric Data
• Boolean values => convert to 0 or 1
• Applies to yes-no/presence-absence attributes
• Non-binary characterizations
• Use natural progression when applicable; e.g., educational attainment: GS, HS, College, MS, PHD
=> 1,2,3,4,5
• Assign arbitrary numbers but be careful about distances; e.g., color: red, yellow, blue => 1,2,3
• How about unavailable data?
(0 value not always the answer)
Preprocessing Your Dataset

• Dataset may need to be preprocessed to ensure more reliable data mining results
• Conversion of non-numeric data to numeric data
• Calibration of numeric data to reduce effects of disparate ranges
• Particularly when using the Euclidean distance metric
k-NN Variations
• Value of k
• Larger k increases confidence in prediction
• Note that if k is too large, decision may be skewed
• Weighted evaluation of nearest neighbors
• Plain majority may unfairly skew decision
• Revise algorithm so that closer neighbors have greater “vote weight”
• Other distance measures
Other Distance Measures
• City-block distance (Manhattan dist)
• Add absolute value of differences
• Cosine similarity
• Measure angle formed by the two samples (with the origin)
• Jaccard distance
• Determine percentage of exact matches between the samples (not including unavailable data)
A=(a,b,c,d) B=(a,c,f,g) J=AnB/AuB = 2/6 = 1/3
Mainly used in text mining
• Others
Distance-Weighted Nearest Neighbor Algorithm
• Assign weights to the neighbors based on their ‘distance’ from the query point
• Weight ‘may’ be inverse square of the distances (the farther away, the less weight the point has)
• All training points may influence a particular instance
• Shepard’s method
How to Choose ”K”?

• For k = 1, …,5 point x gets classified correctly

• red class
• For larger k classification of x is wrong
• blue class
How to Find Optional Value of ”K”?
Cross-Validation
• Use P- fold cross validation
• Divide training data into p-parts.
• Select only (p-1) parts for training and remaining 1 part for testing.
• There are (p-1) combinations of training and test set pairs
• For each of the (p-1) combination learn K-NN model with different K and compute prediction
error for test set.
• Compute the average test error for different K
• Select K with minimum average test error.

LOOCV = Leave One Out Cross Validation

Build model on all the data set except one instance and then test the model on that one
instance(row/sample)
Thus, for a dataset of m, train on m-1 instances and test on the one instance and this will be done for
each instance…
K-NN: Computational Complexity

• Basic k-NN algorithm stores all examples. Suppose

• we have n examples each of dimension d
• O(d) to compute distance to one example
• O(nd) to find one nearest neighbor
• O(knd) to find k closest examples examples
• Thus complexity is O(knd)
• This is prohibitively expensive for large number of samples
• But we need large number of samples for k-NN to work well!

Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
Unit 1
No ratings yet
Unit 1
28 pages
Data Science
No ratings yet
Data Science
6 pages
Foundations of Data Science UNIT 1
No ratings yet
Foundations of Data Science UNIT 1
23 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Data Science
No ratings yet
Data Science
59 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
M1.1 DS
No ratings yet
M1.1 DS
57 pages
Ch7-Overview of Data Science-Part 1
No ratings yet
Ch7-Overview of Data Science-Part 1
37 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
BI Unit 2
No ratings yet
BI Unit 2
113 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
8 pages
Unit 1
No ratings yet
Unit 1
34 pages
Introduction To Data Science What Is Data Science?
No ratings yet
Introduction To Data Science What Is Data Science?
11 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Unit 1
No ratings yet
Unit 1
60 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
18 pages
DS Notes
No ratings yet
DS Notes
159 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data
No ratings yet
Data
43 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
7 pages
Basic of Ds
No ratings yet
Basic of Ds
14 pages
Data Science Unit1
No ratings yet
Data Science Unit1
9 pages
Unit 1-FDS
100% (2)
Unit 1-FDS
18 pages
Introduction To Data-Science
No ratings yet
Introduction To Data-Science
246 pages
Science
No ratings yet
Science
8 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
16 pages
Session 1819
No ratings yet
Session 1819
47 pages
Data Science
No ratings yet
Data Science
5 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
02 Introduction - Fall 23-24
No ratings yet
02 Introduction - Fall 23-24
29 pages
Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Data Science and Its Importance
No ratings yet
Data Science and Its Importance
9 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Computational Data Science - Unit 1
No ratings yet
Computational Data Science - Unit 1
18 pages
Introduction To Data Science UNIT 1
No ratings yet
Introduction To Data Science UNIT 1
44 pages
Data Science Notes - 1-PD
No ratings yet
Data Science Notes - 1-PD
17 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
17 pages
Artificial Neural Network (A.k.a. Deep Learning) : Dr. Md. Aminul Haque Akhand Dept. of CSE, Kuet
No ratings yet
Artificial Neural Network (A.k.a. Deep Learning) : Dr. Md. Aminul Haque Akhand Dept. of CSE, Kuet
29 pages
Excel & Ledger Training Guide
No ratings yet
Excel & Ledger Training Guide
3 pages
Essentials of Business Law 10th Edition Anthony Liuzzo Ruth Calhoun Hughes Get PDF Now
No ratings yet
Essentials of Business Law 10th Edition Anthony Liuzzo Ruth Calhoun Hughes Get PDF Now
323 pages
Libdmclient FOSDEM 2013
No ratings yet
Libdmclient FOSDEM 2013
24 pages
VANTAGE 850dda Manual
No ratings yet
VANTAGE 850dda Manual
32 pages
TTFM100B UK REV232 - Rev1
No ratings yet
TTFM100B UK REV232 - Rev1
97 pages
Pattern Recognition
No ratings yet
Pattern Recognition
12 pages
Terms of Reference - SET4NPL Tender Part (B) - Charging Stations Ecosystem
No ratings yet
Terms of Reference - SET4NPL Tender Part (B) - Charging Stations Ecosystem
43 pages
Petunjuk Install
No ratings yet
Petunjuk Install
3 pages
Python in A Nutshell Second Edition Alex Martelli Download
No ratings yet
Python in A Nutshell Second Edition Alex Martelli Download
52 pages
Braun
No ratings yet
Braun
32 pages
SG 247972
No ratings yet
SG 247972
510 pages
The First Line of Code: Android Programming With Kotlin Lin Guo PDF Download
No ratings yet
The First Line of Code: Android Programming With Kotlin Lin Guo PDF Download
48 pages
Complete Web Development Course
No ratings yet
Complete Web Development Course
4 pages
AI Course Project - Timetable Scheduling
No ratings yet
AI Course Project - Timetable Scheduling
5 pages
Resume - II Siddesh
No ratings yet
Resume - II Siddesh
2 pages
Motion Control of Robot by Using Kinect Sensor
No ratings yet
Motion Control of Robot by Using Kinect Sensor
6 pages
Market Basket Analysis AProfit Based Approachto Apriori Algorithm
No ratings yet
Market Basket Analysis AProfit Based Approachto Apriori Algorithm
8 pages
Tutorial - Resonant LLC Converter Design Using Power Supply Design Suite
No ratings yet
Tutorial - Resonant LLC Converter Design Using Power Supply Design Suite
21 pages
Module 6 - IO Organization - Final
No ratings yet
Module 6 - IO Organization - Final
34 pages
Risk Management of Industrial Security
No ratings yet
Risk Management of Industrial Security
13 pages
How To Send Resume Using Email
100% (2)
How To Send Resume Using Email
8 pages
Pioneer BT Update Guide
No ratings yet
Pioneer BT Update Guide
5 pages
CRg7-Passport-CP-tester (English V1)
No ratings yet
CRg7-Passport-CP-tester (English V1)
21 pages
A Philosophy of Software Design First Edition (V1.0) Ousterhout Online PDF
100% (1)
A Philosophy of Software Design First Edition (V1.0) Ousterhout Online PDF
99 pages
Gantt Plan - NOV
No ratings yet
Gantt Plan - NOV
4 pages
Object Oriented Technique Using Java (Theory and Lab)
No ratings yet
Object Oriented Technique Using Java (Theory and Lab)
6 pages
Opcode Guide for Assembly Programmers
No ratings yet
Opcode Guide for Assembly Programmers
4 pages
System Analysis Design Chapter 8
No ratings yet
System Analysis Design Chapter 8
41 pages
Experiment 1.3: Create An Application To Calculate Interest For FDS, Rds Based On Certain Conditions Using Inheritance
No ratings yet
Experiment 1.3: Create An Application To Calculate Interest For FDS, Rds Based On Certain Conditions Using Inheritance
9 pages

Datascience Presentation

Uploaded by

Datascience Presentation

Uploaded by

What is Data Science

• Data science, also known as data-driven science, is an interdisciplinary field about

•“Business intelligence is the

“What was the percentage change in revenue for a grouping

• What products are selling best? (“…top 20%…”)

A retailer wants to increase revenues by identifying all potentially

The retailer also wants guidance in store layout by determining the

A government agency wants faster and more accurate methods of

• Data was and has been always critical to organizations.

Data acquisition, profiling, preparation, and visualization.

Person who is better at statistics than any software engineer and

Data Web Machine Data

Big data is a term for data sets that

A researcher level that focuses on the underlying mathematics and

A business level that focuses on interpretation and business

An engineering level where the focus is on building working systems

Think analytically, rigorously and systematically about a business

Big Data Apache Microsoft Big data in

Web Scraping is a technique employed to extract large amounts of

• A very common example is search engines, of course. They

Machine Learning is a field of

Getting computers to program themselves

Writing software is the bottleneck

Let the data do the work instead!

Supervised Unsupervised Reinforcement

Supervised learning Unsupervised learning

With big data toolsets, a wide variety of ML application have started to

Big data democratising ML for general public

Any random movement

Its very similar to a

• “Closeness” is defined in terms of the Euclidean distance between two examples

Distance (Jay,Rina) = (35−41)2 +(95,000−215,000)2 +(3 − 2)2

• Any processing is delayed until a new instance must be classified

Response Class: Response

Customer Age Income No. credit cards Response

Rina 22 50K 2 Yes

Neil 25 40K 4 Yes

Jay 35 35K 3 No (35 − 37)2 +(35 − 50)2 +(3 − 2)2

Rina 22 50K 2 Yes 15

Hema 63 200K 1 No 152.23

Tommy 59 170K 1 No 122

Neil 25 40K 4 Yes 15.74

• Comprehensible: easy to explain • Takes more time to classify a new

• Some appealing applications (will discuss

Classification Tree Modes K-Nearest Neighbors

Distance (Jay, Rina)=sqrt [(35-45)2+(95,000-215,000)2 +(3-2)2]

• Important to normalize some features

Customer Age Income No. credit cards Response

55/63= 35/200= 3/4=

Customer Married Income No. credit cards Response

Jay Yes 35K 3 No

• For k = 1, …,5 point x gets classified correctly

LOOCV = Leave One Out Cross Validation

• Basic k-NN algorithm stores all examples. Suppose

You might also like