Datascience Presentation
Datascience Presentation
• This quotes a kind of Pyramid, where data and raw materials that make up the foundation at
the bottom of the pile, and information, knowledge, understanding and wisdom represent
higher and higher levels of the pyramid
• The major goal of a data scientist is to help people to turn data into information and
onwards up the pyramid
• Data science is different from other areas such as mathematics of statistics. Data science is
an applied activity and data scientists serve the needs and solve the problems of data users
• Before you can solve problem , you need to identify it and this process is not always as
obvious as it might seem
Business intelligence
1. Data Acquisition - we need to find (or collect) the data, and get some representation
of it into the computer
2. Data Cleaning - Inevitably, there will be errors in the data, either because they were entered
incorrectly, we misunderstood the nature of the data, records were duplicated or omitted. Many times,
data is presented for viewing, and extracting the data in some other form becomes a challenge.
3. Data Organization - Depending on what you want to do, you may need to reorganize your data. This is
especially true when you need to produce graphical representations of the data. Naturally, we need the
appropriate tools to do these tasks.
4. Data Modelling and Presentation - We may fit a statistical model to our data, or we may just produce a
graph that shows what we think is important. Often, a variety of models or graphs needs to be considered.
It's important to know what techniques are available and whether they are accepted within a particular
user community.
• Overview of Data Science
Big Data
Data cleansing or data cleaning is the process of detecting and correcting (or removing)
corrupt or inaccurate records from a record set and refers to identifying incomplete,
incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data.
Data Cleaning Techniques
• Data cleaning techniques are used to correct, transform, and organize data to improve its quality and accuracy. Here are
some of the most common data-cleaning techniques:
• Data Normalization: Normalization is the process of transforming data into a standard format, making it easier to process
and clean.
• Data Transformation: Data transformation is the process of converting data from one format to another, making it easier to
use and analyze.
• Data Integration: Data integration is the process of combining data from multiple sources into a single, consistent format.
• Data Reduction: Data reduction is the process of removing unnecessary data, such as duplicates or irrelevant information,
to simplify and improve data quality.
• Data Imputation: Data imputation is the process of filling in missing data with estimates or values derived from other data.
• Data Deduplication: Data deduplication is the process of removing duplicate data entries to ensure data accuracy and
consistency.
• Data Enrichment: Data enrichment is the process of adding additional information to data, such as geolocation data or
demographic information, to enhance its value.
Data Cleansing
Technologies Used In Data Science
Web Scrapping
ScrapingBee
ScrapeBox
ScreamingFrog
Scrapy
pyspider
Beautiful Soup
Diffbot
Common Crawl
Technologies Used In Data Science
Data Visualization
Data visualization is a general term that describes any effort to help people understand the significance of data by placing
it in a visual context. Patterns, trends and correlations that might go undetected in text-based data can be exposed and
recognized easier with data visualization software.
Technologies Used In Data Science
Machine Learning Software’s
Tensor Flow
PyTorch
Scikit-Learn
Keras
XGBoost
Apache Spark Mllib
Microsoft Azure Machine Learning
RapidMiner
Technologies Used In Data Science
Machine Learning
Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers
the ability to learn without being explicitly programmed." Samuel, an American pioneer in the field of
computer gaming and artificial intelligence, coined the term "machine learning" in 1959 while at IBM.
Applications Of Datascience
Top Data Science Trends For 2024
• Augmented Analytics
• Responsible AI
• Edge Computing for Data Science
• Quantum Computing Integration
• Continuous Learning Models
• Natural Language Processing (NLP) Advancements
• Federated Learning
• Blockchain in Data Science
Introduction to Machine
Learning
What is Machine Learning ?
Automating automation
• The algorithms control the search to find and build the knowledge structures.
• The learning algorithms should extract useful information from training examples.
Algorithms
Semi-supervised learning 59
Machine learning structure
• Supervised learning
• Regression analysis is a statistical technique used to find the relations between two or more
variables. In regression analysis one variable is independent and its impact on the other
dependent variables is measured. When there is only one dependent and independent variable
we call is simple regression. On the other hand, when there are many independent variables
influencing one dependent variable we call it multiple regression
Machine learning structure
• Unsupervised learning
Machine Learning Applications
Machine Learning: Problem Types
Game Playing
(Reinforcement
Learning)
ML in Big Data
Ability to learn on large corpus of data is a real boon for ML.
Even simplistic ML models shine when they are trained on huge amount
of data.
K-Nearest Neighbours
Different Learning Methods
Eager Learning
• Explicit description of target function on the whole training set
Instance-based Learning
• Learning=storing all training instances
• Classification=assigning target function to a new instance
• Referred to as “Lazy” learning
Different Learning Methods
Eager Learning
I saw a mouse!
Instance Based Learning
Features
• All instances correspond to points in an n-dimensional Euclidean space
• Classification is delayed till a new instance arrives
• Classification done by comparing feature vectors of the different points
• Target function may be discrete or real-valued
K_Nearest Neighbour Classifier
Learning by Analogy
• Tell me who your friends are and I’ll tell you who you are?
• A new example is assigned to the most common class among the (K) examples that are most similar to it
K_Nearest Neighbour Algorithm
To determine the class of a new example E:
• Calculate the distance between E and all examples in the training set
• Select K-nearest examples to E in the training set
• Assign E to the most common class among its K-nearest neighbors
E
Distance Between Neighbors
Each example is represented with a set of numerical attributes
Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2
• The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:
n
D( X , Y ) = (x
i =1
i − yi ) 2
No response
Response
No response
No response
Jay 35 35K 3 No
Hema 63 200K 1 No
Tommy 59 170K 1 No
Dravid 37 50K 2 ?
K_Nearest Neighbours: Example
No.
Custome Incom Respons
Age credit Distance from Dravid
r e e
cards
Dravid 37 50K 2 ? 0
K_Nearest Neighbours: Strengths and Weaknesses
Strengths
Weaknesses
• Simple to implement and use • Need a lot of space to store all examples
Age > 40
No Yes No response
Response
No response
Class=No
Income>100K
Response
No response
Yes No
Response Class: Response
No No cards>2
Response
Yes No
Response No
Response
K_Nearest Neighbours: Strenghts and Weaknesses
Jay: Rina:
Age=35 Age=41
Income=95K Income=215K
No. of credit cards=3 No. of credit cards=2
• Distance between neighbors could be dominated by some attributes with relatively large numbers
(e.g., income in our example)
Example: Income
Highest income = 500K
Davis’s income is normalized to 95/500, Rina income is normalized to 215/500, etc.)
K_Nearest Neighbours: Strenghts and Weaknesses
Normalization of Variables
Example: Married
• Dataset may need to be preprocessed to ensure more reliable data mining results
• Conversion of non-numeric data to numeric data
• Calibration of numeric data to reduce effects of disparate ranges
• Particularly when using the Euclidean distance metric
k-NN Variations
• Value of k
• Larger k increases confidence in prediction
• Note that if k is too large, decision may be skewed
• Weighted evaluation of nearest neighbors
• Plain majority may unfairly skew decision
• Revise algorithm so that closer neighbors have greater “vote weight”
• Other distance measures
Other Distance Measures
• City-block distance (Manhattan dist)
• Add absolute value of differences
• Cosine similarity
• Measure angle formed by the two samples (with the origin)
• Jaccard distance
• Determine percentage of exact matches between the samples (not including unavailable data)
A=(a,b,c,d) B=(a,c,f,g) J=AnB/AuB = 2/6 = 1/3
Mainly used in text mining
• Others
Distance-Weighted Nearest Neighbor Algorithm
• Assign weights to the neighbors based on their ‘distance’ from the query point
• Weight ‘may’ be inverse square of the distances (the farther away, the less weight the point has)
• All training points may influence a particular instance
• Shepard’s method
How to Choose ”K”?