0% found this document useful (0 votes)

17 views55 pages

Machine Learning in Data Science & Big Data Handling"

The document discusses the applications of machine learning (ML) in data science, highlighting areas such as real-time navigation, image recognition, product recommendation, and speech recognition. It also covers Python tools for ML, types of ML including supervised, unsupervised, and semi-supervised learning, and the modeling process involving feature engineering, model selection, validation, and prediction. Additionally, it addresses challenges in handling large datasets and provides programming tips for effective data management.

Uploaded by

hacker.792123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views55 pages

Machine Learning in Data Science & Big Data Handling"

Uploaded by

hacker.792123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Unit II

• Applications of machine learning in Data science

• role of ML in DS
• Python tools like sklearn
• modelling process for feature engineering
• model selection, validation and prediction
• types of ML
• semi-supervised learning
• Handling large data: problems and general techniques for handling
large data
• programming tips for dealing large data
• case studies on DS projects for predicting malicious URLs, for building
recommender systems
What are the Applications of Machine Learning in Data Science?

1) Real-Time Navigation
Real-time navigation is one of the most popular applications of machine learning in
data science.
It uses machine learning algorithms to analyse data from sensors and cameras, such
as GPS, LiDAR, and cameras, to provide users with real-time navigation guidance.

Machine learning algorithms are used to process this data and extract useful
information, such as the location and speed of vehicles, the location of obstacles, and
the flow of traffic.

This information is then used to provide real-time guidance to users, such as turn-by-
turn directions, traffic alerts, and real-time traffic updates.

Autonomous cars: Machine learning algorithms are used to evaluate data from various
sensors, including cameras, LiDAR, and radar, to grasp the environment around the
vehicle.
The capacity of these algorithms to predict the behaviour of other cars, pedestrians,
and bicycles on the road is critical to the safe mobility of autonomous vehicles.

2) Image Recognition
Image recognition is another popular application of machine learning in data science.

It uses machine learning algorithms to analyse and understand images, such as

photographs, videos, and live streams.

Machine learning algorithms are used to process this data and extract useful
information, such as the objects and people within an idea and the scene or context
in which they are located.

This information can be used for various tasks, such as image search, object detection,
and image captioning.

One of the most popular approaches to image recognition is convolutional neural

networks (CNNs). CNNs are a deep learning algorithm designed to process image data
and extract features.

3) Product Recommendation
Product recommendation is another popular application of machine learning in data
science.

Product recommendation aims to improve the customer experience by providing

recommendations for products they are more likely to be interested in.

Machine learning algorithms process this data and extract useful information, such as
customer preferences, purchase history, and development features.

This information is then used to make personalized recommendations to customers.

One of the most popular approaches to product recommendation is the use of

collaborative filtering.
Collaborative filtering is a technique that uses the past behaviour of customers, such
as their purchase history, to make recommendations to other customers who have
similar behaviour.

For example, if two customers have identical histories, then the products one
customer has bought in the past may be recommended to the other customer.
The application of content-based filtering is an additional well-liked strategy. The
characteristics of items, such as their category, brand, and price, are used in content-
based filtering to provide suggestions.

For instance, if a consumer purchases a product from a particular brand, the customer
may be recommended additional goods.

4) Speech Recognition
Speech recognition is another popular application of machine learning in data science.
It involves using machine learning algorithms to analyse speech to convert it into text
or other forms of data.

The goal of speech recognition is to enable machines to understand and interpret

human speech so that it can be used for various tasks, such as voice commands,
transcription, and language translation.

Machine learning algorithms process this data and extract useful information, such as
spoken words and phrases and the speaker's intent.

This information is then used to convert speech into text or other forms of data, such
as commands or questions.
Python Tools

• SciPy is a library that integrates fundamental packages often used in scientific

computing such as NumPy, matplotlib, Pandas, and SymPy.
• NumPy gives you access to powerful array functions and linear algebra functions.
• Matplotlib is a popular 2D plotting package with some 3D functionality.
• Pandas is a high-performance, but easy-to-use, data-wrangling package. It intro
duces dataframes to Python, a type of in-memory data table. It’s a concept that
should sound familiar to regular users of R.
• SymPy is a package used for symbolic mathematics and computer algebra.
• StatsModels is a package for statistical methods and algorithms.
• Scikit-learn is a library filled with machine learning algorithms.
• RPy2 allows you to call R functions from within Python. R is a popular open
source statistics program.
• NLTK (Natural Language Toolkit) is a Python toolkit with a focus on text analytics.
• Numba and NumbaPro—These use just-in-time compilation to speed up
applications written directly in Python and a few annotations. NumbaPro also
allows you to use the power of your graphics processor unit (GPU).
• PyCUDA—This allows you to write code that will be executed on the GPU instead
of your CPU and is therefore ideal for calculation-heavy applications.
• Cython, or C for Python—This brings the C programming language to Python. C is
a lower-level language, so the code is closer to what the computer eventually
uses (bytecode). The closer code is to bits and bytes, the faster it executes. A
computer is also faster when it knows the type of a variable (called static typing).
Python wasn’t designed to do this, and Cython helps you to overcome this
shortfall.
• Blaze—Blaze gives you data structures that can be bigger than your computer’s
main memory, enabling you to work with large data sets.
• Dispy and IPCluster —These packages allow you to write code that can be distrib
uted over a cluster of computers.
• PP—Python is executed as a single process by default. With the help of PP you
can parallelize computations on a single machine or over clusters.
• Pydoop and Hadoopy—These connect Python to Hadoop, a common big data
framework.
• PySpark—This connects Python and Spark, an in-memory big data framework.
Types of ML:
Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on the using
data and algorithms to enable AI to imitate the way that humans learn.
(Or)
“Machine learning is a field of study that gives computers the ability to learn without
being explicitly programmed.”
• It takes labelled inputs and maps it to the known outputs. Which means we
already know the target variable.
• Supervised learning methods needs external supervision to train models.
• These are used for classification and regression.
Algorithms used in supervised learning:
Regression:-
Taking the example of the below image, there is the experience (in years) on the X-axis.
For every experience, there is one salary (in per month Rupees) on the Y-axis. Green
dots are the coordinates (X, Y) in the form of Input and Output data.
The regression problem tries to find the continuous mapping function from input to
output variables.
Applications:
Here, we know the value of input data but output and function both are unknown.
In such scenarios, machine learning algorithms find the function that finds similarity
among different input data instances and groups them based on the similarity index,
which is the output of unsupervised learning.

• Understands patterns and trends in the data and discover the output.

• Don’t need any supervision to train the model.

• These are used for clustering and association.

Algorithms used in supervised learning:
Applications:

Recommendation Systems

Market Segmentation

Image and Document Clustering

The semi-supervised algorithm classifies on its own to some extent and need little
quantity of labelled data.
These algorithms operate on data that has few labels and mostly unlabelled.
Algorithms:
Self-training
Co-training
Graph based labelling
Pseudo labelling is the process of using the labelled data model to predict labels for
unlabelled data.

For example, suppose there is a large chunk of data in the image above, and a small
amount of labeled dataset is present.
We can train the model using that small amount of labeled data and then predict on the
unlabelled dataset.
Prediction on an unlabelled dataset will attach the label with every data sample with
little accuracy, termed as a Pseudo-labeled dataset.
Now a new model can be trained with the mixture of the true-labeled dataset and
pseudo-labeled dataset.
Applications:
Web mining --Classify web pages
Text mining ---- identify names in the text
Video mining ---- classify people in the video
RL:

• The agent interacts with the environment and identifies the possible actions he
can perform.

• The primary goal of an agent in RL is to perform actions by looking at the

environment and get the maximum positive rewards.

• In RL the agent learns automatically using feedbacks without any labeled data,
unlike Supervised learning.

• RL is used to solve specific type of problems where decision making is

sequential, such as game playing, robots etc.
Algorithms used in RL:

What are the situations where RL can be used?

Consider the following grid game, where a robot can move.

Assume the starting node is E and the goal node is G, the game is about finding the
shortest path from starting to goal state.
Applications:
Modelling process for feature engineering, model selection, validation and prediction

Common Feature Types:

• Numerical: Values with numeric types (int, float, etc.). Examples: age, salary,
height.
•

• Categorical Features: Features that can take one of a limited number of values.
Examples: gender (male, female, X), color (red, blue, green).
•

• Ordinal Features: Categorical features that have a clear ordering. Examples: T-

shirt size (S, M, L, XL).
•

• Binary Features: A special case of categorical features with only two categories.
Examples: is_smoker (yes, no), has_subscription (true, false).
•

• Text Features: Features that contain textual data. Textual data typically requires
special preprocessing steps (like tokenization) to transform it into a format
suitable for machine learning models.

What Is Feature Engineering?

Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models.
In other words, it is the process of selecting, extracting, and transforming the most
relevant features from the available data to build more accurate and efficient machine
learning models.

Feature engineering is required when working with machine learning models.

Regardless of the data or architecture, a terrible feature will have a direct impact on
your model.

Feature Engineering Processes

Feature engineering consists of various processes:

• Feature creation
• Feature Transformation
• Feature extraction
• Feature selection
1. Feature creation: Feature Creation is the process of generating new features
based on domain knowledge or by observing patterns in the data.
It is a form of feature engineering that can significantly improve the performance
of a machine-learning model.

Types of Feature Creation:

1. Domain-Specific: Creating new features based on domain knowledge, such
as creating features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data.
3. Synthetic: Generating new features by combining existing features.

2. Feature Transformation: It is the process of transforming the features into a

more suitable representation for the machine learning model.

This is done to ensure that the model can effectively learn from the data.
Types of Feature Transformation:
• Normalization: Rescaling the features to have a similar range, such as
between 0 and 1, to prevent some features from dominating others.
• Scaling: Scaling is a technique used to transform numerical variables
to have a similar scale, so that they can be compared more easily.
• Encoding: Transforming categorical features into a numerical
representation.

Examples are one-hot encoding and label encoding.

• Transformation: Transforming the features using mathematical

operations to change the scale of the features.

Examples are logarithmic, square root, and reciprocal transformations.

3. Feature extraction:
• Feature extraction is the process of extracting features from a data set to identify
useful information.

• Feature extraction aims to reduce the number of features in a dataset with the
goal of maintaining most of the relevant information.

• Feature Extraction is used for improves accuracy, reduce the overfitting risk,
speed up the training, and improved data visualization.
• Types of Feature Extraction:

Dimensionality Reduction: Reducing the number of features by transforming

the data into a lower-dimensional space while retaining important information.

• Principle Components Analysis (PCA)

• Independent Component Analysis (ICA)
• Linear Discriminant Analysis (LDA)
• Locally Linear Embedding (LLE)
• t-distributed Stochastic Neighbor Embedding (t-SNE)

4. Feature selection: Feature selection is the process of isolating the most

consistent, non-redundant, and relevant features to use in model construction.
The reducing size of datasets is important as the size and variety of datasets
continue to grow.

The main goal of feature selection is to improve the performance of a predictive

model and reduce the computational cost of modeling.
Model Selection

In machine learning, the process of selecting the top model or algorithm from a list of
potential models to address a certain issue is referred to as model selection.
• Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example,
classification, regression, or clustering).

• Candidate model selection: Pick a group of models that are appropriate for the
issue at hand. These models can include straightforward methods like decision
trees or linear regression as well as more sophisticated ones like deep neural
networks, random forests, or support vector machines.

• Performance evaluation: Establish measures for measuring how well each model
performs. Common measurements include recall, F1-score, mean squared error,
and accuracy, precision, and recall.
• Training and evaluation: Each candidate model should be trained using a subset
of the available data (the training set), and its performance should be assessed
using a different subset (the validation set ).
• Model comparison: Evaluate the performance of various models and determine
which one performs best on the validation set.
• Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularization, or the number of
layers that are hidden in a neural network, be configured.
• Final model selection: After the models have been analyzed and fine-tuned, pick
the model that performs the best. Then, this model can be used to make
predictions based on fresh, unforeseen data.

Validation and Prediction

There are several types of cross validation techniques, including k-fold cross
validation, leave-one-out cross validation, and Holdout validation, Stratified Cross-
Validation.

1. Holdout Validation
Usually, the ratio of training data set to testing data set is 70:30 or 80:20.
The next step is to train the model with the training data set and once it is trained, the
model is tested with the testing data set.

2. LOOCV (Leave One Out Cross Validation)

In this method, we perform training on the whole dataset but leaves only one data-
point of the available dataset and then iterates for each data-point.
In LOOCV, the model is trained on (n-1) samples and tested on the one omitted
sample, repeating this process for each data point in the dataset.

3. Stratified Cross-Validation

This is particularly important when dealing with imbalanced datasets, where certain
classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds
are used for training.
3. The process is repeated k times, with each fold serving as the test set
exactly once.

4. K-Fold Cross Validation

Another type of cross-validation is the K-fold cross-validation. The parameter for
this type is 'K' which refers to the number of subsets or folds obtained from the
data sample.

The first step is to train the model using the entire data set. The second step is to
divide the data sample in 'k' number of subsets. From hereon, these subsets
become the testing data sets that are then used for testing the validation of a
model one by one.

If you’ve implemented the first three steps successfully, you now have a
performant model that generalizes to unseen data.
The process of applying your model to new data is called model scoring. In fact,
model scoring is something you implicitly did during validation, only now you don’t
know the correct outcome.

By now you should trust your model enough to use it for real.

Then you apply the model on this new data set, and this results in a prediction.
Handling large data
Handling large data on a single computer
Large data sets can be difficult to analyze on a single computer.
To make it easier, there are a few things you can do
1. Use parallel computing

2. Use cloud computing

3. Use distributed computing

4. Use data compression

5. Use data visualization

Use parallel computing:
a parallel computing system consists of multiple processors that communicate with each
other using a shared memory. Parallel computing is a technique that allows you to split up a large
data set into smaller chunks and run them simultaneously on multiple cores. This can greatly reduce
the amount of time it takes to analyze the data
Use cloud computing: Cloud computing allows you to store large data sets in the cloud and analyze them using virtual
machines. This eliminates the need to have powerful hardware in-house, and can significantly reduce the cost of data analysis.

Instead of storing files on a storage device or hard drive, a user can save them on cloud, making it possible to access the files
from anywhere, as long as they have access to the web.
Use distributed computing:
A distributed computing system contains multiple processors connected by a communication
network.
Distributed computing is a technique that allows you to spread large data sets across multiple
computers and analyze them in parallel. This can significantly reduce the amount of time needed to
analyze the data.
Use data compression: Data compression can reduce the size of large data sets, making them
easier to store and analyze on a single computer.
Use data visualization: Data visualization can help you get a better understanding of your data,
and can make it easier to analyze large data sets on a single computer.
The problems to face when handling large data
1. Data Storage

2. Data Cleaning

3. Data Analysis

4. Security
5. Computing Power

6. Data Analysis
•Volume: How much data is collected

•Velocity: How fast data can be generated, gathered, and

analyzed

•Variety: How many points of reference are used to collect

data

•Veracity: How reliable the data is

Data Storage: Storing large data sets can be challenging due to the amount of
space and resources required. Data must be structured and organized to be
useful and efficient.
Data Cleaning: Large data sets often contain missing values, outliers, and
incorrect data types, making it difficult to get an accurate picture of the data.
Data cleaning is essential to ensure the accuracy of any analysis.
Data Analysis: Analyzing large data sets can be complex and time consuming.
Specialized techniques may be required to process, visualize, and interpret the data.
Security: Large data sets can contain sensitive information, making it important to maintain security
and privacy. Appropriate measures must be taken to protect the data from unauthorized access.
Computing Power: Large data sets require large amounts of
computing power to process and analyze. This can be expensive
and difficult to access
General techniques for handling large volumes of data:

1. Use Distributed Computing: Distributed computing involves breaking down large tasks into smaller parts
and distributing them to different machines to be processed in parallel. This can greatly improve the speed
and efficiency of data processing, and is particularly helpful when dealing with large volumes of data.

2. Use a Database: Using a database to store and manage large volumes of data is a great way to ensure data
integrity and scalability. Most databases have built-in features to help with querying, sorting, and filtering
data, which can help make data analysis easier and more efficient.

3. Use Streaming Data: Streaming data is a type of data that is delivered in near real-time. This can be very
helpful in dealing with large volumes of data, since it allows for processing to occur as soon as the data is
received, rather than waiting for the entire dataset to be collected before beginning analysis.

4. Compress Data: Compression is a great way to reduce the size of large datasets, which can help reduce the
amount of time needed for processing. Compression algorithms can also help reduce the amount of storage
space needed to store large amounts of data.
General programming tips for dealing with large datasets:
1. Keep your data organized and structured. Use a database or spreadsheet
program to store, track and maintain your data.
2. Break up large datasets into smaller, more manageable chunks. This will help
you more easily find and access specific data points.
3. Utilize tools such as parallel computing, machine learning and data mining to
help analyze and process large datasets.
4. Make use of specialized software that is designed to handle large datasets.
5. Take advantage of cloud computing to store and manage large datasets.
6. Use data visualization tools to help you make sense of large datasets.
7. Utilize tools such as Apache Spark and Hadoop to help with processing large
datasets.
8. Regularly backup your data to protect against data loss.
9. Consider using data compression to reduce the size of datasets and make
them easier to store and manage.
10. Employ security measures to protect your data from unauthorized access.
1. Predicting Malicious URLs
Case Study: "Malicious URL Detection Using Machine Learning"
Objective: The goal of this project was to develop a machine
learning-based system capable of detecting malicious URLs to
enhance cybersecurity measures.
Approach:
• Data Collection: The dataset was composed of a mix of
malicious and benign URLs. Sources included public datasets
like PhishTank and Alexa for malicious and benign URLs,
respectively.
• Feature Engineering:
• Lexical Features: These included the length of the URL,
the number of dots in the URL, the presence of special
characters, and domain tokenization.
• Host-based Features: These involved attributes such as
WHOIS information (domain names, IP address blocks),
domain registration length, and whether the IP address
was blacklisted.
• Content-based Features: The HTML content of the
webpage was analysed to extract features like the
presence of suspicious scripts or links.
• Modelling:
• Algorithms like Random Forest, Support Vector Machines
(SVM), and Gradient Boosting Machines (GBM) were
used.
• The models were trained on a labelled dataset of
malicious and benign URLs.
• Cross-validation was used to fine-tune the models.
• Evaluation:
• The models were evaluated using metrics such as
accuracy, precision, recall, and F1-score.
Predicting malicious URLs
Data—The data in this case study was made available as part
of a research project. The project contains data from 120 days,
and each observation has approximately 3,200,000 features.
The target variable contains 1 if it’s a malicious website and -
1 otherwise.

Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker, “Beyond Blacklists: Learning to Detect
Malicious Web Sites from Suspicious URLs,” Proceedings of the ACM SIGKDD Conference, Paris (June 2009),
1245–53.

The Scikit-learn library—You should have this library installed

in your Python environment.

Step 1: Defining the research goal

The goal of our project is to detect whether certain URLs can
be trusted or not.
Step 2: Acquiring the URL data
Start by downloading the data from
http://sysnet.ucsd.edu/projects/url/#datasets and place it in
a folder. Choose the data in SVMLight format. SVMLight is a
text-based format with one observation per row. To save
space, it leaves out the zeros.

Step 3 of the data science process, data preparation and

cleansing, isn’t necessary in this case because the URLs come
pre-cleaned.

Step 4: Data exploration

we need to find out whether the data does indeed contain
lots of zeros. We can check this with the following piece of
code:

print "number of non-zero entries %2.6f" %

float((X.nnz)/(float(X.shape[0]) * float(X.shape[1])))

This outputs the following: number of non-zero entries

0.000033.
Data that contains little information compared to zeros is
called sparse data. This can be saved more compactly if you
store the data as [(0,0,1),(4,4,1)] instead of
[[1,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,1]]
Step 5: Model building
Creating a model to distinguish the malicious from the normal
URLs by using stochastic gradient descent classifier
SGDClassifier().

Only 3% (1 - 0.97) of the malicious sites aren’t detected

(precision-the quality of a positive predictions),
6% (1 - 0.94) of the sites detected are falsely accused (recall).
This is a decent result, so we can conclude that the
methodology works.
Recall is a metric that measures how often a machine learning model
correctly identifies positive instances (true positives) from all the actual
positive samples in the dataset.

Support refers to the number of actual occurrences of the class in the

dataset.
Building Recommender Systems
Case Study: "Personalized Content Recommendations for
Streaming Services"
Objective: To create a personalized recommendation system for a
video streaming platform to enhance user engagement and
retention.
Approach:
• Data Collection:
• User interaction data including clicks, watch history,
ratings, and search queries.
• Metadata for each piece of content, including genre,
actors, director, and user ratings.
• Types of Recommendation Systems:
• Collaborative Filtering:
• User-based: Recommends items based on what
similar users have liked.
• Item-based: Recommends items similar to what a
user has liked before.
• Content-Based Filtering:
• Recommends items similar to those the user has
engaged with, based on content features.
• Hybrid Approach:
• Combines both collaborative and content-based
filtering to improve recommendations.
• Modelling:
• Matrix Factorization was used for collaborative filtering to
identify latent factors that explain user preferences.
• TF-IDF(Term frequency-inverse document frequency) and
cosine similarity were used for content-based filtering to
match user profiles with content metadata.
• The hybrid model incorporated a weighted combination
of both approaches.
• Evaluation:
• The models were evaluated using metrics like Mean
Squared Error (MSE) for ratings prediction and Mean
Average Precision (MAP) for ranking performance.
• A/B testing was conducted to compare the
recommendation system's performance against a control
group.

Case study 2: Building a recommender system inside a database

Tools and techniques needed
MySQL database
MySQL database connection Python library
We will also need the pandas python library
TECHNIQUES
A simple recommender system will look for customers who’ve rented
similar movies as you have and then suggest those that the others
have watched but you haven’t seen yet.
This technique is called k-nearest neighbors in machine learning.
• Your boss has stored the data in a MySQL database, and it’s up to
you to do the analysis.
• What he is referring to is a recommender system, an automated
system that learns people’s preferences and recommends movies
and other products the customers haven’t tried yet.
• The goal of our case study is to create a memory-friendly
recommender system.

Data Science & Machine Learning Guide
No ratings yet
Data Science & Machine Learning Guide
52 pages
Library
No ratings yet
Library
23 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Machine Learning With Data Science
No ratings yet
Machine Learning With Data Science
31 pages
Data Science
No ratings yet
Data Science
132 pages
MCA - ML Question Bank Answer
No ratings yet
MCA - ML Question Bank Answer
139 pages
Data Science Unit - 2
No ratings yet
Data Science Unit - 2
9 pages
Module 1 MMC201
No ratings yet
Module 1 MMC201
77 pages
Intro To AI With Python
No ratings yet
Intro To AI With Python
50 pages
ML Unit1 (HKB)
No ratings yet
ML Unit1 (HKB)
7 pages
Research Paper On Machine Learning PDF
No ratings yet
Research Paper On Machine Learning PDF
15 pages
Maharana Pratap Group of Institutions, Mandhana, Kanpur: Department of Computer Science Engineering)
No ratings yet
Maharana Pratap Group of Institutions, Mandhana, Kanpur: Department of Computer Science Engineering)
115 pages
Big Data Lecture # 08
No ratings yet
Big Data Lecture # 08
21 pages
Fundamentals of Machine Learning II
No ratings yet
Fundamentals of Machine Learning II
13 pages
ML CH 1 Notes
No ratings yet
ML CH 1 Notes
6 pages
Top 10 Machine Learning Algorithms With Their Use
100% (1)
Top 10 Machine Learning Algorithms With Their Use
12 pages
Machine Learning Crash Course For BCA 5th Semester
No ratings yet
Machine Learning Crash Course For BCA 5th Semester
21 pages
Query Generation Using Nadaq System
No ratings yet
Query Generation Using Nadaq System
11 pages
DS Unit2
No ratings yet
DS Unit2
23 pages
ML Interactively
No ratings yet
ML Interactively
273 pages
Advanced Machine Learning Mastering Level Learning With Python
No ratings yet
Advanced Machine Learning Mastering Level Learning With Python
81 pages
MLUnit - 1 Share
No ratings yet
MLUnit - 1 Share
162 pages
R LabManual 6-8 Pgms
No ratings yet
R LabManual 6-8 Pgms
12 pages
Unit 1
No ratings yet
Unit 1
62 pages
Aids
No ratings yet
Aids
20 pages
FDP AIML Day1 Part1
No ratings yet
FDP AIML Day1 Part1
61 pages
Industrial Training Report (Sahil)
No ratings yet
Industrial Training Report (Sahil)
33 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Day5 FDP IoT Part1
No ratings yet
Day5 FDP IoT Part1
89 pages
Machine Learning
No ratings yet
Machine Learning
31 pages
Machine Learning Lab Viva
50% (2)
Machine Learning Lab Viva
9 pages
SSRN 3702236
No ratings yet
SSRN 3702236
8 pages
VAM Project
No ratings yet
VAM Project
16 pages
ML Lec 1
No ratings yet
ML Lec 1
49 pages
Machine Learning for Beginners
No ratings yet
Machine Learning for Beginners
27 pages
Unit I Machine Learning
No ratings yet
Unit I Machine Learning
22 pages
Deep Learning Exam: Key Concepts
No ratings yet
Deep Learning Exam: Key Concepts
32 pages
Notes On Data Science and Machine Learning
No ratings yet
Notes On Data Science and Machine Learning
53 pages
Unit Ii
No ratings yet
Unit Ii
31 pages
Data Science & ML Essentials Guide
No ratings yet
Data Science & ML Essentials Guide
5 pages
Machine Learning: Upendra Verma
No ratings yet
Machine Learning: Upendra Verma
34 pages
Project: Advisor Dr. Sanaa El Touny (Spring 2024) Group 3
No ratings yet
Project: Advisor Dr. Sanaa El Touny (Spring 2024) Group 3
7 pages
MLP Unit-I
No ratings yet
MLP Unit-I
62 pages
Machine Learning - Introduction
No ratings yet
Machine Learning - Introduction
36 pages
Unit 1 ML
No ratings yet
Unit 1 ML
23 pages
MLT Unit - 1
No ratings yet
MLT Unit - 1
38 pages
Machine Learning
No ratings yet
Machine Learning
51 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
Module1 - Deep Learning
No ratings yet
Module1 - Deep Learning
26 pages
UNIT I-Part 1
No ratings yet
UNIT I-Part 1
52 pages
PDS Labmanualword
No ratings yet
PDS Labmanualword
32 pages
Report Print
No ratings yet
Report Print
22 pages
Unit 1 Supervised Learning
No ratings yet
Unit 1 Supervised Learning
33 pages
Vtu ML Lab Manual
67% (3)
Vtu ML Lab Manual
47 pages
UNIT 1 All Notes
No ratings yet
UNIT 1 All Notes
24 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
17 pages
ML Aa
No ratings yet
ML Aa
83 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
POSC101 Study Guide - Midterm Exam 1
No ratings yet
POSC101 Study Guide - Midterm Exam 1
4 pages
(Ebooks PDF) Download Probability Theory and Statistical Inference Empirical Modeling With Observational Data 2nd Edition Spanos A Full Chapters
100% (2)
(Ebooks PDF) Download Probability Theory and Statistical Inference Empirical Modeling With Observational Data 2nd Edition Spanos A Full Chapters
55 pages
Strategic Management Essentials
No ratings yet
Strategic Management Essentials
7 pages
Swift Standards Category 7 Version 11 September 2006
No ratings yet
Swift Standards Category 7 Version 11 September 2006
245 pages
Kathrein 742213
No ratings yet
Kathrein 742213
2 pages
The Animated Film Encyclopedia A Complete Guide To American Shorts Features and Sequences 1900 1999 2nd Edition Graham Webb
100% (14)
The Animated Film Encyclopedia A Complete Guide To American Shorts Features and Sequences 1900 1999 2nd Edition Graham Webb
59 pages
Lac-Proposal-For-Tletvl Teachers
No ratings yet
Lac-Proposal-For-Tletvl Teachers
5 pages
Induction Program & University Orientation Program 2025
No ratings yet
Induction Program & University Orientation Program 2025
12 pages
Techies: Crack the Code!
No ratings yet
Techies: Crack the Code!
1 page
Lesson 14 - Business Etiquette & Personal Branding
No ratings yet
Lesson 14 - Business Etiquette & Personal Branding
14 pages
10 Realtime Python Automation Scripts
100% (2)
10 Realtime Python Automation Scripts
12 pages
Call and Put Option Valuation
No ratings yet
Call and Put Option Valuation
27 pages
Green Logistics - Ha Vi
No ratings yet
Green Logistics - Ha Vi
94 pages
Mine Survey Lab 2
No ratings yet
Mine Survey Lab 2
5 pages
FIA Course Textbooks Price List 2016
No ratings yet
FIA Course Textbooks Price List 2016
2 pages
KASANA - Product Catalogue 2024
No ratings yet
KASANA - Product Catalogue 2024
24 pages
Surge Arrester for Medium Voltage
No ratings yet
Surge Arrester for Medium Voltage
1 page
Bank Statement Overview
100% (1)
Bank Statement Overview
1 page
4
No ratings yet
4
2 pages
Company Profile and Financial Analysis of ITC LTD - 1
No ratings yet
Company Profile and Financial Analysis of ITC LTD - 1
44 pages
Design and Fabrication of Bending Machine
100% (1)
Design and Fabrication of Bending Machine
20 pages
SM II-HANDOUT 01-Organisational Learning
No ratings yet
SM II-HANDOUT 01-Organisational Learning
4 pages
En 353 - 2
No ratings yet
En 353 - 2
5 pages
DLC Practical
No ratings yet
DLC Practical
6 pages
Library Software Packages Available in India
No ratings yet
Library Software Packages Available in India
9 pages
Hosseini Resnik 2024 Guidance Needed For Using Artificial Intelligence To Screen Journal Submissions For Misconduct 1
No ratings yet
Hosseini Resnik 2024 Guidance Needed For Using Artificial Intelligence To Screen Journal Submissions For Misconduct 1
8 pages
106 Ignition
No ratings yet
106 Ignition
2 pages
Leading With Purpose UA
No ratings yet
Leading With Purpose UA
4 pages
Robotics, Monitoring and Control Systems Answers
No ratings yet
Robotics, Monitoring and Control Systems Answers
4 pages
High Court Case Listings Nov 2024
No ratings yet
High Court Case Listings Nov 2024
2 pages

Machine Learning in Data Science & Big Data Handling"

Uploaded by

Machine Learning in Data Science & Big Data Handling"

Uploaded by

Unit II

• Applications of machine learning in Data science

It uses machine learning algorithms to analyse and understand images, such as

One of the most popular approaches to image recognition is convolutional neural

Product recommendation aims to improve the customer experience by providing

This information is then used to make personalized recommendations to customers.

One of the most popular approaches to product recommendation is the use of

The goal of speech recognition is to enable machines to understand and interpret

• SciPy is a library that integrates fundamental packages often used in scientific

• Don’t need any supervision to train the model.

• These are used for clustering and association.

Image and Document Clustering

• The primary goal of an agent in RL is to perform actions by looking at the

• RL is used to solve specific type of problems where decision making is

What are the situations where RL can be used?

Consider the following grid game, where a robot can move.

Common Feature Types:

• Ordinal Features: Categorical features that have a clear ordering. Examples: T-

What Is Feature Engineering?

Feature engineering is required when working with machine learning models.

Feature Engineering Processes

Feature engineering consists of various processes:

Types of Feature Creation:

2. Feature Transformation: It is the process of transforming the features into a

Examples are one-hot encoding and label encoding.

• Transformation: Transforming the features using mathematical

Examples are logarithmic, square root, and reciprocal transformations.

Dimensionality Reduction: Reducing the number of features by transforming

• Principle Components Analysis (PCA)

4. Feature selection: Feature selection is the process of isolating the most

The main goal of feature selection is to improve the performance of a predictive

Validation and Prediction

2. LOOCV (Leave One Out Cross Validation)

4. K-Fold Cross Validation

2. Use cloud computing

3. Use distributed computing

4. Use data compression

5. Use data visualization

•Velocity: How fast data can be generated, gathered, and

•Variety: How many points of reference are used to collect

•Veracity: How reliable the data is

The Scikit-learn library—You should have this library installed

Step 1: Defining the research goal

Step 3 of the data science process, data preparation and

Step 4: Data exploration

print "number of non-zero entries %2.6f" %

This outputs the following: number of non-zero entries

Only 3% (1 - 0.97) of the malicious sites aren’t detected

Support refers to the number of actual occurrences of the class in the

Case study 2: Building a recommender system inside a database

You might also like