Unit II
• Applications of machine learning in Data science
• role of ML in DS
• Python tools like sklearn
• modelling process for feature engineering
• model selection, validation and prediction
• types of ML
• semi-supervised learning
• Handling large data: problems and general techniques for handling
large data
• programming tips for dealing large data
• case studies on DS projects for predicting malicious URLs, for building
recommender systems
What are the Applications of Machine Learning in Data Science?
1) Real-Time Navigation
Real-time navigation is one of the most popular applications of machine learning in
data science.
It uses machine learning algorithms to analyse data from sensors and cameras, such
as GPS, LiDAR, and cameras, to provide users with real-time navigation guidance.
Machine learning algorithms are used to process this data and extract useful
information, such as the location and speed of vehicles, the location of obstacles, and
the flow of traffic.
This information is then used to provide real-time guidance to users, such as turn-by-
turn directions, traffic alerts, and real-time traffic updates.
Autonomous cars: Machine learning algorithms are used to evaluate data from various
sensors, including cameras, LiDAR, and radar, to grasp the environment around the
vehicle.
The capacity of these algorithms to predict the behaviour of other cars, pedestrians,
and bicycles on the road is critical to the safe mobility of autonomous vehicles.
2) Image Recognition
Image recognition is another popular application of machine learning in data science.
It uses machine learning algorithms to analyse and understand images, such as
photographs, videos, and live streams.
Machine learning algorithms are used to process this data and extract useful
information, such as the objects and people within an idea and the scene or context
in which they are located.
This information can be used for various tasks, such as image search, object detection,
and image captioning.
One of the most popular approaches to image recognition is convolutional neural
networks (CNNs). CNNs are a deep learning algorithm designed to process image data
and extract features.
3) Product Recommendation
Product recommendation is another popular application of machine learning in data
science.
Product recommendation aims to improve the customer experience by providing
recommendations for products they are more likely to be interested in.
Machine learning algorithms process this data and extract useful information, such as
customer preferences, purchase history, and development features.
This information is then used to make personalized recommendations to customers.
One of the most popular approaches to product recommendation is the use of
collaborative filtering.
Collaborative filtering is a technique that uses the past behaviour of customers, such
as their purchase history, to make recommendations to other customers who have
similar behaviour.
For example, if two customers have identical histories, then the products one
customer has bought in the past may be recommended to the other customer.
The application of content-based filtering is an additional well-liked strategy. The
characteristics of items, such as their category, brand, and price, are used in content-
based filtering to provide suggestions.
For instance, if a consumer purchases a product from a particular brand, the customer
may be recommended additional goods.
4) Speech Recognition
Speech recognition is another popular application of machine learning in data science.
It involves using machine learning algorithms to analyse speech to convert it into text
or other forms of data.
The goal of speech recognition is to enable machines to understand and interpret
human speech so that it can be used for various tasks, such as voice commands,
transcription, and language translation.
Machine learning algorithms process this data and extract useful information, such as
spoken words and phrases and the speaker's intent.
This information is then used to convert speech into text or other forms of data, such
as commands or questions.
Python Tools
• SciPy is a library that integrates fundamental packages often used in scientific
computing such as NumPy, matplotlib, Pandas, and SymPy.
• NumPy gives you access to powerful array functions and linear algebra functions.
• Matplotlib is a popular 2D plotting package with some 3D functionality.
• Pandas is a high-performance, but easy-to-use, data-wrangling package. It intro
duces dataframes to Python, a type of in-memory data table. It’s a concept that
should sound familiar to regular users of R.
• SymPy is a package used for symbolic mathematics and computer algebra.
• StatsModels is a package for statistical methods and algorithms.
• Scikit-learn is a library filled with machine learning algorithms.
• RPy2 allows you to call R functions from within Python. R is a popular open
source statistics program.
• NLTK (Natural Language Toolkit) is a Python toolkit with a focus on text analytics.
• Numba and NumbaPro—These use just-in-time compilation to speed up
applications written directly in Python and a few annotations. NumbaPro also
allows you to use the power of your graphics processor unit (GPU).
• PyCUDA—This allows you to write code that will be executed on the GPU instead
of your CPU and is therefore ideal for calculation-heavy applications.
• Cython, or C for Python—This brings the C programming language to Python. C is
a lower-level language, so the code is closer to what the computer eventually
uses (bytecode). The closer code is to bits and bytes, the faster it executes. A
computer is also faster when it knows the type of a variable (called static typing).
Python wasn’t designed to do this, and Cython helps you to overcome this
shortfall.
• Blaze—Blaze gives you data structures that can be bigger than your computer’s
main memory, enabling you to work with large data sets.
• Dispy and IPCluster —These packages allow you to write code that can be distrib
uted over a cluster of computers.
• PP—Python is executed as a single process by default. With the help of PP you
can parallelize computations on a single machine or over clusters.
• Pydoop and Hadoopy—These connect Python to Hadoop, a common big data
framework.
• PySpark—This connects Python and Spark, an in-memory big data framework.
Types of ML:
Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on the using
data and algorithms to enable AI to imitate the way that humans learn.
(Or)
“Machine learning is a field of study that gives computers the ability to learn without
being explicitly programmed.”
• It takes labelled inputs and maps it to the known outputs. Which means we
already know the target variable.
• Supervised learning methods needs external supervision to train models.
• These are used for classification and regression.
Algorithms used in supervised learning:
Regression:-
Taking the example of the below image, there is the experience (in years) on the X-axis.
For every experience, there is one salary (in per month Rupees) on the Y-axis. Green
dots are the coordinates (X, Y) in the form of Input and Output data.
The regression problem tries to find the continuous mapping function from input to
output variables.
Applications:
Here, we know the value of input data but output and function both are unknown.
In such scenarios, machine learning algorithms find the function that finds similarity
among different input data instances and groups them based on the similarity index,
which is the output of unsupervised learning.
• Understands patterns and trends in the data and discover the output.
• Don’t need any supervision to train the model.
• These are used for clustering and association.
Algorithms used in supervised learning:
Applications:
Recommendation Systems
Market Segmentation
Image and Document Clustering
The semi-supervised algorithm classifies on its own to some extent and need little
quantity of labelled data.
These algorithms operate on data that has few labels and mostly unlabelled.
Algorithms:
Self-training
Co-training
Graph based labelling
Pseudo labelling is the process of using the labelled data model to predict labels for
unlabelled data.
For example, suppose there is a large chunk of data in the image above, and a small
amount of labeled dataset is present.
We can train the model using that small amount of labeled data and then predict on the
unlabelled dataset.
Prediction on an unlabelled dataset will attach the label with every data sample with
little accuracy, termed as a Pseudo-labeled dataset.
Now a new model can be trained with the mixture of the true-labeled dataset and
pseudo-labeled dataset.
Applications:
Web mining --Classify web pages
Text mining ---- identify names in the text
Video mining ---- classify people in the video
RL:
• The agent interacts with the environment and identifies the possible actions he
can perform.
• The primary goal of an agent in RL is to perform actions by looking at the
environment and get the maximum positive rewards.
• In RL the agent learns automatically using feedbacks without any labeled data,
unlike Supervised learning.
• RL is used to solve specific type of problems where decision making is
sequential, such as game playing, robots etc.
Algorithms used in RL:
What are the situations where RL can be used?
Consider the following grid game, where a robot can move.
Assume the starting node is E and the goal node is G, the game is about finding the
shortest path from starting to goal state.
Applications:
Modelling process for feature engineering, model selection, validation and prediction
Common Feature Types:
• Numerical: Values with numeric types (int, float, etc.). Examples: age, salary,
height.
•
• Categorical Features: Features that can take one of a limited number of values.
Examples: gender (male, female, X), color (red, blue, green).
•
• Ordinal Features: Categorical features that have a clear ordering. Examples: T-
shirt size (S, M, L, XL).
•
• Binary Features: A special case of categorical features with only two categories.
Examples: is_smoker (yes, no), has_subscription (true, false).
•
• Text Features: Features that contain textual data. Textual data typically requires
special preprocessing steps (like tokenization) to transform it into a format
suitable for machine learning models.
What Is Feature Engineering?
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models.
In other words, it is the process of selecting, extracting, and transforming the most
relevant features from the available data to build more accurate and efficient machine
learning models.
Feature engineering is required when working with machine learning models.
Regardless of the data or architecture, a terrible feature will have a direct impact on
your model.
Feature Engineering Processes
Feature engineering consists of various processes:
• Feature creation
• Feature Transformation
• Feature extraction
• Feature selection
1. Feature creation: Feature Creation is the process of generating new features
based on domain knowledge or by observing patterns in the data.
It is a form of feature engineering that can significantly improve the performance
of a machine-learning model.
Types of Feature Creation:
1. Domain-Specific: Creating new features based on domain knowledge, such
as creating features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data.
3. Synthetic: Generating new features by combining existing features.
2. Feature Transformation: It is the process of transforming the features into a
more suitable representation for the machine learning model.
This is done to ensure that the model can effectively learn from the data.
Types of Feature Transformation:
• Normalization: Rescaling the features to have a similar range, such as
between 0 and 1, to prevent some features from dominating others.
• Scaling: Scaling is a technique used to transform numerical variables
to have a similar scale, so that they can be compared more easily.
• Encoding: Transforming categorical features into a numerical
representation.
Examples are one-hot encoding and label encoding.
• Transformation: Transforming the features using mathematical
operations to change the scale of the features.
Examples are logarithmic, square root, and reciprocal transformations.
3. Feature extraction:
• Feature extraction is the process of extracting features from a data set to identify
useful information.
• Feature extraction aims to reduce the number of features in a dataset with the
goal of maintaining most of the relevant information.
• Feature Extraction is used for improves accuracy, reduce the overfitting risk,
speed up the training, and improved data visualization.
• Types of Feature Extraction:
Dimensionality Reduction: Reducing the number of features by transforming
the data into a lower-dimensional space while retaining important information.
• Principle Components Analysis (PCA)
• Independent Component Analysis (ICA)
• Linear Discriminant Analysis (LDA)
• Locally Linear Embedding (LLE)
• t-distributed Stochastic Neighbor Embedding (t-SNE)
4. Feature selection: Feature selection is the process of isolating the most
consistent, non-redundant, and relevant features to use in model construction.
The reducing size of datasets is important as the size and variety of datasets
continue to grow.
The main goal of feature selection is to improve the performance of a predictive
model and reduce the computational cost of modeling.
Model Selection
In machine learning, the process of selecting the top model or algorithm from a list of
potential models to address a certain issue is referred to as model selection.
• Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example,
classification, regression, or clustering).
• Candidate model selection: Pick a group of models that are appropriate for the
issue at hand. These models can include straightforward methods like decision
trees or linear regression as well as more sophisticated ones like deep neural
networks, random forests, or support vector machines.
• Performance evaluation: Establish measures for measuring how well each model
performs. Common measurements include recall, F1-score, mean squared error,
and accuracy, precision, and recall.
• Training and evaluation: Each candidate model should be trained using a subset
of the available data (the training set), and its performance should be assessed
using a different subset (the validation set ).
• Model comparison: Evaluate the performance of various models and determine
which one performs best on the validation set.
• Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularization, or the number of
layers that are hidden in a neural network, be configured.
• Final model selection: After the models have been analyzed and fine-tuned, pick
the model that performs the best. Then, this model can be used to make
predictions based on fresh, unforeseen data.
Validation and Prediction
There are several types of cross validation techniques, including k-fold cross
validation, leave-one-out cross validation, and Holdout validation, Stratified Cross-
Validation.
1. Holdout Validation
Usually, the ratio of training data set to testing data set is 70:30 or 80:20.
The next step is to train the model with the training data set and once it is trained, the
model is tested with the testing data set.
2. LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole dataset but leaves only one data-
point of the available dataset and then iterates for each data-point.
In LOOCV, the model is trained on (n-1) samples and tested on the one omitted
sample, repeating this process for each data point in the dataset.
3. Stratified Cross-Validation
This is particularly important when dealing with imbalanced datasets, where certain
classes may be underrepresented. In this method,
1. The dataset is divided into k folds while maintaining the proportion of
classes in each fold.
2. During each iteration, one-fold is used for testing, and the remaining folds
are used for training.
3. The process is repeated k times, with each fold serving as the test set
exactly once.
4. K-Fold Cross Validation
Another type of cross-validation is the K-fold cross-validation. The parameter for
this type is 'K' which refers to the number of subsets or folds obtained from the
data sample.
The first step is to train the model using the entire data set. The second step is to
divide the data sample in 'k' number of subsets. From hereon, these subsets
become the testing data sets that are then used for testing the validation of a
model one by one.
If you’ve implemented the first three steps successfully, you now have a
performant model that generalizes to unseen data.
The process of applying your model to new data is called model scoring. In fact,
model scoring is something you implicitly did during validation, only now you don’t
know the correct outcome.
By now you should trust your model enough to use it for real.
Then you apply the model on this new data set, and this results in a prediction.
Handling large data
Handling large data on a single computer
Large data sets can be difficult to analyze on a single computer.
To make it easier, there are a few things you can do
1. Use parallel computing
2. Use cloud computing
3. Use distributed computing
4. Use data compression
5. Use data visualization
Use parallel computing:
a parallel computing system consists of multiple processors that communicate with each
other using a shared memory. Parallel computing is a technique that allows you to split up a large
data set into smaller chunks and run them simultaneously on multiple cores. This can greatly reduce
the amount of time it takes to analyze the data
Use cloud computing: Cloud computing allows you to store large data sets in the cloud and analyze them using virtual
machines. This eliminates the need to have powerful hardware in-house, and can significantly reduce the cost of data analysis.
Instead of storing files on a storage device or hard drive, a user can save them on cloud, making it possible to access the files
from anywhere, as long as they have access to the web.
Use distributed computing:
A distributed computing system contains multiple processors connected by a communication
network.
Distributed computing is a technique that allows you to spread large data sets across multiple
computers and analyze them in parallel. This can significantly reduce the amount of time needed to
analyze the data.
Use data compression: Data compression can reduce the size of large data sets, making them
easier to store and analyze on a single computer.
Use data visualization: Data visualization can help you get a better understanding of your data,
and can make it easier to analyze large data sets on a single computer.
The problems to face when handling large data
1. Data Storage
2. Data Cleaning
3. Data Analysis
4. Security
5. Computing Power
6. Data Analysis
•Volume: How much data is collected
•Velocity: How fast data can be generated, gathered, and
analyzed
•Variety: How many points of reference are used to collect
data
•Veracity: How reliable the data is
Data Storage: Storing large data sets can be challenging due to the amount of
space and resources required. Data must be structured and organized to be
useful and efficient.
Data Cleaning: Large data sets often contain missing values, outliers, and
incorrect data types, making it difficult to get an accurate picture of the data.
Data cleaning is essential to ensure the accuracy of any analysis.
Data Analysis: Analyzing large data sets can be complex and time consuming.
Specialized techniques may be required to process, visualize, and interpret the data.
Security: Large data sets can contain sensitive information, making it important to maintain security
and privacy. Appropriate measures must be taken to protect the data from unauthorized access.
Computing Power: Large data sets require large amounts of
computing power to process and analyze. This can be expensive
and difficult to access
General techniques for handling large volumes of data:
1. Use Distributed Computing: Distributed computing involves breaking down large tasks into smaller parts
and distributing them to different machines to be processed in parallel. This can greatly improve the speed
and efficiency of data processing, and is particularly helpful when dealing with large volumes of data.
2. Use a Database: Using a database to store and manage large volumes of data is a great way to ensure data
integrity and scalability. Most databases have built-in features to help with querying, sorting, and filtering
data, which can help make data analysis easier and more efficient.
3. Use Streaming Data: Streaming data is a type of data that is delivered in near real-time. This can be very
helpful in dealing with large volumes of data, since it allows for processing to occur as soon as the data is
received, rather than waiting for the entire dataset to be collected before beginning analysis.
4. Compress Data: Compression is a great way to reduce the size of large datasets, which can help reduce the
amount of time needed for processing. Compression algorithms can also help reduce the amount of storage
space needed to store large amounts of data.
General programming tips for dealing with large datasets:
1. Keep your data organized and structured. Use a database or spreadsheet
program to store, track and maintain your data.
2. Break up large datasets into smaller, more manageable chunks. This will help
you more easily find and access specific data points.
3. Utilize tools such as parallel computing, machine learning and data mining to
help analyze and process large datasets.
4. Make use of specialized software that is designed to handle large datasets.
5. Take advantage of cloud computing to store and manage large datasets.
6. Use data visualization tools to help you make sense of large datasets.
7. Utilize tools such as Apache Spark and Hadoop to help with processing large
datasets.
8. Regularly backup your data to protect against data loss.
9. Consider using data compression to reduce the size of datasets and make
them easier to store and manage.
10. Employ security measures to protect your data from unauthorized access.
1. Predicting Malicious URLs
Case Study: "Malicious URL Detection Using Machine Learning"
Objective: The goal of this project was to develop a machine
learning-based system capable of detecting malicious URLs to
enhance cybersecurity measures.
Approach:
• Data Collection: The dataset was composed of a mix of
malicious and benign URLs. Sources included public datasets
like PhishTank and Alexa for malicious and benign URLs,
respectively.
• Feature Engineering:
• Lexical Features: These included the length of the URL,
the number of dots in the URL, the presence of special
characters, and domain tokenization.
• Host-based Features: These involved attributes such as
WHOIS information (domain names, IP address blocks),
domain registration length, and whether the IP address
was blacklisted.
• Content-based Features: The HTML content of the
webpage was analysed to extract features like the
presence of suspicious scripts or links.
• Modelling:
• Algorithms like Random Forest, Support Vector Machines
(SVM), and Gradient Boosting Machines (GBM) were
used.
• The models were trained on a labelled dataset of
malicious and benign URLs.
• Cross-validation was used to fine-tune the models.
• Evaluation:
• The models were evaluated using metrics such as
accuracy, precision, recall, and F1-score.
Predicting malicious URLs
Data—The data in this case study was made available as part
of a research project. The project contains data from 120 days,
and each observation has approximately 3,200,000 features.
The target variable contains 1 if it’s a malicious website and -
1 otherwise.
Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker, “Beyond Blacklists: Learning to Detect
Malicious Web Sites from Suspicious URLs,” Proceedings of the ACM SIGKDD Conference, Paris (June 2009),
1245–53.
The Scikit-learn library—You should have this library installed
in your Python environment.
Step 1: Defining the research goal
The goal of our project is to detect whether certain URLs can
be trusted or not.
Step 2: Acquiring the URL data
Start by downloading the data from
http://sysnet.ucsd.edu/projects/url/#datasets and place it in
a folder. Choose the data in SVMLight format. SVMLight is a
text-based format with one observation per row. To save
space, it leaves out the zeros.
Step 3 of the data science process, data preparation and
cleansing, isn’t necessary in this case because the URLs come
pre-cleaned.
Step 4: Data exploration
we need to find out whether the data does indeed contain
lots of zeros. We can check this with the following piece of
code:
print "number of non-zero entries %2.6f" %
float((X.nnz)/(float(X.shape[0]) * float(X.shape[1])))
This outputs the following: number of non-zero entries
0.000033.
Data that contains little information compared to zeros is
called sparse data. This can be saved more compactly if you
store the data as [(0,0,1),(4,4,1)] instead of
[[1,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,1]]
Step 5: Model building
Creating a model to distinguish the malicious from the normal
URLs by using stochastic gradient descent classifier
SGDClassifier().
Only 3% (1 - 0.97) of the malicious sites aren’t detected
(precision-the quality of a positive predictions),
6% (1 - 0.94) of the sites detected are falsely accused (recall).
This is a decent result, so we can conclude that the
methodology works.
Recall is a metric that measures how often a machine learning model
correctly identifies positive instances (true positives) from all the actual
positive samples in the dataset.
Support refers to the number of actual occurrences of the class in the
dataset.
Building Recommender Systems
Case Study: "Personalized Content Recommendations for
Streaming Services"
Objective: To create a personalized recommendation system for a
video streaming platform to enhance user engagement and
retention.
Approach:
• Data Collection:
• User interaction data including clicks, watch history,
ratings, and search queries.
• Metadata for each piece of content, including genre,
actors, director, and user ratings.
• Types of Recommendation Systems:
• Collaborative Filtering:
• User-based: Recommends items based on what
similar users have liked.
• Item-based: Recommends items similar to what a
user has liked before.
• Content-Based Filtering:
• Recommends items similar to those the user has
engaged with, based on content features.
• Hybrid Approach:
• Combines both collaborative and content-based
filtering to improve recommendations.
• Modelling:
• Matrix Factorization was used for collaborative filtering to
identify latent factors that explain user preferences.
• TF-IDF(Term frequency-inverse document frequency) and
cosine similarity were used for content-based filtering to
match user profiles with content metadata.
• The hybrid model incorporated a weighted combination
of both approaches.
• Evaluation:
• The models were evaluated using metrics like Mean
Squared Error (MSE) for ratings prediction and Mean
Average Precision (MAP) for ranking performance.
• A/B testing was conducted to compare the
recommendation system's performance against a control
group.
Case study 2: Building a recommender system inside a database
Tools and techniques needed
MySQL database
MySQL database connection Python library
We will also need the pandas python library
TECHNIQUES
A simple recommender system will look for customers who’ve rented
similar movies as you have and then suggest those that the others
have watched but you haven’t seen yet.
This technique is called k-nearest neighbors in machine learning.
• Your boss has stored the data in a MySQL database, and it’s up to
you to do the analysis.
• What he is referring to is a recommender system, an automated
system that learns people’s preferences and recommends movies
and other products the customers haven’t tried yet.
• The goal of our case study is to create a memory-friendly
recommender system.