Machine Learning
Machine Learning
TRAINER:
1
1.4. Data cleaning is appropriately performed based on the
provided dataset.
2. Develop Machine Learning Model 2.1. Machine Learning algorithm is properly selected based
on the characteristics of the dataset.
2.2. Machine Learning models are properly trained based on
a training set of data.
2.3. Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4. Hyperparameters are properly finetuned based on
evaluation results
3. Perform Model deployment 3.1. Deployment methods are clearly selected based on the
requirements.
3.2. Model file is properly integrated in the system based on
the Deployment method (RESTful API guidelines.)
3.3. Prediction responses are accurately delivered to
the Clients based on the model insights.
2
Learning outcome 1.Apply Data Pre-processing
Definition:
- Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing algorithms and
statistical models that allow computers to improve their performance on tasks over time through experience.
Instead of being explicitly programmed to perform a specific task, machine learning models are trained on
data to recognize patterns and make decisions or predictions based on new, unseen data.
- Machine learning (ML) is a branch of artificial intelligence (AI) that involves the development of
algorithms and statistical models that enable computers to perform specific tasks without explicit
instructions. Instead, these systems learn from and make decisions based on data
- Machine learning is a field of study in artificial intelligence concerned with the development and study of
statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without
explicit instructions.
3
Key Components of Machine Learning
2. Data: it is the cornerstone of all ML algorithms. Without data,ML algorithms cannot learn.
The data can be in various formats, such as text, images, videos or even sensor data.
Some common types include supervised learning, unsupervised learning, and reinforcement
learning.
• Supervised Learning: The model is trained on labeled data (where the correct
answers are known). For example, a model might be trained to recognize cats and dogs
in images by being shown many examples of each with labels.
• Unsupervised Learning: The model works with unlabeled data and tries to find
patterns or groupings on its own. For instance, it might cluster customers into different
segments based on their purchasing behavior.
• Reinforcement Learning: The model learns by interacting with an environment and
receiving feedback in the form of rewards or penalties. This is often used in robotics
and game-playing.
4. Training: During training, the ML model adjusts its parameters to minimize errors or optimize
its performance based on the data it processes.
5. Evaluation: this involves assessing how well your model is performing. Common metrics for
this include accuracy, precision, recall and F1 score, depending on the problem you’re solving
(e.g classification, regression)
6. Deployment: Once trained and evaluated, the model can be deployed in real-world applications,
such as recommendation systems, speech recognition, and autonomous vehicles.
What is the Machine Learning lifecycle? The Machine Learning lifecycle is the end-to-end process that machine
learning models go through, from problem definition to model deployment and maintenance.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
4
o Test the model
o Deployment
In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
5
prediction.
6
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will
be used in further steps.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
7
o Invalid data
o Noise
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques
such as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
8
Testing the model determines the percentage accuracy of the model as
per the requirement of project or problem.
7. Deployment
9
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
10
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,
and Naïve Bayes classifier are used for email spam filtering and malware detection.
These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
11
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.
1. Automation
A key advantage of Machine Learning is its capacity to automate repetitive and time-
consuming tasks, leading to improved productivity, cost savings, and minimized errors within
organizations.
For example, ML-driven chatbots deployed in customer service streamline interactions by
promptly addressing inquiries, recommending products, and comparing prices, thereby
reducing waiting times and augmenting customer satisfaction.
2. Scope of Improvement
Machine Learning is going to be used in the education sector extensively, and it will be going
to enhance the quality of education and student experience.
This technology has a very wide range of applications. Machine learning plays a role in
almost every field, like hospitality, ed-tech, medicine, science, banking, and business. It
creates more opportunities.
6. Pattern Identification:
Machine Learning excels in discerning intricate trends and patterns within vast and complex
datasets, catalyzing transformative advancements across various industries. In healthcare, ML
12
algorithms analyze diverse data sources such as medical images and patient records to
facilitate early disease detection and tailor treatment plans to individual patients.
7. Variety of Applications:
Machine Learning exhibits exceptional versatility, permeating virtually every sector and facet
of modern life. In finance, ML underpins risk assessment and fraud detection initiatives,
while healthcare leverages ML for tasks ranging from diagnosis to drug discovery and
personalized medicine.
13
Disadvantages of the Machine Learning
Nothing is perfect in the world. Machine Learning has some serious limitations, which are
bigger than human errors.
1. Data Acquisition
The whole concept of machine learning is about identifying useful data. The outcome will be
incorrect if a credible data source is not provided. The quality of the data is also significant. If
the user or institution needs more quality data, wait for it. It will cause delays in providing the
output. So, machine learning significantly depends on the data and its quality.
The data that machines process remains huge in quantity and differs greatly. Machines
require time so that their algorithm can adjust to the environment and learn it. Trials runs are
held to check the accuracy and reliability of the machine. It requires massive and expensive
resources and high-quality expertise to set up that quality of infrastructure.
14
3. Results Interpretations
One of the biggest advantages of Machine learning is that interpreted data that we get from
the cannot be hundred percent accurate. It will have some degree of inaccuracy. For a high
degree of accuracy, algorithms should be developed so that they give reliable results.
The error committed during the initial stages is huge, and if not corrected at that time, it
creates havoc. Biasness and wrongness have to be dealt with separately; they are not
interconnected. Machine learning depends on two factors, i.e., data and algorithm. All the
errors are dependent on the two variables. Any incorrectness in any variables would have
huge repercussions on the output.
5. Social Changes
Machine learning is bringing numerous social changes in society. The role of machine
learning- based technology in society has increased multifold. It is influencing the thought
process of society and creating unwanted problems in society. Character assassination and
sensitive details are disturbing the social fabric of society.
Automation, Artificial Intelligence, and Machine Learning have eliminated human interface
from some work. It has eliminated employment opportunities. Now, all those works are
conducted with the help of artificial intelligence and machine learning.
With the advancement of machine learning, the nature of the job is changing. Now, all the
work are done by machine, and it is eating up the jobs for human which were done earlier by
them. It is difficult for those without technical education to adjust to these changes.
8. Highly Expensive
15
This software is highly expensive, and not everybody can own it. Government agencies, big
private firms, and enterprises mostly own it. It needs to be made accessible to everybody for
wide use.
9. Privacy Concern
As we know that one of the pillars of machine learning is data. The collection of data has
raised the fundamental question of privacy. The way data is collected and used for
commercial purposes has always been a contentious issue.
Machine learning is evolving concept. This area has not seen any major developments yet
that fully revolutionized any economic sector. The area requires continuous research and
innovation.
Artificial Intelligence
Artificial Intelligence is basically the mechanism to incorporate human intelligence into
machines through a set of rules(algorithm). AI is a combination of two words: “Artificial”
meaning something made by humans or non-natural things and “Intelligence” meaning the
ability to understand or think accordingly. Another definition could be that “AI is basically
the study of training your machine(computers) to mimic a human brain and its
thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to obtain
the maximum efficiency possible.
Machine Learning:
16
Machine Learning is basically the study/process which provides the system(computer) to
learn automatically on its own through experiences it had and improve accordingly without
being explicitly programmed. ML is an application or subset of AI. ML focuses on the
17
development of programs so that it can access data to use it for itself. The entire process
makes observations on data to identify the possible patterns being formed and make better
future decisions as per the examples provided to them. The major aim of ML is to allow
the systems to learn by themselves through experience without any kind of human
intervention or assistance.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning which
makes use of Neural Networks(similar to the neurons working in our brain) to mimic
human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does and
classifies the information accordingly. DL works on larger sets of data when compared to
ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and Deep
Learning:
as it’s components.
algorithm which exhibits allows system to learn from uses deep (more than one
data. layer) neural networks to
18
Artificial Intelligence Machine Learning Deep Learning
The aim is to basically The aim is to increase It attains the highest rank in
success and not accuracy. about the success ratio. trained with large amount
of data.
DL can be considered as
neural networks with a
Three broad
large number of parameters
categories/types Of AI
Three broad categories/types layers lying in one of the
are: Artificial Narrow
Of ML are: Supervised four fundamental network
Intelligence (ANI),
Learning, Unsupervised architectures: Unsupervised
Artificial General
Learning and Reinforcement Pre-trained Networks,
Intelligence (AGI)
Learning Convolutional Neural
and Artificial Super
Networks, Recurrent
Intelligence (ASI)
Neural Networks and
Recursive Neural Networks
19
Artificial Intelligence Machine Learning Deep Learning
Examples of AI
applications include: Examples of ML applications Examples of DL
Apps Like Uber and Lyft, Google, etc., Email Spam and aggregation, Image analysis
an AI Autopilot, etc.
and decision-making.
20
Artificial Intelligence Machine Learning Deep Learning
DL networks consist of
multiple layers of
In reinforcement learning, the
AI systems can be rule- interconnected neurons that
algorithm learns by trial and
based, knowledge-based, process data in a
error, receiving feedback in
or data-driven. hierarchical manner,
the form of rewards or
allowing them to learn
punishments.
increasingly complex
representations of the data.
21
Medical diagnosis: AI-powered medical diagnosis systems analyze medical
images and other patient data to help doctors make more accurate diagnoses and
treatment plans.
Autonomous vehicles: Self-driving cars and other autonomous vehicles use AI
algorithms and sensors to analyze their environment and make decisions about
speed, direction, and other factors.
Virtual Personal Assistants (VPA) like Siri or Alexa – these use natural
language processing to understand and respond to user requests, such as playing
music, setting reminders, and answering questions.
Autonomous vehicles – self-driving cars use AI to analyze sensor data, such as
cameras and lidar, to make decisions about navigation, obstacle avoidance, and
route planning.
Fraud detection – financial institutions use AI to analyze transactions and
detect patterns that are indicative of fraud, such as unusual spending patterns or
transactions from unfamiliar locations.
Image recognition – AI is used in applications such as photo organization,
security systems, and autonomous robots to identify objects, people, and scenes
in images.
Natural language processing – AI is used in chatbots and language translation
systems to understand and generate human-like text.
Predictive analytics – AI is used in industries such as healthcare and marketing
to analyze large amounts of data and make predictions about future events, such
as disease outbreaks or consumer behavior.
Game-playing AI – AI algorithms have been developed to play games such as
chess, Go, and poker at a superhuman level, by analyzing game data and making
predictions about the outcomes of moves.
Examples of Machine Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the use of
algorithms and statistical models to allow a computer system to “learn” from data and
improve its performance over time, without being explicitly programmed to do so.
22
variety of applications, such as self-driving cars, security systems, and medical
imaging.
Speech recognition: Machine learning algorithms are used in speech
recognition systems to transcribe speech and identify the words spoken. These
systems are used in virtual assistants like Siri and Alexa, as well as in call
centers and other applications.
Natural language processing (NLP): Machine learning algorithms are used in
NLP systems to understand and generate human language. These systems are
used in chatbots, virtual assistants, and other applications that involve natural
language interactions.
Recommendation systems: Machine learning algorithms are used in
recommendation systems to analyze user data and recommend products or
services that are likely to be of interest. These systems are used in e-commerce
sites, streaming services, and other applications.
Sentiment analysis: Machine learning algorithms are used in sentiment analysis
systems to classify the sentiment of text or speech as positive, negative, or
neutral. These systems are used in social media monitoring and other
applications.
Predictive maintenance: Machine learning algorithms are used in predictive
maintenance systems to analyze data from sensors and other sources to predict
when equipment is likely to fail, helping to reduce downtime and maintenance
costs.
Spam filters in email – ML algorithms analyze email content and metadata to
identify and flag messages that are likely to be spam.
Recommendation systems – ML algorithms are used in e-commerce websites
and streaming services to make personalized recommendations to users based on
their browsing and purchase history.
Predictive maintenance – ML algorithms are used in manufacturing to predict
when machinery is likely to fail, allowing for proactive maintenance and
reducing downtime.
Credit risk assessment – ML algorithms are used by financial institutions to
assess the credit risk of loan applicants, by analyzing data such as their income,
employment history, and credit score.
23
Customer segmentation – ML algorithms are used in marketing to segment
customers into different groups based on their characteristics and behavior,
allowing for targeted advertising and promotions.
Fraud detection – ML algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.
Speech recognition – ML algorithms are used to transcribe spoken words into
text, allowing for voice-controlled interfaces and dictation software.
Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks with
multiple layers to learn and make decisions.
24
Fraud detection – Deep Learning algorithms are used in financial transactions
to detect patterns of behavior that are indicative of fraud, such as unusual
spending patterns or transactions from unfamiliar locations.
Game-playing AI – Deep Learning algorithms have been used to develop game-
playing AI that can compete at a superhuman level, such as the AlphaGo AI that
defeated the world champion in the game of Go.
Time series forecasting – Deep Learning algorithms are used to forecast future
values in time series data, such as stock prices, energy consumption, and
weather patterns.
1.1.2 Types of Machine Learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
25
Learning algorithms learn to map points between inputs and correct outputs. It has both
training and validation datasets labelled.
Supervised Learning
26
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
2. Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Advantages of Supervised Machine Learning
Supervised Learning models can have high accuracy as they are trained
on labelled data.
The process of decision-making in supervised learning models is often
interpretable.
It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
It can be time-consuming and costly as it relies on labeled data only.
It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
Image classification: Identify objects, faces, and other features in images.
27
Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
Speech recognition: Convert spoken language into text.
Recommendation systems: Make personalized recommendations to users.
Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
Medical diagnosis: Detect diseases and other medical conditions.
Fraud detection: Identify fraudulent transactions.
Autonomous vehicles: Recognize and respond to objects in the environment.
Email spam detection: Classify emails as spam or not spam.
Quality control in manufacturing: Inspect products for defects.
Credit scoring: Assess the risk of a borrower defaulting on a loan.
Gaming: Recognize characters, analyze player behavior, and create NPCs.
Customer support: Automate customer support tasks.
Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.
28
Unsupervised Learning
29
Disadvantages of Unsupervised Machine Learning
- Without using labels, it may be difficult to predict the quality of the model’s output.
- Cluster Interpretability may not be clear and may not have meaningful
interpretations.
- It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
- Clustering: Group similar data points into clusters.
- Anomaly detection: Identify outliers or anomalies in data.
- Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
- Recommendation systems: Suggest products, movies, or content to users based on their
historical behavior or preferences.
- Topic modeling: Discover latent topics within a collection of documents.
- Density estimation: Estimate the probability density function of data.
- Image and video compression: Reduce the amount of storage required for multimedia
content.
- Data preprocessing: Help with data preprocessing tasks such as data cleaning, imputation
of missing values, and data scaling.
- Market basket analysis: Discover associations between products.
- Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
- Image segmentation: Segment images into meaningful regions.
- Community detection in social networks: Identify communities or groups of individuals
with similar interests or connections.
- Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
- Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
- Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.
30
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labeled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-
supervised learning is chosen when labeled data requires skills and relevant resources in
order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the
rest large portion of it is unlabeled. We can use the unsupervised techniques to predict
labels and then feed these labels to supervised techniques. This technique is mostly
applicable in the case of image data sets where usually all images are not labeled.
Semi-Supervised Learning
31
Graph-based semi-supervised learning: This approach uses a graph to
represent the relationships between the data points. The graph is then used to
propagate labels from the labeled data points to the unlabeled data points.
Label propagation: This approach iteratively propagates labels from the labeled
data points to the unlabeled data points, based on the similarities between the
data points.
Co-training: This approach trains two different machine learning models on
different subsets of the unlabeled data. The two models are then used to label
each other’s predictions.
Self-training: This approach trains a machine learning model on the labeled
data and then uses the model to predict labels for the unlabeled data. The model
is then retrained on the labeled data and the predicted labels for the unlabeled
data.
Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabeled data for semi-supervised learning by training two neural
networks, a generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
It leads to better generalization as compared to supervised learning, as it takes
both labeled and unlabeled data.
Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
Semi-supervised methods can be more complex to implement compared to
other approaches.
It still requires some labeled data that might not always be available or easy to
obtain.
The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled
images.
32
Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of
unlabeled audio.
Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions
(labeled data) with a wealth of unlabeled user behavior data.
Healthcare and Medical Imaging: Enhance medical image analysis by utilizing
a small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps
on increasing its performance using Reward Feedback to learn the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence
experienced.
Here are some of most common reinforcement learning algorithms:
a. Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
b. SARSA (State-Action-Reward-State-Action): SARSA is another model-free
RL algorithm that learns a Q-function. However, unlike Q-learning, SARSA
updates the Q-function for the action that was actually taken, rather than the
optimal action.
c. Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.
33
Reinforcement Machine Learning
34
- This technique is preferred to achieve long-term results that are very difficult to achieve.
35
1.1.3 Machine Learning tools
Machine learning has witnessed exponential growth in tools and frameworks designed to help data scientists and
engineers efficiently build and deploy ML models. Below is a detailed overview of some of the top machine learning
tools, highlighting their key features.
Microsoft Azure is a cloud-based environment you can use to train, deploy, automate, manage, and track ML
models. It is designed to help data scientists and ML engineers leverage their existing data processing and model
development skills & frameworks.
Key Features:
2. IBM Watson
IBM Watson is an enterprise-ready AI services, applications, and tooling suite. It provides various tools for data
analysis, natural language processing, and machine learning model development and deployment.
36
Key Features
3. TensorFlow
TensorFlow, an open-source software library, facilitates numerical computation through data flow graphs.
Developed by the Google Brain team's researchers and engineers, it is utilized both in research and production
activities within Google.
Key Features
Amazon Machine Learning is a cloud service that makes it easy for professionals of all skill levels to use machine
learning technology. It provides visualization tools and wizards to create machine learning models without learning
complex ML algorithms and technology.
Key Features
37
Integration with Amazon S3, Redshift, and RDS for data storage.
5. OpenNN
OpenNN is an open-source neural network library written in C++. It is designed to implement neural networks
flexibly and robustly, focusing on advanced analytics.
Key Features
6. PyTorch
PyTorch, a machine learning framework that's open-source and built upon the Torch library, supports a wide range
of applications, including computer vision and natural language processing. It's celebrated for its adaptability and its
capacity to dynamically manage computational graphs.
Key Features
7. Vertex AI
38
Vertex AI is Google Cloud's AI platform. It consolidates its ML offerings into a unified API, client library, and user
interface. This enables ML engineers and data scientists to accelerate the development and maintenance of artificial
intelligence models.
Key Features
Unified tooling and workflow for model training, hosting, and deployment.
Integration with Google Cloud services for storage, data analysis, and more.
8. BigML
BigML is a machine learning platform that helps users create, deploy, and maintain machine learning models. It
offers a comprehensive environment for preprocessing, machine learning, and model evaluation tasks.
Key Features
9. Apache Mahout
Apache Mahout serves as a scalable linear algebra framework and offers a mathematically expressive Scala-based
domain-specific language (DSL). This design aims to facilitate the rapid development of custom algorithms by
mathematicians, statisticians, and data scientists. Its primary areas of application include filtering, clustering, and
classification, streamlining these processes for professionals in the field.
39
Key Features
10. Weka
Weka is an open-source software suite written in Java, designed for data mining tasks. It includes a variety of
machine learning algorithms geared towards tasks such as data pre- processing, classification, regression, clustering,
discovering association rules, and data visualization.
Key Features
11. Scikit-learn
Scikit-learn is a complimentary, open-source library dedicated to machine learning within the Python ecosystem. It
is celebrated for its user-friendly nature and straightforwardness, offering an extensive array of supervised and
unsupervised learning algorithms. Anchored by foundational libraries such as NumPy, SciPy, and matplotlib, it
emerges as a primary choice for data mining and analysis tasks.
40
Key Features
Comprehensive collection of algorithms for classification, regression, clustering, and dimensionality reduction.
Tools for model selection, evaluation, and preprocessing.
Google Cloud AutoML offers a collection of machine learning tools designed to help developers with minimal ML
knowledge create tailored, high-quality models for their unique business requirements. It leverages Google's
advanced transfer learning and neural architecture search technologies.
Key Features
Integration with Google Cloud services for seamless deployment and scalability.
Colab
Colab, or Google Colaboratory, is a free cloud service based on Jupyter Notebooks that supports Python. It is
designed to facilitate ML education and research with no setup required. Colab provides an easy way to write and
execute arbitrary Python code through the browser.
Key Features
Integration with Google Drive for easy storage and access to notebooks.
41
KNIME
KNIME is an open-source data analytics, reporting, and integration platform allowing users to create data flows
visually, selectively execute some or all analysis steps, and inspect the results, models, and interactive views.
Key Features
Wide range of nodes for data integration, transformation, analysis, and visualization.
Keras
Keras, a Python-based open-source library for neural networks, facilitates swift experimentation in the realm of deep
learning. Serving as an interface for TensorFlow, it simplifies the construction and training of models.
Key Features
RapidMiner
RapidMiner serves as a comprehensive data science tool, offering a cohesive platform for tasks like data prep,
machine learning, deep learning, text mining, and predictive analytics. It caters to users of varying expertise,
accommodating both novices and seasoned professionals.
Key Features
42
Extensive collection of algorithms for data analysis.
Shogun
Shogun is a freely available machine learning library that encompasses a wide range of efficient and cohesive
techniques. Developed in C++, it features interfaces for several programming languages, including C++, Python, R,
Java, Ruby, Lua, and Octave.
Key Features
Supports many ML algorithms and frameworks for regression, classification, and clustering.
Integration with other scientific computing libraries.
Project Jupyter
Project Jupyter is a free, open-source initiative designed to enhance interactive data science and scientific computing
across various programming languages. Originating from the IPython project, it offers a comprehensive framework
for interactive computing, including notebooks, code, and data management.
Key Features
43
Amazon SageMaker
Amazon SageMaker empowers all developers and data scientists to create, train, and deploy
ML models with ease. It simplifies and streamlines every stage of the machine learning
workflow. Discover how to efficiently use Amazon SageMaker to develop, train, optimize,
and deploy machine learning models.
Key Features
Apache Spark
Apache Spark serves as an integrated analytics engine designed to process data on a large
scale. It offers advanced APIs for Java, Scala, Python, and R, alongside an efficient engine
that backs versatile computation graphs for data analysis. Engineered for rapid processing,
Spark enables in-memory computation and supports a range of machine learning algorithms
through its MLlib library.
Key Features
Installation of Python
The process of How to install Python in Windows, operating system
is relatively easy and involves a few uncomplicated steps. This article
aims to take you through the process of downloading and installing
Python on your Windows computer.
How to Install Python in Windows?
44
We have provided step-by-step instructions to guide you and ensure a
successful installation. Whether you are new to programming or have
some experience, mastering how to install Python on Windows will
enable you to utilize this potent language and uncover its full range of
potential applications.
To download Python on your system, you can use the following steps
45
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably version 3.10.11, which
was used in testing this tutorial. Choose the correct link for your device from the options
provided: either Windows installer (64-bit) or Windows installer (32-bit) and proceed to
download the executable file.
Python Homepage
46
Python Installer
After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.
Python Setup
47
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see
a successful message.
Python version
You can also check the version of Python by opening the IDLE application. Go to Start and
enter IDLE in the search bar and then click the IDLE app, for example, IDLE (Python
3.10.11 64-bit). If you can see the Python IDLE window then you are successfully able to
download and installed Python on Windows.
48
Python IDLE
Installation of Tools
In a terminal, run:
49
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.linting.flake8Path": "flake8",
"python.formatting.provider": "black",
"python.formatting.blackPath": "whataformatter",
"python.formatting.blackArgs": [],
Environment Testing
Think of how you might test the lights on a car. You would turn on the lights (known as the
test step) and go outside the car or ask a friend to check that the lights are on (known as the
test assertion). Testing multiple components is known as integration testing.
You have just seen two types of tests:
1. An integration test checks that components in your application operate with each
other.
2. A unit test checks a small component in your application.
You can write both integration tests and unit tests in Python. To write a unit test for the built-
in function sum(), you would check the output of sum() against a known output.
For example, here’s how you check that the sum() of the numbers (1, 2, 3) equals 6:
Python
>>> assert sum([1, 2, 3]) == 6, "Should be 6"
50
Data: data is a distinct piece of information that is gathered and
translated for some purpose. Data is information that has been translated
into a form that is efficient for movement or processing.
Information: Information is a result of processing or transforming data
into a useful form. We understand information because it's more
organized and has context. Information can be in the form of graphs,
tables, or videos.
Dataset: A dataset is an organized collection of data. The most basic
representation of a dataset is data elements presented in tabular form.
Each column represents a particular variable. Each row corresponds to a
given value of that column's variable.
Data warehouse: A data warehouse (DW) is a digital storage system that
connects and harmonizes large amounts of data from many different
sources.
Big data: Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size and
complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.
Types Of Big
Data
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.
However, nowadays, we are foreseeing issues when a size of such data grows to a huge
extent, typical sizes are being in the rage of multiple zettabytes.
51
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.
52
Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.
53
Example Of Un-structured Data
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Volume: The name Big Data itself is related to a size which is enormous. ‘Volume’
is one characteristic which needs to be considered while dealing with Big Data
solutions.
Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. Nowadays, data in the form of emails, photos, videos,
54
monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining
and analyzing data.
Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
Variability: This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.
Before we dive into such topics as ML and Data Science, and try to explain how it works, we
should answer several questions:
What can we achieve in business or on the project with the help of ML? What
goals do I want to accomplish using ML?
55
Do I only want to hop on the trend, or will the use of ML really improve user
experience, increase profitability, or protect my product and its users?
Clearly outline the objectives of the data collection process and the specific research
questions you want to answer. This step will guide the entire process and ensure you
collect the right data to meet your goals.
Also, it is recommended to identify data sources. Determine the sources from which you
will collect data. These sources may include primary data (collected directly for your study)
or secondary data (previously collected by others). Common data sources include surveys,
interviews, existing databases, observation, experiments, and online platforms.
In this stage, it is better to start with the selection of data collection methods. Choose the
appropriate methods to collect data from the identified sources. The methods may vary
depending on the nature of the data and research objectives.
56
Sensor data collection: Gathering data from sensors or IoT devices.
The next step is very crucial. Ensuring data quality means reviewing the collected data to
check for errors, inconsistencies, or missing values. Apply quality assurance techniques to
ensure the data is reliable and suitable for analysis.
The following step would be data storage and management. It will require organizing and
storing the collected data in a secure and accessible manner. Consider using databases or
other data management systems for efficient storage and retrieval.
Synthetic data is any information manufactured artificially which does not represent events or
objects in the real world. Algorithms create synthetic data used in model datasets for testing
or training purposes. This data can mimic operational or production data and help train ML
models or test out mathematical models.
2. Active Learning
Active learning is a machine learning technique that focuses on selecting the most
informative data points to label or annotate from an unlabeled dataset. The aim of active
learning is to reduce the amount of labeled data required to build an accurate model by
strategically choosing which instances to query for labels. This is especially useful when
labeling data can be time-consuming or expensive.
57
Step Features
Initial Data Initially, a small labeled dataset is collected through random sampling or any
Collection other standard method.
Model The initial labeled data is used to train a machine learning model.
Training
The model is then used to predict the labels of the remaining unlabeled data
Uncertainty points. During this process, the model’s uncertainty about its predictions is often
Estimation estimated. There are various ways to measure uncertainty, such as entropy, margin
sampling, and least confidence.
A query strategy is chosen to decide which data points to request labels for. The
query strategy selects instances with high uncertainty as these instances are likely
Query Strategy
to have the most impact on improving the model’s performance.
Selection
There are a few methods to apply query strategy selection: uncertainty sampling,
diversity sampling, and representative sampling.
Labeling New The selected instances are then sent for labeling or annotation by domain experts.
Data Points
Model Update The newly labeled data is added to the labeled dataset, and the model is retrained
using the expanded labeled set.
3. Transfer Learning
It is a popular approach for training models when there is not enough training data or time to
train from scratch. A common technique is to start from an existing model that is well trained
(also called a source task), one can incrementally train a new model (a target task) that
already performs well.
58
4. Open Source Datasets
Open source datasets are a valuable resource in data collection for various research and
analysis purposes. These datasets are typically publicly available and can be freely accessed,
used, and redistributed. Leveraging open source datasets can save time, resources, and effort,
as they often come pre-cleaned and curated, ready for analysis.
Open Source
Features
Datasets Methods
Start by identifying the relevant open source datasets for your research or
Identifying Suitable analysis needs. There are various platforms and repositories where you can
Datasets find open source datasets, such as Kaggle, data.gov, UCI Machine Learning
Repository, GitHub, Google Dataset Search, and many others.
Before using a dataset, it’s essential to explore its contents to understand its
structure, the variables available, and the quality of the data. This
Data Exploration
preliminary analysis will help you determine if the dataset meets your
research requirements.
Pay close attention to the licensing terms associated with the open source
dataset. Some datasets might have specific conditions for usage and
Data Licensing
redistribution, while others may be entirely open for any purpose. Make sure
to adhere to the terms of use.
Data Preprocessing Although open source datasets are usually pre-cleaned, they may still require
some preprocessing to fit your specific needs. This step could involve
59
handling missing data, normalizing values, encoding categorical variables,
and other data transformations.
Ethical Ensure that the data you are using does not contain sensitive or private
In some cases, your research might require data from multiple sources. Open
Data Integration source datasets can be combined with proprietary data or other open source
datasets to enhance the scope and depth of your analysis.
Validation and Just like with any data, it’s crucial to validate the open source dataset for
Quality Control accuracy and quality. Cross-referencing the data with other sources or
performing sanity checks can help ensure the dataset’s reliability.
When using open source datasets in your research or analysis, it’s essential
Citations and to give proper credit to the original creators or contributors. Follow the
Attribution provided citation guidelines and acknowledge the source of the data
appropriately.
60
5. Manual Data Generation
Manual data generation refers to the process of collecting data by hand, without the use of
automated tools or systems. Manual data generation can be time-consuming and resource-
intensive, but it can yield valuable and reliable data when performed carefully.
61
Manual Data Description
Generation Methods
Manual Extraction from When dealing with data that exists in physical forms, such as books,
Manual Labeling or In machine learning and AI, manually annotating data with labels
Diaries or Logs Participants may be asked to keep diaries or logs of their activities,
experiences, or behaviors over a certain period.
Handwritten Surveys or Data In some cases, data might be collected using pen and paper, and then
Collection manually transcribed into digital formats for analysis.
Building synthetic datasets is one of the most common methods in data collection when real
data is limited or unavailable, or when privacy concerns prevent the use of actual data.
Synthetic datasets are artificially generated datasets that mimic the statistical properties
and patterns of real data without containing any sensitive or identifiable information.
62
Here’s a step-by-step guide on how to build synthetic datasets:
Define the Problem and Objectives: Clearly identify the purpose of the
synthetic dataset. Determine what specific features, relationships, and patterns you
want the synthetic data to capture. Understand the target domain and data
characteristics to ensure the synthetic dataset is meaningful and useful.
Understand the Real Data: If possible, analyze and understand the real data you
want to emulate. Identify the key statistical properties, distributions, and relationships
within the data. This will help inform the design of the synthetic dataset.
Choose a Data Generation Method: Several methods can be used to create synthetic
datasets (Statistical Methods, Generative Models, Data Augmentation, Simulations ).
Choose the Right Features: Identify the essential features from the real data that
need to be included in the synthetic dataset. Avoid including personally
identifiable information (PII) or any sensitive data that might compromise privacy.
Generate the Synthetic Data: Implement the chosen data generation method to
create the synthetic dataset. Ensure that the dataset follows the same format and
data types as the real data to be used seamlessly in analyses and modeling.
Validate and Evaluate: Assess the quality and accuracy of the synthetic dataset
by comparing it to the real data. Use metrics and visualizations to validate that the
synthetic data adequately captures the patterns and distributions present in the real
data.
Modify and Iterate: If the initial synthetic dataset does not meet your expectations,
refine the data generation method or adjust parameters until it better aligns with the
desired objectives.
Ensure Privacy and Ethics: Always prioritize privacy and ethical considerations
when generating synthetic datasets. Ensure that no individual or sensitive information
can be inferred from the synthetic data.
63
By following these steps, you can create synthetic datasets that can serve as valuable
substitutes for real data in various scenarios, contributing to better model development and
analysis in data-scarce or privacy-sensitive environments.
7. Federated Learning
TensorFlow
64
TensorFlow, an open-source machine learning tool, is renowned for its flexibility, ideal
for crafting diverse models, and simple to use. With abundant resources and user-friendly
interfaces, it simplifies data comprehension.
PyTorch
Scikit-learn
Keras
Keras helps easily create models, great for quick experiments, especially with images or
words. It’s user-friendly, making it simple to try out ideas, whether you’re working on
recognizing images or understanding language.
XGBoost
Apache Spark MLlib is a powerful tool designed for handling massive datasets, making it
ideal for large-scale projects with extensive data. It simplifies complex data analysis tasks
by providing a robust machine-learning framework. Whether you’re dealing with
65
substantial amounts of information, Spark MLlib offers scalability and efficiency, making
it a valuable resource for projects requiring the processing of extensive data sets.
Microsoft Azure Machine Learning makes it easy to do machine learning in the cloud.
It’s simple, user-friendly, and works well for many different projects, making
machine learning accessible and efficient in the cloud.
Google Cloud AI Platform is a strong tool for using machine learning on Google Cloud.
Great for big projects, it easily works with other Google tools. It provides detailed stats
and simple functions, making it a powerful option for large machine-learning tasks.
H2O.ai
H2O.ai is a tool that helps you use machine learning easily. It’s good for many jobs and
has a helpful community. With H2O.ai, you can use machine learning well, thanks to its
easy interface and helpful people.
RapidMiner
RapidMiner is an all-rounder tool for the entire machine learning method, ideal for
concept exploration and collaboration on tremendous projects. It enables trying out ideas
and permits seamless teamwork, making it a versatile tool for diverse stages of machine
learning development.
Installation of Python
To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably
66
version 3.10.11, which was used in testing this tutorial. Choose the correct link for your
device from the options provided: either Windows installer (64- bit) or Windows
installer (32-bit) and proceed to download the executable file.
Python Homepage
67
Python Installer
After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.
Python Setup
68
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will
see a successful message.
Python version
You can also check the version of Python by opening the IDLE application. Go to Start
and enter IDLE in the search bar and then click the IDLE app, for
69
example, IDLE (Python 3.10.11 64-bit). If you can see the Python IDLE window then
you are successfully able to download and installed Python on Windows.
Python IDLE
Getting Started with Python
Python is a lot easier to code and learn. Python programs can be written on any plain text
editor like Notepad, notepad++, or anything of that sort. One can also use an Online IDE
to run Python code or can even install one on their system to make it more feasible to write
these codes because IDEs provide a lot of features like an intuitive code editor, debugger,
compiler, etc. To begin with, writing Python Codes and performing various intriguing and
useful operations, one must have Python installed on their System.
Summary
To install Python on Windows, you need to download the Python installer from the official
Python website and run it on your system. The installation process is straightforward and
includes options to add Python to your system PATH.
70
1. Download the Installer:
Check the box that says “Add Python to PATH” at the bottom of the installer
window.
The installer will copy the necessary files and set up Python on your system.
71
1. Open Command Prompt:
Press Win + R, type cmd, and press Enter to open the Command Prompt.
You should see the installed Python version, e.g., Python 3.x.x.
This verifies that pip, the Python package installer, is also installed correctly.
Environment variables are used to configure the environment in which processes run. For
Python, you often need to set the PATH environment variable so that you can run Python and
pip from the command line. This ensures that Python executables and scripts can be accessed
from any command line prompt without specifying their full path.
72
When running the Python installer, ensure you check the box that says “Add
Python to PATH.”
Open the Start menu, search for “Environment Variables,” and select “Edit the
system environment variables.”
Under “System variables,” find the Path variable and click “Edit.”
Click “New” and add the path to the Python installation directory (e.g., C:\
Python39) and the Scripts directory
(e.g., C:\Python39\Scripts).
Type python and press Enter to start the Python interpreter. If Python starts,
the PATH is configured correctly.
Type pip and press Enter to verify that pip can be called from the command
line.
Installation of Tools
73
convenient packaged solution which you can easily download and install in your
computer. It will automatically install Python and some basic IDEs and libraries with it.
elow some steps are given to show the downloading and installing process of
Anaconda and IDE:
o To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below
image. Alternatively, you can directly download it by clicking on this link,
https://www.anaconda.com/distribution/#download- section.
o After clicking on the first link, you will reach to download page of
Anaconda, as shown in the below image:
74
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in below
image. It will provide you Python 2.7 and Python 3.7 versions, but the latest
version is 3.7, hence we will download Python 3.7 version. After clicking on the
download option, it will start downloading on your computer.
Note: In this topic, we are downloading Anaconda for Windows you can choose it as
per your OS.
75
Step- 2: Install Anaconda Python (Python 3.7 version):
Once the downloading process gets completed, go to downloads → double click on the
".exe" file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a
setup window for Anaconda installations as given in below image, then click on Next.
76
o It will open a License agreement window click on "I Agree" option and move
further.
77
o In the next window, you will get two options for installations as given in the
below image. Select the first option (Just me) and click on Next.
o Now you will get a window for installing location, here, you can leave it as default
or change it by browsing a location, and then click on Next. Consider the below
image:
78
o Now select the second option, and click on install.
79
o Now installation is completed, tick the checkbox if you want to learn more about
Anaconda and Anaconda cloud. Click on Finish to end the process.
Note: Here, we will use the Spyder IDE to run Python programs.
80
Step- 3: Open Anaconda Navigator
81
Run your Python program in Spyder IDE.
o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right side.
82
Step- 4: Close the Spyder IDE.
Machine learning data refers to the collection of information (or dataset) that is used by
machine learning models to learn, train, and make predictions. It can be structured or
unstructured and usually consists of features (input variables) and labels (output variables or
target values). The quality and nature of this data directly impact the performance and
accuracy of machine learning models.
Machine Learning information refers to the data, knowledge, and insights utilized or
generated in the process of training machine learning models. It includes everything from
raw datasets to model predictions, as well as the intermediate knowledge gained during
data analysis, feature extraction, model evaluation, and decision-making.
A machine learning dataset is a structured collection of data used to train, validate, and
test machine learning models. It consists of multiple examples or data points, where each
data point typically contains features (input variables) and may include a corresponding
label (output or target variable) in the case of supervised learning. The dataset is
essential for enabling machine learning algorithms to learn patterns, make predictions,
and generalize to new data.
83
Big data for machine learning refers to extremely large, complex, and diverse datasets
that are generated at high velocity and volume. These datasets require advanced processing
techniques and technologies to extract useful insights and are used in machine learning
(ML) to improve model performance, accuracy, and scalability. Machine learning on big
data enables models to learn from vast and varied data sources, resulting in more accurate
predictions and better decision- making capabilities.
84