[go: up one dir, main page]

0% found this document useful (0 votes)
30 views84 pages

Machine Learning

The document outlines the NITML501 Machine Learning Application module, focusing on applying machine learning fundamentals through data pre-processing, model development, and deployment. It details the machine learning lifecycle, including data gathering, preparation, analysis, training, testing, and deployment, along with various applications like image recognition and medical diagnosis. Additionally, it discusses the advantages and disadvantages of machine learning, emphasizing its automation capabilities and reliance on data quality.

Uploaded by

niyokwibukadamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views84 pages

Machine Learning

The document outlines the NITML501 Machine Learning Application module, focusing on applying machine learning fundamentals through data pre-processing, model development, and deployment. It details the machine learning lifecycle, including data gathering, preparation, analysis, training, testing, and deployment, along with various applications like image recognition and medical diagnosis. Additionally, it discusses the advantages and disadvantages of machine learning, emphasizing its automation capabilities and reliance on data quality.

Uploaded by

niyokwibukadamas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 84

NITML501: MACHINE LEARNING APPLICATION

Competence: APPLY MACHINE LEARNING FUNDAMENTALS

ICT AND MULTIMEDIA

Class: level 5 NIT

Module name: NITML501 MACHINE LEARNING APPLICATION

Competency: APPLY MACHINE LEARNING FUNDAMENTALS

Learning Hours: 120

TRAINER:

Elements of Competency and Performance Criteria

Elements of competency Performance criteria


1. Apply Data Pre-processing 1.1. Environment is properly prepared based on system
requirements.
1.2. Data is properly manipulated based on the python
libraries functionalities.
1.3. Visualization results are properly interpreted based on
its statistical analysis.

1
1.4. Data cleaning is appropriately performed based on the
provided dataset.
2. Develop Machine Learning Model 2.1. Machine Learning algorithm is properly selected based
on the characteristics of the dataset.
2.2. Machine Learning models are properly trained based on
a training set of data.
2.3. Machine Learning model performance is properly
evaluated based on appropriate evaluation metrics
2.4. Hyperparameters are properly finetuned based on
evaluation results
3. Perform Model deployment 3.1. Deployment methods are clearly selected based on the
requirements.
3.2. Model file is properly integrated in the system based on
the Deployment method (RESTful API guidelines.)
3.3. Prediction responses are accurately delivered to
the Clients based on the model insights.

2
Learning outcome 1.Apply Data Pre-processing

1.1 Description of Machine learning concepts

1.1.1 Machine Learning Overview

Definition:

- Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing algorithms and
statistical models that allow computers to improve their performance on tasks over time through experience.
Instead of being explicitly programmed to perform a specific task, machine learning models are trained on
data to recognize patterns and make decisions or predictions based on new, unseen data.

- Machine learning (ML) is a branch of artificial intelligence (AI) that involves the development of

algorithms and statistical models that enable computers to perform specific tasks without explicit

instructions. Instead, these systems learn from and make decisions based on data

- Machine learning is a field of study in artificial intelligence concerned with the development and study of
statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without
explicit instructions.

3
Key Components of Machine Learning

1. Models: A model in machine learning is a mathematical representation of a real-


world process. The model is what learns from the data by adjusting its parameters
to fit the observed data as closely as possible.

2. Data: it is the cornerstone of all ML algorithms. Without data,ML algorithms cannot learn.

The data can be in various formats, such as text, images, videos or even sensor data.

3. Algorithms: These are the methods used to train models.

Some common types include supervised learning, unsupervised learning, and reinforcement
learning.

• Supervised Learning: The model is trained on labeled data (where the correct
answers are known). For example, a model might be trained to recognize cats and dogs
in images by being shown many examples of each with labels.
• Unsupervised Learning: The model works with unlabeled data and tries to find
patterns or groupings on its own. For instance, it might cluster customers into different
segments based on their purchasing behavior.
• Reinforcement Learning: The model learns by interacting with an environment and
receiving feedback in the form of rewards or penalties. This is often used in robotics
and game-playing.
4. Training: During training, the ML model adjusts its parameters to minimize errors or optimize
its performance based on the data it processes.
5. Evaluation: this involves assessing how well your model is performing. Common metrics for
this include accuracy, precision, recall and F1 score, depending on the problem you’re solving
(e.g classification, regression)
6. Deployment: Once trained and evaluated, the model can be deployed in real-world applications,
such as recommendation systems, speech recognition, and autonomous vehicles.

Machine learning life cycle

What is the Machine Learning lifecycle? The Machine Learning lifecycle is the end-to-end process that machine
learning models go through, from problem definition to model deployment and maintenance.
Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model

4
o Test the model
o Deployment

In the complete life cycle process, to solve a problem, we create a machine learning system
called "model", and this model is created by providing "training". But to train a model, we
need data, hence, life cycle starts by collecting data.

1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
5
prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data

6
o Integrate the data obtained from different sources

By performing the above task, we get a coherent set of data, also called as a dataset. It will
be used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.

In this step, first, we put all data together, and then randomize the ordering of

data. This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is preprocessing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.

It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:

o Missing Values
o Duplicate data

7
o Invalid data
o Noise

So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
o Review the result

The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques
such as Classification, Regression, Cluster analysis, Association, etc. then build the model
using prepared data, and evaluate the model.

Hence, in this step, we take the data and use machine learning algorithms to build the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

8
Testing the model determines the percentage accuracy of the model as
per the requirement of project or problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we


deploy the model in the real-world system.

If the above-prepared model is producing an accurate result as per our


requirement with acceptable speed, then we deploy the model in the
real system. But before deploying the project, we will check whether
it is improving its performance using available data or not. The
deployment phase is similar to making the final report for a project.

Machine learning applications

9
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a


photo with our Facebook friends, then we automatically get a tagging suggestion with name,
and the technology behind this is machine learning's face detection and recognition
algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:

10
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and spam
emails in our spam box, and the technology behind this is Machine learning. Below are some
spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree,
and Naïve Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri.
As the name suggests, they help us in finding the information using our voice instruction.
These assistants can help us in various ways just by our voice instructions such as Play
music, call someone, Open an email, Scheduling an appointment, etc.

These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by
checking whether it is a genuine transaction or a fraud transaction.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short term memory
neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

11
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

Machine learning Advantages and disadvantages

Advantages of Machine Learning

1. Automation

A key advantage of Machine Learning is its capacity to automate repetitive and time-
consuming tasks, leading to improved productivity, cost savings, and minimized errors within
organizations.
For example, ML-driven chatbots deployed in customer service streamline interactions by
promptly addressing inquiries, recommending products, and comparing prices, thereby
reducing waiting times and augmenting customer satisfaction.
2. Scope of Improvement

Machine Learning is a field where things keep evolving. It gives many


opportunities for improvement and can become the leading technology in the
future. A lot of research and innovation is happening in this technology, which
helps improve software and hardware.
3. Learning Capability:
Machine Learning algorithms have the remarkable ability to learn from the data provided to
them. For instance, companies like Amazon and Walmart leverage ML algorithms to
scrutinize vast customer data sets, uncovering hidden correlations and preferences to suggest
personalized product recommendations, thereby enhancing customer satisfaction and driving
sales.

4. Enhanced Experience in Online Shopping and Quality Education

Machine Learning is going to be used in the education sector extensively, and it will be going
to enhance the quality of education and student experience.

5. Wide Range of Applicability

This technology has a very wide range of applications. Machine learning plays a role in
almost every field, like hospitality, ed-tech, medicine, science, banking, and business. It
creates more opportunities.

6. Pattern Identification:
Machine Learning excels in discerning intricate trends and patterns within vast and complex
datasets, catalyzing transformative advancements across various industries. In healthcare, ML

12
algorithms analyze diverse data sources such as medical images and patient records to
facilitate early disease detection and tailor treatment plans to individual patients.

7. Variety of Applications:
Machine Learning exhibits exceptional versatility, permeating virtually every sector and facet
of modern life. In finance, ML underpins risk assessment and fraud detection initiatives,
while healthcare leverages ML for tasks ranging from diagnosis to drug discovery and
personalized medicine.

13
Disadvantages of the Machine Learning

Nothing is perfect in the world. Machine Learning has some serious limitations, which are
bigger than human errors.

1. Data Acquisition

The whole concept of machine learning is about identifying useful data. The outcome will be
incorrect if a credible data source is not provided. The quality of the data is also significant. If
the user or institution needs more quality data, wait for it. It will cause delays in providing the
output. So, machine learning significantly depends on the data and its quality.

2. Time and Resources

The data that machines process remains huge in quantity and differs greatly. Machines
require time so that their algorithm can adjust to the environment and learn it. Trials runs are
held to check the accuracy and reliability of the machine. It requires massive and expensive
resources and high-quality expertise to set up that quality of infrastructure.

14
3. Results Interpretations

One of the biggest advantages of Machine learning is that interpreted data that we get from
the cannot be hundred percent accurate. It will have some degree of inaccuracy. For a high
degree of accuracy, algorithms should be developed so that they give reliable results.

4. High Error Chances

The error committed during the initial stages is huge, and if not corrected at that time, it
creates havoc. Biasness and wrongness have to be dealt with separately; they are not
interconnected. Machine learning depends on two factors, i.e., data and algorithm. All the
errors are dependent on the two variables. Any incorrectness in any variables would have
huge repercussions on the output.

5. Social Changes

Machine learning is bringing numerous social changes in society. The role of machine
learning- based technology in society has increased multifold. It is influencing the thought
process of society and creating unwanted problems in society. Character assassination and
sensitive details are disturbing the social fabric of society.

6. Elimination of Human Interface

Automation, Artificial Intelligence, and Machine Learning have eliminated human interface
from some work. It has eliminated employment opportunities. Now, all those works are
conducted with the help of artificial intelligence and machine learning.

7. Changing Nature of Jobs

With the advancement of machine learning, the nature of the job is changing. Now, all the
work are done by machine, and it is eating up the jobs for human which were done earlier by
them. It is difficult for those without technical education to adjust to these changes.

8. Highly Expensive

15
This software is highly expensive, and not everybody can own it. Government agencies, big
private firms, and enterprises mostly own it. It needs to be made accessible to everybody for
wide use.

9. Privacy Concern

As we know that one of the pillars of machine learning is data. The collection of data has
raised the fundamental question of privacy. The way data is collected and used for
commercial purposes has always been a contentious issue.

10. Research and Innovations

Machine learning is evolving concept. This area has not seen any major developments yet
that fully revolutionized any economic sector. The area requires continuous research and
innovation.

Difference between machine learning, artificial intelligence, and deep


learning

 Artificial Intelligence
Artificial Intelligence is basically the mechanism to incorporate human intelligence into
machines through a set of rules(algorithm). AI is a combination of two words: “Artificial”
meaning something made by humans or non-natural things and “Intelligence” meaning the
ability to understand or think accordingly. Another definition could be that “AI is basically
the study of training your machine(computers) to mimic a human brain and its
thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to obtain
the maximum efficiency possible.
 Machine Learning:

16
Machine Learning is basically the study/process which provides the system(computer) to
learn automatically on its own through experiences it had and improve accordingly without
being explicitly programmed. ML is an application or subset of AI. ML focuses on the

17
development of programs so that it can access data to use it for itself. The entire process
makes observations on data to identify the possible patterns being formed and make better
future decisions as per the examples provided to them. The major aim of ML is to allow
the systems to learn by themselves through experience without any kind of human
intervention or assistance.
 Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning which
makes use of Neural Networks(similar to the neurons working in our brain) to mimic
human brain-like behavior. DL algorithms focus on information processing
patterns mechanism to possibly identify the patterns just like our human brain does and
classifies the information accordingly. DL works on larger sets of data when compared to
ML and the prediction mechanism is self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning and Deep
Learning:

Artificial Intelligence Machine Learning Deep Learning

AI stands for Artificial DL stands for Deep


Intelligence, and is Learning, and is the study
ML stands for Machine
basically the that makes use of Neural
Learning, and is the study
study/process which Networks (similar to
that uses statistical methods
enables machines to neurons present in human
enabling machines to improve
mimic human behaviour brain) to imitate
with experience.
through particular functionality just like a
algorithm. human brain.

AI is the broader family


consisting of ML and DL ML is the subset of AI. DL is the subset of ML.

as it’s components.

AI is a computer ML is an AI algorithm which DL is a ML algorithm that

algorithm which exhibits allows system to learn from uses deep (more than one
data. layer) neural networks to

18
Artificial Intelligence Machine Learning Deep Learning

intelligence through analyze data and provide


decision making. output accordingly.

If you are clear about the


If you have a clear idea about math involved in it but
the logic(math) involved in don’t have idea about the
Search Trees and much behind and you can visualize features, so you break the
complex math is involved the complex functionalities complex functionalities into
in AI. like K-Mean, Support Vector linear/lower dimension
Machines, etc., then it defines features by adding more
the ML aspect. layers, then it defines the
DL aspect.

The aim is to basically The aim is to increase It attains the highest rank in

increase chances of accuracy not caring much terms of accuracy when it is

success and not accuracy. about the success ratio. trained with large amount
of data.

DL can be considered as
neural networks with a
Three broad
large number of parameters
categories/types Of AI
Three broad categories/types layers lying in one of the
are: Artificial Narrow
Of ML are: Supervised four fundamental network
Intelligence (ANI),
Learning, Unsupervised architectures: Unsupervised
Artificial General
Learning and Reinforcement Pre-trained Networks,
Intelligence (AGI)
Learning Convolutional Neural
and Artificial Super
Networks, Recurrent
Intelligence (ASI)
Neural Networks and
Recursive Neural Networks

19
Artificial Intelligence Machine Learning Deep Learning

The efficiency Of AI is Less efficient than DL as it More powerful than ML as


basically the efficiency can’t work for longer it can easily work for larger
provided by ML and DL dimensions or higher amount sets of data.
respectively. of data.

Examples of AI
applications include: Examples of ML applications Examples of DL

Google’s AI-Powered include: Virtual Personal applications include:

Predictions, Ridesharing Assistants: Siri, Alexa, Sentiment based news

Apps Like Uber and Lyft, Google, etc., Email Spam and aggregation, Image analysis

Commercial Flights Use Malware Filtering. and caption generation, etc.

an AI Autopilot, etc.

AI refers to the broad


field of computer science ML is a subset of AI that
DL is a subset of ML that
that focuses on creating focuses on developing
focuses on developing deep
intelligent machines that algorithms that can learn
neural networks that can
can perform tasks that from data and improve their
automatically learn and
would normally require performance over time
extract features from data.
human intelligence, such without being explicitly

as reasoning, perception, programmed.

and decision-making.

AI can be further broken ML algorithms can be DL algorithms are inspired


down into various categorized as supervised, by the structure and
subfields such as unsupervised, or function of the human
robotics, natural language reinforcement learning. In brain, and they are
processing, computer supervised learning, the particularly well-suited to
vision, expert systems, algorithm is trained on tasks such as image and
and more. labeled data, where the speech recognition.

20
Artificial Intelligence Machine Learning Deep Learning

desired output is known. In


unsupervised learning, the
algorithm is trained on
unlabeled data, where the
desired output is unknown.

DL networks consist of
multiple layers of
In reinforcement learning, the
AI systems can be rule- interconnected neurons that
algorithm learns by trial and
based, knowledge-based, process data in a
error, receiving feedback in
or data-driven. hierarchical manner,
the form of rewards or
allowing them to learn
punishments.
increasingly complex
representations of the data.

AI vs. Machine Learning vs. Deep Learning Examples:


Artificial Intelligence (AI) refers to the development of computer systems that can perform
tasks that would normally require human intelligence.

Some examples of AI include:


There are numerous examples of AI applications across various industries. Here are some
common examples:

 Speech recognition: speech recognition systems use deep learning algorithms


to recognize and classify images and speech. These systems are used in a variety
of applications, such as self-driving cars, security systems, and medical imaging.
 Personalized recommendations: E-commerce sites and streaming services like
Amazon and Netflix use AI algorithms to analyze users’ browsing and viewing
history to recommend products and content that they are likely to be interested
in.
 Predictive maintenance: AI-powered predictive maintenance systems analyze
data from sensors and other sources to predict when equipment is likely to fail,
helping to reduce downtime and maintenance costs.

21
 Medical diagnosis: AI-powered medical diagnosis systems analyze medical
images and other patient data to help doctors make more accurate diagnoses and
treatment plans.
 Autonomous vehicles: Self-driving cars and other autonomous vehicles use AI
algorithms and sensors to analyze their environment and make decisions about
speed, direction, and other factors.
 Virtual Personal Assistants (VPA) like Siri or Alexa – these use natural
language processing to understand and respond to user requests, such as playing
music, setting reminders, and answering questions.
 Autonomous vehicles – self-driving cars use AI to analyze sensor data, such as
cameras and lidar, to make decisions about navigation, obstacle avoidance, and
route planning.
 Fraud detection – financial institutions use AI to analyze transactions and
detect patterns that are indicative of fraud, such as unusual spending patterns or
transactions from unfamiliar locations.
 Image recognition – AI is used in applications such as photo organization,
security systems, and autonomous robots to identify objects, people, and scenes
in images.
 Natural language processing – AI is used in chatbots and language translation
systems to understand and generate human-like text.
 Predictive analytics – AI is used in industries such as healthcare and marketing
to analyze large amounts of data and make predictions about future events, such
as disease outbreaks or consumer behavior.
 Game-playing AI – AI algorithms have been developed to play games such as
chess, Go, and poker at a superhuman level, by analyzing game data and making
predictions about the outcomes of moves.
Examples of Machine Learning:
Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves the use of
algorithms and statistical models to allow a computer system to “learn” from data and
improve its performance over time, without being explicitly programmed to do so.

Here are some examples of Machine Learning:


 Image recognition: Machine learning algorithms are used in image recognition
systems to classify images based on their contents. These systems are used in a

22
variety of applications, such as self-driving cars, security systems, and medical
imaging.
 Speech recognition: Machine learning algorithms are used in speech
recognition systems to transcribe speech and identify the words spoken. These
systems are used in virtual assistants like Siri and Alexa, as well as in call
centers and other applications.
 Natural language processing (NLP): Machine learning algorithms are used in
NLP systems to understand and generate human language. These systems are
used in chatbots, virtual assistants, and other applications that involve natural
language interactions.
 Recommendation systems: Machine learning algorithms are used in
recommendation systems to analyze user data and recommend products or
services that are likely to be of interest. These systems are used in e-commerce
sites, streaming services, and other applications.
 Sentiment analysis: Machine learning algorithms are used in sentiment analysis
systems to classify the sentiment of text or speech as positive, negative, or
neutral. These systems are used in social media monitoring and other
applications.
 Predictive maintenance: Machine learning algorithms are used in predictive
maintenance systems to analyze data from sensors and other sources to predict
when equipment is likely to fail, helping to reduce downtime and maintenance
costs.
 Spam filters in email – ML algorithms analyze email content and metadata to
identify and flag messages that are likely to be spam.
 Recommendation systems – ML algorithms are used in e-commerce websites
and streaming services to make personalized recommendations to users based on
their browsing and purchase history.
 Predictive maintenance – ML algorithms are used in manufacturing to predict
when machinery is likely to fail, allowing for proactive maintenance and
reducing downtime.
 Credit risk assessment – ML algorithms are used by financial institutions to
assess the credit risk of loan applicants, by analyzing data such as their income,
employment history, and credit score.

23
 Customer segmentation – ML algorithms are used in marketing to segment
customers into different groups based on their characteristics and behavior,
allowing for targeted advertising and promotions.
 Fraud detection – ML algorithms are used in financial transactions to detect
patterns of behavior that are indicative of fraud, such as unusual spending
patterns or transactions from unfamiliar locations.
 Speech recognition – ML algorithms are used to transcribe spoken words into
text, allowing for voice-controlled interfaces and dictation software.
Examples of Deep Learning:
Deep Learning is a type of Machine Learning that uses artificial neural networks with
multiple layers to learn and make decisions.

Here are some examples of Deep Learning:


 Image and video recognition: Deep learning algorithms are used in image and
video recognition systems to classify and analyze visual data. These systems are
used in self-driving cars, security systems, and medical imaging.
 Generative models: Deep learning algorithms are used in generative models to
create new content based on existing data. These systems are used in image and
video generation, text generation, and other applications.
 Autonomous vehicles: Deep learning algorithms are used in self-driving cars
and other autonomous vehicles to analyze sensor data and make decisions about
speed, direction, and other factors.
 Image classification – Deep Learning algorithms are used to recognize objects
and scenes in images, such as recognizing faces in photos or identifying items in
an image for an e-commerce website.
 Speech recognition – Deep Learning algorithms are used to transcribe spoken
words into text, allowing for voice-controlled interfaces and dictation software.
 Natural language processing – Deep Learning algorithms are used for tasks
such as sentiment analysis, language translation, and text generation.
 Recommender systems – Deep Learning algorithms are used in
recommendation systems to make personalized recommendations based on
users’ behavior and preferences.

24
 Fraud detection – Deep Learning algorithms are used in financial transactions
to detect patterns of behavior that are indicative of fraud, such as unusual
spending patterns or transactions from unfamiliar locations.
 Game-playing AI – Deep Learning algorithms have been used to develop game-
playing AI that can compete at a superhuman level, such as the AlphaGo AI that
defeated the world champion in the game of Go.
 Time series forecasting – Deep Learning algorithms are used to forecast future
values in time series data, such as stock prices, energy consumption, and
weather patterns.
1.1.2 Types of Machine Learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning

1. Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.


Labelled datasets have both input and output parameters. In Supervised

25
Learning algorithms learn to map points between inputs and correct outputs. It has both
training and validation datasets labelled.

Supervised Learning

Let’s understand it with the help of an example.


Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled
images. When we input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is how supervised learning
works, and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
- Classification
- Regression
1. Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting
whether a patient has a high risk of heart disease. Classification algorithms learn to map the
input features to one of the predefined classes.
Here are some classification algorithms:

26
 Logistic Regression
 Support Vector Machine
 Random Forest
 Decision Tree
 K-Nearest Neighbors (KNN)
 Naive Bayes
2. Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Ridge Regression
 Lasso Regression
 Decision tree
 Random Forest
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained
on labelled data.
 The process of decision-making in supervised learning models is often
interpretable.
 It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
 It has limitations in knowing patterns and may struggle with unseen or
unexpected patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.

27
 Natural language processing: Extract information from text, such as sentiment,
entities, and relationships.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the environment.
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyze player behavior, and create NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
 Sports analytics: Analyze player performance, make game predictions, and
optimize strategies.
2. Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.

28
Unsupervised Learning

Let’s understand it with the help of an example.


Example: Consider that you have a dataset that contains information about the purchases
you made from the shop. Through clustering, the algorithm can group the same purchasing
behavior among you and other customers, which reveals potential customers without
predefined labels. This type of information can help businesses get target customers as well
as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
 Clustering
 Association
1. Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
Here are some clustering algorithms:
 K-Means Clustering algorithm
 Mean-shift algorithm
 DBSCAN Algorithm
 Principal Component Analysis
 Independent Component Analysis
2. Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
a. Apriori Algorithm
b. Eclat
c. FP-growth Algorithm
Advantages of Unsupervised Machine Learning
- It helps to discover hidden patterns and various relationships between the data.
- Used for tasks such as customer segmentation, anomaly detection, and data exploration.
- It does not require labeled data and reduces the effort of data labeling.

29
Disadvantages of Unsupervised Machine Learning
- Without using labels, it may be difficult to predict the quality of the model’s output.
- Cluster Interpretability may not be clear and may not have meaningful
interpretations.
- It has techniques such as autoencoders and dimensionality reduction that can be used to
extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
- Clustering: Group similar data points into clusters.
- Anomaly detection: Identify outliers or anomalies in data.
- Dimensionality reduction: Reduce the dimensionality of data while preserving its essential
information.
- Recommendation systems: Suggest products, movies, or content to users based on their
historical behavior or preferences.
- Topic modeling: Discover latent topics within a collection of documents.
- Density estimation: Estimate the probability density function of data.
- Image and video compression: Reduce the amount of storage required for multimedia
content.
- Data preprocessing: Help with data preprocessing tasks such as data cleaning, imputation
of missing values, and data scaling.
- Market basket analysis: Discover associations between products.
- Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
- Image segmentation: Segment images into meaningful regions.
- Community detection in social networks: Identify communities or groups of individuals
with similar interests or connections.
- Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
- Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
- Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.

30
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labeled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-
supervised learning is chosen when labeled data requires skills and relevant resources in
order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled and the
rest large portion of it is unlabeled. We can use the unsupervised techniques to predict
labels and then feed these labels to supervised techniques. This technique is mostly
applicable in the case of image data sets where usually all images are not labeled.

Semi-Supervised Learning

Let’s understand it with the help of an example.


Example: Consider that we are building a language translation model, having labeled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labeled and unlabeled sentence pairs, making them more accurate. This technique has
led to significant improvements in the quality of machine translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:

31
 Graph-based semi-supervised learning: This approach uses a graph to
represent the relationships between the data points. The graph is then used to
propagate labels from the labeled data points to the unlabeled data points.
 Label propagation: This approach iteratively propagates labels from the labeled
data points to the unlabeled data points, based on the similarities between the
data points.
 Co-training: This approach trains two different machine learning models on
different subsets of the unlabeled data. The two models are then used to label
each other’s predictions.
 Self-training: This approach trains a machine learning model on the labeled
data and then uses the model to predict labels for the unlabeled data. The model
is then retrained on the labeled data and the predicted labels for the unlabeled
data.
 Generative adversarial networks (GANs): GANs are a type of deep learning
algorithm that can be used to generate synthetic data. GANs can be used to
generate unlabeled data for semi-supervised learning by training two neural
networks, a generator and a discriminator.
Advantages of Semi- Supervised Machine Learning
 It leads to better generalization as compared to supervised learning, as it takes
both labeled and unlabeled data.
 Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
 Semi-supervised methods can be more complex to implement compared to
other approaches.
 It still requires some labeled data that might not always be available or easy to
obtain.
 The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
 Image Classification and Object Recognition: Improve the accuracy of models
by combining a small set of labeled images with a larger set of unlabeled
images.

32
 Natural Language Processing (NLP): Enhance the performance of language
models and classifiers by combining a small set of labeled text data with a vast
amount of unlabeled text.
 Speech Recognition: Improve the accuracy of speech recognition by leveraging
a limited amount of transcribed speech data and a more extensive set of
unlabeled audio.
 Recommendation Systems: Improve the accuracy of personalized
recommendations by supplementing a sparse set of user-item interactions
(labeled data) with a wealth of unlabeled user behavior data.
 Healthcare and Medical Imaging: Enhance medical image analysis by utilizing
a small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps
on increasing its performance using Reward Feedback to learn the behavior or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence
experienced.
Here are some of most common reinforcement learning algorithms:
a. Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function,
which maps states to actions. The Q-function estimates the expected reward of
taking a particular action in a given state.
b. SARSA (State-Action-Reward-State-Action): SARSA is another model-free
RL algorithm that learns a Q-function. However, unlike Q-learning, SARSA
updates the Q-function for the action that was actually taken, rather than the
optimal action.
c. Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep
learning. Deep Q-learning uses a neural network to represent the Q-function,
which allows it to learn complex relationships between states and actions.

33
Reinforcement Machine Learning

Let’s understand it with the help of examples.


Example: Consider that you are training an AI agent to play a game like chess. The agent
explores different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
- Rewards the agent for taking a desired action.
- Encourages the agent to repeat the behavior.
Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.
Negative reinforcement
- Removes an undesirable stimulus to encourage a desired behavior.
- Discourages the agent from repeating the behavior.
Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by completing a
task.
Advantages of Reinforcement Machine Learning
- It has autonomous decision-making that is well-suited for tasks and that can learn to
make a sequence of decisions, like robotics and game-playing.

34
- This technique is preferred to achieve long-term results that are very difficult to achieve.

- It is used to solve a complex problems that cannot be solved by conventional techniques.


Disadvantages of Reinforcement Machine Learning
- Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
- Reinforcement learning is not preferable to solving simple problems.
- It needs a lot of data and a lot of computation, which makes it impractical and costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
- Game Playing: RL can teach agents to play games, even complex ones.
- Robotics: RL can teach robots to perform tasks autonomously.
- Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
- Recommendation Systems: RL can enhance recommendation algorithms by learning user
preferences.
- Healthcare: RL can be used to optimize treatment plans and drug discovery.
- Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.
- Finance and Trading: RL can be used for algorithmic trading.
- Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
- Energy Management: RL can be used to optimize energy consumption.
- Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
- Adaptive Personal Assistants: RL can be used to improve personal assistants.
- Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive
and interactive experiences.
- Industrial Control: RL can be used to optimize industrial processes.
- Education: RL can be used to create adaptive learning systems.
- Agriculture: RL can be used to optimize agricultural operations

35
1.1.3 Machine Learning tools

Top 10 Machine Learning Tools

Machine learning has witnessed exponential growth in tools and frameworks designed to help data scientists and
engineers efficiently build and deploy ML models. Below is a detailed overview of some of the top machine learning
tools, highlighting their key features.

1. Microsoft Azure Machine Learning

Microsoft Azure is a cloud-based environment you can use to train, deploy, automate, manage, and track ML
models. It is designed to help data scientists and ML engineers leverage their existing data processing and model
development skills & frameworks.

Key Features:

Drag-and-drop visual interface (Azure ML Studio).

Support for popular ML frameworks and languages.

Scalable cloud resources for training and deployment.

2. IBM Watson

IBM Watson is an enterprise-ready AI services, applications, and tooling suite. It provides various tools for data
analysis, natural language processing, and machine learning model development and deployment.

36
Key Features

Pre-built applications for various industries.

Powerful natural language processing capabilities.

Robust toolset for building, training, and deploying models.

3. TensorFlow

TensorFlow, an open-source software library, facilitates numerical computation through data flow graphs.
Developed by the Google Brain team's researchers and engineers, it is utilized both in research and production
activities within Google.

Key Features

Extensive library for deep learning and machine learning.

Strong support for research and production projects.

Runs on CPUs, GPUs, and TPUs.

4. Amazon Machine Learning

Amazon Machine Learning is a cloud service that makes it easy for professionals of all skill levels to use machine
learning technology. It provides visualization tools and wizards to create machine learning models without learning
complex ML algorithms and technology.

Key Features

Easy to use for creating ML models.

Automatic data transformation and model evaluation.

37
Integration with Amazon S3, Redshift, and RDS for data storage.

5. OpenNN

OpenNN is an open-source neural network library written in C++. It is designed to implement neural networks
flexibly and robustly, focusing on advanced analytics.

Key Features

High performance and parallelization.

Comprehensive documentation and examples.

Designed for research and development in deep learning.

6. PyTorch

PyTorch, a machine learning framework that's open-source and built upon the Torch library, supports a wide range
of applications, including computer vision and natural language processing. It's celebrated for its adaptability and its
capacity to dynamically manage computational graphs.

Key Features

Dynamic computation graph that allows for flexibility in model architecture.

Strong support for deep learning and neural networks.

Large ecosystem of tools and libraries.

7. Vertex AI

38
Vertex AI is Google Cloud's AI platform. It consolidates its ML offerings into a unified API, client library, and user
interface. This enables ML engineers and data scientists to accelerate the development and maintenance of artificial
intelligence models.

Key Features

Unified tooling and workflow for model training, hosting, and deployment.

AutoML features for training high-quality models with minimal effort.

Integration with Google Cloud services for storage, data analysis, and more.

8. BigML

BigML is a machine learning platform that helps users create, deploy, and maintain machine learning models. It
offers a comprehensive environment for preprocessing, machine learning, and model evaluation tasks.

Key Features

Interactive visualizations for data analysis.

Automated model tuning and selection.

REST API for integration and model deployment.

9. Apache Mahout

Apache Mahout serves as a scalable linear algebra framework and offers a mathematically expressive Scala-based
domain-specific language (DSL). This design aims to facilitate the rapid development of custom algorithms by
mathematicians, statisticians, and data scientists. Its primary areas of application include filtering, clustering, and
classification, streamlining these processes for professionals in the field.

39
Key Features

Scalable machine learning library.

Support for multiple distributed backends (including Apache Spark).

Extensible and customizable for developing new ML algorithms.

10. Weka

Weka is an open-source software suite written in Java, designed for data mining tasks. It includes a variety of
machine learning algorithms geared towards tasks such as data pre- processing, classification, regression, clustering,
discovering association rules, and data visualization.

Key Features

User-friendly interface for exploring data and models.

Wide range of algorithms for data analysis tasks.

Suitable for developing new machine learning schemes.

11. Scikit-learn

Scikit-learn is a complimentary, open-source library dedicated to machine learning within the Python ecosystem. It
is celebrated for its user-friendly nature and straightforwardness, offering an extensive array of supervised and
unsupervised learning algorithms. Anchored by foundational libraries such as NumPy, SciPy, and matplotlib, it
emerges as a primary choice for data mining and analysis tasks.

40
Key Features

Comprehensive collection of algorithms for classification, regression, clustering, and dimensionality reduction.
Tools for model selection, evaluation, and preprocessing.

Extensive documentation and community support.

Google Cloud AutoML

Google Cloud AutoML offers a collection of machine learning tools designed to help developers with minimal ML
knowledge create tailored, high-quality models for their unique business requirements. It leverages Google's
advanced transfer learning and neural architecture search technologies.

Key Features

User-friendly interface for training custom models.

Supports various ML tasks such as vision, language, and structured data.

Integration with Google Cloud services for seamless deployment and scalability.

Colab

Colab, or Google Colaboratory, is a free cloud service based on Jupyter Notebooks that supports Python. It is
designed to facilitate ML education and research with no setup required. Colab provides an easy way to write and
execute arbitrary Python code through the browser.

Key Features

Free access to GPUs and TPUs for training.

Easy sharing of notebooks within the community.

Integration with Google Drive for easy storage and access to notebooks.

41
KNIME

KNIME is an open-source data analytics, reporting, and integration platform allowing users to create data flows
visually, selectively execute some or all analysis steps, and inspect the results, models, and interactive views.

Key Features

A graphical user interface for easy workflow assembly.

Wide range of nodes for data integration, transformation, analysis, and visualization.

Extensible through plugins and integration with other languages.

Keras

Keras, a Python-based open-source library for neural networks, facilitates swift experimentation in the realm of deep
learning. Serving as an interface for TensorFlow, it simplifies the construction and training of models.

Key Features

User-friendly, modular, and extensible.

Supports convolutional and recurrent networks, as well as combinations of the two.

Runs seamlessly on CPU and GPU.

RapidMiner

RapidMiner serves as a comprehensive data science tool, offering a cohesive platform for tasks like data prep,
machine learning, deep learning, text mining, and predictive analytics. It caters to users of varying expertise,
accommodating both novices and seasoned professionals.

Key Features

Visual workflow designer for easy creation of analysis processes.

42
Extensive collection of algorithms for data analysis.

Supports deployment of models in enterprise applications.

Shogun

Shogun is a freely available machine learning library that encompasses a wide range of efficient and cohesive
techniques. Developed in C++, it features interfaces for several programming languages, including C++, Python, R,
Java, Ruby, Lua, and Octave.

Key Features

Supports many ML algorithms and frameworks for regression, classification, and clustering.
Integration with other scientific computing libraries.

Focus on kernel methods and support vector machines.

Project Jupyter

Project Jupyter is a free, open-source initiative designed to enhance interactive data science and scientific computing
across various programming languages. Originating from the IPython project, it offers a comprehensive framework
for interactive computing, including notebooks, code, and data management.

Key Features

Supports interactive data visualization and sharing of live code.

Extensible with a large number of extensions and widgets.

Cross-language support, including Python, Julia, R, and many more.

43
Amazon SageMaker

Amazon SageMaker empowers all developers and data scientists to create, train, and deploy
ML models with ease. It simplifies and streamlines every stage of the machine learning
workflow. Discover how to efficiently use Amazon SageMaker to develop, train, optimize,
and deploy machine learning models.

Key Features

Built-in algorithms and support for custom algorithms.

One-click deployment and automatic model tuning.

Integration with AWS services for data processing and storage.

Apache Spark

Apache Spark serves as an integrated analytics engine designed to process data on a large
scale. It offers advanced APIs for Java, Scala, Python, and R, alongside an efficient engine
that backs versatile computation graphs for data analysis. Engineered for rapid processing,
Spark enables in-memory computation and supports a range of machine learning algorithms
through its MLlib library.

Key Features

Fast processing of large datasets.

Spark supports SQL queries and streaming data.

MLlib for machine learning (common libraries).

Runs in standalone mode or scales up to thousands of nodes.

A very active community that contributes to its extensive ecosystem.

I.C 1.2 Preparing Machine Learning environment

 Installation of Python
The process of How to install Python in Windows, operating system
is relatively easy and involves a few uncomplicated steps. This article
aims to take you through the process of downloading and installing
Python on your Windows computer.
How to Install Python in Windows?

44
We have provided step-by-step instructions to guide you and ensure a
successful installation. Whether you are new to programming or have
some experience, mastering how to install Python on Windows will
enable you to utilize this potent language and uncover its full range of
potential applications.
To download Python on your system, you can use the following steps

45
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably version 3.10.11, which
was used in testing this tutorial. Choose the correct link for your device from the options
provided: either Windows installer (64-bit) or Windows installer (32-bit) and proceed to
download the executable file.

Python Homepage

Step 2: Downloading the Python Installer


Once you have downloaded the installer, open the .exe file, such as python-3.10.11-
amd64.exe, by double-clicking it to launch the Python installer. Choose the option to
Install the launcher for all users by checking the corresponding checkbox, so that all users
of the computer can access the Python launcher application. Enable users to run Python
from the command line by checking the Add python.exe to PATH checkbox.

46
Python Installer

After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.

Python Setup

47
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will see
a successful message.

Python Successfully installed

Step 4: Verify the Python Installation in Windows


Close the window after successful installation of Python. You can check if the installation
of Python was successful by using either the command line or the Integrated Development
Environment (IDLE), which you may have installed. To access the command line, click
on the Start menu and type “cmd” in the search bar. Then click on Command Prompt.
python --version

Python version

You can also check the version of Python by opening the IDLE application. Go to Start and
enter IDLE in the search bar and then click the IDLE app, for example, IDLE (Python
3.10.11 64-bit). If you can see the Python IDLE window then you are successfully able to
download and installed Python on Windows.

48
Python IDLE

Getting Started with Python


Python is a lot easier to code and learn. Python programs can be written on any plain text
editor like Notepad, notepad++, or anything of that sort. One can also use an Online IDE to
run Python code or can even install one on their system to make it more feasible to write
these codes because IDEs provide a lot of features like an intuitive code editor, debugger,
compiler, etc. To begin with, writing Python Codes and performing various intriguing and
useful operations, one must have Python installed on their System.

 Installation of Tools
In a terminal, run:

$ python3 -m pip install python-dev-tools --user --upgrade

Installation with Visual Studio Code


 Follow the installation procedure for python-dev-tools
 Be sure to have the official Python extension installed in VS Code
 Open VS Code from within your activated virtual environment (in fact, make
sure that flake8 from python-dev-tools is in your PYTHON_PATH)
 In VS Code, open settings (F1 key, then type “Open Settings (JSON)”, then
enter)
 Add in the opened JSON file (before the closing }):

49
"python.linting.enabled": true,

"python.linting.flake8Enabled": true,

"python.linting.flake8Path": "flake8",

"python.formatting.provider": "black",

"python.formatting.blackPath": "whataformatter",

"python.formatting.blackArgs": [],

Environment Testing
Think of how you might test the lights on a car. You would turn on the lights (known as the
test step) and go outside the car or ask a friend to check that the lights are on (known as the
test assertion). Testing multiple components is known as integration testing.
You have just seen two types of tests:

1. An integration test checks that components in your application operate with each
other.
2. A unit test checks a small component in your application.

You can write both integration tests and unit tests in Python. To write a unit test for the built-
in function sum(), you would check the output of sum() against a known output.

For example, here’s how you check that the sum() of the numbers (1, 2, 3) equals 6:

Python
>>> assert sum([1, 2, 3]) == 6, "Should be 6"

Data Collection and Acquisition

 Description of key terms

50
Data: data is a distinct piece of information that is gathered and
translated for some purpose. Data is information that has been translated
into a form that is efficient for movement or processing.
Information: Information is a result of processing or transforming data
into a useful form. We understand information because it's more
organized and has context. Information can be in the form of graphs,
tables, or videos.
Dataset: A dataset is an organized collection of data. The most basic
representation of a dataset is data elements presented in tabular form.
Each column represents a particular variable. Each row corresponds to a
given value of that column's variable.
Data warehouse: A data warehouse (DW) is a digital storage system that
connects and harmonizes large amounts of data from many different
sources.
Big data: Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size and
complexity that none of traditional data management tools can store it or
process it efficiently. Big data is also a data but with huge size.

Types Of Big
Data

Following are the types of Big Data:

1. Structured
2. Unstructured
3. Semi-structured

Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data.

However, nowadays, we are foreseeing issues when a size of such data grows to a huge
extent, typical sizes are being in the rage of multiple zettabytes.

51
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

52
Looking at these figures one can easily understand why the name Big Data is given and
imagine the challenges involved in its storage and processing.

Do you know? Data stored in a relational database management system is one example of
a ‘structured’ data.

Examples Of Structured Data

An ‘Employee’ table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs


2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000

Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc.

Examples Of Un-structured Data

The output returned by ‘Google Search’

53
Example Of Un-structured Data

Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in an XML file.

Examples Of Semi-structured Data

Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data

Big data can be described by the following characteristics:

 Volume: The name Big Data itself is related to a size which is enormous. ‘Volume’
is one characteristic which needs to be considered while dealing with Big Data
solutions.
 Variety: Variety refers to heterogeneous sources and the nature of data, both
structured and unstructured. Nowadays, data in the form of emails, photos, videos,

54
monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining
and analyzing data.

 Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data. Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.

 Variability: This refers to the inconsistency which can be shown by the data at times,
thus hampering the process of being able to handle and manage the data effectively.

 Identification of Source of data


IoT Sensors: IoT sensors are one of the key components in IoT
devices that collect data from surroundings and transmit them over
networks.
Camera: A camera is an instrument used to capture and store images
and videos, either digitally via an electronic image sensor, or
chemically via a light-sensitive.
Computer: The data might be located on the same computer as
the program, or on another computer somewhere on a network.
Smartphone:
Social data: information that social media users publicly share
and includes metadata such as the user's location, language
spoken, biographical data, and shared links.
Transactional data: Transactional data relates to the transactions of
the organization and includes data that is captured, for example, when
a product is sold or purchased.
Gathering Machine Learning dataset

Before we dive into such topics as ML and Data Science, and try to explain how it works, we
should answer several questions:

 What can we achieve in business or on the project with the help of ML? What
goals do I want to accomplish using ML?

55
 Do I only want to hop on the trend, or will the use of ML really improve user
experience, increase profitability, or protect my product and its users?

 Do I need the system to predict anything or does it need to be able to


detect anomalies?

Understanding the Data Collection Process

1. Defining the Problem Statement

Clearly outline the objectives of the data collection process and the specific research
questions you want to answer. This step will guide the entire process and ensure you
collect the right data to meet your goals.

Also, it is recommended to identify data sources. Determine the sources from which you
will collect data. These sources may include primary data (collected directly for your study)
or secondary data (previously collected by others). Common data sources include surveys,
interviews, existing databases, observation, experiments, and online platforms.

2. Planning Data Collection

In this stage, it is better to start with the selection of data collection methods. Choose the
appropriate methods to collect data from the identified sources. The methods may vary
depending on the nature of the data and research objectives.

Common methods include:

 Surveys: Structured questionnaires administered to a target group to gather


specific information.

 Interviews: Conducting one-on-one or group conversations to gain in-depth insights.

 Observation: Systematically observing and recording behaviors or events.

 Experiments: Controlling variables to study cause-and-effect relationships.

 Web scraping: Extracting data from websites and online sources.

56
 Sensor data collection: Gathering data from sensors or IoT devices.

3. Ensuring Data Quality

The next step is very crucial. Ensuring data quality means reviewing the collected data to
check for errors, inconsistencies, or missing values. Apply quality assurance techniques to
ensure the data is reliable and suitable for analysis.

The following step would be data storage and management. It will require organizing and
storing the collected data in a secure and accessible manner. Consider using databases or
other data management systems for efficient storage and retrieval.

How to Start Collecting Data for ML: Data Collection Strategy

1. Synthetic Data Generation

Synthetic data is any information manufactured artificially which does not represent events or
objects in the real world. Algorithms create synthetic data used in model datasets for testing
or training purposes. This data can mimic operational or production data and help train ML
models or test out mathematical models.

2. Active Learning

Active learning is a machine learning technique that focuses on selecting the most
informative data points to label or annotate from an unlabeled dataset. The aim of active
learning is to reduce the amount of labeled data required to build an accurate model by
strategically choosing which instances to query for labels. This is especially useful when
labeling data can be time-consuming or expensive.

Active Learning in Data Collection: Steps

57
Step Features

Initial Data Initially, a small labeled dataset is collected through random sampling or any
Collection other standard method.

Model The initial labeled data is used to train a machine learning model.
Training

The model is then used to predict the labels of the remaining unlabeled data
Uncertainty points. During this process, the model’s uncertainty about its predictions is often
Estimation estimated. There are various ways to measure uncertainty, such as entropy, margin
sampling, and least confidence.

A query strategy is chosen to decide which data points to request labels for. The
query strategy selects instances with high uncertainty as these instances are likely
Query Strategy
to have the most impact on improving the model’s performance.
Selection
There are a few methods to apply query strategy selection: uncertainty sampling,
diversity sampling, and representative sampling.

Labeling New The selected instances are then sent for labeling or annotation by domain experts.
Data Points

Model Update The newly labeled data is added to the labeled dataset, and the model is retrained
using the expanded labeled set.

3. Transfer Learning

It is a popular approach for training models when there is not enough training data or time to
train from scratch. A common technique is to start from an existing model that is well trained
(also called a source task), one can incrementally train a new model (a target task) that
already performs well.

58
4. Open Source Datasets

Open source datasets are a valuable resource in data collection for various research and
analysis purposes. These datasets are typically publicly available and can be freely accessed,
used, and redistributed. Leveraging open source datasets can save time, resources, and effort,
as they often come pre-cleaned and curated, ready for analysis.

Common Methods For Utilizing Open Source Datasets in Data Collection:

Open Source
Features
Datasets Methods

Start by identifying the relevant open source datasets for your research or
Identifying Suitable analysis needs. There are various platforms and repositories where you can
Datasets find open source datasets, such as Kaggle, data.gov, UCI Machine Learning
Repository, GitHub, Google Dataset Search, and many others.

Before using a dataset, it’s essential to explore its contents to understand its
structure, the variables available, and the quality of the data. This
Data Exploration
preliminary analysis will help you determine if the dataset meets your
research requirements.

Pay close attention to the licensing terms associated with the open source
dataset. Some datasets might have specific conditions for usage and
Data Licensing
redistribution, while others may be entirely open for any purpose. Make sure
to adhere to the terms of use.

Data Preprocessing Although open source datasets are usually pre-cleaned, they may still require
some preprocessing to fit your specific needs. This step could involve

59
handling missing data, normalizing values, encoding categorical variables,
and other data transformations.

Ethical Ensure that the data you are using does not contain sensitive or private

Considerations information that could potentially harm individuals or organizations. Respect


data privacy and consider anonymizing or de-identifying data if necessary.

In some cases, your research might require data from multiple sources. Open
Data Integration source datasets can be combined with proprietary data or other open source
datasets to enhance the scope and depth of your analysis.

Validation and Just like with any data, it’s crucial to validate the open source dataset for

Quality Control accuracy and quality. Cross-referencing the data with other sources or
performing sanity checks can help ensure the dataset’s reliability.

When using open source datasets in your research or analysis, it’s essential
Citations and to give proper credit to the original creators or contributors. Follow the
Attribution provided citation guidelines and acknowledge the source of the data
appropriately.

If your research involves publishing results or sharing analyses, make sure to


Reproducibility share the exact details of the open source datasets you used. This ensures
that others can replicate your work and verify your findings.

60
5. Manual Data Generation

Manual data generation refers to the process of collecting data by hand, without the use of
automated tools or systems. Manual data generation can be time-consuming and resource-
intensive, but it can yield valuable and reliable data when performed carefully.

Manual Data Generation Methods in Data Collection

Manual Data Description


Generation Methods

Researchers design surveys or questionnaires to gather information


directly from respondents. These can be administered in person,
Surveys and Questionnaires
over the phone, via email, or through online platforms. Manual data
entry may be required to record the responses.

Researchers directly observe and record data on certain behaviors,


Observations events, or phenomena. This approach is common in social
sciences and ethnographic studies.

Conducting interviews, either face-to-face or through phone calls,


allows researchers to gather qualitative data directly from
Interviews
participants. Manual note-taking or recording of responses is
typically necessary.

Involves manually reviewing and categorizing data from various


Content Analysis sources, such as documents, articles, or social media posts, to
identify patterns or themes.

61
Manual Data Description
Generation Methods

Manual Extraction from When dealing with data that exists in physical forms, such as books,

Physical Sources handwritten records, or photographs, manual transcription or data


extraction may be necessary.

Manual Labeling or In machine learning and AI, manually annotating data with labels

Annotation or tags can be crucial for training algorithms in supervised learning


tasks.

Field Studies Researchers collect data in real-world settings, making direct


observations and recording relevant information manually.

Diaries or Logs Participants may be asked to keep diaries or logs of their activities,
experiences, or behaviors over a certain period.

Handwritten Surveys or Data In some cases, data might be collected using pen and paper, and then
Collection manually transcribed into digital formats for analysis.

6. Building Synthetic Datasets

Building synthetic datasets is one of the most common methods in data collection when real
data is limited or unavailable, or when privacy concerns prevent the use of actual data.
Synthetic datasets are artificially generated datasets that mimic the statistical properties
and patterns of real data without containing any sensitive or identifiable information.

62
Here’s a step-by-step guide on how to build synthetic datasets:

 Define the Problem and Objectives: Clearly identify the purpose of the
synthetic dataset. Determine what specific features, relationships, and patterns you
want the synthetic data to capture. Understand the target domain and data
characteristics to ensure the synthetic dataset is meaningful and useful.

 Understand the Real Data: If possible, analyze and understand the real data you
want to emulate. Identify the key statistical properties, distributions, and relationships
within the data. This will help inform the design of the synthetic dataset.

 Choose a Data Generation Method: Several methods can be used to create synthetic
datasets (Statistical Methods, Generative Models, Data Augmentation, Simulations ).

 Choose the Right Features: Identify the essential features from the real data that
need to be included in the synthetic dataset. Avoid including personally
identifiable information (PII) or any sensitive data that might compromise privacy.

 Generate the Synthetic Data: Implement the chosen data generation method to
create the synthetic dataset. Ensure that the dataset follows the same format and
data types as the real data to be used seamlessly in analyses and modeling.

 Validate and Evaluate: Assess the quality and accuracy of the synthetic dataset
by comparing it to the real data. Use metrics and visualizations to validate that the
synthetic data adequately captures the patterns and distributions present in the real
data.

 Modify and Iterate: If the initial synthetic dataset does not meet your expectations,
refine the data generation method or adjust parameters until it better aligns with the
desired objectives.

 Use Case Considerations: Understand the limitations of synthetic datasets. They


might not fully capture rare events or extreme cases present in real data.
Consequently, synthetic datasets are best suited for certain use cases, such as
initial model development, testing, and sharing with third parties.

 Ensure Privacy and Ethics: Always prioritize privacy and ethical considerations
when generating synthetic datasets. Ensure that no individual or sensitive information
can be inferred from the synthetic data.

63
By following these steps, you can create synthetic datasets that can serve as valuable
substitutes for real data in various scenarios, contributing to better model development and
analysis in data-scarce or privacy-sensitive environments.

7. Federated Learning

Federated learning is a privacy-preserving machine learning approach that enables multiple


parties to collaboratively build a global machine learning model without sharing their raw
data with a central server. This method is particularly useful in scenarios where data privacy
and security are major concerns, such as in healthcare, financial services, and other sensitive
industries.

 Machine Learning tools

TensorFlow

64
TensorFlow, an open-source machine learning tool, is renowned for its flexibility, ideal
for crafting diverse models, and simple to use. With abundant resources and user-friendly
interfaces, it simplifies data comprehension.

PyTorch

PyTorch is a user-friendly machine learning tool, facilitating seamless model


construction. Loved by researchers for its simplicity, it fosters easy idea testing and
error identification. Its intuitive design makes it a preferred choice, offering a smooth
and precise experience in model development.

Scikit-learn

Scikit-learn is a valuable tool for everyday machine-learning tasks, offering a plethora of


tools for tasks like pattern recognition and prediction. Its user-friendly interface and
extensive functionality make it accessible for various applications, whether you’re
identifying patterns in data or making accurate predictions.

Keras

Keras helps easily create models, great for quick experiments, especially with images or
words. It’s user-friendly, making it simple to try out ideas, whether you’re working on
recognizing images or understanding language.

XGBoost

XGBoost excels in analyzing tabular data, showcasing exceptional prowess in pattern


identification and prediction, making it a top choice for competitive scenarios. This
machine learning tool is particularly adept at discerning trends and delivering accurate
predictions, making it a standout performer, especially in competitive environments.

Apache Spark MLlib

Apache Spark MLlib is a powerful tool designed for handling massive datasets, making it
ideal for large-scale projects with extensive data. It simplifies complex data analysis tasks
by providing a robust machine-learning framework. Whether you’re dealing with

65
substantial amounts of information, Spark MLlib offers scalability and efficiency, making
it a valuable resource for projects requiring the processing of extensive data sets.

Microsoft Azure Machine Learning

Microsoft Azure Machine Learning makes it easy to do machine learning in the cloud.
It’s simple, user-friendly, and works well for many different projects, making
machine learning accessible and efficient in the cloud.

Google Cloud AI Platform

Google Cloud AI Platform is a strong tool for using machine learning on Google Cloud.
Great for big projects, it easily works with other Google tools. It provides detailed stats
and simple functions, making it a powerful option for large machine-learning tasks.

H2O.ai

H2O.ai is a tool that helps you use machine learning easily. It’s good for many jobs and
has a helpful community. With H2O.ai, you can use machine learning well, thanks to its
easy interface and helpful people.

RapidMiner

RapidMiner is an all-rounder tool for the entire machine learning method, ideal for
concept exploration and collaboration on tremendous projects. It enables trying out ideas
and permits seamless teamwork, making it a versatile tool for diverse stages of machine
learning development.

 Preparing Machine Learning environment

 Installation of Python

To download Python on your system, you can use the following steps
Step 1: Select Version to Install Python
Visit the official page for Python https://www.python.org/downloads/ on the Windows
operating system. Locate a reliable version of Python 3, preferably

66
version 3.10.11, which was used in testing this tutorial. Choose the correct link for your
device from the options provided: either Windows installer (64- bit) or Windows
installer (32-bit) and proceed to download the executable file.

 Python Homepage

Step 2: Downloading the Python Installer


Once you have downloaded the installer, open the .exe file, such as python- 3.10.11-
amd64.exe, by double-clicking it to launch the Python installer.
Choose the option to Install the launcher for all users by checking the corresponding
checkbox, so that all users of the computer can access the Python launcher
application.Enable users to run Python from the command line by checking the Add
python.exe to PATH checkbox.

67
 Python Installer

After Clicking the Install Now Button the setup will start installing Python on your
Windows system. You will see a window like this.

 Python Setup

68
Step 3: Running the Executable Installer
After completing the setup. Python will be installed on your Windows system. You will
see a successful message.

 Python Successfully installed

Step 4: Verify the Python Installation in Windows


Close the window after successful installation of Python. You can check if the
installation of Python was successful by using either the command line or the Integrated
Development Environment (IDLE), which you may have installed. To access the
command line, click on the Start menu and type “cmd” in the search bar. Then click on
Command Prompt.
python --version

 Python version

You can also check the version of Python by opening the IDLE application. Go to Start
and enter IDLE in the search bar and then click the IDLE app, for

69
example, IDLE (Python 3.10.11 64-bit). If you can see the Python IDLE window then
you are successfully able to download and installed Python on Windows.

 Python IDLE
Getting Started with Python
Python is a lot easier to code and learn. Python programs can be written on any plain text
editor like Notepad, notepad++, or anything of that sort. One can also use an Online IDE
to run Python code or can even install one on their system to make it more feasible to write
these codes because IDEs provide a lot of features like an intuitive code editor, debugger,
compiler, etc. To begin with, writing Python Codes and performing various intriguing and
useful operations, one must have Python installed on their System.
Summary

How to Install Python on Windows?

To install Python on Windows, you need to download the Python installer from the official
Python website and run it on your system. The installation process is straightforward and
includes options to add Python to your system PATH.

What Are the Steps to Install Python 3 on Windows? Steps

to Install Python 3 on Windows:

70
1. Download the Installer:

 Visit the official Python website: python.org.

 Go to the Downloads section and click on “Download Python 3.x.x” (the


latest version).

2. Run the Installer:

 Locate the downloaded installer file (python-3.x.x.exe) and run it.

3. Select Installation Options:

 Check the box that says “Add Python to PATH” at the bottom of the installer
window.

 Choose “Install Now” for a standard installation or “Customize Installation”


to choose specific features and installation location.

4. Customize Installation (Optional):

 If you chose “Customize Installation,” select optional features like pip,


tcl/tk, and documentation.

 Choose the installation location or accept the default.

5. Complete the Installation:

 The installer will copy the necessary files and set up Python on your system.

 Once the installation is complete, you can close the installer.

How to Verify Python Installation on Windows? Steps to

Verify Python Installation on Windows:

71
1. Open Command Prompt:

 Press Win + R, type cmd, and press Enter to open the Command Prompt.

2. Check Python Version:

 Type python --version and press Enter.

 You should see the installed Python version, e.g., Python 3.x.x.

3. Check pip Version:

 Type pip --version and press Enter.

 This verifies that pip, the Python package installer, is also installed correctly.

What Are Environment Variables for Python on Windows?

Environment Variables for Python on Windows:

Environment variables are used to configure the environment in which processes run. For
Python, you often need to set the PATH environment variable so that you can run Python and
pip from the command line. This ensures that Python executables and scripts can be accessed
from any command line prompt without specifying their full path.

How to Configure Python Path on Windows? Steps

to Configure Python Path on Windows:

1. Add Python to PATH During Installation:

72
 When running the Python installer, ensure you check the box that says “Add
Python to PATH.”

2. Manually Add Python to PATH:

 Open the Start menu, search for “Environment Variables,” and select “Edit the
system environment variables.”

 In the System Properties window, click on the “Environment Variables”


button.

 Under “System variables,” find the Path variable and click “Edit.”

 Click “New” and add the path to the Python installation directory (e.g., C:\
Python39) and the Scripts directory
(e.g., C:\Python39\Scripts).

 Click “OK” to close all windows.

3. Verify PATH Configuration:

 Open Command Prompt.

 Type python and press Enter to start the Python interpreter. If Python starts,
the PATH is configured correctly.

 Type exit() to exit the Python interpreter.

 Type pip and press Enter to verify that pip can be called from the command
line.

 Installation of Tools

Anaconda distribution provides installation of Python with various IDE's such as


Jupyter Notebook, Spyder, Anaconda prompt, etc. Hence it is a very

73
convenient packaged solution which you can easily download and install in your
computer. It will automatically install Python and some basic IDEs and libraries with it.

elow some steps are given to show the downloading and installing process of
Anaconda and IDE:

Step-1: Download Anaconda Python:

o To download Anaconda in your system, firstly, open your favorite browser and type
Download Anaconda Python, and then click on the first link as given in the below
image. Alternatively, you can directly download it by clicking on this link,
https://www.anaconda.com/distribution/#download- section.

o After clicking on the first link, you will reach to download page of
Anaconda, as shown in the below image:

74
o Since, Anaconda is available for Windows, Linux, and Mac OS, hence, you can
download it as per your OS type by clicking on available options shown in below
image. It will provide you Python 2.7 and Python 3.7 versions, but the latest
version is 3.7, hence we will download Python 3.7 version. After clicking on the
download option, it will start downloading on your computer.

Note: In this topic, we are downloading Anaconda for Windows you can choose it as
per your OS.

75
Step- 2: Install Anaconda Python (Python 3.7 version):

Once the downloading process gets completed, go to downloads → double click on the
".exe" file (Anaconda3-2019.03-Windows-x86_64.exe) of Anaconda. It will open a
setup window for Anaconda installations as given in below image, then click on Next.

76
o It will open a License agreement window click on "I Agree" option and move
further.

77
o In the next window, you will get two options for installations as given in the
below image. Select the first option (Just me) and click on Next.

o Now you will get a window for installing location, here, you can leave it as default
or change it by browsing a location, and then click on Next. Consider the below
image:

78
o Now select the second option, and click on install.

o Once the installation gets complete, click on Next.

79
o Now installation is completed, tick the checkbox if you want to learn more about
Anaconda and Anaconda cloud. Click on Finish to end the process.

Note: Here, we will use the Spyder IDE to run Python programs.
80
Step- 3: Open Anaconda Navigator

o After successful installation of Anaconda, use Anaconda navigator to launch


a Python IDE such as Spyder and Jupyter Notebook.

o To open Anaconda Navigator, click on window Key and search for


Anaconda navigator, and click on it. Consider the below image:

o After opening the navigator, launch the Spyder IDE by clicking on


the Launch button given below the Spyder. It will install the Spyder IDE in your
system.

81
Run your Python program in Spyder IDE.

o Open Spyder IDE, it will look like the below image:

o Write your first program, and save it using the .py extension.
o Run the program using the triangle Run button.
o You can check the program's output on console pane at the bottom right side.

82
Step- 4: Close the Spyder IDE.

 Data Collection and Acquisition

 Description of key terms

Machine learning data refers to the collection of information (or dataset) that is used by
machine learning models to learn, train, and make predictions. It can be structured or
unstructured and usually consists of features (input variables) and labels (output variables or
target values). The quality and nature of this data directly impact the performance and
accuracy of machine learning models.

Machine Learning information refers to the data, knowledge, and insights utilized or
generated in the process of training machine learning models. It includes everything from
raw datasets to model predictions, as well as the intermediate knowledge gained during
data analysis, feature extraction, model evaluation, and decision-making.

A machine learning dataset is a structured collection of data used to train, validate, and
test machine learning models. It consists of multiple examples or data points, where each
data point typically contains features (input variables) and may include a corresponding
label (output or target variable) in the case of supervised learning. The dataset is
essential for enabling machine learning algorithms to learn patterns, make predictions,
and generalize to new data.

A machine learning data warehouse is a centralized repository designed to store large


volumes of structured, semi-structured, and unstructured data that can be used for machine
learning (ML) and analytics tasks. It provides the infrastructure to collect, manage, and
retrieve data efficiently for training and deploying machine learning models. Data
warehouses support complex queries and enable users to perform large-scale data
processing, making them essential for preparing high-quality datasets for machine learning
applications.

83
Big data for machine learning refers to extremely large, complex, and diverse datasets
that are generated at high velocity and volume. These datasets require advanced processing
techniques and technologies to extract useful insights and are used in machine learning
(ML) to improve model performance, accuracy, and scalability. Machine learning on big
data enables models to learn from vast and varied data sources, resulting in more accurate
predictions and better decision- making capabilities.

 Identification of Source of data

84

You might also like