[go: up one dir, main page]

0% found this document useful (0 votes)
9 views12 pages

Introduction To Data in Machine Learning

Data is a fundamental element in machine learning, influencing model performance through its quality and quantity. It is categorized into labeled and unlabeled data, with preprocessing being essential for effective model training. The document also discusses the advantages and disadvantages of using data in machine learning, the historical development of machine learning algorithms, and the significance of data in various applications.

Uploaded by

Meenakshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Introduction To Data in Machine Learning

Data is a fundamental element in machine learning, influencing model performance through its quality and quantity. It is categorized into labeled and unlabeled data, with preprocessing being essential for effective model training. The document also discusses the advantages and disadvantages of using data in machine learning, the historical development of machine learning algorithms, and the significance of data in various applications.

Uploaded by

Meenakshi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

ML | Introduction to Data in Machine

Learning
Introduction to data in machine learning:

Data is a crucial component in the field of Machine Learning. It


refers to the set of observations or measurements that can be used
to train a machine learning model. The quality and quantity of data
available for training and testing play a significant role in
determining the performance of a machine learning model. Data can
be in various forms such as numerical, categorical, or time-series
data and can come from various sources such as databases,
spreadsheets, or APIs. Machine learning algorithms use data to learn
patterns and relationships between input variables and target
outputs, which can then be used for prediction or classification
tasks.
Data is typically divided into two types: labeled and unlabeled.
Labeled data includes a label or target variable that the model is
trying to predict, whereas unlabeled data does not include a label or
target variable.
The data used in machine learning is typically numerical or
categorical. Numerical data includes values that can be ordered and
measured, such as age or income. Categorical data includes values
that represent categories, such as gender or type of fruit.
Data can be divided into training and testing sets. The training set is
used to train the model, and the testing set is used to evaluate the
performance of the model. It is important to ensure that the data is
split in a random and representative way.
Data preprocessing is an important step in the machine learning
pipeline. This step can include cleaning and normalizing the data,
handling missing values, and feature selection or engineering.
DATA: It can be any unprocessed fact, value, text, sound, or picture
that is not being interpreted and analyzed. Data is the most
important part of all Data Analytics, Machine Learning, Artificial
Intelligence. Without data, we can’t train any model and all modern
research and automation will go in vain. Big Enterprises are
spending lots of money just to gather as much certain data as
possible.
Example: Why did Facebook acquire WhatsApp by paying a huge
price of $19 billion?
The answer is very simple and logical – it is to have access to the
users’ information that Facebook may not have but WhatsApp will
have. This information of their users is of paramount importance to
Facebook as it will facilitate the task of improvement in their
services.
INFORMATION: Data that has been interpreted and manipulated
and has now some meaningful inference for the users.
KNOWLEDGE: Combination of inferred information, experiences,
learning, and insights. Results in awareness or concept building for
an individual or organization.

How we split data in Machine Learning?


 Training Data: The part of data we use to train our model.
This is the data that your model actually sees(both input
and output) and learns from.
 Validation Data: The part of data that is used to do a
frequent evaluation of the model, fit on the training dataset
along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data
plays its part when the model is actually training.
 Testing Data: Once our model is completely trained,
testing data provides an unbiased evaluation. When we feed
in the inputs of Testing data, our model will predict some
values(without seeing actual output). After prediction, we
evaluate our model by comparing it with the actual output
present in the testing data. This is how we evaluate and see
how much our model has learned from the experiences feed
in as training data, set at the time of training.

Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which
he has a long list of questions and answers that he had asked from
the customers, this list of questions and answers is DATA. Now
every time when he wants to infer anything and can’t just go
through each and every question of thousands of customers to find
something relevant as it would be time-consuming and not helpful.
In order to reduce this overhead and time wastage and to make
work easier, data is manipulated through software, calculations,
graphs, etc. as per your own convenience, this inference from
manipulated data is Information. So, Data is a must for
Information. Now Knowledge has its role in differentiating between
two individuals having the same information. Knowledge is actually
not technical content but is linked to the human thought process.
Different Forms of Data
 Numeric Data : If a feature represents a characteristic
measured in numbers , it is called a numeric feature.
 Categorical Data : A categorical feature is an attribute
that can take on one of the limited , and usually fixed
number of possible values on the basis of some qualitative
property . A categorical feature is also called a nominal
feature.
 Ordinal Data : This denotes a nominal variable with
categories falling in an ordered list . Examples include
clothing sizes such as small, medium , and large , or a
measurement of customer satisfaction on a scale from “not
at all happy” to “very happy”.

Properties of Data –
1. Volume: Scale of Data. With the growing world population
and technology at exposure, huge data is being generated
each and every millisecond.
2. Variety: Different forms of data – healthcare, images,
videos, audio clippings.
3. Velocity: Rate of data streaming and generation.
4. Value: Meaningfulness of data in terms of information that
researchers can infer from it.
5. Veracity: Certainty and correctness in data we are working
on.
Some facts about Data:
 As compared to 2005, 300 times i.e. 40 Zettabytes
(1ZB=10^21 bytes) of data will be generated by 2020.
 By 2011, the healthcare sector has a data of 161 Billion
Gigabytes
 400 Million tweets are sent by about 200 million active users
per day
 Each month, more than 4 billion hours of video streaming is
done by the users.
 30 Billion different types of content are shared every month
by the user.
 It is reported that about 27% of data is inaccurate and so 1
in 3 business idealists or leaders don’t trust the information
on which they are making decisions.
The above-mentioned facts are just a glimpse of the actually
existing huge data statistics. When we talk in terms of real-world
scenarios, the size of data currently presents and is getting
generated each and every moment is beyond our mental horizons to
imagine.
Example:
Imagine you’re working for a car manufacturing company and you
want to build a model that can predict the fuel efficiency of a car
based on the weight and the engine size. In this case, the target
variable (or label) is the fuel efficiency, and the features (or input
variables) are the weight and engine size. You will collect data from
different car models, with corresponding weight and engine size,
and their fuel efficiency. This data is labeled and it’s in the form of
(weight,engine size,fuel efficiency) for each car. After having your
data ready, you will then split it into two sets: training set and
testing set, the training set will be used to train the model and the
testing set will be used to evaluate the performance of the model.
Preprocessing could be needed for example, to fill missing values or
handle outliers that might affect your model accuracy.

Implementation:

Example: 1
 Python3

# Example input data


X = [[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]]
y = [0, 0, 1, 1, 1]

# Train a model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)

# Make a prediction
prediction = model.predict([[6, 7]])[0]
print(prediction)

Output:
0,1
If you run the code I provided, the output will be the prediction
made by the model. In this case, the prediction will be either 0 or 1,
depending on the specific parameters learned by the model during
training.
For example, if the model learned that input data with a high second
element is more likely to have a label of 1, then the prediction for [6,
7] would be 1.

Advantages Or Disadvantages:
Advantages of using data in Machine Learning:

1. Improved accuracy: With large amounts of data, machine


learning algorithms can learn more complex relationships
between inputs and outputs, leading to improved accuracy
in predictions and classifications.
2. Automation: Machine learning models can automate
decision-making processes and can perform repetitive tasks
more efficiently and accurately than humans.
3. Personalization: With the use of data, machine learning
algorithms can personalize experiences for individual users,
leading to increased user satisfaction.
4. Cost savings: Automation through machine learning can
result in cost savings for businesses by reducing the need
for manual labor and increasing efficiency.

Disadvantages of using data in Machine Learning:

1. Bias: Data used for training machine learning models can be


biased, leading to biased predictions and classifications.
2. Privacy: Collection and storage of data for machine learning
can raise privacy concerns and can lead to security risks if
the data is not properly secured.
3. Quality of data: The quality of data used for training
machine learning models is critical to the performance of
the model. Poor quality data can lead to inaccurate
predictions and classifications.
4. Lack of interpretability: Some machine learning models can
be complex and difficult to interpret, making it challenging
to understand how they are making decisions.
What is Machine Learning?
Machine learning is a field of computer science that deals with the
development of algorithms that can learn from data. Machine learning has
revolutionized many areas of research over the past few decades- most
notably in fields like natural language processing (NLP) and image
recognition . It’s because machine learning algorithms are able to improve
efficiency and accuracy when it comes to tasks like predicting outcomes
or interpreting data .

This makes them incredibly useful for a variety as varied as finance ,


healthcare , retailing ,and manufacturing. One example of how SEO might
use machine learning is by using predictive modelling techniques to
predict how users will behave on a given page or site. This information
can then be used to adjust your website’s content or design accordingly.

There are several different types of machine learning algorithms, each


with its own strengths and weaknesses. Some of the most popular include
supervised learning (where the algorithm is given a set of training data),
unsupervised learning (where the algorithm is given unlabeled data),
reinforcement learning (where an agent learns to associate positive and
negative feedback with behaviors), deep neural networks (DNNs) , genetic
algorithms , Bayesian network s, and more . However, it’s important to
note that there isn’t one single “best” type of machine learning algorithm-
each has its own advantages and disadvantages. Consequently, it’s
important to explore all the different options available before making a
final decision.

The early History of Machine Learning


Machine Learning has gone through many phases of development since
the inception of computers. In the following, we will take a closer look at
some of the most important events.

The early History of Machine Learning, Timeline 1943-1979

1943: The First Neutral Network with Electric Circuit

The first neutral network with electric circuit was developed by Warren
McCulloch and Walter Pitts in 1943. The goal of the network was to solve a
problem that had been posed by John von Neumann and others: how
could computers be made to communicate with each other?

This early model showed that it was possible for two computers to
communicate without any human interaction. This event is important
because it paved the way for machine learning development.

1950: Turing Test


The Turing Test is a test of artificial intelligence proposed by
mathematician Alan Turing. It involves determining whether a machine
can act like a human, or if humans can’t tell the difference between
human and machine given answers.

The goal of the test is to determine whether machines can think


intelligently and demonstrate some form of emotional capability. It does
not matter whether the answer is true or false but whether it is considered
human or not by the questioner. There have been several attempts to
create an AI that passes the Turing Test, but no machine has yet
successfully done so.

The Turing Test has been criticized because it measures how much a
machine can imitate a human rather than proving their true intelligence.

1952: Computer Checkers

Arthur Samuel was a pioneer in machine learning and is credited with


creating the first computer program to play championship-level checkers.
His program, which he developed in 1952, used a technique called alpha-
beta pruning to measure the chances of winning a game. This method is
still widely used in games today. In addition, Samuel also developed the
minimax algorithm, which is a technique for minimizing losses in games.

1957: Frank Rosenblatt – The Perceptron

Frank Rosenblatt was a psychologist who is most famous for his work on
machine learning. In 1957, he developed the perceptron, which is a
machine learning algorithm. The Perceptron was one of the first
algorithms to use artificial neural networks, widely used in machine
learning.

It was designed to improve the accuracy of computer predictions. The


goal of the Perceptron was to learn from data by adjusting its parameters
until it reached an optimal solution. Perceptron’s purpose was to make it
easier for computers to learn from data and to improve upon previous
methods that had limited success.

Tip:

AI needs training data in order to learn how to do things on its own. In


order to train a machine learning algorithm, you need large quantities of
labelled data—data that has been annotated with information about the
different types of objects or events it contains. Unfortunately, this is often
difficult to come by. That’s where datasets by clickworker come in!
Datasets are collections of carefully curated examples that have been
specifically prepared for use in machine learning research or application.

Datasets for Machine Learning


1967: The Nearest Neighbor Algorithm

The Nearest Neighbor Algorithm was developed as a way to automatically


identify patterns within large datasets. The goal of this algorithm is to find
similarities between two items and determine which one is closer to the
pattern found in the other item. This can be used for things like finding
relationships between different pieces of data or predicting future events
based on past events.

In 1967, Cover and Hart published an article on “Nearest neighbor pattern


classification.” It is a method of inductive logic used in machine learning
to classify an input object into one of two categories. The pattern
classifies the same items that are classified in the same categories as its
nearest neighbors. This method is used to classify objects with a number
of attributes, many of which are categorical or numerical and may have
overlapping values.

1974: The Backpropagation

Backpropagation was initially designed to help neural networks learn how


to recognize patterns. However, it has also been used in other areas of
machine learning, such as boosting performance and generalizing from
data sets to new instances. The goal of backpropagation is to improve the
accuracy of a model by adjusting its weights so that it can more
accurately predict future outputs.

Paul Werbos laid the foundation for this approach to machine learning in
his dissertation in 1974, which is included in the book “The Roots of
Backpropagation“.

1979: The Stanford Cart

The Stanford Cart is a remote-controlled robot that can move


independently in space. It was first developed in the 1960s and reached
an important milestone in its development in 1979. The purpose of the
Stanford Cart is to avoid obstacles and reach a specific destination: In
1979, “The Cart” succeeded for the first time in traversing a room filled
with chairs in 5 hours without human intervention.

The AI Winter in the History of Machine Learning

During the AI winter, funding dropped and so did the mood among
researchers and the media. (Src:HBO)

AI has seen a number of highs and lows over the years. The low point for
AI was known as the AI winter, which happened in the late 70s to the
90s. During this time, research funding dried up and many projects were
shut down due to their lack of success. It has been described as a series of
hype cycles that have led to disappointment and disillusionment among
developers, researchers, users, and media.
The Rise of Machine Learning in History
The rise of machine learning in the 21th century is a result of Moore’s Law
and its exponential growth. When computing power was becoming more
affordable, it became possible to train AI algorithms using more data,
which resulted in an increase of the accuracy and efficiency of these
algorithms.

History of Machine Learning Timeline, 1997-2017

1997: A Machine Defeats a Man in Chess

In 1997, the IBM supercomputer Deep Blue defeated chess grandmaster


Garry Kasparov in a match. It was the first time a machine had beaten an
expert player at chess and it caused great concern for humans in the
chess community. This was a landmark event as it showed that AI systems
could surpass human understanding in complex tasks.

This marked a magical turning point in machine learning because the


world now knew that mankind had created its own opponent- an artificial
intelligence that could learn and evolve on its own.

2002: Software Library Torch

Torch is a software library for machine learning and data science. Torch
was created by Geoffrey Hinton, Pedro Domingos, and Andrew Ng to
develop the first large-scale free machine learning platform. In 2002, the
founders of Torch created it as an alternative to other libraries because
they believed that their specific needs were not met by other libraries. As
of 2018, it has over 1 million downloads on Github and is one of the most
popular machine learning libraries available today.

Keep in mind: No longer in active development, however, PyTorch can be


used, which is based on the Torch Library.

2006: Geoffrey Hinton, the father of Deep Learning

In 2006, Geoffrey Hinton published his “A Fast Learning Algorithm for


Deep Belief Nets.” This paper was the birth of deep learning. He showed
that by using a deep belief network, a computer could be trained to
recognize patterns in images.

Hinton’s paper described the first deep learning algorithm that can
achieve human-level performance on difficult and complex pattern
recognition tasks.

2011: Google Brain


Google Brain is a research group of Google devoted to artificial
intelligence and machine learning. The group was founded in 2011 by
Google X and is located in Mountain View, California. The team works
closely with other AI research groups within Google such as the DeepMind
group that has developed AlphaGo, an AI that defeated the world
champion at Go. Their goal is to build machines that can learn from data,
understand language, answer questions in natural language, and have
common sense reasoning.

The group is, as of 2021, led by Geoffrey Hinton, Jeff Dean and Zoubin
Ghahramani and focuses on deep learning, a model of artificial neural
networks that is capable to learn complex patterns from data
automatically without being explicitly programmed.

2014: DeepFace

DeepFace is a deep learning algorithm which was originally developed in


2014 and is part of the company “Meta”. The project received significant
media attention after it outperformed human performance on the well-
known “Faces in the Wild” test.

DeepFace is based on a deep neural network, which consists of many


layers of artificial neurons and weights that connect each layer to its
neighboring ones. The algorithm takes as input a training data set of
photographs, with each photo annotated with the identity and age of its
subject. The team has been very successful in recent years and published
many papers on their research results. They have also trained several
deep neural networks that have achieved significant success in pattern
recognition and machine learning tasks.

Image and Face Recognition is on the rise.

2017: ImageNet Challenge – Milestone in the History of


Machine Learning

The ImageNet Challenge is a competition in computer vision that has


been running since 2010. The challenge focuses on the abilities of
programs to process patterns in images and recognize objects with
varying degrees of detail.

In 2017, a milestone was reached. 29 out of 38 teams achieved


95% accuracy with their computer vision models. The improvement in
image recognition is immense.

Present: State-of-the-art Machine Learning


Machine learning is used in many different fields, from fashion to
agriculture. Machine Learning algorithms are able to learn patterns and
relationships between data, find predictive insights for complex problems
and extract information that is otherwise too difficult to find. Today’s
Machine Learning algorithms are able to handle large amounts of data
with accuracy in a relatively short amount of time.

ML in Robotics

Machine learning has been used in robotics for various purposes, the most
common of which are classification, clustering, regression, and anomaly
detection.

 In classification, robots are taught to distinguish between different objects


or categories.
 Clustering helps robots group similar objects together so they can be more
easily processed.
 Regression allows robots to learn how to control their movements by
predicting future values based on past data.
 Anomaly detection is used to identify unusual patterns in data so that they
can be investigated further.

One common use of machine learning in robotics is to improve the


performance of robots through experience. In this application, robots are
given a task and then allowed to learn how to best complete it by
observing the results of their own actions. This type of learning is known
as reinforcement learning.

Another use of machine learning in robotics is to help designers create


more accurate models for future robots. Data from past experiments or
simulations is used to train a machine learning algorithm. The algorithm
helps to predict the results of future experiments, allowing designers to
make better predictions about how their robots will behave.

Machine learning has been used in robotics for some time now to improve
the robots’ ability to interact with their environment. Robots are able to
learn how to do tasks more effectively as well as make better decisions
about what to do next. This allows the robots to be more efficient and
effective in completing their tasks.

ML in Healthcare

Despite the challenges, machine learning has already made a significant


impact in the healthcare industry. It is currently being used to diagnose
and treat diseases, identify patterns and relationships in data, and help
doctors make better decisions about treatments for patients.

However, there is still much work to be done in order to realize the full
potential of ML in healthcare.

ML in Education
Machine learning is a process where computers are taught how to learn
from data. This can be used in a variety of ways, one of which is in
education.

 Track the progress of students and track their overall understanding of the
material they are studying.
 Personalize the educational experience for each student by providing
personalized content and creating rich environments.
 Assess learners’ progress, identify their interests in order to give
appropriate support, and track learning progress to help students adjust
their course.

You might also like