Introduction To Data in Machine Learning
Introduction To Data in Machine Learning
Learning
Introduction to data in machine learning:
Consider an example:
There’s a Shopping Mart Owner who conducted a survey for which
he has a long list of questions and answers that he had asked from
the customers, this list of questions and answers is DATA. Now
every time when he wants to infer anything and can’t just go
through each and every question of thousands of customers to find
something relevant as it would be time-consuming and not helpful.
In order to reduce this overhead and time wastage and to make
work easier, data is manipulated through software, calculations,
graphs, etc. as per your own convenience, this inference from
manipulated data is Information. So, Data is a must for
Information. Now Knowledge has its role in differentiating between
two individuals having the same information. Knowledge is actually
not technical content but is linked to the human thought process.
Different Forms of Data
Numeric Data : If a feature represents a characteristic
measured in numbers , it is called a numeric feature.
Categorical Data : A categorical feature is an attribute
that can take on one of the limited , and usually fixed
number of possible values on the basis of some qualitative
property . A categorical feature is also called a nominal
feature.
Ordinal Data : This denotes a nominal variable with
categories falling in an ordered list . Examples include
clothing sizes such as small, medium , and large , or a
measurement of customer satisfaction on a scale from “not
at all happy” to “very happy”.
Properties of Data –
1. Volume: Scale of Data. With the growing world population
and technology at exposure, huge data is being generated
each and every millisecond.
2. Variety: Different forms of data – healthcare, images,
videos, audio clippings.
3. Velocity: Rate of data streaming and generation.
4. Value: Meaningfulness of data in terms of information that
researchers can infer from it.
5. Veracity: Certainty and correctness in data we are working
on.
Some facts about Data:
As compared to 2005, 300 times i.e. 40 Zettabytes
(1ZB=10^21 bytes) of data will be generated by 2020.
By 2011, the healthcare sector has a data of 161 Billion
Gigabytes
400 Million tweets are sent by about 200 million active users
per day
Each month, more than 4 billion hours of video streaming is
done by the users.
30 Billion different types of content are shared every month
by the user.
It is reported that about 27% of data is inaccurate and so 1
in 3 business idealists or leaders don’t trust the information
on which they are making decisions.
The above-mentioned facts are just a glimpse of the actually
existing huge data statistics. When we talk in terms of real-world
scenarios, the size of data currently presents and is getting
generated each and every moment is beyond our mental horizons to
imagine.
Example:
Imagine you’re working for a car manufacturing company and you
want to build a model that can predict the fuel efficiency of a car
based on the weight and the engine size. In this case, the target
variable (or label) is the fuel efficiency, and the features (or input
variables) are the weight and engine size. You will collect data from
different car models, with corresponding weight and engine size,
and their fuel efficiency. This data is labeled and it’s in the form of
(weight,engine size,fuel efficiency) for each car. After having your
data ready, you will then split it into two sets: training set and
testing set, the training set will be used to train the model and the
testing set will be used to evaluate the performance of the model.
Preprocessing could be needed for example, to fill missing values or
handle outliers that might affect your model accuracy.
Implementation:
Example: 1
Python3
# Train a model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
# Make a prediction
prediction = model.predict([[6, 7]])[0]
print(prediction)
Output:
0,1
If you run the code I provided, the output will be the prediction
made by the model. In this case, the prediction will be either 0 or 1,
depending on the specific parameters learned by the model during
training.
For example, if the model learned that input data with a high second
element is more likely to have a label of 1, then the prediction for [6,
7] would be 1.
Advantages Or Disadvantages:
Advantages of using data in Machine Learning:
The first neutral network with electric circuit was developed by Warren
McCulloch and Walter Pitts in 1943. The goal of the network was to solve a
problem that had been posed by John von Neumann and others: how
could computers be made to communicate with each other?
This early model showed that it was possible for two computers to
communicate without any human interaction. This event is important
because it paved the way for machine learning development.
The Turing Test has been criticized because it measures how much a
machine can imitate a human rather than proving their true intelligence.
Frank Rosenblatt was a psychologist who is most famous for his work on
machine learning. In 1957, he developed the perceptron, which is a
machine learning algorithm. The Perceptron was one of the first
algorithms to use artificial neural networks, widely used in machine
learning.
Tip:
Paul Werbos laid the foundation for this approach to machine learning in
his dissertation in 1974, which is included in the book “The Roots of
Backpropagation“.
During the AI winter, funding dropped and so did the mood among
researchers and the media. (Src:HBO)
AI has seen a number of highs and lows over the years. The low point for
AI was known as the AI winter, which happened in the late 70s to the
90s. During this time, research funding dried up and many projects were
shut down due to their lack of success. It has been described as a series of
hype cycles that have led to disappointment and disillusionment among
developers, researchers, users, and media.
The Rise of Machine Learning in History
The rise of machine learning in the 21th century is a result of Moore’s Law
and its exponential growth. When computing power was becoming more
affordable, it became possible to train AI algorithms using more data,
which resulted in an increase of the accuracy and efficiency of these
algorithms.
Torch is a software library for machine learning and data science. Torch
was created by Geoffrey Hinton, Pedro Domingos, and Andrew Ng to
develop the first large-scale free machine learning platform. In 2002, the
founders of Torch created it as an alternative to other libraries because
they believed that their specific needs were not met by other libraries. As
of 2018, it has over 1 million downloads on Github and is one of the most
popular machine learning libraries available today.
Hinton’s paper described the first deep learning algorithm that can
achieve human-level performance on difficult and complex pattern
recognition tasks.
The group is, as of 2021, led by Geoffrey Hinton, Jeff Dean and Zoubin
Ghahramani and focuses on deep learning, a model of artificial neural
networks that is capable to learn complex patterns from data
automatically without being explicitly programmed.
2014: DeepFace
ML in Robotics
Machine learning has been used in robotics for various purposes, the most
common of which are classification, clustering, regression, and anomaly
detection.
Machine learning has been used in robotics for some time now to improve
the robots’ ability to interact with their environment. Robots are able to
learn how to do tasks more effectively as well as make better decisions
about what to do next. This allows the robots to be more efficient and
effective in completing their tasks.
ML in Healthcare
However, there is still much work to be done in order to realize the full
potential of ML in healthcare.
ML in Education
Machine learning is a process where computers are taught how to learn
from data. This can be used in a variety of ways, one of which is in
education.
Track the progress of students and track their overall understanding of the
material they are studying.
Personalize the educational experience for each student by providing
personalized content and creating rich environments.
Assess learners’ progress, identify their interests in order to give
appropriate support, and track learning progress to help students adjust
their course.