[go: up one dir, main page]

0% found this document useful (0 votes)
16 views12 pages

R LabManual 6-8 Pgms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

R LabManual 6-8 Pgms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

6.

Machine Learning Using R:

The word Machine Learning was first coined by Arthur Samuel in 1959.
The definition of machine learning can be defined as that machine
learning gives computers the ability to learn without being
explicitly programmed. Also in 1997, Tom Mitchell defined machine
learning that “A computer program is said to learn from
experience E with respect to some task T and some performance
measure P, if its performance on T, as measured by P, improves
with experience E”. Machine learning is considered to be the most
interesting field of computer science.

How Machine Learning Works?

1. Clean the data obtained from the dataset


2. Select a proper algorithm for building a prediction model
3. Train your model to understand the pattern of project
4. Predict your results with higher accuracy

Classification Of Machine Learning

Machine learning implementations are classified into 3 major categories,


depending on the nature of learning.
1. Supervised Learning Supervised learning as the name itself
suggests that under the presence of supervision. In short in
supervised learning we try to teach the machine with the data
using labels and which already have the correct answer in it.
After this, the machine will create an example set of data so that
the supervised algorithm analyses the training data and produce
the correct output of the labeled data. For example, if we create
a set of data of fruits then we will be labeling as the fruit having
a round shape with a dip upside and red in color then it is
termed as an apple. Now when we ask the machine to identify
the apple from the basket of fruits then it will use the previous
labeling and identify an apple. Supervised Learning is classified
into two categories as below:
 Classification: A classification problem is when the
output variable is a category, such as “Red” or “Orange”
or “countable” or “not countable”.
 Regression: A regression is used when the output
variable is real value, such as “rupees” or “height”.
2. Unsupervised Learning Unsupervised learning is the training
of machines using information that is not labeled and it works
without any guidance. Here the main task of the machine is to
separate the data using the similarities, differences, and
patterns without any prior supervision. Hence, the machine is
restricted to find the hidden structure in unlabeled data by own-
self. For example, if we provide a group of cats and dogs which
are never seen before. Then the machine will differentiate the
group of cats and dogs according to their behavior and nature.
Now when we provide the pictures of dogs and cats according to
the classification made by the machine it will provide the result.
Unsupervised Learning is classified into two categories as below:
 Clustering: A clustering problem is where the machine
identify the inherent groupings in the data, such as
grouping customers according to visits in the shop.
 Association: An association problem is where we can
find the relation between two events or items, such as
people buying item A also tends to buy B.
3. Reinforcement Learning The reinforcement learning method
is all about taking suitable action to maximize reward in a
particular situation. It is supervised by various machines to take
the best possible path to solve the problem in a specific
situation. The difference between reinforcement learning and
supervised learning is that in supervised learning the data has a
key of the correct answer which it uses to find the answer but in
reinforcement, the agent decides what to do perform the given
task. For example, while traveling from one place to another we
always consider the shortest and best part possible to reach the
destination. Some main points in reinforcement learning:
 Input: The input should be from the initial stage where
the model actually starts.
 Output: There are multiple outputs to any problem.
 Training: As the training is dependent on input, the
model will return the state and the user will decide to
reward or discard the model based on its output.
R language is basically developed by statisticians to help other
statisticians and developers faster and efficiently with the data. As by
now, we know that machine learning is basically working with a large
amount of data and statistics as a part of data science the use of R
language is always recommended. Therefore the R language is mostly
becoming handy for those working with machine learning making tasks
easier, faster, and innovative. Here are some top advantages of R
language to implement a machine learning algorithm in R programming.

Advantages to Implement Machine Learning Using R Language

 It provides good explanatory code. For example, if you are at the


early stage of working with a machine learning project and you
need to explain the work you do, it becomes easy to work with R
language comparison to python language as it provides the
proper statistical method to work with data with fewer lines of
code.
 R language is perfect for data visualization. R language provides
the best prototype to work with machine learning models.
 R language has the best tools and library packages to work with
machine learning projects. Developers can use these packages
to create the best pre-model, model, and post-model of the
machine learning projects. Also, the packages for R are more
advanced and extensive than python language which makes it
the first choice to work with machine learning projects.

Popular R Language Packages Used to Implement Machine Learning

 lattice: The lattice package supports the creation of the graphs


displaying the variable or relation between multiple variables
with conditions.
 DataExplorer: This R package focus to automate the data
visualization and data handling so that the user can pay
attention to data insights of the project.
 Dalex(Descriptive Machine Learning Explanations): This
package helps to provide various explanations for the relation
between the input variable and its output. It helps to understand
the complex models of machine learning
 dplyr: This R package is used to summarize the tabular data of
machine learning with rows and columns. It applies the “split-
apply-combine” approach.
 Esquisse: This R package is used to explore the data quickly to
get the information it holds. It also allows to plot bar graph,
histograms, curves, and scatter plots.
 caret: This R package attempts to streamline the process for
creating predictive models.
 janitor: This R package has functions for examining and
cleaning dirty data. It is basically built for the purpose of user-
friendliness for beginners and intermediate users.
 rpart: This R package helps to create the classification and
regression models using two-stage procedures. The resulting
models are represented as binary trees.

Application Of R in Machine Learning

There are many top companies like Google, Facebook, Uber, etc using
the R language for application of Machine Learning. The application are:
 Social Network Analytics
 To analyze trends and patterns
 Getting insights for behaviour of users
 To find the relationships between the users
 Developing analytical solutions
 Accessing charting components
 Embedding interactive visual graphics
Example of Machine Learning Problems

 Web search like Siri, Alexa, Google, Cortona: Recognize the


user’s voice and fulfill the request made
 Social Media Service: Help people to connect all over the
world and also show the recommendations of the people we may
know
 Online Customer Support: Provide high convenience of
customer and efficiency of support agent
 Intelligent Gaming: Use high level responsive and adaptive
non player characters similar to human like intelligence
 Product Recommendation: A software tool used to
recommend the product that you might like to purchase or
engage with
 Virtual Personal Assistance: It is the software which can
perform the task according to the instructions provided
 Traffic Alerts: Help to switch the traffic alerts according to the
situation provided
 Online Fraud Detection: Check the unusual functions
performed by the user and detect the frauds
 Healthcare: Machine Learning can manage a large amount of
data beyond the imagination of normal human being and help to
identify the illness of the patient according to symptoms
 Real world example: When you search for some kind of
cooking recipe on youTube, you will see the recommendations
below with the title “You May Also Like This”. This is a common
use of Machine Learning.

Types of Machine Learning Problems

 Regression: The regression technique helps the machine


learning approach to predict continuous values. For example,
the price of a house.
 Classification: The input is divided into one or more classes or
categories for the learner to produce a model to assign unseen
modules. For example, in the case of email fraud, we can divide
the emails into two classes i.e “spam” and “not spam”.
 Clustering: This technique follows the summarization, finding a
group of similar entities. For example, we can gather and take
readings of the patients in the hospital.
 Association: This technique finds co-occurring events or items.
For example, market-basket.
 Anomaly Detection: This technique works by discovering
abnormal cases or behavior. For example, credit card fraud
detection.
 Sequence Mining: This technique predicts the next stream
event. For example, click-stream event.
 Recommendation: This technique recommends the item. For
example, songs or movies according to the celebrity in it.
7. Data Visualization Using Tableau:
Why Tableau for Data Visualization?
There are many tools you can choose from, like Excel, and Power BI (or even
Python’s data visualization packages), to create charts and dashboards for your
data. But Tableau stands out from the crowd as the preferred software that data
analysts and BI analysts use in their day-to-day workflow.

So, why choose Tableau for data visualization? Here are just some of the
reasons:

 Tableau Public is open-source and perfect for beginners in data


visualization
 It has a user-friendly interface that is easy to navigate
 You can quickly integrate it with other software or programming languages
like Python and SQL for more advanced analyses

There are many other benefits, including the fact that it is an extremely intuitive
platform able to recognize data types and make suggestions based on the inputs.
For example, it can offer appropriate chart types automatically. You then have
the option to customize these further, and thus minimize workload while
maximizing efficiency.

What Are the Types of Charts for Data Visualization in Tableau?

No data is the same, meaning each project will have its own unique
requirements. To get the most out of your work, you need to understand what
type of visualization will work best for every individual dataset. In this section,
we’ll outline some of the more common chart types you’ll use when visualizing
data in Tableau.

Bar Chart

Bar charts are one of the most popular ways to visualize your data, not just in
Tableau. You’ll be pressed not to find one during business meetings, science
seminars, or even news broadcasts. The reason is that bar charts are among the
clearest and most straightforward visual representations of information. As such,
they are also easily interpreted by an audience, regardless of their technical
qualifications or background.
It is highly likely that you will use this particular form of visualization regularly on
the job. Therefore, mastering the bar chart is one of the most fundamental skills
you can gain. Luckily, Tableau is quite intuitive (as we previously said) and it will
guide you through the process.

Pie Chart

Pie charts are another popular visualization type that represents your variables
as parts of a circle. Keep in mind, though, that they are best suited for displaying
data that sums up to 100%.

While they’re very common, we don’t recommend using pie charts. The reason
for this is the human eye is just not great at determining the size of non-
rectangular-shaped objects – it is often very difficult for the audience to visually
compare the different slices and order them from largest to smallest. If you can,
simply use a bar chart instead.

Now, although pie charts are usually not the best way to visualize data, being
able to create one is a must if you’re working as a data analyst, Tableau
developer, or in another data-visualization-related profession.

Line Chart

A line chart is exactly what it says on the tin – lines running across the graph that
represent the data at hand. More precisely, they show the evolution of one or
several quantities and their behavior over time. If your task is to follow a value’s
timeline or identify trends in a dataset, then this is the chart for you.

In data science terms, we can safely say that a line chart is most often used to
represent time series data. A great example is the visualization of financial
information – you’d be able to trace how stock market returns changed over a
specific timeframe.
Histogram

First and foremost, you shouldn’t confuse this with a bar chart – the 2 types
serve different purposes. In essence, a histogram shows the distribution of a
numeric variable. The range of values is split into intervals, represented by
different bars, more commonly known as bins.

Meanwhile, the height of these bins shows the number of observations within an
interval. Depending on the data, the histogram might skew to the left, the right, or
peak at the center.

It may sound complicated, but this is one of the most fundamental and useful
tools for understanding numerical data.

Scatter Plot

A scatter plot shows the relationship between 2 numerical features. The


observations are displayed as points on the graph where the “X coordinate” is
one of the variables, while the “Y coordinate” is the other.

This chart is extremely useful and has one big advantage: scatter plots, unlike
other visualizations, are able to display a large number of points. Whereas some
graphs are limited to clearly showing just a few elements, in a scatter you can
easily display a few hundred. Most notably, each observation is a distinct data
point on that scatter.

8. R- Hadloop Intergration For Analytics:


Hadoop is an open-source framework that was introduced by the
ASF — Apache Software Foundation. Hadoop is the most crucial
framework for copying with Big Data. Hadoop has been written in
Java, and it is not based on OLAP (Online Analytical Processing) .
The best part of this big data framework is that it is scalable and
can be deployed for any type of data in various varieties like the
structured, unstructured, and semi-structured type. Hadoop is a
middleware tool that provides us with a platform that manages a
large and complex cluster of computers that was developed in Java
and although Java is the main programming language for Hadoop
other languages could be used to- R, Python, or Ruby.

The Hadoop framework includes :


 Hadoop Distributed File System (HDFS) – It is a file
system that provides a robust distributed file system.
Hadoop has a framework that is used for job scheduling and
cluster resource management whose name is YARN.
 Hadoop MapReduce –It is a system for parallel processing
of large data sets that implement the MapReduce model of
distributed programming.
Hadoop extends an easier distributed storage with the help
of HDFS and provides an analysis system through MapReduce. It
has a well-designed architecture to scale up or scale down the
servers as per the requirements of the user from one to hundreds
or thousands of computers, having a high degree of fault tolerance.
Hadoop has proved its infallible need and standards in big data
processing and efficient storage management, it provides unlimited
scalability and is supported by major vendors in the software
industry.
Integration of Hadoop and R
As we know that data is the precious thing that matters most for an
organization and it’ll be not an exaggeration if we say data is the
most valuable asset. But in order to deal with this huge structure
and unstructured we need an effective tool that could effectively do
the data analysis, so we get this tool by merging the features of
both R language and Hadoop framework of big data analysis,
this merging result increment in its scalability. Hence, we need to
integrate both then only we can find better insights and result from
data. Soon we’ll go through the various methodologies which help
to integrate these two.
R is an open-source programming language that is extensively
used for statistical and graphical analysis. R supports a large
variety of Statistical-Mathematical based library for(linear and
nonlinear modeling, classical-statistical tests, time-series analysis,
data classification, data clustering, etc) and graphical techniques
for processing data efficiently.
One major quality of R’s is that it produces well-designed quality
plots with greater ease, including mathematical symbols and
formulae where needed. If you are in a crisis of strong data-
analytics and visualization features then combining this R language
with Hadoop into your task will be the last choice for you to reduce
the complexity. It is a highly extensible object-oriented
programming language and it has strong graphical capabilities.

Some reasons for which R is considered the best fit for data
analytics :
 A robust collection of packages
 Powerful data visualization techniques
 Commendable Statistical and graphical programming
features
 Object-oriented programming language
 It has a wide smart collection of operators for calculations of
arrays, particular matrices, etc
 Graphical representation capability on display or on hard
copy.
The Main Motive behind R and Hadoop Integration :
No suspicion, that R is the most picked programming language for
statistical computing, graphical analysis of data, data analytics, and
data visualization. On the other hand, Hadoop is a powerful Bigdata
framework that is capable to deal with a large amount of data. In all
the processing and analysis of data the distributed file
system(HDFS) of Hadoop plays a vital role, It applies the map-
reduce processing approach during data processing(provides by
rmr package of R Hadoop), Which make the data analyzing process
more efficient and easier.
What would happen, if both collaborated with each other?
Obviously, the efficiency of the data management and analyzing
process will get increase multiple times. So, in order to have
efficiency in the process of data analytics and visualization process,
we have to combine R with Hadoop.
After joining these two technologies, R’s statistical computing
power becomes increases, then we enable to :
 Use Hadoop for the execution of the R codes.
 Use R for accessing the data stored in Hadoop.
Several ways using which One can Integrate both R and Hadoop:
The most popular and frequently picked methods are shown below
but there are some other RODBC/RJDBC that could be used but are
not popular as the below methods are. The general architecture of
the analytics tools integrated with Hadoop is shown below along
with its different layered structure as follows.
The first layer: It is the hardware layer — it consists of a cluster of
computers systems,
The second layer: It is the middleware layer of Hadoop. This
layer also takes care of the distributions of the files flawlessly
through using HDFS and the features of the MapReduce job.
The third layer: It is the interface layer that provides the interface
for analysis of data. At this level, we can use an effective tool like
Pig which provides a high-level platform to us for creating
MapReduce programs using a language which we called Pig-Latin.
We can also use Hive which is a data warehouse infrastructure
developed by Apache and built on top of Hadoop. Hive provides a
number of facilities to us for running complex queries and helps to
analyze the data using an SQL-like language called HiveQL and it
also extends support for implementing MapReduce tasks.
Besides using Hive and Pig, We can also use Rhipe or Rhadoop
libraries that build an interface to provide integration between
Hadoop and R and enables users to access data from the Hadoop
file system and enable to write his own script to implement the Map
and Reduce jobs, or we can also use the Hadoop- streaming that is
a technology which is used to integrate the Hadoop.
a) R Hadoop: R Hadoop method includes four packages, which are
as follows:
 The rmr package –rmr package provides Hadoop
MapReduce functionality in R. So, the R programmer only
has to do just divide the logic and idea of their application
into the map and reduce phases associates and just submit
it with the rmr methods. After that, The rmr package makes
a call to the Hadoop streaming and the MapReduce API
through multiple job parameters as input directory, output
directory, reducer, mapper, and so on, to perform the R
MapReduce job over Hadoop cluster(most of the
components are similar as Hadoop streaming).
 The rhbase package –Allows R developer to connect
Hadoop HBASE to R using Thrift Server. It also offers
functionality like (read, write, and modify tables stored in
HBase from R).
A script that utilizes the RHаdoop functionality looks like the figure
shown below as follows.
library(rmr)
map<-function(k,v){...}
reduce<-function(k,vv){...}
mapreduce(
input = "data.txt",
output ="output",
textinputformat = rawtextinputformat,
map = map,
reduce = reduce
)

You might also like