Big Data Assighmwnt 2
Big Data Assighmwnt 2
Big Data Assighmwnt 2
HDFS:-
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications. It employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
HDFS is a key part of the many Hadoop ecosystem technologies, as it
provides a reliable means for managing pools of big data and supporting
related big data analytics applications.
3.4. Replication
Data Replication is one of the most important and unique features of Hadoop
HDFS. In HDFS replication of data is done to solve the problem of data loss in
unfavorable conditions like crashing of a node, hardware failure, and so on. Since
data is replicated across a number of machines in the cluster by creating blocks.
The process of replication is maintained at regular intervals of time by HDFS and
HDFS keeps creating replicas of user data on different machines present in the
cluster. Hence whenever any machine in the cluster gets crashed, the user can
access their data from other machines which contain the blocks of that data. Hence
there is no possibility of losing of user data. Follow this guide to learn more about
the data read operation.
3.5. Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase
we can scale the cluster. There is two scalability mechanism available: Vertical
scalability – add more resources (CPU, Memory, Disk) on the existing nodes of
the cluster. Another way is horizontal scalability – Add more machines in the
cluster. The horizontal way is preferred since we can scale the cluster from 10s of
nodes to 100s of nodes on the fly without any downtime.
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
15.What is metadata, what information it provides and explain the role of
namenode in a hdfs clusters
Metadata means "data about data". Although the "meta" prefix means "after" or
"beyond", it is used to mean "about" in epistemology. Metadata is defined as the data
providing information about one or more aspects of the data; it is used to summarize
basic information about data which can make tracking and working with specific data
easier.
· Standards used
· File size
· Data quality
Suppose a client wants to write a file into HDFS. So, the following steps will be
performed internally during the whole HDFS write process:-
The client will divide the files into blocks and will send a write request to the
NameNode.
For each block, the NameNode will provide the client a list containing the IP address
of DataNodes (depending on replication factor, 3 by default) where the data block has
to be copied eventually.
The client will copy the first block into the first DataNode and then the other copies
of the block will be replicated by the DataNodes themselves in a sequential manner.
NameNode works as Master in Hadoop cluster. Below listed are the main
function performed by NameNode:
1. Stores metadata of actual data. E.g. Filename, Path, No. of Data Blocks,
Block IDs, Block Location, No. of Replicas, Slave related configuration
2. Manages File system namespace.
3. Regulates client access request for actual file data file.
4. Assign work to Slaves(DataNode).
5. Executes file system name space operation like opening/closing files,
renaming files and directories.
6. As Name node keep metadata in memory for fast retrieval, the huge
5amount of memory is required for its operation. This should be hosted on
reliable hardware.
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
No Need for Separate Caching Layer
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with
equal effect
Object-oriented programming which is easy to use and flexible
NoSQL databases don't need a dedicated high-performance server
Support Key Developer Languages and Platforms
Simple to implement than using RDBMS
It can serve as the primary data source for online applications.
Handles big data which manages data velocity, variety, volume, and
complexity
Excels at distributed database and multi-data center operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without
downtime or service disruption
Key-value pair storage databases store data as a hash table where each key
is unique, and the value can be a JSON, BLOB(Binary Large Objects), string,
etc.
For example, a key-value pair may contain a key like "Website" associated
with a value like "Guru99".
It is one of the most basic types of NoSQL databases. This kind of NoSQL
database is used as a collection, dictionaries, associative arrays, etc. Key
value stores help the developer to store schema-less data. They work best for
shopping cart contents.
Redis, Dynamo, Riak are some examples of key-value store DataBases. They
are all based on Amazon's Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable
paper by Google. Every column is treated separately. Values of single column
databases are stored contiguously.
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair
but the value part is stored as a document. The document is stored in JSON
or XML formats. The value is understood by the DB and can be queried.
The document type is mostly used for CMS systems, blogging platforms, real-
time analytics & e-commerce applications. It should not use for complex
transactions which require multiple operations or queries against varying
aggregate structures.
Graph-Based
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An
edge gives a relationship between nodes. Every node and edge has a unique
identifier.
Graph base database mostly used for social networks, logistics, spatial data.
i. NameNode
It is also known as Master node. NameNode does not store actual data or
dataset. NameNode stores Metadata i.e. number of blocks, their location, on
which Rack, which Datanode the data is stored and other details. It consists of
files and directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming, closing, opening files and
directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data
in HDFS. Datanode performs read and write operation as per the request of
the clients. Replica block of Datanode consists of 2 files on the file system. The
first file is for data and second file is for recording the block’s metadata. HDFS
Metadata includes checksums for data. At startup, each Datanode connects to
its corresponding Namenode and does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking.
At the time of mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion, and
replication according to the instruction of NameNode.
DataNode manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.
fsck
HDFS Command to check the health of the Hadoop file system.
ls
HDFS Command to display the list of Files and Directories in HDFS.
mkdir
HDFS Command to create the directory in HDFS.
touchz
HDFS Command to create a file in HDFS with file size 0 bytes.
du
HDFS Command to check the file size.
text
HDFS Command that takes a source file and outputs the file in text format.
copyFromLocal
HDFS Command to copy the file from a Local file system to HDFS.
copyToLocal
HDFS Command to copy the file from HDFS to Local File System.
put
HDFS Command to copy single source or multiple sources from local file system to the
destination file system.
get
HDFS Command to copy files from hdfs to the local file system.
count
HDFS Command to count the number of directories, files, and bytes under the paths that
match the specified file pattern.
rm
HDFS Command to remove the file from HDFS.
Usage: hdfs dfs –rm <path>
rm -r
HDFS Command to remove the entire directory and all of its content from HDFS.
cp
HDFS Command to copy files from source to destination. This command allows multiple
sources as well, in which case the destination must be a directory.
mv
HDFS Command to move files from source to destination. This command allows multiple
sources as well, in which case the destination needs to be a directory.
expunge
HDFS Command that makes the trash empty.
rmdir
HDFS Command to remove the directory.
usage
HDFS Command that returns the help for an individual command.
help
HDFS Command that displays help for given command or all commands if none is
specified.
Command: hdfs dfs -help
For Example:- Suppose, 10 Map and 10 Reduce Jobs are running with 10 +
10 Slots to perform a computation. All Map Jobs are doing their tasks but all
Reduce jobs are idle. We cannot use these Idle jobs for other purpose.
6.expLAIN ABOUT 5V’S.
Velocity
First let’s talk about velocity. Obviously, velocity refers to the speed at which
vast amounts of data are being generated, collected and analyzed. Every day
the number of emails, twitter messages, photos, video clips, etc. increases at
lighting speeds around the world. Every second of every day data is
increasing. Not only must it be analyzed, but the speed of transmission, and
access to the data must also remain instantaneous to allow for real-time
access to website, credit card verification and instant messaging. Big data
technology allows us now to analyze the data while it is being generated,
without ever putting it into databases.
Volume
Volume refers to the incredible amounts of data generated each second from
social media, cell phones, cars, credit cards, M2M sensors, photographs,
video, etc. The vast amounts of data have become so large in fact that we
can no longer store and analyze data using traditional database
technology. We now use distributed systems, where parts of the data is
stored in different locations and brought together by software. With just
Facebook alone there are 10 billion messages, 4.5 billion times that the “like”
button is pressed, and over 350 million new pictures are uploaded every
day. Collecting and analyzing this data is clearly an engineering challenge of
immensely vast proportions.
Value
When we talk about value, we’re referring to the worth of the data being
extracted. Having endless amounts of data is one thing, but unless it can be
turned into value it is useless. While there is a clear link between data and
insights, this does not always mean there is value in Big Data. The most
important part of embarking on a big data initiative is to understand the costs
and benefits of collecting and analyzing the data to ensure that ultimately the
data that is reaped can be monetized.
Variety
Variety is defined as the different types of data we can now use. Data today
looks very different than data from the past. We no longer just have
structured data (name, phone number, address, financials, etc) that fits nice
and neatly into a data table. Today’s data is unstructured. In fact, 80% of all
the world’s data fits into this category, including photos, video sequences,
social media updates, etc. New and innovative big data technology is now
allowing structured and unstructured data to be harvested, stored, and used
simultaneously.
Veracity
Last, but certainly not least there is veracity. Veracity is the quality or
trustworthiness of the data. Just how accurate is all this data? For example,
think about all the Twitter posts with hash tags, abbreviations, typos, etc., and
the reliability and accuracy of all that content. Gleaning loads and loads of
data is of no use if the quality or trustworthiness is not accurate. Another
good example of this relates to the use of GPS data. Often the GPS will “drift”
off course as you peruse through an urban area. Satellite signals are lost as
they bounce off tall buildings or other structures. When this happens, location
data has to be fused with another data source like road data, or data from an
accelerometer to provide accurate data.
7.write about challenges with big data
It's no surprise, then, that the IDG report found, "Managing unstructured data is growing
as a challenge – rising from 31 percent in 2015 to 45 percent in 2016."
On the management and analysis side, enterprises are using tools like NoSQL
databases, Hadoop, Spark, big data analytics software, business intelligence
applications, artificial intelligence and machine learning to help them comb through their
big data stores to find the insights their companies need.
All of those goals can help organizations become more competitive — but only if they
can extract insights from their big data and then act on those insights quickly. PwC's
Global Data and Analytics Survey 2016 found, "Everyone wants decision-making to be
faster, especially in banking, insurance, and healthcare."
To achieve that speed, some organizations are looking to a new generation of ETL
and analytics tools that dramatically reduce the time it takes to generate reports. They
are investing in software with real-time analytics capabilities that allows them to
respond to developments in the marketplace immediately.
The 2017 Robert Half Technology Salary Guide reported that big data engineers were
earning between $135,000 and $196,000 on average, while data scientist salaries
ranged from $116,000 to $163, 500. Even business intelligence analysts were very well
paid, making $118,000 to $138,750 per year.
In order to deal with talent shortages, organizations have a couple of options. First,
many are increasing their budgets and their recruitment and retention efforts. Second,
they are offering more training opportunities to their current staff members in an attempt
to develop the talent they need from within. Third, many organizations are looking to
technology. They are buying analytics solutions with self-service and/or machine
learning capabilities. Designed to be used by professionals without a data science
degree, these tools may help organizations achieve their big data goals even if they do
not have a lot of big data experts on staff.
5. Validating data
Closely related to the idea of data integration is the idea of data validation. Often
organizations are getting similar pieces of data from different systems, and the data in
those different systems doesn't always agree. For example, the ecommerce system
may show daily sales at a certain level while the enterprise resource planning (ERP)
system has a slightly different number. Or a hospital's electronic health record (EHR)
system may have one address for a patient, while a partner pharmacy has a different
address on record.
The process of getting those records to agree, as well as making sure the records are
accurate, usable and secure, is called data governance. And in the AtScale 2016 Big
Data Maturity Survey, the fastest-growing area of concern cited by respondents was
data governance.
However, most organizations seem to believe that their existing data security
methods are sufficient for their big data needs as well. In the IDG survey, less than half
of those surveyed (39 percent) said that they were using additional security measure for
their big data repositories or analyses. Among those who do use additional measures,
the most popular include identity and access control (59 percent), data encryption (52
percent) and data segregation (42 percent).
7. Organizational resistance
It is not only the technological aspects of big data that can be challenging — people can
be an issue too.
In the NewVantage Partners survey, 85.5 percent of those surveyed said that their firms
were committed to creating a data-driven culture, but only 37.1 percent said they had
been successful with those efforts. When asked about the impediments to that culture
shift, respondents pointed to three big obstacles within their organizations:
Insufficient organizational alignment (4.6 percent)
Lack of middle management adoption and understanding (41.0 percent)
Business resistance or lack of understanding (41.0 percent)
In order for organizations to capitalize on the opportunities offered by big data, they are
going to have to do some things differently. And that sort of change can be
tremendously difficult for large organizations.
One way to establish that sort of leadership is to appoint a chief data officer, a step that
NewVantage Partners said 55.9 percent of Fortune 1000 companies have taken. But
with or without a chief data officer, enterprises need executives, directors and
managers who are going to commit to overcoming their big data challenges, if they
want to remain competitive in the increasing data-driven economy.
2. Pseudo-distributed Mode
The pseudo-distribute mode is also known as a single-node cluster where both
NameNode and DataNode will reside on the same machine.
In pseudo-distributed mode, all the Hadoop daemons will be running on a single node.
Such configuration is mainly used while testing when we don’t need to think about the
resources and other users sharing the resource.
In this architecture, a separate JVM is spawned for every Hadoop components as they
could communicate across network sockets, effectively producing a fully functioning
and optimized mini-cluster on a single host.
Here is the summarized view of pseudo distributed Mode-
• Single Node Hadoop deployment running on Hadoop is considered as pseudo
distributed mode
• All the master & slave daemons will be running on the same node
• Mainly used for testing purpose
• Replication Factor will be ONE for blocks
• Changes in configuration files will be required for all the three files- mapred-site.xml,
core-site.xml, hdfs-site.xml
11. Below given is a sample data load command. We provide the file
location which can be a directory or a specific file. We select the load
function through which data is parsed from the file. PigStorage function
parses each line in the file and splits the data based on the argument
provided with the function to generate the fields. We provide the schema
(field names with data type) in the load function after the keyword ‘as‘.
12. modified_data = GROUP data BY ;
13. counts = FOREACH modified_data GENERATE group,
14. COUNT(data);
15. We can either dump the processed data or store it in a file based
upon the requirements. Using dump method, the processed data is
displayed on the standard output.
20. In case of Java, the required data types are imported from the
resective classes and the custom function is written by extending the
resecting class. In case of Jython the script is registered using jython
which imports the required scripts to interpret the jython script. The output
schema for every function is specified so that pig can parse the data. The
same goes with Javascript. In case of Ruby ‘pigudf’ library is extended
and jruby is used to register the script. In case of python udf, python
command line is used in which the data is streamed in and out to execute
the script.
23. To run the program using the script, run the following command.
The script can be stored in hdfs which can be distributed to other
machines in case the program is run in cluster mode.
24. $ pig
Array
MAP
STRUCT
UNION
i. ARRAY
An ordered collection of fields. The fields must all be of the same type.
Syntax: ARRAY<data_type>
E.g. array (1, 2)
ii. MAP
An unordered collection of key-value pairs. Keys must be primitives; values
may be any type. For a particular map, the keys must be the same type, and
the values must be the same type.
Syntax: MAP<primitive_type, data_type>
E.g. map(‘a’, 1, ‘b’, 2).
iii. STRUCT
A collection of named fields. The fields may be of different types.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment],…..>
E.g. struct(‘a’, 1 1.0),[b] named_struct(‘col1’, ‘a’, ‘col2’, 1, ‘col3’, 1.0)
iv. UNION
A value that may be one of a number of defined data. The value is tagged with
an integer (zero-indexed) representing its data type in the union.
Syntax: UNIONTYPE<data_type, data_type, …>
E.g. create_union(1, ‘a’, 63)
c. Column Data Types in Hive
Column Hive data types are furthermore divide into 6 categories:
Integral Type
Strings
Timestamp
Dates
Decimals
Union Types
Let us discuss these Hive Column data types one by one-
Hive Column Data Types
i. Integral type
In this category of Hive data types following 4 data types are come-
TINYINT
SMALLINT
INT/INTEGER
BIGINT
By default, Integral literals are assumed to be INT. When the data range
exceeds the range of INT, we need to use BIGINT. If the data range is smaller
than the INT, we uses SMALLINT. And TINYINT is smaller than SMALLINT.
Table Postfix Example
TINYINT Y 100Y
SMALLINT S 100S
BIGINT L 100L
ii. Strings
The string data types in Hive, can be specified with either single quotes (‘) or
double quotes (“). Apache Hive use C-style escaping within the strings.
VARCHAR 1 to 65355
CHAR 255
* VARCHAR
Varchar- Hive data types are created with a length specifier (between 1 and
65355). It defines the maximum number of characters allowed in the
character string.
* Char
Char – Hive data types are similar to VARCHAR. But they are fixed-length
meaning that values shorter than the specified length value are padded with
spaces but trailing spaces are not important during comparisons. 255 is the
maximum fixed length.
iii. Timestamp
Hive supports traditional UNIX timestamp with operational nanosecond
precision. Timestamps in text files use format ”YYYY-MM-DD
HH:MM:SS.fffffffff” and “yyyy-mm-dd hh:mm:ss.ffffffffff”.
iv. Dates
DATE values are described in particular year/month/day (YYYY-MM-DD)
format.E.g. DATE ‘2017-01-01’. These types don’t have a time of day
component. This type supports range of values for 0000-01-01 to 9999-12-31.
v. Decimals
In Hive DECIMAL type is similar to Big Decimal format of java. This
represents immutable arbitrary precision. The syntax and example are below:
Apache Hive 0.11 and 0.12 has the precision of the DECIMAL type fixed. And
it’s limited to 38 digits.
Apache Hive 0.13 users can specify scale and precision when creating tables
with the DECIMAL data type using DECIMAL (precision, scale) syntax. If the
scale is not specified, then it defaults to 0 (no fractional digits). If no precision
is specified, then it defaults to 10.
11. Role of driver code ,mapper code and reducer code with a map.
Mapper code:-
We have created a class Map that extends the class Mapper which is already
Reducer code:-
We have created a class Reduce which extends class Reducer like that of Mapper.
We define the data types of input and output key/value pair after the class
declaration using angle brackets as done for Mapper.
Both the input and the output of the Reducer is a key-value pair.
Input:
o The key nothing but those unique words which have been generated after
the sorting and shuffling phase: Text
o The value is a list of integers corresponding to each key: IntWritable
o Example – Bear, [1, 1], etc.
Output:
o The key is all the unique words present in the input text file: Text
o The value is the number of occurrences of each of the unique
words: IntWritable
o Example – Bear, 2; Car, 3, etc.
We have aggregated the values present in each of the list corresponding to each
key and produced the final answer.
In general, a single reducer is created for each of the unique words, but, you can
specify the number of reducer in mapred-site.xml.
Driver code:-
In the driver class, we set the configuration of our MapReduce job to run in
Hadoop.
We specify the name of the job , the data type of input/output of the mapper and
reducer.
We also specify the names of the mapper and reducer classes.
The path of the input and output folder is also specified.
The method setInputFormatClass () is used for specifying that how a Mapper will
read the input data or what will be the unit of work. Here, we have chosen
TextInputFormat so that single line is read by the mapper at a time from the input
text file.
The main () method is the entry point for the driver. In this method, we instantiate
a new Configuration object for the job.
If you’re looking for a great conversation starter at the next party you go to, you could
always start with “You know, machine learning is not so new; why, the concept
of regression was first described by Francis Galton, Charles Darwin’s half cousin, all the
way back in 1875”. Of course, it will probably be the last party you get an invite to for a
while.
But the concept is simple enough. Francis Galton was looking at the sizes of sweet peas
over many generations. We know that if you selectively breed peas for size, you can get
larger ones.
But if you let nature take its course, you see a variety of sizes. Eventually, even bigger
peas will produce smaller offspring and “regress to the mean”. Basically, there’s a typical
size for a pea and although things vary, they don’t “stay varied” (as long as you don’t
selectively breed).
The same principle applies to monkeys picking stocks. On more than one occasion there
have been stock-picking competitions (WSJ has done them, for example) where a
monkey will beat the pros. Great headline. But what happens next year or the year after
that? Chances are that monkey, which is just an entertaining way of illustrating
“random,” will not do so well. Put another way, its performance will eventually regress to
the mean.
What this means is that in this simple situation, you can predict what the next result will
be (with some kind of error). The next generation of pea will be the average size, with
some variability or uncertainty (accommodating smaller and larger peas). Of course, in
the real world things are a little more complicated than that.
In the image above, we don’t have a single mean value like pea size. We’ve got a
straight line with a slope and two values to work with, not just one. Instead of variability
around a single value, here we’ve got variability in a two-dimensional plane based on an
underlying line.
You can see all the various data points in blue, and that red line is the line that best fits
all that data. And based on that red line, you could make a prediction about what
would happen if, say, the next data point was a 70 on the X axis. (That prediction would
not be a single definitive value, but rather a projected value with some degree of
uncertainty, just like for the pea sizes we looked at earlier).
Regression algorithms are used to make predictions about numbers. For example, with
more data, we can:
Furthermore, I’ve given you a two-dimensional diagram there. If you were trying to
predict house prices, for example, you’d include many more factors than just two: size,
number of rooms, school scores, recent sales, size of garden, age of house and more.
While you kids don't know about fruit, the good news for you is that I do. You don’t
have to guess (at least initially). I’m going to show you lots of pieces of fruit and tell you
what each one is. And so, like children in a preschool, you will learn how to classify fruit.
You’ll look at things like size, color, taste, firmness, smell, shape and whatever else
strikes your fancy as you attempt to figure out what it is that makes an apple, an apple,
as opposed to a banana.
Once I've gone through 70 percent to 80 percent of the basket, we can move onto the
next stage. I’ll show you a fruit (that I have already identified) and ask you “What is it?”
Based on the learning you’ve done, you should be able to classify that new fruit
correctly.
In fact, by testing you on fruit that I’ve already classified correctly, I can see how well
you’ve learned. If you do a good job, then you’re ready for some real work which in a
non-kindergarten situation, would mean deploying that trained model into production.
If of course the results of the test weren’t good enough that would mean the model
wasn’t ready. Perhaps we need to start again with more data, better data, or a different
algorithm.
We call this approach “supervised learning” because we’re testing your ability to get the
right answers, and we have got lots of correct examples to work with since we have a
whole basket that has been correctly classified.
That idea of using part of the basket for training and the rest for testing is also
important. That's how techniques like this make sure that the training worked or,
alternatively, that the training didn't work and a new approach is needed.
Note that the basket of fruit we worked with had only four kinds of fruit: apples,
bananas, strawberries (you can't see them in the picture, but I assure you they are there)
and oranges. So, if you were presented with a cherry to classify it would be somewhat
unpredictable. It would depend what the algorithm found to be important in
differentiating the others. The point here of course, is that if you want to recognize
cherries then the model should be trained on them.
Here's an example of a chart showing a data set that has been grouped into two
different classes. We've got a few outliers in this diagram, a few colored points that are
on the wrong side of the line. I show this to emphasize the point that these algorithms
aren't magic and may not get everything right. It could also be the case that with
different approaches or algorithms, we could do a better job classifying these data
points and identifying them correctly.
Insurance companies pay out on claims and they've got a historical set of claims that
they have already classified into "good claims" and ones that need "further
investigation". Train a classification algorithm on all those old claims, and perhaps you
can do a better job of spotting dubious claims when they come in.
In all these cases, it’s important to have lots of data available to train on. The more data
you have, the better the training (more accurate, wider range of situations etc.). One of
the reasons (of course there are others) for building a data lake is to have easy access to
more data for machine learning algorithms.
Machine Learning Technique #3: Clustering
Alert readers should have noticed that this is the same bowl of fruit used in the
classification example. Yes, this was done on purpose. Same fruit, but a different
approach.
This time we’re going to do clustering, which is an example of unsupervised learning.
You're back in preschool and the same teacher is standing in front of you with the same
basket of fruit.
But this time, as I hand the stuff out, I'm not going to tell you "This is a banana." Instead
I'm effectively going to say, “Do these things have any kind of natural grouping?”
(Which is a complex concept for a pre-schooler, but work with me for a moment).
You’ll look at them and their various characteristics, and you might end up with several
piles of fruit that look like “squidgy red things”, “curved yellow things”, “small green
things” and “larger red or green things”.
To clarify, what you did (in your role as preschoolers/machine learning algorithm) is
group the fruits in that way. What the teacher (or the human supervising the machine
learning process) did was to come up with meaningful names for those different piles.
This is likely the same process used to do the customer segmentation mentioned in the
previous blog. Having found logical groupings of customers, somebody came up with a
shorthand way to name or describe each grouping.
Here’s a real-world cluster diagram. With these data points you can see five separate
clusters. Those little arrows represent part of the process of calculating the clusters and
their boundaries: basically pick arbitrary centers, calculate which points belong in which
cluster and then move your arbitrary point to the actual center of the cluster and repeat
until you’ve got close enough (movements of the centers are sufficiently small).
This approach is very common for customer segmentation. You could evaluate credit
risk, or even things like the similarity between written documents. Basically, if you look
at a mass of data and don’t know how to logically group it, then clustering is a good
place to start.
Sometimes you’re not trying to group like things together. Maybe you don’t much care
about all the things that blend in with the flock. What you’re looking for is something
unusual, something different, something that stands out in some way.
This approach is called anomaly detection. You can use this to find things that are
different, even if you can’t say up front how they are different. It’s fairly easy to spot the
outliers here, but in the real world, those outliers might be harder to find.
One health provider used anomaly detection to look at claims for medical services and
found a dentist billing at the extraordinarily high rate of 85 fillings per hour. That's 42
seconds per patient to get the numbing shot, drill the bad stuff out and put the filling
in.
Clearly that's suspicious and needs further investigation. Just by looking at masses of
data (and there were millions of records) it would not have been obvious that you were
looking for something like that.
Of course, it might also throw up that fact that one doctor only ever billed on
Thursdays. Anomalous, yes. Relevant, probably not. Anomaly detection can throw up
the outliers for you to evaluate to see if they need further investigation.
Finding a dentist billing for too much work is a relatively simple anomaly. If you knew to
look at billing rates (which will not always be the case), you could find this kind of issue
using other techniques. But anomaly detection could also apply to more complex
scenarios. Perhaps you are responsible for some mechanical equipment where things
like pressure, flow rate and temperature are normally in sync with each other: one goes
up, they all go up; one goes down, they all go down. Anomaly detection could identify
the situation where two of those variables go up and the other one goes down. That
would be really hard to spot with any other technique.
All right, I think that’s enough to think about and process for this week. But be sure
to subscribe to the Oracle blog, because the fun hasn’t come to an end yet. Next, we’re
going to post about three more machine learning techniques that people are especially
excited about.
If you're ready to get started with machine learning, try Oracle Cloud for free and build
your own data lake to test out some of these techniques.
What is it? Azure Machine Learning is a cloud service that you can use to develop and deploy
machine-learning models. You can track your models as you build, train, scale, and
manage them by using the Python SDK. Deploy models as containers and run them
in the cloud, on-premises, or on Azure IoT Edge.
How to use or As a Python SDK and in the Azure CLI. Activate to the conda
run it environment AzureML on Windows edition or to py36 on Linux edition.
Link to Sample Jupyter notebooks are included in the AzureML directory under notebooks.
samples
H2O
What is it? An open-source AI platform that supports in-memory, distributed, fast, and
scalable machine learning.
Supported Linux
versions
How to use or run Connect to the VM by using X2Go. Start a new terminal, and run java -jar
it /dsvm/tools/h2o/current/h2o.jar. Then start a web browser and connect
to http://localhost:54321.
Link to samples Samples are available on the VM in Jupyter under the h2o directory.
Related tools Apache Spark, MXNet, XGBoost, Sparkling Water, Deep Water
Rattle
How to use or As a UI tool. On Windows, start a command prompt, run R, and then inside R,
run it run rattle(). On Linux, connect with X2Go, start a terminal, run R, and then inside
R, run rattle().
Link to Rattle
samples
Vowpal Wabbit
Weka
What is it? A collection of machine-learning algorithms for data-mining tasks. The algorithms can
be either applied directly to a data set or called from your own Java code. Weka
contains tools for data pre-processing, classification, regression, clustering, association
rules, and visualization.
How to use On Windows, search for Weka on the Start menu. On Linux, sign in with X2Go, and
or run it then go to Applications > Development > Weka.
XGBoost
What is it? A fast, portable, and distributed gradient-boosting (GBDT, GBRT, or GBM)
library for Python, R, Java, Scala, C++, and more. It runs on a single machine, and
on Apache Hadoop and Spark.
How to use or As a Python library (2.7 and 3.5), R package, and on-path command-line tool
run it (C:\dsvm\tools\xgboost\bin\xgboost.exe for Windows
and /dsvm/tools/xgboost/xgboost for Linux)
Market Basket Analysis is a modelling technique based upon the theory that if
you buy a certain group of items, you are more (or less) likely to buy another
group of items. For example, if you are in an English pub and you buy a pint
of beer and don't buy a bar meal, you are more likely to buy crisps (US. chips)
at the same time than somebody who didn't buy beer.
A major difficulty is that a large number of the rules found may be trivial for
anyone familiar with the business. Although the volume of data has been
reduced, we are still asking the user to find a needle in a haystack. Requiring
rules to have a high minimum support level and a high confidence level risks
missing any exploitable result we might have found. One partial solution to
this problem is differential market basket analysis, as described below.
How is it used?
As a first step, therefore, market basket analysis can be used in deciding the
location and promotion of goods inside a store. If, as has been observed,
purchasers of Barbie dolls have are more likely to buy candy, then high-
margin candy can be placed near to the Barbie doll display. Customers who
would have bought candy with their Barbie dolls had they thought of it will now
be suitably tempted.
But this is only the first level of analysis. Differential market basket
analysis can find interesting results and can also eliminate the problem of a
potentially high volume of trivial results.
If we observe that a rule holds in one store, but not in any other (or does not
hold in one store, but holds in all others), then we know that there is
something interesting about that store. Perhaps its clientele are different, or
perhaps it has organized its displays in a novel and more lucrative way.
Investigating such differences may yield useful insights which will improve
company sales.
Note that despite the terminology, there is no requirement for all the items to
be purchased at the same time. The algorithms can be adapted to look at a
sequence of purchases (or events) spread out over time. A predictive market
basket analysis can be used to identify sets of item purchases (or events) that
generally occur in sequence — something of interest to direct marketers,
criminologists and many others.
5. Rdbms to hbase
Once the Mysql service is started, enter Mysql shell using the below command in the
terminal.
Login to MySQL shell:
mysql -u root -p
Password: cloudera
In the above command -u represents the user name and -p represents the password. Here
username is root and password to Mysql shell is cloudera.
Show databases:
show databases;
As we have mentioned earlier we will be using emp database in our example which is
already available in Mysql DB.
Use database emp:
Follow the below code to use database emp;
Use emp;
Show tables:
Let us use show tables command to list the tables which are present in the database emp.
Show tables;
We can observe from the above image there is our example table employee in the
database emp.
Describe table:
We can use below command to describe employee table schema.
Describe employee;
The DESCRIBE TABLE command lists the following information about each column:
Column name
Type schema
Type name
Length
Scale
Nulls (yes/no)
Display the contents of the table employee:
We can use below command to display all the columns present in the table employee.
select * from employee;
MySQL privileges are critical to the utility of the system as they allow each of the users to
access and utilize only the areas needed to perform their work functions. This is meant to
prevent, a user from accidentally accessing an area where he or she should not have access.
Additionally, this adds to the security of the MySQL server. When you connect to a MySQL
server, the host from which we connect and the user name we specify determines our
identity. With this information, the server then grants privileges based upon this identity.
The above step finishes the Mysql part.
Now, we need to create a new table in Hbase to import table contents from Mysql
database. So, follow the below steps to import the contents from Mysql to Hbase.
Enter Hbase shell:
Use below command to enter HBase shell.
hBase shell
Create table:
We can use create command to create a table in Hbase.
create ‘Academp’,’emp_details’
We can observe from the above image we have create a new table in Hbase with the
name Academp and column family as emp_details
Scan table:
We can use scan command to see a table contents in Hbase.
scan ‘Academp’
We can observe from the above image no contents are available in the table Academp
Sqoop import command:
Now use below command to import Mysql employee table to HBase Academp table.
scan 'Academp'
We can observe from the above image we have successfully imported contents from a
Mysql table to HBase table using Sqoop. For more updates on Big Data Hadoop and other
technologies visit Acadgild blog section.
6. Rdbms to hive?
Now, we will discuss how we can efficiently import data from MySQL to Hive using Sqoop.
But before we move ahead, we recommend you to take a look at some of the blogs that we
put out previously on Sqoop and its functioning.
Beginners Guide for Sqoop
Sqoop Tutorial for Incremental Imports
Export Data from Hive to MongoDB
Importing Data from MySQL to HBase
The DESCRIBE TABLE command lists the following information about each column:
Column name
Type schema
Type name
Length
Scale
Nulls (Yes/No)
Displaying the Table Contents
We can use the following commands to display all the columns present in the
table Company1.
select * from Company1;
MySQL privileges are critical to the utility of the system as they allow each of their users to
access and utilize only those areas that are needed to perform their work functions. This is
meant to prevent a user from accidentally accessing an area which they should not have
access to.
Additionally, this adds to the security of the MySQL server.
Whenever someone connects to a MySQL server, their identities are determined by the host
used to connect them and the user name specified. With this information, the server grants
privileges based upon the identity determined.
The above step finishes the MySQL part.
Now, let us open a new terminal and enter Sqoop commands to import data from
MySQL to Hive table.
I. A Sqoop command is used to transfer selected columns from MySQL to Hive.
Now, use the following command to import selected columns from the
MySQL Company1 table to the Hive Company1Hive table.
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –split-by
EmpId –columns EmpId,EmpName,City –table company1 –target-dir /myhive –hive-
import –create-hive-table –hive-table default.Company1Hive -m 1
The above Sqoop command will create a new table with the
name Company1Hive in the Hive default database and transfer the 3 mentioned
column (EmpId, EmpName and City) values from the MySQL table Company1 to the
Hive table Company1Hive.
Displaying the Contents of the Table Company1Hive
Now, let us see the transferred contents in the table Company1Hive.
select * from Company1Hive;
II. Sqoop command for transferring a complete table data from MySQL to Hive.
In the previous example, we transferred only the 3 selected columns from the MySQL
table Company1 to the Hive default database table Company1Hive.
Now, let us go ahead and transfer the complete table from the table Company1 to a
new Hive table by following the command given here:
sqoop import –connect jdbc:mysql://localhost:3306/db1 -username root –table
Company1 –target-dir /myhive –hive-import –create-hive-table –hive-table
default.Company2Hive -m 1
The above given Sqoop command will create a new table with the
name Company2Hive in the Hive default database and will transfer all this data
from the MySQL table Company1 to the Hive table Company2Hive.
In Hive.
Now, let us see the transferred contents in the table Company2Hive.
select * from Company2Hive;
We can observe from the above screenshot that we have successfully transferred
these table contents from the MySQL to a Hive table using Sqoop.
Next, we will do a vice versa job, i.e, we will export table contents from the Hive table
to the MySQL table.
III. Export command for transferring the selected columns from Hive to MySQL.
In this example we will transfer the selected columns from Hive to MySQL. For this,
we need to create a table before transferring the data from Hive to the MySQL
database. We should follow the command given below to create a new table.
create table Company2(EmpId int, EmpName varchar(20), City varchar(15));
The above command creates a new table named Company2 in the MySQL database
with three columns: EmpId, EmpName, and City.
Let us use the select statement to see the contents of the table Company2.
Select * from Company2;
We can observe that in the screenshot shown above, the table contents are empty.
Let us use the Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 -username root –P –columns
EmpId,EmpName,City –table Company2 –export-dir
/user/hive/warehouse/company2hive –input-fields-terminated-by ‘\001’ -m 1
The Sqoop command given above will transfer the 3 mentioned column (EmpId,
EmpName, and City) values from the Hive table Company2Hive to the MySQL
table Company2.
Displaying the Contents of the Table Company2
Now, let us see the transferred contents in the table Company2.
select * from Company2;
We can observe from the above image that we have now successfully transferred
data from Hive to MySQL.
IV. Export command for transferring the complete table data from Hive to
MySQL.
Now, let us transfer this complete table from the Hive table Company2Hive to a
MySQL table by following the command given below:
create table Company2Mysql(EmpId int, EmpName varchar(20), Designation
varchar(15), DOJ varchar(15), City varchar(15), Country varchar(15));
Let us use the select statement to see the contents of the table Company2Msyql.
select * from Company2Mysql;
We observe in the screenshot given above that the table contents are empty. Let us
use a Sqoop command to load this data from Hive to MySQL.
sqoop export –connect jdbc:mysql://localhost/db1 –username root –P –table
Company2Mysql –export-dir /user/hive/warehouse/company2hive –input-fields-
terminated-by ‘\001’ -m 1
The above given Sqoop command will transfer the complete data from the Hive
table Company2Hive to the MySQL table Company2Mysql.
Displaying the Contents of the Table Company2Mysql
Now, let us see the transferred contents in the table Company2Mysql.
select * from Company2Mysql;
We can see here in the screenshot how we have successfully exported table contents
from Hive to MySQL using Sqoop. We can follow the above steps to transfer this
data between Apache Hive and the structured databases.
Keep visiting our site www.acadgild.com for more updates on Big Data and other
technologies.
Enroll for our Big Data and Hadoop Training and kickstart a successful career as a big
data developer.
Market basket analysis is identifying items in the supermarket which customers are more likely to buy
together.
e.g., Customers who bought pampers also bought beer
This is important for super markets to arrange their items in a consumer convenient manner as well as to
come up with promotions taking item affinity in to consideration.
Frequent item set mining is a sub area in data mining that focuses on identifying frequently co-occuring
items. Once, the frequent item set is ready, we can come up with rules to derive association
between items.
e.g., Frequent item set = {pampers, beer, milk}, association rule = {pampers, milk ---> beer}
There are two possible popular approaches for frequent item set mining and association rule
learning as given below:
Apriori algorithm
FP-Growth algorithm
To explain above algorithms, let us consider example with 4 customers making 4 transactions in
supermarket that contain 7 items in total as given below:
Transaction 1: Jana’s purchase: egg, beer, pampers, milk
Transaction 2: Abi’s purchase: carrot, milk, pampers, beer
Transaction 3: Mahesha’s purchase: perfume, tissues, carrot
Transaction 4: Jayani’s purchase: perfume, pampers, beer
Item index
1: egg, 2: beer, 3: pampers, 4: carrot, 5: milk, 6: perfume, 7: tissues
Apriori algorithm identifies frequent item sets by starting individual items and extending item set by one
at a time. This is known as candidate generation step.
This algorithm makes the assumption that any sub set of item within a frequent item set is also frequent.
Transaction: Items
1: 1, 2, 3, 5
2: 4, 5, 3, 2
3: 6, 7, 4
4: 6, 3, 2
Minimum Support
Minimum support is used to prune the associations that are less frequent.
itemset: support
1: 0.25: eliminated
2: 0.75
3: 0.75
4: 0.5
5: 0.5
6: 0.5
7: 0.25: eliminated
remaining items: 2, 3, 4, 5, 6
itemset: support
2, 3: 0.75
2, 4: 0.25: eliminated
2, 5: 0.5
2, 6: 0.25: eliminated
3, 4: 0.25: eliminated
3, 5: 0.5
3, 6: 0.25: eliminated
4, 5: 0.25: eliminated
4, 6: 0.25: eliminated
5, 6: 0.25: eliminated
2, 3, 5: 0.5