BDA Notes Unit-2
BDA Notes Unit-2
UNIT – II Notes
Big Data Technologies: Hadoop’s Parallel World – Data discovery – Open
source technology for Big Data Analytics – cloud and Big Data –Predictive
Analytics – Mobile Business Intelligence and Big Data
Big Data Technologies: Hadoop’s Parallel World
Brief History of Hadoop
There are many Big Data technologies that have been making an impact on the
new technology stacks for handling Big Data, but Apache Hadoop is one
technology that has been the darling of Big Data talk.
Hadoop is an open-source platform for storage and processing of diverse data
types that enables data-driven enterprises to rapidly derive the complete value
from all their data.
The original creators of Hadoop are Doug Cutting (used to be at Yahoo! now
at Cloudera) and Mike
Doug and Mike were building a project called “Nutch” with the goal of
creating a large Web index.
They saw the MapReduce and GFS papers from Google, which were obviously
super relevant to the problem Nutch was trying to solve.
Hadoop gives organizations the flexibility to ask questions across their
structured and unstructured data that were previously impossible to ask or
solve:
The scale and variety of data have permanently overwhelmed the ability to cost-
effectively extract value using traditional platforms.
The scalability and elasticity of free, open-source Hadoop running on
standard hardware allow organizations to hold onto more data than ever before,
at a transformationally lower TCO than proprietary solutions and thereby take
advantage of all their data to increase operational efficiency and gain a
competitive edge.
At one-tenth the cost of traditional solutions, Hadoop excels at supporting
complex analyses— including detailed, special-purpose computation—across
large collections of data.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Hadoop workloads
Hadoop handles a variety of workloads, including search, log processing,
recommendation systems, data warehousing, and video/image analysis.
Today ’s explosion of data types and volumes means that Big Data equals big
opportunities and Apache Hadoop empowers organizations to work on the most
modern scale-out architectures using a clean-sheet design data framework,
without vendor lock-in.
Apache Hadoop is an open-source project administered by the Apache
Software Foundation.
The software was originally developed by the world ’s largest Internet
companies to capture and analyze the data that they generate.
Unlike traditional, structured platforms, Hadoop is able to store any kind
of data in its native format and to perform a wide variety of analyses and
transformations on that data.
Hadoop stores terabytes, and even petabytes, of data inexpensively. It is
robust and reliable and handles hardware and system fail.
Hadoop runs on clusters of commodity servers and each of those servers
has local CPUs and disk storage that can be leveraged by the system.
Features of Hadoop
Both HDFS and MapReduce are designed to continue to work in the face
of system failures.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Hadoop Common: Includes the common utilities which supports the other
Hadoop modules
Hadoop YARN: This technology is basically used for scheduling of job and
efficient management of the cluster resource.
Apache Ambari :It is a tool for managing, monitoring and provisioning of the
Hadoop clusters. Apache Ambari supports the HDFS and MapReduce programs.
Major highlights of Ambari are:
Apache Spark: This is highly agile, scalable and secure the Big Data compute
engine, versatiles the sufficient work on a wide variety of applications like real-
time processing, machine learning, ETL and so on.
Hive: It is a data warehouse tool basically used for analyzing, querying and
summarizing of analyzed data concepts on top of the Hadoop framework.
Sqoop: This framework is used for transferring the data to Hadoop from
relational databases. This application is based on a command-line interface.
The term used to describe the new wave of business intelligence that enables users
to explore data, make discoveries, and uncover insights in a dynamic and intuitive
way versus predefined queries and preconfigured drill-down dashboards.
Tableau Software and QlikTech are the two Business intelligence tools used for
reporting .
Analytics and reporting are produced by the people using the
results. IT provides the infrastructure, but business people create
their own reports and dashboards.
There is a simple example of powerful visualization that the
Tableau team is referring to.
A company uses an interactive dashboard to track the critical
metrics driving their business.
Example of Tableau Software
• A company uses an interactive dashboard to track the critical metrics
driving their business.
• Every day, the CEO and other executives are plugged in real-time to see
how their markets are performing in terms of sales and profit, what the
service quality scores look like against advertising investments, and
how products are performing in terms of revenue and profit.
• Interactivity is key: a click on any filter lets the executive look into
specific markets or products.
• She can click on any data point in any one view to show the related data
in the other views.
• Hovering over a data point can get any unusual pattern or outlier by
showing details on demand.
Or she can click through the underlying information in a split-second
“Business intelligence needs to work the way people ’s minds work.
Users need to navigate and interact with data any way they want to—
asking and answering questions on their own and in big groups or teams.
One capability that we have all become accustomed to is search,
what many people refer to as “Googling.”
This is a prime example of the way people ’s minds work. Qliktech has designed
a way for users to leverage direct— and indirect—search.
With QlikView search, users type relevant words or phrases
in any order and get instant, associative results.
With a global search bar, users can search across the entire
data set. With search boxes on individual list boxes, users can
confine the search to just that field.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Proprietary Software:
• We have to pay to get this software and its commercial support is available
for maintenance.
• The company gives a valid and authenticated license to the users to use this
software. But this license puts some restrictions on users also like.
1. Hadoop
Even if you are a beginner in this field, we are sure that this is not the first time you’ve read
about Hadoop. It is recognized as one of the most popular big data tools to analyze large data
sets, as the platform can send data to different servers. Another benefit of using Hadoop is that
it can also run on a cloudinfrastructure.
This open-source software framework is used when the data volume exceeds the available
memory. This big data tool is also ideal for data exploration, filtration, sampling, and
summarization. It consists of four parts:
Hadoop Distributed File System: This file system, commonly known as HDFS,
is a distributed file system compatible with very high-scale bandwidth.
MapReduce: It refers to a programming model for processing big data.
YARN: All Hadoop’s resources in its infrastructure are managed and scheduled
using this platform.
Libraries: They allow other modules to work efficiently with Hadoop.
2. Apache Spark
The next hype in the industry among big data tools is Apache Spark. the reason behind this is
that this open-source big data tool fills the gaps of Hadoop when it comes to data processing.
This big data tool is the most preferred tool for data analysis over other types of programs due
to its ability to store large computations in memory. It can run complicated algorithms, which
is a prerequisite for dealing with large data sets.
Proficient in handling batch and real-time data, Apache Spark is flexible to work with HDFS
and OpenStack Swift or Apache Cassandra. Often used as an alternative to MapReduce, Spark
can run tasks 100x faster than Hadoop’s MapReduce.
3. Cassandra
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Apache Cassandra is one of the best big data tools to process structured data sets. Created
in 2008 by Apache Software Foundation, it is recognized as the best open-source big data
tool for scalability. This big data tool has a proven fault-tolerance on cloud infrastructure and
commodity hardware, making it more critical for big data uses.
It also offers features that no other relational and NoSQL databases can provide. This includes
simple operations, cloud availability points, performance, and continuous availability as a data
source, to name a few. Apache Cassandra is used by giants like Twitter, Cisco, and Netflix.
4. MongoDB
Thanks to its power to store data in documents, it is very flexible and can be easily adapted by
companies. It can store any data type, be it integer, strings, Booleans, arrays,
or objects. MongoDB is easy to learn and provides support for multiple technologies and
platforms.
5.Apache Hive
Hive is an open source big data software tool. It allows programmers analyze
large data sets on Hadoop. It helps with querying and managing large datasets
real fast.
Features:
It Supports SQL like query language for interaction and Data modeling
It compiles language with two main tasks map, and reducer
It allows defining these tasks using Java or Python
Hive designed for managing and querying only structured data
Hive’s SQL-inspired language separates the user from the complexity of
Map Reduce programming
It offers Java Database Connectivity (JDBC) interface.
6.kaggle
Features:
7.Apache Hbase
It is a free big data open-source computation system. It is one of the best big data
tools that offers a distributed, real-time, fault-tolerant processing system.
Being open source, robust and flexible, it is preferred by medium and large-scale
organizations. It guarantees data processing even if the messages are lost, or
nodes of the cluster die.
9.Apache Pig
Apache Pig is a high-level data flow platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.
The Pig scripts get internally converted to Map Reduce jobs and get executed on data
stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache
Spark.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and
stores the corresponding results into Hadoop Data File System. Every task which can
be achieved using PIG can also be achieved using java used in MapReduce.
10.Apache Flink
Apache Flink allows to ingest massive streaming data (up to several terabytes)
from different sources and process it in a distributed fashion way across multiple
nodes, before pushing the derived streams to other services or applications such
as Apache Kafka, DBs, and Elastic search. Simply, the basics building blocks of
a Flink pipeline: input, processing, and output. Its runtime supports low-latency
processing at extremely high throughputs in a fault-tolerant manner. Flink
capabilities enable real-time insights from streaming data and event-based
capabilities. Flink enables real-time data analytics on streaming data and fits well
for continuous Extract-transform-load (ETL) pipelines on streaming data and for
event-driven applications as well.
The open-source projects are managed and supported by commercial companies, such
as Cloudera, that provide extra capabilities, training, and professional services that support
open-source projects such as Hadoop.
This is similar to what Red Hat has done for the open-source project Linux.
The advantage of the open source stack—flexibility, extensibility, and lower cost.
“One of the great benefits of open source lies in the flexibility of the adoption model:
you download and deploy it when you need it,” .With open source, you can try it and adopt it
at your own pace.
It is like a resource on demand whether it be storage, computing etc. Cloud follows pay per
usage model. You need to pay the amount of resource you use.
This computing service by cloud charges you based only on the amount of computing resources
we use. So for example, if you want to give demo to a client on a cluster of more than 100
machines and you do not have so many machines currently available with you, then in such
case cloud computing plays a very important role.
Cloud plays an important role within the Big Data world, by providing horizontally expandable
and optimized infrastructure that supports practical implementation of Big Data.
In cloud computing, all data is gathered in data centers and then distributed to the end-users.
Further, automatic backups and recovery of data is also ensured for business continuity, all
such resources are available in the cloud.
We do not know exact physical location of these resources provided to us. You just need
dummy terminals like desktops, laptops, phones etc. and a net connection.
a. Scalability
b. Elasticity
Customers are allowed to use and pay for only that much resource which it is using.
In cloud computing, elasticity is defined as the degree to which a system is able to adapt to
workload changes in an autonomic manner, so that at any time the available resources match
the current demand as closely as possible.
c. Resource Pooling
Same resources are allowed to be used by multiple organizations. The computing resources are
pooled for serving various consumers via multi-tenant model, with different resources
dynamically assigned and reassigned according to consumer demand.
d. Self service
Customers are provided easy to use interface through which they can choose services they
want. A consumer can unilaterally provision computing capabilities, such as server time and
network storage, as needed without requiring human interaction.
e. Low Costs
It charges you based only on the amount of computing resources we use and you need not buy
expensive infrastructure. Pricing on a utility computing basis is usage-based and fewer IT skills
are required for implementation.
f. Fault Tolerance
Public cloud – A cloud is called a “public cloud” when the services are open over
a network for public use.
Private Cloud – Private cloud is operated solely for a single organization,
whether managed internally or by a third-party, and hosted either internally or
externally.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Examples of IaaS are virtual machines, load balancers, and network attached
storage.
Examples of PaaS are Windows Azure and Google App Engine (GAE).
For SaaS to work, the infrastructure (IaaS) and the platform (PaaS) must be in
place.
IAAS in a public cloud: Using a cloud provider’s infrastructure for Big Data
services, gives access to almost limitless storage and compute power.
IaaS can be utilized by enterprise customers to create cost-effective and easily
scalable IT solutions where cloud providers bear the complexities and expenses
of managing the underlying hardware.
PAAS in a private cloud: PaaS vendors are beginning to incorporate Big Data
technologies such as Hadoop and MapReduce into their PaaS offerings, which
eliminate the dealing with the complexities of managing individual software and
hardware elements.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
For example, web developers can use individual PaaS environments at every
stage of development, testing and ultimately hosting their websites. However,
businesses that are developing their own internal software can also utilize
Platform as a Service, particularly to create distinct ring-fenced development and
testing environments.
SAAS in a hybrid cloud: Many organizations feel the need to analyze the
customer’s voice, especially on social media. SaaS vendors provide the platform
for the analysis as well as the social media data.
Office software is the best example of businesses utilizing SaaS. Tasks related to
accounting, sales, invoicing, and planning can all be performed through SAAS.
Businesses may wish to use one piece of software that performs all of these tasks
or several that each performs different tasks.
The software can be subscribed through the internet and then accessed online via
any computer in the office using a username and password. If needed, they can
switch to software that fulfills their requirements in better manner.
In SaaS, Google provides space that includes Google Docs, Gmail, Google
Calendar and Picasa.
IBM provides LotusLive iNotes, a web-based email service for messaging and
calendaring capabilities to business users.
Zoho provides online products similar to Microsoft office suite.
Predictive analytics can be deployed in across various industries for different business
problems. Below are a few industry use cases to illustrate how predictive analytics can
inform decision-making within real-world situations.
Banking: Financial services use machine learning and quantitative tools to predict credit
risk and detect fraud. Predictive analytics allows them to support dynamic market
changes in real-time in addition to static market constraints. This use of technology allows
it to both customize personal services for clients and to minimize risk.
Healthcare: Predictive analytics in health care is used to detect and manage the care of
chronically ill patients, as well as to track specific infections such as sepsis. Geisinger
Health used predictive analytics to mine health records to learn more about how sepsis is
diagnosed and treated. Geisinger created a predictive model based on health records for
more than 10,000 patients who had been diagnosed with sepsis in the past. The model
yielded impressive results, correctly predicting patients with a high rate of survival.
Human resources (HR): HR teams use predictive analytics and employee survey metrics
to match prospective job applicants, reduce employee turnover and increase employee
engagement. This combination of quantitative and qualitative data allows businesses to
reduce their recruiting costs and increase employee satisfaction, which is particularly
useful when labor markets are volatile.
Marketing and sales: While marketing and sales teams are very familiar with business
intelligence reports to understand historical sales performance, predictive analytics
enables companies to be more proactive in the way that they engage with their clients
across the customer lifecycle. For example, churn predictions can enable sales teams to
identify dissatisfied clients sooner, enabling them to initiate conversations to promote
retention. Marketing teams can leverage predictive data analysis for cross-sell strategies,
and this commonly manifests itself through a recommendation engine on a brand’s
website.
Supply chain: Businesses commonly use predictive analytics to manage product
inventory and set pricing strategies. This type of predictive analysis helps companies
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Fraud Detection
Financial services can use predictive analytics to examine transactions, trends, and
patterns. If any of this activity appears irregular, an institution can investigate it for
fraudulent activity. This may be done by analyzing activity between bank accounts
or analyzing when certain transactions occur.
Credit
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Underwriting
Marketing
Individuals who work in this field look at how consumers have reacted to the
overall economy when planning on a new campaign. They can use these
shifts in demographics to determine if the current mix of products will entice
consumers to make a purchase.
Supply Chain
Supply chain analytics is used to predict and manage inventory levels and
pricing strategies. Supply chain predictive analytics use historical data and
statistical models to forecast future supply chain performance, demand, and
potential disruptions.
Human Resources
Decision Trees
If you want to understand what leads to someone's decisions, then you may find
decision trees useful. This type of model places data into different sections based
on certain variables, such as price or market capitalization. Just as the name implies,
it looks like a tree with individual branches and leaves. Branches indicate the
choices available while individual leaves represent a particular decision.
Decision trees are the simplest models because they're easy to understand and
dissect. They're also very useful when you need to make a decision in a short period
of time.
Regression
This is the model that is used the most in statistical analysis. Use it when you want
to determine patterns in large sets of data and when there's a linear relationship
between the inputs. This method works by figuring out a formula, which represents
the relationship between all the inputs found in the dataset. For example, you can
use regression to figure out how price and other key factors can shape the
performance of a security.
Neural Networks
Cluster Models
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Clustering describes the method of aggregating data that share similar attributes.
Consider a large online retailer like Amazon.
Amazon can cluster sales based on the quantity purchased or it can cluster sales
based on the average account age of its consumer. By separating data into similar
groups based on shared features, analysts may be able to identify other
characteristics that define future activity.
Sometimes, data relates to time, and specific predictive analytics rely on the
relationship between what happens when. These types of models assess inputs at
specific frequencies such as daily, weekly, or monthly iterations. Then, analytical
models seek seasonality, trends, or behavioral patterns based on timing. This type
of predictive model can be useful to predict when peak customer service periods are
needed or when specific sales will be made.
Risk engines for a wide variety of business areas, including market and
credit risk, catastrophic risk, and portfolio risk.
Customer insight engines will be the backbone in online and set-top box
advertisement targeting, customer loyalty programs to maximize customer
lifetime value, optimizing marketing campaigns for revenue lift, and
targeting individuals or companies at the right time to maximize their
spend.
Software as a Service BI
Software-as-a-Service Business Intelligence (SaaS BI) is a business
intelligence (BI) delivery model in which applications are implemented
outside of a company and usually employed at a hosted location accessed
by an end user via protected Internet access.
SaaS BI generally implies a pay-as-you-go or subscription model, versus
the conventional software licensing model with annual maintenance or
license fees.
SaaS BI is also known as cloud BI or on-demand BI.
SaaS BI allows organizations to use BI tools without on-site installation
or maintenance, allowing customers to concentrate on generating analytic
queries and BI reports, rather than unnecessary tasks. The SaaS BI
approach also allows organizations to broaden their BI systems as usage
is increased. Heavy equipment purchases are not required because there
are no on-premise deployments.
SaaS BI can be a fit if there is no available budget to purchase BI software
or related hardware. Because there is no upfront purchase expense or extra
staffing demands for handling the BI system, the total cost of ownership
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM
Our first question for James was why his business was so successful: In addition to the
Omniture people, several other reasons stand out to me. They include:
Scaling the SaaS delivery model. We built Omniture from the ground up
to be SaaS and we understood the math better than the competition. We
invented a concept called the Magic Number.The Magic Number helps
you look at your SaaS business and helps you understand the value you
are creating when standard GAAP accounting numbers would lead you to
believe the opposite.
3. The user experience. Is where we are putting all our marbles. Today ’s BI
is not designed for the end user. It ’s not intuitive, it ’s not accessible, it ’s
not real time, and it doesn ’t meet the expectations of today ’s consumers of
technology, who expect a much more connected experience than enterprise
software delivers.
The definition of mobile BI refers to the access and use of information via
mobile devices.
With the increasing use of mobile devices for business – not only in
management positions – mobile BI is able to bring business intelligence and
analytics closer to the user when done properly.
Whether during a train journey, in the airport departure lounge or during a
meeting break, information can be consumed almost anywhere and anytime
with mobile BI.
Mobile BI – driven by the success of mobile devices – was considered by many
as a big wave in BI and analytics a few years ago. Nowadays, there is a level of
disillusion in the market and users attach much less importance to this trend.
Crowdsourcing Analytics
What is crowdsourcing in data analytics?
What Is Crowdsourcing? Crowdsourcing involves obtaining work,
information, or opinions from a large group of people who submit their data
via the Internet, social media, and smartphone apps. People involved in
crowdsourcing sometimes work as paid freelancers, while others perform small
tasks voluntarily.
Netflix already had an algorithm to solve the problem but thought there
was an opportunity to realize additional model “lift,” which would translate
to huge top-line revenue.
How it works?
Corporations, governments, and research laboratories are
confronted with complex statistical challenges.
They describe the problems to Kaggle and provide data sets.
Kaggle converts the problems and the data into contests that are
posted on its web site.
The contests feature cash prizes ranging in value from $100 to
$3 million.
Kaggle ’s clients range in size from tiny start-ups to multinational
corporations such as Ford Motor Company and government
agencies such as NASA.
For example, there are instances where a retailer and a social media company can
come together to share insights on consumer behavior that will benefit both
players.
Some of the more progressive companies are taking this a step further and
working on leveraging the large volumes of data outside the fi rewall such as
social data, location data, and so forth.
In other words, it will be not very long before internal data and insights from
within the firewall is no longer a differentiator. We see this trend as the move
from intra- to inter- and trans-firewall analytics.
Today they are doing intra-firewall analytics with data within the firewall.
Tomorrow they will be collaborating on insights with other companies to do inter-
firewall analytics as well as leveraging the public domain spaces to do trans-
firewall analytics.
MREM DEPARTMENT Of CSE-DATA SCIENCE (CSD) III YEAR -II SEM