MODULE 1 - ST

INTRODUCTION TO BIG DATA PLATFORM
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer,

which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It
is a data with so large size and complexity that none of traditional data management tools can
store it or process it efficiently. Big data is also a data but with huge size.
Don’t draw this

figure
BIG DATA??
Big Data is the term for collection of data sets so large and complex that it becomes
difficult to process using on hand database system tools or traditional data processing
applications.
4. Veracity
Accuracy and trustworthiness of generated data. Accuracy of analysis depends on the

veracity of source data.
5. Value
Mechanism to bring the correct meaning out of data.
CHALLENGES OF CONVENTIONAL SYSTEMS
The sharply increasing data deluge in the big data era brings about huge challenges on data
acquisition, storage, management and analysis. RDBMS only apply to structured data. The
traditional RDBMS could not handle the huge volume and heterogeneity of big data.
The key challenges are:
1. Data representation
Many datasets have certain levels of heterogeneity in type, structure, semantics,

organization, granularity and accessibility.
Data representation aims to make data more meaningful for computer analysis and user
interpretation. Efficient data representation shall reflect data structure, class and type as
well as integrated technologies so as to enable efficient operations on different data sets
2. Redundancy reduction and Data compression
It is effective to reduce the indirect cost of the entire system on the premise that the
potential values of the data are not affected.
3. Data lifecycle management
Data importance principle related to the analytical value should be developed to decide
which data shall be stored and which data shall be discarded.
4. Analytical mechanism
The analytical system of big data shall process masses of heterogeneous data within a
limited time.
5. Data confidentiality
Most big data service providers or owners at present could not effectively maintain and
analyze huge data sets because of their limited capacity. They must rely on
professionals or tools to analyze such data, which increase the potential safety risks.
6. Energy management
With the increase of data volume and analytical demands, the processing, storage and
transmission of big data will inevitably consume more energy
7. Expandability and Scalability
The analytical system of big data must support present and future datasets. The
analytical algorithm must be able to process increasingly expanding and more complex
data sets.
8. Cooperation
Analysis of big data is an interdisciplinary research, which requires experts in different

fields cooperate to harvest the potential of big data
Data growth
Benefits of Big Data Processing
 Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.
 Improved customer service
Traditional customer feedback systems are getting replaced by new systems

designed with Big Data technologies. In these new systems, Big Data and natural
language processing technologies are being used to read and evaluate consumer
responses.
 Early identification of risk to the product/services, if any

 Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone
for new data before identifying what data should be moved to the data warehouse. In
addition, such integration of Big Data technologies and data warehouse helps an
organization to offload infrequently accessed data.
MODERN DATA ANALYTIC TOOLS:
Following are some of the prominent big data analytics tools and techniques that are used
by analytics developers.
Cassandra:
This is the most applauded and widely used big data tool because it offers an effective
management of large and complex amounts of data. This is a database which offers high
availability and scalability without affecting the performance of commodity hardware and
cloud infrastructure. Cassandra has many advantages and some of those are fault tolerance,
decentralization, durability, performance, professional support, elasticity, and scalability. Since
this tool has so many qualities hence it is loved by all the analytics developers. Companies
which are using Cassandra big data analytics tool are eBay and Netflix.
Hadoop:
This is a striking product from Apache which has been used by many eminent
companies. Hadoop is basically an open-source software framework which is written in Java
language so that it can work with a chunk of data sets. It is designed in such a way so that it
can scale up from a single server to hundreds of machines. The most prominent feature of this
advanced software library is superior processing of voluminous data sets. Many companies
choose big data tool Hadoop because of its great processing capabilities.
Knime:
This is a big data analytics open source data tool. Knime is a leading analytics platform
which provides an open solution for data-driven innovation. With the help of this tool, you can
discover the hidden potential of your data, mine for fresh insights, and can predict new futures
by analysing the data. This tool can support any type of data like XML, JSON, Images,
documents, and more. This tool also possesses advanced predictive and machine learning
algorithms.
OpenRefine:
Used for large and voluminous data sets. This tool helps in cleaning and transforming
data from one format into another. This data tool can also be used to link and extend your
datasets with web services and other peripheral data. Earlier, OpenRefine is known as Google
Refine but from 2012, Google didn’t support this project and it was then rebranded to
OpenRefine.
R language:
R is an open source programming language which helps the organizations to manage

and analyse a chunk of data effectively and aptly. The language was initially written by Ross
Ihaka and Robert Gentleman but it has got immense appreciation from the mathematicians,
statisticians, data scientists and data miners who are in the field of data analytics. R is packed
with a host of data analysis tools which make the analysis of data more facile and simpler for
the users. With R, businesses don’t need to develop the customized tools and moreover, they
can easily get rid of the time-consuming codes. R is the prime data analysis software which
consists of innumerable algorithms that are designed for data retrieval, processing, analysis and
high-end statistical graphics representations.
Plotly:
As a successful big data analytics tool, Plotly has been used to create great dynamic
visualization even the organization has inadequate time or skills for meeting big data needs.
With the help of this tool, you can create stunning and informative graphics very effortlessly.
Basically, Plotly is used for composing, editing, and sharing interactive data visualization via
web.
Bokeh:
This tool has many resemblances with Plotly. This tool is very effective and useful if
you want to create easy and informative visualizations. Bokeh is a Python interactive
visualization library which helps you in creating astounding and meaningful visual
presentation of data in the web browsers. Thus, this tool is widely used by big data analytics
experienced persons to create interactive data applications, dashboards, and plots quickly and
easily. Bokeh is the most progressive and effective visual data representation tool.
Neo4j:
Neo4j is one of the leading big data analytics tools as it takes the big data business to
the next level. Neo4j is a graph database management system which is developed by Neo4j
Inc. This tool helps to work with the connections between them. The connections between the
data drive modern intelligent applications, and Neo4j is the tool that transforms these
connections to gain competitive advantage. As per DB-Engines ranking, Neo4j is the most
popular graph database.
Rapidminer:
This is certainly one of the favourite tools for all the data specialists. Like Knime, this
is also an open source data science platform which operates through visual programming. This
tool has the capability of manipulating, analysing, modeling and integrating the data into
business processes. RapidMiner helps data science teams to become more productive by giving
an open source platform for data preparation, model deployment, and machine learning. Its
unified data science platform accelerates the building of complete analytical workflows.
Wolfram Alpha:
This tool gives every minute detail of the data. This famous tool was developed by
Wolfram alpha LLC which is a subsidiary of Wolfram Research. If you want to do advanced
research on financial, historical, social, and other professional areas, then you must use this
platform
Orange:
Orange is an open source data visualization and data analysis tool which can be used by
both novice and sagacious persons in the field of data analytics. This tool provides interactive
workflows with a large toolbox. With the help of this toolbox, you can create interactive
workflows to analyse and visualize data.
Node XL:
This is a data visualization and analysis software tool for relationships and networks.
This tool offers exact calculations to the users. You will be glad to know that it is a free and
open-source network analysis and visualization software tool which has a wide range of
application. This tool is considered as one of the best and latest statistical tools for data
analysis which gives advanced network metrics, automation, access to social media network
data importers, and many more things.
Storm:
Storm has inscribed its name as one of the popular data analytics tools because of its
superior streaming data processing capabilities in real time. You can even integrate this tool
with many other tools like Apache Slider in order to manage and secure your data. Storm can
be used by an organization in many cases like data monetization, cybersecurity analytics,
detection of the threat, operational dashboards, real-time customer management, etc.
ANALYSIS Vs REPORTING:
Reporting: The process of organizing data into informational summaries in order to monitor
how different areas of a business are performing.
Analysis: The process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.
Reporting translates raw data into information. Analysis transforms data and information
into insights. Reporting helps companies to monitor their online business and be alerted to
when data falls outside of expected ranges. Good reporting should raise questions about the
business from its end users. The goal of analysis is to answer questions by interpreting the
data at a deeper level and providing actionable recommendations.
There are five differences between reporting and analysis:
1. Purpose
Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier; analysis interprets this
information and provides recommendations on actions.
2. Tasks
Reporting includes building, configuring, consolidating, organizing, formatting, and

summarizing. It’s very similar to the abovementioned like turning data into charts, graphs, and
linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and confirming. With

big data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their
outputs. Reporting has a push approach, as it pushes information to users and outputs come in
the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended actions,
and a forecast of its impact on the company
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data,

automation has been a lifesaver, especially now with big data. It’s not surprising that the first
thing outsourced are data entry services since outsourcing companies are perceived as data
reporting experts.
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these
days, as organizations depend on them to come up with recommendations for leaders or
business executives make decisions about their businesses.
5. Value
Both analysis and reporting are indispensable when looking at the big picture. It should
help businesses grow, expand, move forward, and make more profit or increase their value.

MODULE 1 - ST

Uploaded by

Document Informationclick to expand document informationModule 1_st

Document Informationclick to expand document information

Copyright:

Available Formats

MODULE 1 - ST

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MODULE 1 - ST

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO BIG DATA PLATFORM

The quantities, characters, or symbols on which operations are performed by a computer,

What is Big Data?

Don’t draw this

Accuracy and trustworthiness of generated data. Accuracy of analysis depends on the

Mechanism to bring the correct meaning out of data.

CHALLENGES OF CONVENTIONAL SYSTEMS

The key challenges are:

Many datasets have certain levels of heterogeneity in type, structure, semantics,

2. Redundancy reduction and Data compression

3. Data lifecycle management

7. Expandability and Scalability

Analysis of big data is an interdisciplinary research, which requires experts in different

 Businesses can utilize outside intelligence while taking decisions

 Improved customer service

Traditional customer feedback systems are getting replaced by new systems

 Early identification of risk to the product/services, if any

R is an open source programming language which helps the organizations to manage

There are five differences between reporting and analysis:

Reporting includes building, configuring, consolidating, organizing, formatting, and

Analysis consists of questioning, examining, interpreting, comparing, and confirming. With

Considering that reporting involves repetitive tasks—often with truckloads of data,

You might also like