[go: up one dir, main page]

0% found this document useful (0 votes)
9 views19 pages

Big Data Summary

The document provides a comprehensive overview of data, including its definitions, characteristics, and classifications such as small, medium, and big data. It discusses the advantages and disadvantages of big data, the 3Vs (Volume, Velocity, Variety), and the nature of structured and unstructured data. Additionally, it highlights the importance of big data in driving business value and the need for appropriate technologies to manage and analyze it.

Uploaded by

fatmaeram49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views19 pages

Big Data Summary

The document provides a comprehensive overview of data, including its definitions, characteristics, and classifications such as small, medium, and big data. It discusses the advantages and disadvantages of big data, the 3Vs (Volume, Velocity, Variety), and the nature of structured and unstructured data. Additionally, it highlights the importance of big data in driving business value and the need for appropriate technologies to manage and analyze it.

Uploaded by

fatmaeram49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Introduction

What is Data?
Data can be defined as a representation of facts, concepts, or instructions in a formalized
manner.
Characteristics of Data

Accuracy Is the information correct in every detail?

Completeness How comprehensive is the information?

Reliability Does the information contradict other trusted resources?

Relevance Do you really need this information?


How up- to-date is information? Can it be used for real-time
Timeliness
reporting?

Differences between Small Data, Medium Data and Big Data


Data can be small, medium or big.
Small data is data in a volume and format that makes it accessible, informative and
actionable.
Medium data refers to data sets that are too large to fit on a single machine but don’t require
enormous clusters of thousands.
Big data is extremely large data sets that may be analysed computationally to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions.
Small Data and Big Data Comparison Table
Basis of
Small Data Big Data
Comparison
Data that is ‘small’ enough for human Data sets that are so large or complex
Definition .In a volume and format that makes it that traditional data processing
accessible, informative and actionable applications cannot deal with them
● Data from traditional enterprise
Data Source systems like ● Purchase data from point-of-sale
○ Enterprise resource planning ● Clickstream data from websites
○ Customer relationship ● GPS stream data – Mobility data
management(CRM) sent to a server
● Social media – Facebook, Twitter
Most cases in a range of tens or
Volume hundreds of GB.Some case few TBs ( 1 More than a few Terabytes (TB)
TB=1000 GB)
● Data can arrive at very fast
Velocity (Rate ● Controlled and steady data flow
speeds.
at which data
● Enormous data can accumulate
appears) ● Data accumulation is slow
within very short periods of time
High variety data sets which include
Structured data in tabular format with
Tabular data,Text files, Images,
Variety fixed schema and semi-structured data
Video, Audio,
in JSON or XML format
XML,JSON,Logs,Sensor data etc.
Veracity Usually, the quality of data not
Contains less noise as data collected in
(Quality of guaranteed. Rigorous data validation
a controlled manner.
data ) is required before processing.
Business Intelligence, Analysis, and Complex data mining for prediction,
Value
Reporting recommendation, pattern finding, etc.
Time Historical data equally valid as data In some cases, data gets older soon(Eg
Variance represent solid business interactions fraud detection).
Databases within an enterprise, Local Mostly in distributed storages on
Data Location
servers, etc. Cloud or in external file systems.
More agile infrastructure with a
Predictable resource allocation.Mostly
Infrastructure horizontally scalable architecture.
vertically scalable hardware
Load on the system varies a lot.

Introduction to Big Data


Big data is data that exceeds the processing capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To
gain value from this data, you must choose an alternative way to process it. Big Data has to
deal with large and complex datasets that can be structured, Semi-structured, or unstructured
and will typically not fit into memory to be Processed.
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional data-
processing application software. – Wikipedia
3Vs of Big Data

Examples of Big Data:


The New York Stock Exchange generates about one terabyte of new trade data per day.
The statistic shows that 500+terabytes of new data get ingested into the databases of social
media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Examples of Data Volumes

Unit Value Example


Kilobytes (KB) 1,000 bytes a paragraph of a text document
Megabytes (MB) 1,000 Kilobytes a small novel
Gigabytes (GB) 1,000 Megabytes Beethoven’s 5th Symphony
Terabytes (TB) 1,000 Gigabytes all the X-rays in a large hospital
half the contents of all US academic research
Petabytes (PB) 1,000 Terabytes
libraries
about one fifth of the words people have ever
Exabytes (EB) 1,000 Petabytes
spoken
as much information as there are grains of sand
Zettabytes (ZB) 1,000 Exabytes
on all the world’s beaches
as much information as there are atoms in 7,000
Yottabytes (YB) 1,000 Zettabytes
human bodies

Advantages of using Big Data


1. Improved business processes
2. Fraud detection
3. Improved customer service
4. Better decision-making
5. Increased productivity
6. Reduce costs
7. Improved customer service
8. Fraud detection
9. Increased revenue
10. Increased agility
11. Greater innovation
12. Faster speed to market

Disadvantages of Big Data


1. Privacy and security concerns
2. Need for technical expertise
3. Need for talent
4. Data quality
5. Need for cultural change
6. Compliance
7. Cyber security risks
8. Rapid change
9. Hardware needs
10. Costs
11. Difficulty integrating legacy systems
Characteristics of Big Data (3 Vs of Big Data)
3Vs of Big Data = Volume, Velocity and Variety.
1. Volume:
Volume refers to the sheer size of the ever-exploding data of the computing world. It raises
the question about the quantity of data collected from different sources over the Internet
2. Velocity:
Velocity refers to the processing speed. It raises the question of at what speed the data is
processed. The speed is measured by the use of the data in a specific time period.In Big Data
velocity data flows in from sources like machines, networks, social media, mobile phones
etc.There is a massive and continuous flow of data. This determines the potential of data that
how fast the data is generated and processed to meet the demands.
3. Variety:
Variety: Variety refers to the types of data. In Big Data the raw data always collected in
variety. The raw data can be structured, unstructured, and semi structured. This is because the
data is collected from various sources.It also refers to heterogeneous sources.

Veracity is all about the trust score of the data. If the data is collected from trusted or reliable
sources then the data neglect this rule of big data.It refers to inconsistencies and uncertainty
in data, that is data which is available can sometimes get messy and quality and accuracy are
difficult to control.Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.Example: Data in bulk could create
confusion whereas less amount of data could convey half or Incomplete Information.
Challenges of Conventional System:
Fundamental challenges
–How to store
–How to work with voluminous data sizes,
–and more important, how to understand data and turn it into a competitive advantage.
How about Conventional system technology?
• CPU Speeds:
– 1990 - 44 MIPS at 40 MHz
– 2000 - 3,561 MIPS at 1.2 GHz
– 2010 - 147,600 MIPS at 3.3 GHz
• RAM Memory
– 1990 – 640K conventional memory (256K extended memory recommended)
– 2000 – 64MB memory
– 2010 - 8-32GB (and more)
• Disk Capacity
– 1990 – 20MB
– 2000 - 1GB
– 2010 – 1TB
• Disk Latency (speed of reads and writes) – not much improvement in last 7-10 years,
currently around 70 – 80MB / sec
How long it will take to read 1TB of data?
•1TB (at 80Mb / sec):
• 1 disk - 3.4 hours
• 10 disks - 20 min
•100 disks - 2 min
• 1000 disks - 12 sec

What do we care about when we process data?


• Handle partial hardware failures without going down:
– If machine fails, we should be switch over to stand by machine
– If disk fails – use RAID or mirror disk
• Able to recover on major failures:
– Regular backups
– Logging
– Mirror database at different site
• Capability:
– Increase capacity without restarting the whole system
– More computing power should equal to faster processing
• Result consistency:
– Answer should be consistent (independent of something failing) and returned in reasonable
amount of time
Nature of Data
Big data is a term thrown around in a lot of articles, and for those who understand what big
data means that is fine, but for those struggling to understand exactly what big data is, it can
get frustrating. There are several definitions of big data as it is frequently used as an all-
encompassing term for everything from actual data sets to big data technology and big data
analytics. However, this article will focus on the actual types of data that are contributing to
the ever growing collection of data referred to as big data. Specifically we focus on the data
created outside of an organization, which can be grouped into two broad categories:
structured and unstructured.

Structured Data
1. Created:
Created data is just that; data businesses purposely create, generally for market research. This
may consist of customer surveys or focus groups. It also includes more modern methods of
research, such as creating a loyalty program that collects consumer information or asking
users to create an account and login while they are shopping online.
2. Provoked:
A Forbes Article defined provoked data as, “Giving people the opportunity to express their
views.” Every time a customer rates a restaurant, an employee, a purchasing experience or a
product they are creating provoked data. Rating sites, such as Yelp, also generate this type of
data.

3. Transacted:
Transactional data is also fairly self-explanatory. Businesses collect data on every transaction
completed, whether the purchase is completed through an online shopping cart or in-store at
the cash register. Businesses also collect data on the steps that lead to a purchase online. For
example, a customer may click on a banner ad that leads them to the product pages which
then spurs a purchase.
As explained by the Forbes article, “Transacted data is a powerful way to understand exactly
what was bought, where it was bought, and when. Matching this type of data with other
information, such as weather, can yield even more insights. (We know that people buy more
Pop-Tarts at Walmart when a storm is predicted.)”
4. Compiled:
Compiled data is giant databases of data collected on every U.S. household. Companies like
Acxiom collect information on things like credit scores, location, demographics, purchases
and registered cars that marketing companies can then access for supplemental consumer
data.
5. Experimental:
Experimental data is created when businesses experiment with different marketing pieces and
messages to see which are most effective with consumers. You can also look at experimental
data as a combination of created and transactional data.
Unstructured Data
People in the business world are generally very familiar with the types of structured data
mentioned above. However, unstructured is a little less familiar not because there’s less of it,
but before technologies like NoSQL and Hadoop came along, harnessing unstructured data
wasn’t possible. In fact, most data being created today is unstructured. Unstructured data, as
the name suggests, lacks structure. It can’t be gathered based on clicks, purchases or a
barcode, so what is it exactly?
6. Captured:
Captured data is created passively due to a person’s behavior. Every time someone enters a
search term on Google that is data that can be captured for future benefit. The GPS info on
our smartphones is another example of passive data that can be captured with big data
technologies.
7. User-generated:
User-generated data consists of all of the data individuals are putting on the Internet every
day. From tweets, to Facebook posts, to comments on news stories, to videos put up on
YouTube, individuals are creating a huge amount of data that businesses can use to better
target consumers and get feedback on products.
Big data is made up of many different types of data. The seven listed above comprise types of
external data included in the big data spectrum. There are, of course, many types of internal
data that contribute to big data as well, but hopefully breaking down the types of data helps
you to better see why combining all of this data into big data is so powerful for business.
Sources of Big Data
Classification of Types of Big Data

1. Social Networks (human-sourced information): this information is the record of human


experiences, previously recorded in books and works of art, and later in photographs, audio
and video. Human-sourced information is now almost entirely digitized and stored
everywhere from personal computers to social networks. Data are loosely structured and
often ungoverned.
Social Networks: Facebook, Twitter, Tumblr etc.
Blogs and comments
Personal documents
Pictures: Instagram, Flickr, Picasa etc.
Videos: Youtube etc.
Internet searches
Mobile data content: text messages .
User-generated maps
E-Mail
2. Traditional Business systems (process-mediated data): these processes record and
monitor business events of interest, such as registering a customer, manufacturing a product,
taking an order, etc. The process-mediated data thus collected is highly structured and
includes transactions,reference tables and relationships, as well as the metadata that sets its
context. Traditional business data is the vast majority of what IT managed and processed, in
both operational and BI systems. Usually structured and stored in relational database systems.
(Some sources belonging to this class may fall into the category of "Administrative data").

3. Internet of Things (machine-generated data): derived from the phenomenal growth in


the number of sensors and machines used to measure and record the events and situations in
the physical world. The output of these sensors is machine-generated data, and from simple
sensor records to complex computer logs, it is well structured. As sensors proliferate and data
volumes grow, it is becoming an increasingly important component of the information stored
and processed by many businesses. Its well-structured nature is suitable for computer
processing, but its size and speed is beyond traditional approaches.

BIG DATA ENTERPRISE ARCHITECTURE


The 5 V’s of Big Data:
Too often in the hype and excitement around Big Data, the conversation gets complicated
very quickly. Data scientists and technical experts bandy around terms like Hadoop, Pig,
Mahout, and Sqoop, making us wonder if we’re talking about information architecture or a
Dr. Seuss book. Business executives who want to leverage the value of Big Data analytics in
their organisation can get lost amidst this highly-technical and rapidly-emerging ecosystem.
In an effort to simplify Big Data, many experts have referenced the “3 V’s”: Volume,
Velocity, and Variety. In other words, is information being generated at a high volume (e.g.
terabytes per day), with a rapid rate of change, encompassing a broad range of sources
including both structured and unstructured data? If the answer is yes then it falls into the Big
Data category along with sensor data from the “internet of things”, log files, and social media
streams. The ability to understand and manage these sources, and then integrate them into the
larger Business Intelligence ecosystem can provide previously unknown insights from data
and this understanding leads to the “4th V” of Big Data – Value.

There is a vast opportunity offered by Big Data technologies to discover new insights that
drive significant business value. Industries are seeing data as a market differentiator and have
started reinventing themselves as “data companies”, as they realise that information has
become their biggest asset. This trend is prevalent in industries such as telecommunications,
internet search firms, marketing firms, etc. who see their data as a key driver for monetisation
and growth. Insights such as footfall traffic patterns from mobile devices have been used to
assist city planners in designing more efficient traffic flows. Customer sentiment analysis
through social media and call logs have given new insights into customer satisfaction.
Network performance patterns have been analysed to discover new ways to drive efficiencies.
Customer usage patterns based on web click-stream data have driven innovation for new
products and services to increase revenue. The list goes on.
Key to success in any Big Data analytics initiative is to first identify the business needs and
opportunities, and then select the proper fit-for-purpose platform. With the array of new Big
Data technologies emerging at a rapid pace, many technologists are eager to be the first to
test the latest Dr. Seuss-termed platform. But each technology has a unique specialisation,
and might not be aligned to the business priorities. In fact, some identified use cases from the
business might be best suited by existing technologies such as a data warehouse while others
require a combination of existing technologies and new Big Data systems.
With this integration of disparate data systems comes the 5th V – Veracity, i.e. the
correctness and accuracy of information. Behind any information management practice lies
the core doctrines of Data Quality, Data Governance, and Metadata Management, along with
considerations for Privacy and Legal concerns. Big Data needs to be integrated into the entire
information landscape, not seen as a stand-alone effort or a stealth project done by a handful
of Big Data experts.
BIG DATA ANALYTICS

In the excitement and hype around Big Data analytics, it’s easy to see this emerging
technology as a “silver bullet” that can magically generate new insights solely through
powerful technology and smart data scientists. As in any age of change, however, core
principles still apply, and in order to gain insights from Big Data, you need to make sure your
“little data” is correct. Many of the “golden nuggets” of discovery are obtained through an
intersection of Big Data analytics with traditional sources such as a data warehouses or
master data management hubs.

Customer sentiment analysis is a common use-case for Big Data analytics—i.e. what are our
customers saying about our products in social media and/or call log records? And how can
we leverage this information to improve our business? Unless you have a robust ‘single
BIG DATA ANALYTICS
source of record’ for customer information, new discoveries from Big Data analytics will be
of little use. Was it Jane R. Doe or Jane P. Doe complaining about the new luxury sedan
model? With data properly managed within an information management framework, the full
value of Big Data becomes apparent and “golden nuggets” of information can appear. For
example, not only did Jane R. Doe complain about the new luxury sedan, but she had five
service calls about her transmission. She has purchased five high-priced sedans from us in the
past ten years and has an income of over $750,000. Jane R. Doe recently followed our
competitor on Twitter and has asked several questions about new features. It might be worth
having a representative call her personally.
Big Data analytics is an exciting development in the field of information management and, if
used properly, can generate a wealth of opportunity. In order to discover the “golden
nuggets” in your organisation, remember these guiding principles:
•Start with your business goals and drivers and align them to fit-for-purpose technologies (not
the other way around)
•Integrate your Big Data initiatives with core information management practices
•Build your information management practice on a core framework that includes data
governance, data quality management, data quality, and the other principles that create a
trusted source of information
Lastly, have fun—this is an exciting time to be in information management. New
technologies are emerging almost daily that can add significant value to your organisation,
particularly in the Big Data space.
A big data architecture is designed to handle the ingestion, processing, and analysis of data
that is too large or complex for traditional database systems. The threshold at which
organizations enter into the big data realm differs, depending on the capabilities of the users
and their tools. For some, it can mean hundreds of gigabytes of data, while for others it means
hundreds of terabytes. Over the years, the data landscape has changed. What you can do, or
are expected to do, with data has changed. The cost of storage has fallen dramatically, while
the means by which data is collected keeps growing. Some data arrives at a rapid pace,
constantly demanding to be collected and observed. Other data arrives more slowly, but in
very large chunks, often in the form of decades of historical data. You might be facing an
advanced analytics problem, or one that requires machine learning. These are challenges that
big data architectures seek to solve.
BIG DATA ANALYTICS

Components of a Big Data Architecture


Big Data Analytics are the natural result of four major global trends

Four Major Global Trends


Moore’s Law – Basically says that technology always gets cheaper
Mobile Computing – Smart Phone or Mobile Phone in your hand
BIG DATA ANALYTICS
Social Networking – Facebook, Foursquare, Pinterest (American Image sharing social media
service), etc.
Cloud Computing – You don’t have to buy hardware or software. Just rent or lease it.
Big data analytics is the use of advanced analytic techniques against very large, diverse data
sets that include structured, semi-structured and unstructured data, from different sources, and
in different sizes from terabytes to zettabytes.
Big data is a term applied to data sets whose size or type is beyond the ability of
traditional relational databases to capture, manage and process the data with low latency. Big
data has one or more of the following characteristics: high volume, high velocity or high
variety. Artificial intelligence (AI), mobile, social and the Internet of Things (IoT) are
driving data complexity through new forms and sources of data. For example, big data comes
from sensors, devices, video/audio, networks, log files, transactional applications, web, and
social media — much of it generated in real time and at a very large scale.
APPLICATIONS – Where it is used
1. Life Sciences:
Clinical research is a slow and expensive process, with trials failing for a variety of reasons.
Advanced analytics, artificial intelligence (AI) and the Internet of Medical Things (IoMT)
unlocks the potential of improving speed and efficiency at every stage of clinical research by
delivering more intelligent, automated solutions.
2. Banking:
Financial institutions gather and access analytical insight from large volumes of unstructured
data in order to make sound financial decisions. Big data analytics allows them to access the
information they need when they need it, by eliminating overlapping, redundant tools and
systems.
3. Manufacturing:
For manufacturers, solving problems is nothing new. They wrestle with difficult problems on
a daily basis - from complex supply chains, to motion applications, to labor constraints and
equipment breakdowns. That's why big data analytics is essential in the manufacturing
industry, as it has allowed competitive organizations to discover new cost saving
opportunities and revenue opportunities.
4. Health Care:
Big data is a given in the health care industry. Patient records, health plans, insurance
information and other types of information can be difficult to manage – but are full of key
insights once analytics are applied. That’s why big data analytics technology is so important
to heath care. By analyzing large amounts of information – both structured and unstructured –
quickly, health care providers can provide lifesaving diagnoses or treatment options almost
immediately.
5. Government:
Certain government agencies face a big challenge: tighten the budget without compromising
quality or productivity. This is particularly troublesome with law enforcement agencies,
which are struggling to keep crime rates down with relatively scarce resources. And that’s
why many agencies use big data analytics; the technology streamlines operations while
giving the agency a more holistic view of criminal activity.
6. Retail:
Customer service has evolved in the past several years, as savvier shoppers expect retailers to
understand exactly what they need, when they need it. Big data analytics technology helps
retailers meet those demands. Armed with endless amounts of data from customer loyalty
programs, buying habits and other sources, retailers not only have an in-depth understanding
of their customers, they can also predict trends, recommend new products – and boost
profitability.
Big Data Enterprise Model

Big Data Enterprise Model


The requirements of traditional enterprise data models for application, database, and storage
resources have grown over the years, and the cost and complexity of these models has
increased along the way to meet the needs of big data. This rapid change has prompted
changes in the fundamental models that describe the way that big data is stored, analyzed,
and accessed.
The new models are based on a scaled-out, shared-nothing architecture, bringing new
challenges to enterprises to decide what technologies to use, where to use them, and how.
One size no longer fits all, and the traditional model is now being expanded to incorporate
new building blocks that address the challenges of big data with new information processing
frameworks purpose-built to meet big data’s requirements. However, these purpose-built
systems also must meet the inherent requirement for integration into current business
models, data strategies, and network infrastructures.
Big Data Components
Two main building blocks are being added to the enterprise stack to accommodate big data:
● Hadoop: Provides storage capability through a distributed, shared-nothing file system, and
analysis capability through MapReduce
● NoSQL: Provides the capability to capture, read, and update, in real time, the large influx
of unstructured data and data without schemas; examples include click streams, social media,
log files, event data, mobility trends, and sensor and machine data
Big Data Enterprise Model

You might also like