Chapter 1
Introduction to Big Data
Introduction
1. What is BigData?
2. BigData Characteristics
3. Types of BigData
4. Traditional vs. Big Data business approach
5. Case study of Big Data Solutions
What is BigData?
• Massive datasets
• Collected from variety of data sources
• E-business and social media creates 2.5 Exabyte(1018 byte) of
data per day.
• To reveal new insights for optimized decision making.
• Used to stored for analysis to reveal hidden correlation and
patterns which is “BIG DATA ANALYTICS”
Trends of Data Generation
Year: 2020
Data: 50 ZB
Year: 2017
Data: 30 ZB
Year: 2010
Data: 20 ZB
Year:
2006
Data: 10
ZB
Big Data: Results of 3 computing Trends
Social Network Big Data Cloud Computing
Mobile
compu
ting
Volume of Big Data
Big Data (In Petabytes)
Web (In Terabytes)
CRM (In Gigabytes)
ERP (In Megabytes)
Transaction Operations
Customer Segmentation Support
Offer History Dynamic Pricing Behavior Weblogs
Sensor RFID UserClick Mobile Web
Characteristics of Big Data
1. Volume
2. Velocity
3. Variety
Five V’s of Big Data
Types of Big Data
What is Structured Data?
• Structured data usually resides in relational databases (RDBMS).
• Even text strings of variable length like names are contained in records,
making it a simple matter to search.
• Data may be human- or machine-generated as long as the data is created
within an RDBMS structure.
• This format is eminently searchable both with human generated queries and
via algorithms using type of data and field names, such as alphabetical or
numeric, currency or date.
• Common relational database applications with structured data include airline
reservation systems, inventory control, sales transactions, and ATM
activity.
• Structured Query Language (SQL) enables queries on this type of structured
data within relational databases.
What is Unstructured Data?
• Unstructured data has internal structure but is not structured via pre-
defined data models or schema.
• It may be textual or non-textual, and human- or machine-generated.
• It may also be stored within a non-relational database like NoSQL.
• Typical human-generated unstructured data includes:
1. Text files: Word processing, spreadsheets, presentations, email, logs.
2. Social Media: Data from Facebook, Twitter, LinkedIn.
3. Website: YouTube, Instagram, photo sharing sites.
4. Mobile data: Text messages, locations.
5. Communications: Chat, IM, phone recordings, collaboration software.
6. Media: MP3, digital photos, audio and video files.
7. Business applications: MS Office documents, productivity applications.
• Typical machine-generated unstructured data includes:
1. Satellite imagery: Weather data, land forms, military movements.
2. Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
3. Digital surveillance: Surveillance photos and video.
4. Sensor data: Traffic, weather, oceanographic sensors.
What is Semi-structured data ?
• Semi-structured data maintains internal tags and markings that identify
separate data elements, which enables information grouping and
hierarchies.
• Both documents and databases can be semi-structured.
• Email is a very common example of a semi-structured data type.
• Examples of Semi-structured Data:
1. Markup language XML : XML is a set of document encoding rules that defines
a human- and machine-readable format.
2. Open standard JSON (JavaScript Object Notation) : Its structure consists of
name/value pairs (or object, hash table, etc.) and an ordered value list (or array,
sequence, list).
3. NoSQL : NoSQL databases differ from relational databases because they do not
separate the organization (schema) from the data. It also allows for easier data
exchange between databases. Some newer NoSQL databases
ike MongoDB and Couchbase .
Traditional data management Approach
• Traditional data management store structure data in data
mart and data warehouses which are distributed
throughout the organization.
• Copying all the data from each of these systems to a
centralized location and keeping it updated is not an easy
task.
• Moreover, sampling the data will not serve the purpose of
extracting required information.
• This approach was able to handle huge volume of
transactions but up to an extent.
Big Data Approach
• Many IT tools are available for Big Data projects.
• Hadoop- Storage requirement
• Apache Spark- Stream Processing
• When used, these tools can dramatically reduce the time-to-
value- in most cases from more than 2 years to less than 4
months.
Advantages of using Hadoop:
1. Scalability
2. No pre-processing of data
3. Handles un-structure data
4. No limit of data and time
5. Protection against H/W failure
Beneficial Domains
• Insurance companies: To understand the likelihood of fraud by
accessing the internal and external data while processing claims.
• Manufacturers and Distributers: benefitted by realizing supply
chain issues earlier so that they can take decisions on different logistical
approaches to avoid the additional cost associated with material delays,
overstock or stock-out conditions.
• Hotels and Telecommunications companies: to serves
customers likely to have better clarity on customer needs.
• Public Services: such as traffic, ambulance, transportations, etc can
optimize their delivery mechanism.
• Smart city: To make cities more efficient and sustainable
to improve the lives of the citizens.
Case Study
1. Clickstream Analytics
2. Feedback analysis using word count
Thank you