Data Analytics and Big Data
Richard Lui
1
The Big Data Era
• Data: Any piece of information stored and/or processed by a computer or mobile device.
• Companies/Organizations are generating and keeping more and more data
• The term "Big Data" was coined by John Mashey in 1990s to describe data that is too
vast and complex for traditional tools to handle.
1.44 megabytes (MB)
4000-year old clay disk
Video: Facebook Data Center
1 terabytes (TB) = 1,048,576 MB
Over a hundred petabytes of photos and videos data (1,024 terabytes (TB))
Where are the data coming from?
• Your every interaction with your computer or phone
• Your every interaction on social media
• Every time you walk down the street with a phone in your pocket, it’s
tracking your location through GPS sensors
• Every time you buy something with your credit cards or octopus card
• Every time you read an article online
• Every time you stream a song, movie or podcast
• …
3
Explosion of data
• Exponential growth of the Internet and World Wide Web
• Transactions and interaction of users with e-commerce and
mobile applications
• Social network activities
• E.g. YouTube, Facebook, Instagram, Twitter
• Companies collect and store a large volume of data from
different types of users
• E.g. Google, Baidu, Netflix, Uber
• Internet of Things (IoT) and wireless sensors
• Smart watch, thermostat, water heaters, smoke detectors, …
A chart which provides an overview of what happens online every minute
https://www.socialmediatoday.com/news/what-happens-on-the-internet-every-
minute-2021-version-infographic/607586/
4V of Big Data
• Volume
• A huge amount of data
• Velocity
• High speed and continuous flow of data
• Variety
• Different types of structured, semi-structured and unstructured data coming
from heterogenous sources
• Veracity
• Data may be inconsistent, incomplete and messy
Data Analytics
• Data Analytics refers to the technologies and processes that turn raw data into
insight for making decisions and facilitates drawing conclusion from data
{
"timestamp":"2022-08-12 03:01:58.732726",
"user_id":"35",
"click_id":“15cf179b9c9d483a…",
"event_name":"Search",
"user_ip":"11.22.33.44",
"additional_data":{
“engagement_time":40,
"product_id":12345
Clickstreams in an }
e-commerce website
6
Structured vs. Unstructured data
• Structured data
• Data conforms to a data model or schema and is often stored in tabular form.
• Unstructured data
• Data that does not conform to a data model or data schema is known as unstructured data.
• Estimated to makes up 80% of the data within any given enterprise.
• Semi-structured data
• Non-tabular structure, but conform to some level of structure
Unstructured data Semi-structured data
Structured data
7
Are the data structured/unstructured?
It’s estimated that 90% of the big data we generate is unstructured! 8
Four data analytic capabilities
Source: Gartner's 2017 Planning Guide for Data and Analytics.
9
Descriptive Analytics
• What has happened?”
• Example
• What was the sales volume over the past 12 months?
10
Diagnostic Analytics
• Cause of a phenomenon that occurred in the past
• Example
• Why were Q2 sales less than Q1 sales?
11
Predictive Analytics
• Generate future predictions based upon past events.
• Example
• What is the predicted sales in the next month?
12
Visualization
• Creation and study of the visual representation of data
• One of the most important tools for data analytics
• Dashboard: A read-only snapshot of an analysis that you can share with other users for reporting
purposes.
https://www.gapminder.org/fw/world-health-chart AWS QuickSight
https://aws.amazon.com/quicksight
13
Applications of Big Data
• Coca Cola use data to create new products, like Cherry Sprite, based on consumer preferences.
• Targeted advertising on platforms like Facebook is made possible through categorizing users based
on their data.
• The 2016 U.S. presidential campaign used Big Data to target specific groups of voters with
tailored ads.
• Netflix's algorithm for recommending shows and movies based on user preferences.
• Google Maps uses real-time data from users' locations and speeds to predict traffic conditions.
• Alibaba's City Brain initiative in Hangzhou, China, uses data to manage city traffic and
infrastructure.
• Personalize medicines by sequencing a patient’s genome, and predicting which medicine will have
the fewest side effects.
Video: Intro to Big Data: Crash Course Statistics #38
14
How Facebook track your data?
• Facebook has 2.89 billion active users, as of the second quarter of 2021 (Source: Statistica)
• Collect, store and analyze users data and behavior
• Suggest posts and advertisement which match the users’ preference
• Collected data
• Age, gender, Hobbies and recent experiences
• Posts and pages liked by user
• "People You May Know" feature
• phone contacts and shared locations
• Users' political activities, such as protests and marches attended
• Facebook partners with data brokers to gather information about users' purchases.
• Even offline transactions, like credit card payments, can be linked to user profiles, leading to targeted ads.
Video: How Facebook Tracks Your Data
Example: Facebook advertising
https://www.facebook.com/help/794535777607370?ref=learn_more_ipl
16
Artwork Personalization at Netflix
• Artwork selection is crucial to encourage members to engage with unfamiliar titles.
• Netflix personalized the image we use to depict the movie “Good Will Hunting”
• Someone who has watched many romantic movies => show the artwork containing Matt Damon and
Minnie Driver
• A member who has watched many comedies => use the artwork containing Robin Williams, a well-known
comedian.
https://netflixtechblog.com/artwork-personalization-c589f074ad76 17
Data analytic in Healthcare
• Metrics: patient falls with injury, average length of stay, and patient recommendations, etc.
• Create interactive dashboards
• Allow clinicians to analyze their performance and outcomes.
• Highlight areas of improvement in patient care.
• Deliver better and safer patient care.
The SEPTEE model
Video: What it's like to be a Healthcare Data Analyst 18
Predictive policing
• Video: How predictive policing software works
• The use of data to anticipate and prevent crime.
• Hotspot analysis
• Utilizing data from past crimes to forecast the likelihood of crime in each grid during the next
shift
• Placing police officers in these hotspots to prevent future crimes.
19
Case Study: How Cops Are Using Algorithms to
Predict Crimes
• Los Angeles Police Departments (LAPD) are using data-driven algorithms to forecast future crimes.
• Predicts violent crime occurrences and potential perpetrators using historical crime, arrest, and field data.
• PredPol: A predictive policing tool utilized by over 60 departments
• Identifies areas or "hotspots" with a higher likelihood of criminal activity
• Officers are directed to specific hotspots identified by PredPol's algorithm, which analyzes historical crime data
and creates hotspots.
• Drone surveillance and facial recognition-equipped body cameras
• Stop LAPD Spying Coalition argue that such strategies disproportionately target low-income and
communities of color.
Video: How Cops Are Using Algorithms to Predict Crimes
20
Summary
• Data analytics refers to technologies and processes that turn raw data into insights for decision
making.
• "Big Data" describes large, complex datasets that are difficult for traditional tools to process.
• Volume, Velocity, Variety, Veracity.
• Structured vs unstructured data. Unstructured makes up estimated 80% of enterprise data.
• 4 types of analytics: Descriptive, Diagnostic, Predictive, Prescriptive.
• Visualization is crucial for exploring and communicating insights from data.
• Applications of big data
• Targeted Advertising, Personalization, Predictive Policing, etc
21