Facets of Data
Facets of Data
Facets of Data
Future
Past Explain
How? Explore potential future events
Why?
Qualitative Quantitative
Intution+Analysis Formulae+ Algorithm
Difference between data analyst and data
scientist
Business Administration
Analyst
Domain specific responsibility : For Example marketing analyst, Financial analyst etc.
Data Scientist
Advance algorithms and machine learning
• Scala Data
• Data , Data and Data
Smart
Smart Cars
phones
Wearable
pedometers
Data, data and data…….
• We live in a world that’s drowning in data.
• Websites track every user’s every click.
• Your smartphone is building up a record of your location and speed every second of every day.
• “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates,
movement habits, diet, and sleep patterns.
• Smart cars collect driving habits, smart homes collect living habits, and smart marketers collect
purchasing habits.
• The Internet itself represents a huge graph of knowledge that contains (among other things) an
enormous cross-referenced encyclopedia;
• domain-specific databases about movies, music, sports results, pinball machines, memes, and cocktails;
• and too many government statistics (some of them nearly true!) from too many governments to wrap your
head around.
• Buried in these data are answers to countless questions that no one’s ever thought to ask.
Facets Of Data
• In data science and big data you’ll come across many different types of
data, and each of them tends to require different tools and techniques.
• The main categories of data are these:
• ■ Structured
• ■ Unstructured
• ■ Natural language
• ■ Machine-generated
• ■ Graph-based
• ■ Audio, video, and images
• ■ Streaming
• Let’s explore all these interesting data types
Structured data
• data that depends on a data model and resides in a fixed field within a record.
• As such, it’s often easy to store structured data in tables within databases or Excel files
• SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.
• You may also come across structured data that might give you a hard time storing it in a
traditional relational database.
• Hierarchical data such as a family tree is one such example.
• The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines.
• More often, data comes unstructured.
Example : Excel Data of Bank
age job marital education default balance housing
0 41 services married unknown no 88 yes
1 56 technician married secondary no 1938 no
2 30 services single secondary no 245 no
3 34 management single tertiary no 1396 yes
4 29 technician single secondary no -13 yes
5 26 unemployed single secondary no 632 no
6 59 retired married primary no 2074 no
7 46 management married tertiary no 3 no
8 61 retired married secondary no 569 no
9 35 blue-collar divorced secondary no 336 yes
Unstructured Data
• Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying.
• One example of unstructured data is your regular email .
• Although email contains structured elements such as the sender, title, and body text, it’s a challenge to find
the number of people who have written an email complaint about a specific employee because so many
ways exist to refer to a person, for example.
• The thousands of different languages and dialects out there further complicate this.
• Example : An email
Natural language
• Natural language is a special type of unstructured data;
• it’s challenging to process because it requires knowledge of specific data science
techniques and linguistics
• The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis, but models trained
in one domain don’t generalize well to other domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of
text.
• This shouldn’t be a surprise though: humans struggle with natural language as well.
• It’s ambiguous by nature.
• The concept of meaning itself is questionable here.
• Have two people listen to the same conversation. Will they get the same meaning?
• The meaning of the same words can vary when coming from someone upset or joyous.
Machine-generated data
• Machine-generated data is information that’s automatically created
by a
• Computer
• Process
• Application
• or other machine without human intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• Wikibon has forecast that the market value of the industrial Internet
will be approximately $540 billion in 2020.
Machine-generated data
• IDC (International Data Corporation) has estimated there will be 26 times
more connected things than people in 2020.
• This network is commonly referred to as the internet of things.
• The analysis of machine data relies on highly scalable tools, due to its high
volume and speed.
• Examples of machine data are web server logs, call detail records, network
event logs, and telemetry
• The machine data would fit nicely in a classic table-structured database.
• This isn’t the best approach for highly interconnected or “networked”
data, where the relationships between entities have a valuable role to play
Graph-based or network data
• “Graph” in this case points to mathematical graph theory.
• In graph theory, a graph is a mathematical structure to model pair-wise
relationships between objects.
• Graph or network data is, in short, data that focuses on the relationship or
adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and store
graphical data.
• Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the
shortest path between two people.
• Examples of graph-based data can be found on many social media websites
• LinkedIn you can see who you know at which company.
• Your follower list on Twitter is another example of graph-based data
Graph-based or network data
• The power and sophistication comes from multiple, overlapping
graphs of the same nodes.
• For example, imagine the connecting edges here to show “friends” on
Facebook.
• Imagine another graph with the same people which connects
business colleagues via LinkedIn.
• Imagine a third graph based on movie interests on Netflix.
• Overlapping the three different-looking graphs makes more
interesting questions possible
Graph-based or network data
Friends in a social network are an example of graph-based data
Harsha
Ronit
Lisa Hitesh
Rahul Somya
Pallavi Ritesh
Rahul Masaum
Linda Tarun
Chandani Dinesh
Saren Rekha
Graph Data Bases
• Graph databases are used to store graph-based data and are queried
with specialized query languages such as SPARQL.
• Graph data poses its challenges, but for a computer interpreting
additive and image data, it can be even more difficult
Audio, image, and video
• Audio, image, and video are data types that pose specific challenges
to a data scientist.
• Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
• MLBAM (Major League Baseball Advanced Media) announced in
2014 that they’ll increase video capture to approximately 7 TB per
game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete
movements to calculate in real time, for example, the path taken by a
defender relative to two baselines.
Audio, image, and video
• Recently a company called DeepMind succeeded at creating an
algorithm that’s capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
• It’s a remarkable feat that prompted Google to buy the company for
their own Artificial Intelligence (AI) development plans.
• The learning algorithm takes in data as it’s produced by the computer
game; it’s streaming data.
Streaming Data
• While streaming data can take almost any of the previous forms, it
has an extra property.
• The data flows into the system when an event happens instead of
being loaded into a data store in a batch.
• Although this isn’t really a different type of data, we treat it here as
such because you need to adapt your process to deal with this type of
information.
• Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.
Summary
• Data science : field of finding answer to questions
• Analyze stored data for meaningful insight
• DS vs data analysis vs data analytics
• Facets of data