Unstructured Data
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
Wikipedia
SearchDataManagement
3pillarglobal
October 18 1
By the end of this lesson, you should know:
• What NoSQL databases are.
• How are they different from SQL databases.
• Types of NoSQL databases.
October 18 2
NoSQL databases
• Non SQL or Non relational or Not only SQL.
• Stores and retrieves data that is not modelled in rows and columns.
• "Not only SQL“ - may support SQL-like query languages.
October 18 3
Applications of NoSQL databases
• The NoSQL distributed database infrastructure has been the solution
to handling some of the biggest data warehouses on the planet – i.e.
the likes of Google, Amazon, and the CIA.
• Airbus
http://medianetwork.oracle.com/video/player/4662924811001
October 18 4
NoSQL vs SQL
1. Non-relational model. 1. Relational model.
NoSQL
2. Stores data in JSON, key/value, 2. Stores data in a table.
SQL
graphs, columns. 3. Adding a new property may
3. New properties can be added on require altering schemas.
the fly. 4. Good for structured data.
4. Good for semi-structured, 5. Relationships are captured in
complex or nested data. normalised model using joins to
5. Relationships are captured by resolve references across tables.
denormalizing data and 6. Strict schema.
presenting all data for an object
in a single record.
6. Dynamic/flexible schema.
October 18 5
October 18 6
Case study: Building a social media website
• Users can post articles with related media like, pictures, videos, or
even music.
• Users can comment on posts and give points for ratings.
• Users can see a feed of posts.
• Users can interact with the main website.
October 18 7
Relational model
October 18 8
NoSQL model
In general:
• One query.
• No JOINS.
• No schema is maintained.
October 18 9
SQL NoSQL
October 18 10
Types of NoSQL databases
• Key-value
• Column / BigTable
• Document
• Graph
October 18 11
October 18 12
Key-value database
• Most basic and a backbone implementation of NoSQL.
• Underlying is a hash table which consists of a unique key that points
to a specific item of data.
• Work by matching keys with values like a dictionary.
• Give a key (e.g. the_answer_to_life) and receives a matching value
(e.g.24).
• Database is a global collection of key-value pairs.
• As the volume of data increases, maintaining unique values as keys
may become more difficult.
• Riak, Amazon S3 (Dynamo), Oracle NoSQL.
October 18 13
Example data
October 18 14
Storage
• Any reads and
writes of values
uses the key.
• Key can be
synthetic or Value can be String,
auto-generated. JSON, BLOB etc
October 18 15
Basic reading and writing
• Get(key), returns the value associated with the provided key.
• Put(key, value), associates the value with the key.
• Multi-get(key1, key2, .., keyN), returns the list of values associated
with the list of keys.
• Delete(key), removes the entry for the key from the data store.
October 18 16
Column/BigTable
• Advance the simple nature of key / value based.
• Do not require a pre-structured table to work with the data.
• Work by creating collections of one or more key / value pairs.
• Two dimensional arrays whereby each key has one or more key /
value pairs attached to it.
• Two groups: column-store and column-family store.
• Column-family store: Bigtable, HBase, Hypertable, and Cassandra.
• Column-store: Sybase IQ, C-store, Vertica, VectorWise, MonetDB,
ParAccel and Infobright.
October 18 17
KEY
Column-store,
position-based
October 18 18
Column-store,
rowid-based
KEY
VALUE
October 18 19
Column-family
October 18 20
October 18 21
• The outermost keys 3PillarNoida,
3PillarCluj, 3PillarTimisoara and
3PillarFairfax are analogues to
rows.
• ‘address’ and ‘details’ are
called column families.
• The column-family ‘address’
has columns ‘city’ and ‘pincode’.
• The column-family details’
has columns ‘strength’ and
‘projects’.
October 18 22
Document database
• A collection of key value pairs but the values stored (referred to as
“documents”) provide some structure and encoding of the managed
data i.e. XML, JSON, BSON. A unique key is a simple identifier (string,
URI, path).
• Embeds attribute metadata associated with content, this provides a
way to query data based on contents. API is used to retrieve data
based on content. Also allows editing of content and metadata.
• While key-value stores require the key to access data value,
document store has metadata which allows data access directly to the
attribute instead of through a key.
• CouchDB, Apache Cassandra, MongoDB.
October 18 23
Document database
• Document is the most basic unit of data.
• Documents are ordered sets of key-value pairs.
• Each document contains one or more name-value pairs.
Example:
{ KEY
_id : 978 NAME-VALUES
“Title” : “The Linux Command Line”, Document 1
“Author” : “William Shotts”
}
October 18 24
Documents are gathered together in collections within the database.
Collections should make sense e.g. books, webstore, retail store, fruits.
Hence, document database is unstructured and schemaless.
October 18 25
Since we are so used to relational db…
Relational database NoSQL document database
October 18 26
Since we are so used to relational db…
Relational Databases Document Databases
Databases Databases or Buckets
Tables Collections or Type Signifiers
Rows Documents
Columns Attributes/Names
Index Index
October 18 27
Document database
• We can store different schemas in different documents and these documents reside in the same
collection.
Example:
{
_id : 1
“ISBN” : “978”,
Document 1
“Title” : “The Linux Command Line”
}
Collection
{
_id : 2
“ASIN” : “B00J”, Document 2
“Item” : “Cherry Barbeque Sauce”
}
October 18 28
Document database
• We can have more complicated structure.
Example:
{
_id : “978”,
“Title” : “Data Science”,
“Author” : [“William Jackson”,
List of values
“Ben Ten”]
}
October 18 29
Use cases
• Event logging
• Blogs and Website content management
• Web-analytics or real time analytics
• E-commerce applications e.g. shopping cart.
October 18 30
Graph database
• Use graph structures with edges, nodes and properties.
• Nodes are organised based on their relationships with one another.
• These relationships are represented by edges between the nodes.
• Relationship defines social connectivities.
• Both nodes and relationships have defined properties.
• Neo4j.
October 18 31
Use cases
• People who likes this product, usually like that product.
• Mary is friends with George. George likes pizza. George has visited
Japan. Thus, we can ask the question of who are the friends of Mary’s
friends who likes the food that Mary’s friend likes but have not visited
the place that Mary’s friend has visited.
• You are more likely to be friends with Abu because you know Ali since
Abu is Ali’s friend.
October 18 32
Graph database
October 18 33