NoSql 2024 Assign2
NoSql 2024 Assign2
NoSQL
These type of data storing may not require fixed schema, avoid
join operations and typically scale horizontally.
RDBMS
• Tight Consistency
Impedance Mismatch
The emergence of NoSQL
• No predefined schema
• CAP Theorem
SQL vs NoSQL
databases.
• Each of these categories has its own specific attributes and limitations.
• There is not a single solutions which is better than all the others, however
there are some databases that are better to solve specific problems.
• Key-value stores
• Column-oriented
• Document oriented
• Graph database
Key-value stores
• Key-value stores are most basic types of NoSQL databases.
• In the key-value storage, database stores data as hash table where each
key is unique and the value can be string, JSON etc.
• For example a key-value pair might consist of a key like "Name" that is
associated with a value like "Robin".
• However, there are many scenarios where writes are rare, but we
often need to read a few columns of many rows at once.
• In this situation, it’s better to store groups of columns for all rows
as the basic storage unit—which is why these databases are
called column stores.
• A collection of documents
• A document is a key value collection where the key allows access to its value.
• Documents are not typically forced to have a schema and therefore are
• In Figure we have a web of information whose nodes are very small (nothing
more than a name) but there is a rich structure of interconnections between
them.
• With this structure, we can ask questions such as “ find the books in the
Databases category that are written by someone whom a friend of mine
likes.”
Brewer’s CAP Theorem
• The theorem states that within a large-scale distributed data system, there
are three requirements that have a relationship of sliding dependency:
Consistency, Availability, and Partition Tolerance.
• Consistency : All database clients will read the same value for the same
query, even given concurrent updates.
• Availability : All database clients will always be able to read and write
data.
• The more consistency we demand from our system, for example, the less
partition-tolerant we are likely to be able to make it, unless we make some
concessions around availability.
• In distributed systems, however, it is very likely that we will have network
partitioning, and that at some point, machines will fail and cause others to
become unreachable.
• This leads us to the conclusion that a distributed system must do its best to
continue operating in the face of network partitions (to be Partition-
Tolerant), leaving us with only two real options to choose from: Availability
and Consistency.
• Figure . CAP Theorem indicates that we can realize only two of these properties at once
• Figure shows the general focus of some of the different databases.
• Graph databases such as Neo4J and the set of databases derived at least in
part from the design of Google’s Bigtable database (such as MongoDB,
HBase, Hypertable, and Redis) all are focused slightly less on Availability and
more on ensuring Consistency and Partition Tolerance.
• However, this does not mean that they dismiss Consistency as unimportant.
• According to the Bigtable paper, the average percentage of server hours that
“some data” was unavailable is 0.0047%.
CP :
• Some data may not be accessible, but the rest is still consistent/accurate.
AP :
• System is still avaiable under partioning, but some of the data returned
may be inaccurate.
How to Choose the Right NoSQL
Database for Your Application?
• https://www.dataversity.net/choose-right-
nosql-database-application/
Aggregate data models
• A data model is the model through which we perceive and manipulate our
data.
• For people using a database, the data model describe how we interact with
the data in the database.
• Each table has rows, with each row representing some entity of interest.
• A column may refer to another row in the different table, which constitutes a
relationship between those entities.
Aggregates
• The relational model takes the information that we want to store and divides
it into tuples (rows).
• We often want to operate on data in units that have a more complex structure
than a set of tuples.
• key-value, document, and column-family databases all make use of this more
complex record
• However, there is no common term for this complex record; here (we) use the
term “ aggregate.”
• Aggregates are also often easier for application programmers to work with,
since they often manipulate data through aggregate structures.
Example of Relations and Aggregates
• Consider an example of building an e-commerce website;
• we are going to be selling items directly to customers over the web, and we
will have to store information about users, our product catalog, orders,
shipping addresses, billing addresses, and payment data.
• We can use this scenario to model the data using a relation data store as
well as NoSQL data stores and talk about their pros and cons.
For a relational database, we might start with a data model shown in Figure 2.1.
Figure 2.2 presents some sample data for this model
The model might look when we think in more aggregate oriented terms (Figure 2.3).
• sample data, which is shown in JSON format as a common representation for
data in NoSQL land.
// in customers
{
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{
"id":99,
"customerId":1,
"orderItems":[
{ "productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{ "ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
}
• In this model, we have two main aggregates: customer and order.
• The black-diamond composition marker in UML to show how data fits into the
aggregation structure.
• The customer contains a list of billing addresses; the order contains a list of
order items, a shipping address, and payments.
• A single logical address record appears three times in the example data, but
instead of using IDs it’s treated as a value and copied each time.
• With aggregates, we can copy the whole address structure into the
aggregate as we need to.
• Indeed we could draw our aggregate boundaries differently, putting all the
orders for a customer into the customer aggregate (Figure 2.4).
• Using the above data model, an example Customer and Order would look
like this:
• Like most things in modeling, there’s no universal answer for how to draw our
aggregate boundaries.
• Depending on our distribution model, we can get a data store that will give us
– the ability to handle larger quantities of data,
– the ability to process a greater read or write traffic, or
– more availability in the face of network slowdowns or breakages.
• Broadly, there are two paths to data distribution: replication and sharding.
• Replication takes the same data and copies it over multiple nodes.
• In the ideal case, we have different users all talking to different server nodes.
• Each user only has to talk to one server, so gets rapid responses from that
server.
• The load is balanced out nicely between servers—for example, if we have ten
servers, each one only has to handle 10% of the load.
• we have to ensure that data that’s accessed together is clumped together
on the same node and that these clumps are arranged on the nodes to
provide the best data access.
• The first part of this question is how to clump the data up so that one user
mostly gets her data from a single server. This is where aggregate
orientation comes in really handy.
• When it comes to arranging the data on the nodes, there are several factors
that can help improve performance.
• If we have orders for someone who lives in Boston, we can place that data in
our eastern US data center.
• Sharding does little to improve resilience when used alone.
• Although the data is on different nodes, a node failure makes that shard’s
data unavailable just as surely as it does for a single-server solution.
• The resilience benefit it does provide is that only the users of the data on
that shard will suffer; however, it’s not good to have a database with part of
its data missing.
Master-Slave Replication
• This master is the authoritative source for the data and is usually
responsible for processing any updates to that data.
• we can scale horizontally to handle more read requests by adding more slave
nodes and ensuring that all read requests are routed to the slaves.
• we are still, however, limited by the ability of the master to process updates
and its ability to pass those updates on.
• Consequently it isn’t such a good scheme for datasets with heavy write traffic.
• A second advantage of master-slave replication is read resilience: Should the
master fail, the slaves can still handle read requests.
• The failure of the master does eliminate the ability to handle writes until
either the master is restored or a new master is appointed.
• Masters can be appointed manually or automatically.
• All the replicas have equal weight, they can all accept writes, and the loss
of any of them doesn’t prevent access to the data store.
• With a peer-to-peer replication cluster, we can ride over node failures
without losing access to data.
• When we can write to two different places, we run the risk that two people
will attempt to update the same record at the same time—a write-write
conflict.
• Inconsistencies on read lead to problems but at least they are relatively
transient. Inconsistent writes are forever.
Combining Sharding and Replication
• Replication and sharding are strategies that can be combined.
• If we use both masterslave replication and sharding (see Figure 4.4), this
means that we have multiple masters, but each data item only has a single
master.
• Should a node fail, then the shards on that node will be built on the other
nodes (see Figure 4.5).
• MongoDB is an open-source document database and leading NoSQL
database.
Each database gets its own set of files on the file system.
Using Composer
The preferred method of installing the
MongoDB PHP Library is with Composer by
running the following command from your
project root:
https://getcomposer.org/
Installing the Library
Using Composer
Installing the Library
Using Composer
The preferred method of installing the
MongoDB PHP Library is with Composer by
running the following command from your
project root:
composer require mongodb/mongodb
(run on cmd)
<?php
require 'vendor/autoload.php';
$client=new MongoDB\Client;
$companydb=$client->companydb;
$result1=$companydb-
>createCollection('empcollection');
var_dump($result1);
?>
Following example shows the document structure of a blog site, which is simply
a comma separated key value pair.
• _id is a 12 bytes hexadecimal number which assures the uniqueness of
every document.
• These 12 bytes first 4 bytes for the current timestamp, next 3 bytes for
machine id, next 2 bytes for process id of MongoDB server and remaining
3 bytes are simple incremental VALUE.
MongoDB Help
MongoDB Statistics
• Suppose a client needs a database design for his blog/website and see the
differences between RDBMS and MongoDB schema design. Website has
the following requirements.
• Every post has the name of its publisher and total number of likes.
• Every post has comments given by users along with their name, message,
data-time and likes.
In MongoDB default database is test. If you didn't create any database, then
collections will be stored in test database.
The dropDatabase() Method
The dropDatabase() Method
The createCollection() Method
>db.mitcollection.insert({"name" : “ICT"})
>show collections
mycol
mycollection
system.indexes
mitcollection
>
The drop() Method
MongoDB - Datatypes
• String
• Integer
• Boolean
• Double
• Arrays
• Timestamp
• Object.
And more
The insert() Method
MongoDB - Query Document
The following example will show the documents that have likes greater than 100
and whose title is either 'MongoDB Overview' or by is 'tutorials point'.
Equivalent SQL where clause is 'where likes>10 AND (by = 'tutorials point' OR
title = 'MongoDB Overview')'
Update Document
MongoDB's update() and save() methods are used to update document into a
collection.
The update() method updates the values in the existing document while the
save() method replaces the existing document with the document passed in
save() method.
db.post.updateOne({_id: ObjectId('65a9ea971220ffbb76de9657')},{ $set:
{ title: 'Developer\'s hub',topic:['MongoDB Atlas','MongoDB
Compass']}},{upsert:true})
db.post.updateOne({_id:
ObjectId('65a9ea971220ffbb76de9657')},{$push:{tags:'not structured'}})
• The updateOne() method accepts a filter document,
an update document, and an optional options
object. MongoDB provides update operators and
options to help you update documents.
• The $set operator replaces the value of a field with
the specified value
• The upsert option creates a new document if no
documents match the filtered criteria
• The $push operator adds a new value to
the hosts array field
• db.post.updateMany({},{$set:{application:["Se
rverless dev","Edge Computing","AI","IOT"]}})
Delete Document
Projection
In MongoDB, projection means selecting only the necessary data rather than
selecting whole of the data of a document. If a document has 5 fields and
you need to show only 3, then select only 3 fields from them.
To specify sorting order 1 and -1 are used. 1 is used for ascending order while
-1 is used for descending order.
Please note, if you don't specify the sorting preference, then sort() method will
display the documents in ascending order.
Count Documents
• db.collection.countDocuments( <query>,
<options> )
• db.post.countDocuments({likes:{$gt:25}})
Aggregation Pipeline
• Aggregation: Collection and summary of data
• Stage: One of the built-in methods that can be
completed on the data, but does not
permanently alter it
• Aggregation pipeline: A series of stages
completed on the data in order
db.collection.aggregate([
{
$stage1: {
{ expression1 },
{ expression2 }...
},
$stage2: {
{ expression1 }...
}
}
])
• $match and $group aggregation
• The $match stage filters for documents that
match specified conditions
• The $group stage groups documents by a
group key
• {
• $match: {
• "field_name": "value"
• }
• }
• {
• $group:
• {
• _id: <expression>, // Group key
• <field>: { <accumulator> : <expression> }
• }
• }
• db.post.aggregate([
• {
• $match: {
• by: 'tutorials point'
• }
• }, { $group:{_id:null,"totallikes":{$sum:"$likes"}}}
• ])
• < { _id: null,
• totallikes: 440}
• db.post.aggregate([
• {
• $match: {
• by: 'tutorials point'
• }
• }, { $group:{_id:"$title","totallikes":{$sum:"$likes"}}}
• ])
Sort and limit
• The $sort stage sorts all input documents and
returns them to the pipeline in sorted order.
Use 1 to represent ascending order, and -1 to
represent descending order.
• The $limit stage returns only a specified
number of records.
Sort and limit
$merge
• Merges the output with a specified collection
• The $merge stage provides more flexibility. It
can merge the results of the aggregation with
an existing collection.
• It allows specifying how the merging should
occur, with options like overwriting existing
documents, merging them, or even keeping
the existing ones if there's a conflict
$merge
$merge
$lookup and $map
• The $lookup stage adds a new array field to
each input document.
• $map Applies an expression to each item in an
array and returns an array with the applied
results
• db.posts.insertMany([ { _id: 1, title: "The Joy of MongoDB", description:
"Introduction to MongoDB", url: "http://example.com/mongodb", likes:
100, post_by: "Author1" },
• { _id: 2, title: "Aggregation Framework", description: "Deep Dive into
Aggregation", url: "http://example.com/aggregation", likes: 150, post_by:
"Author2" },
• { _id: 3, title: "Sharding Strategies", description: "How to shard
effectively", url: "http://example.com/sharding", likes: 75, post_by:
"Author3" }]);
• db.comments.insertMany([ // Comments for post with _id: 1 {
comment_id: 1, post_id: 1, by_user: "User1", message: "Great post!",
date_time: new Date(), likes: 5 }, { comment_id: 2, post_id: 1, by_user:
"User2", message: "Very informative!", date_time: new Date(), likes: 3 },
• Replication also allows you to recover from hardware failure and service
interruptions.
Why Replication?
A replica set is a group of mongod instances that host the same data set.
In a replica, one node is primary node that receives all write operations.
All other instances, such as secondaries, apply operations from the primary so
that they have the same data set.
• In a replica set, one node is primary node and remaining nodes are
secondary.
• After the recovery of failed node, it again join the replica set and works as
a secondary node.
A typical diagram of MongoDB replication is shown in which client application always
interact with the primary node and the primary node then replicates the data to the
secondary nodes.
Replica Set: Adding the First Member using rs.initiate()
Step 1) Ensure that all mongod.exe instances which will be added to the
replica set are installed on different servers.
Step 2) Ensure that all mongo.exe instances can connect to each other. From
ServerA, issue the below 2 commands
Step 3) Start the first mongod.exe instance with the replSet option.
This option provides a grouping for all servers which will be part of this replica
set.
mongo –replSet "Replica1“
Where "Replica1" is the name of your replica set. You can choose any
meaningful name for your replica set name.
Step 4) Now that the first server is added to the replica set, the next step is to
initiate the replica set by issuing the following command
rs.initiate ()
Step 5) Verify the replica set by issuing the command rs.conf() to ensure the
replica set up properly
Replica Set: Adding a Secondary using rs.add()
The secondary servers can be added to the replica set by just using the
rs.add command.
This command takes in the name of the secondary servers and adds the
servers to the replication set.
Suppose if you have ServerA, ServerB, and ServerC, which are required to
be part of your replica set and ServerA, is defined as the primary server in
the replica set.
To add ServerB and ServerC to the replica set issue the commands
rs.add("ServerB")
rs.add("ServerC")
Eg: rs.add("localhost:27018");rs.add("localhost:27019");
MongoDB - Sharding
Sharding is the process of storing data records across multiple machines and
it is MongoDB's approach to meeting the demands of data growth.
As the size of the data increases, a single machine may not be sufficient to
store the data nor provide an acceptable read and write throughput.
With sharding, you add more machines to support data growth and the
demands of read and write operations.
Sharding in MongoDB
Three main components are
Shards :
This is the basic thing, and this is nothing but a MongoDB instance which holds
the subset of the data.
They provide high availability and data consistency.
In production environments, all shards need to be part of replica sets.
Config Servers :
This is a mongodb instance which holds metadata about the cluster, basically
information about the various mongodb instances which will hold the shard
data.
Query Routers :
https://www.tutorialspoint.com/redi
s/redis_quick_guide.htm