Mongo Notes
Mongo Notes
Introduction to MongoDB
● What is MongoDB?
● Features of MongoDB
● Types of Databases (SQL vs NoSQL)
● Why does MongoDB use BSON?
● BSON Advantages
● Alternatives to MongoDB (Cassandra, Redis, DynamoDB, HBase, OrientDB)
● Collections
● Insert vs Save
● Update vs UpdateOne vs UpdateMany
● Delete Operations (DeleteOne, DeleteMany)
● Basic Query Operations (find, findOne)
● Cursors
● Admin Database
● How to List Collections
● How to Modify a Collection Name
4. Indexing in MongoDB
● What is Indexing?
● Single Field Index
● Compound Index
● Multi-Key Index
● Geospatial Index
● Text Index
● Covered Queries
● How to Create Indexes (db.collection.createIndex())
● Indexing Best Practices
● Clustered Index vs Non-Clustered Index
● Clustered Collections
5. Aggregation Framework
6. Data Modeling
9. Replication
● What is Replication?
● Primary and Secondary Replica Set
● How Many Nodes in a Replica Set?
● Voting in Replication
● Difference Between GridFS and Sharding
10. Sharding
● What is Sharding?
● Components of Sharding
● Query Routing in Sharding
● Advantages and Disadvantages of Sharding
● Sharding vs Replication
● Sharding Best Practices
● CAP Theorem
● Capped Collections
● How to Create a Capped Collection
● Sharding Disadvantages
11. GridFS
● What is GridFS?
● Difference Between GridFS and Sharding
● GridFS vs Traditional File Storage
● Transactions in MongoDB
● ACID Compliance
● Batch Sizing
● Upsert Operations
● Use Cases for Transactions
● CAP Theorem
● TTL (Time to Live)
● Data Redundancy
● Clustered Collections
● Materialized Views
● View collections
● Decrement Operations
● Alternatives to MongoDB
o Cassandra
o Redis
o DynamoDB
o HBase
o OrientDB
1. Introduction to MongoDB
● What is MongoDB?
MongoDB is a modern, open-source NoSQL database that handles lots of unstructured data.
Instead of using tables like traditional databases, it stores data in flexible, JSON-like
documents called BSON. This means you can easily change the structure of your data
without any issues. MongoDB is great for applications that need to process large amounts of
data quickly, like real-time analytics and big data projects. It’s also easy to use and works
well with modern development tools, making it a popular choice for developers.
● Features of MongoDB
SQL Databases
● Structured Data: SQL databases store data in tables with rows and columns, similar to a
spreadsheet.
● Fixed Schema: You need to define the structure of your data (schema) before you can store it.
● Relational: Data is organized in a way that allows relationships between different tables.
● ACID Compliance: Ensures reliable transactions with properties like Atomicity, Consistency,
Isolation, and Durability.
● Vertical Scalability: Typically scaled by increasing the power of a single server (e.g., adding more
CPU, RAM).
● Flexible Data: NoSQL databases store data in various formats like documents, key-value pairs,
graphs, or wide-columns.
● Schema-less: You don’t need to define the structure of your data in advance, allowing for more
flexibility.
● Non-Relational: Data is often stored without strict relationships, making it easier to handle
unstructured data.
● High Scalability: Designed to scale out horizontally by adding more servers.
● Eventual Consistency: Some NoSQL databases prioritize availability and partition tolerance over
immediate consistency.
Key Differences
● Structure: SQL uses structured tables, while NoSQL uses flexible formats.
● Schema: SQL requires a predefined schema; NoSQL does not.
● Scalability: SQL scales vertically; NoSQL scales horizontally.
● Use Cases: SQL is great for complex queries and transactions; NoSQL is ideal for large volumes of
unstructured data and real-time applications.
“MongoDB uses BSON (Binary JSON) because it is a binary format that is more efficient for
storage and retrieval, supports a wider range of data types, and allows for faster parsing
and flexibility in representing complex data structures.”
1. BSON is a binary format, which means it can store data more compactly than plain
text JSON. This helps in saving storage space.
2. BSON is designed to be fast to encode and decode. This makes data retrieval and
storage operations quicker.
3. BSON supports more data types than JSON, such as dates and binary data.This
allows MongoDB to handle a wider variety of data efficiently.
4. BSON is designed to be traversable, meaning MongoDB can easily navigate through
the data to perform operations like queries and indexing.
5. BSON maintains the order of keys in documents, which can be important for certain
applications
● BSON Advantages
Advantages of BSON
1. Efficiency: BSON is a binary format, which makes it faster to read and write compared to text-based
formats like JSON
2. Compactness: It generally results in smaller file sizes, saving storage space and improving
transmission speeds
3. Rich Data Types: BSON supports a wider range of data types, including dates and binary data,
which JSON does not
4. Speed: The binary encoding allows for quicker parsing and efficient data traversal
5. Flexibility: It supports nested documents and arrays, making it easier to represent complex data
structures
Disadvantages of BSON
1. Space Efficiency: While BSON is compact, it can sometimes be less space-efficient than JSON due
to additional metadata.
2. Human Readability: BSON is not human-readable, which can make debugging and manual data
inspection more challenging.
3. Complexity: The binary format can be more complex to work with compared to the simpler,
text-based JSON.
MongoDB supports a variety of data types to handle different kinds of information. Here are some of
the key data types:
These data types allow MongoDB to handle a wide range of data and provide flexibility in how you
store and manage your information.
An ObjectId in MongoDB is a unique identifier for documents. It is 12 bytes in size and consists of the
following components:
● 4-byte Timestamp: Represents the creation time of the ObjectId, measured in
seconds since the Unix epoch.
● 5-byte Random Value: Generated once per process, unique to the machine
and process.
● 3-byte Counter: An incrementing counter, initialized to a random value
● Embedded Documents
Embedded documents in MongoDB are documents stored within other documents, creating a
nested structure. This approach is useful for storing related data together, making it easier to access
and manage.
Example:
Imagine you have a user document that includes the user’s address. Instead of storing the address
in a separate collection, you can embed it directly within the user document:
JSON
{
"_id": 111111,
"email": "email@example.com",
"name": {
"given": "Jane",
"family": "Han"
},
"address": {
"street": "111 Elm Street",
"city": "Springfield",
"state": "Ohio",
"country": "US",
"zip": "00000"
}
}
Benefits:
When to Use:
● When the embedded data grows too large, making the document unwieldy.
● When the data has complex relationships that are better managed with references.
● Collections
● Insert vs Save
o Insert: Adds a new document to a collection. If the document already exists, it will
not be added again.
o Save: If the document has an _id field and it matches an existing
document, save will update that document. If there’s no match, it will insert the
document as a new one.
Comparison Operators
1. $eq: Matches values that are equal to a specified value.
2. $ne: Matches values that are not equal to a specified value.
3. $gt: Matches values that are greater than a specified value.
4. $gte: Matches values that are greater than or equal to a specified value.
5. $lt: Matches values that are less than a specified value.
6. $lte: Matches values that are less than or equal to a specified value.
7. $in: Matches any of the values specified in an array.
8. $nin: Matches none of the values specified in an array.
Logical Operators
1. $and: Joins query clauses with a logical AND, returning all documents that
match the conditions of both clauses.
2. $or: Joins query clauses with a logical OR, returning all documents that match
the conditions of either clause.
3. $not: Inverts the effect of a query expression and returns documents that do
not match the query expression.
4. $nor: Joins query clauses with a logical NOR, returning all documents that fail
to match both clauses.
Element Operators
1. $exists: Matches documents that have the specified field.
2. $type: Matches documents that have a field of the specified type.
Evaluation Operators
1. $regex: Matches documents where the value of a field matches a specified regular
expression.
2. $expr: Allows the use of aggregation expressions within the query language.
3. $jsonSchema: Validates documents against the given JSON Schema.
Array Operators
1. $all: Matches arrays that contain all elements specified in the query.
2. $elemMatch: Matches documents that contain an array field with at least one
element that matches all the specified query criteria.
3. $size: Matches any array with the specified number of elements.
Geospatial Operators
1. $geoWithin: Selects documents with geospatial data that exist entirely within a
specified shape.
2. $geoIntersects: Selects documents with geospatial data that intersect with a
specified shape.
3. $near: Returns documents in order of proximity to a specified point.
● Cursors
A cursor is an object that allows you to iterate over the results of a query. When you
use find, it returns a cursor, which you can use to access each document one by one.
● Admin Database
The admin database is a special database in MongoDB that holds administrative information
and commands. It’s used for tasks like managing users and roles, and performing
server-side operations.
To list all collections in a database, you can use the listCollections command or
the show collections command in the MongoDB shell.
db.oldCollectionName.renameCollection("newCollectionName")
4. Indexing in MongoDB
● What is Indexing?
● Indexes are special data structures that store a small portion of the collection’s data in
an easy-to-traverse form. They are similar to the index in a book, which helps you quickly
find the information you need without having to read through the entire book.
● Purpose: They make it faster to retrieve documents from a collection by reducing the
amount of data MongoDB needs to scan.
● Types: MongoDB supports various types of indexes, including single field, compound,
multi-key, text, and geospatial indexes.
● Creation: You can create an index on a collection using the createIndex method.
● Usage: When you query a collection, MongoDB uses the index to quickly locate the
required documents.
For example, if you have a collection of books and you frequently search by the author’s
name, you can create an index on the author field to speed up these queries.
Default _id Index:
every collection automatically has a default index on the _id field. This index is created
when the collection is created and ensures that each document in the collection has a
unique identifier.
db.collection.createIndex({ name: 1 })
● Compound Index
o A compound index is an index on multiple fields. This is useful for queries that filter
on multiple fields. For example:
● Multi-Key Index
o A multi-key index is used for indexing fields that hold arrays. MongoDB creates an
index entry for each element of the array. For example:
db.collection.createIndex({ tags: 1 })
● Geospatial Index
o A geospatial index is used for querying geospatial data. MongoDB supports 2D and
2DSphere indexes for different types of geospatial queries. For example:
db.collection.createIndex({ location: "2dsphere" })
● Text Index
o A text index is used for text search queries. It indexes the content of string fields for
efficient text search. For example:
● Covered Queries
A covered query is a query where all the fields in the query are part of an index. This means
MongoDB can satisfy the query using only the index, without scanning any documents. This
can significantly improve performance.
o Analyze Query Patterns: Create indexes based on the fields that are frequently
queried.
o Limit the Number of Indexes: Each index consumes disk space and affects write
performance.
o Use Compound Indexes Wisely: Ensure the order of fields in compound indexes
matches the query patterns.
o Monitor Index Usage: Use tools like MongoDB Atlas Performance Advisor to
monitor and optimize index usage.
o Clustered Index: MongoDB does not support clustered indexes in the traditional
sense. However, the _id field in MongoDB is automatically indexed and can be
considered similar to a clustered index.
o Non-Clustered Index: All other indexes in MongoDB are non-clustered. They store a
reference to the actual data rather than the data itself.
5. Framework
1. Aggregation Pipeline:
o The aggregation framework works like a pipeline where data passes
through various stages, with each stage performing an operation on the
data.
o The output of one stage becomes the input for the next stage.
1. $match:
db.user.aggregate([
{
$match: { "country": "USA" } // Filters documents where
'country' is 'USA'
}
])
2. $group:
● The $group stage groups documents by a specified field or fields and performs
aggregation operations such as $sum, $avg, $max, etc., on each group.
● Example: Group users by favoriteFruit and calculate the total count of
users in each group.
db.user.aggregate([
{
$group: {
_id: "$favoriteFruit", // Group by 'favoriteFruit'
count: { $sum: 1 } // Count the number of
users in each group
}
}
])
3. $sort:
● The $sort stage sorts documents by a specified field or fields, either in
ascending (1) or descending (-1) order.
● Example: Sort users by age in descending order.
db.user.aggregate([
{
$sort: { "age": -1 } // Sort documents by 'age' in
descending order
}
])
4. $project:
db.user.aggregate([
{
$project: {
name: 1, // Include the 'name' field
age: 1 // Include the 'age' field
}
}
])
db.user.aggregate([
{
$project: {
name: 1, // Include 'name'
ageInFiveYears: { $add: ["$age", 5] } // Add 5 to the
'age' field and store it as 'ageInFiveYears'
}
}
])
5. $limit:
● The $limit stage restricts the number of documents passed to the next stage
of the pipeline. It is useful when you need to return a specific number of
documents, such as in pagination.
● Example: Limit the result to the first 5 documents.
db.user.aggregate([
{
$limit: 5 // Return only the first 5 documents
}
])
● Complex Aggregation Operators:
● $or: Matches documents where at least one of the conditions in the array is
true.
o Example: Find users who are either from the USA or are older than
30.
db.user.find({
$or: [
{ country: "USA" },
{ age: { $gt: 30 } }
]
})
● $in: Matches any documents where the field’s value is in the specified array.
o Example: Find users whose favorite fruit is either "Apple" or
"Banana."
db.user.find({
favoriteFruit: { $in: ["Apple", "Banana"] }
})
db.user.find({
email: { $exists: true }
})
2. $facet
db.user.aggregate([
{
$facet: {
fruitCounts: [
{ $group: { _id: "$favoriteFruit", count: { $sum:
1 } } }
],
averageAge: [
{ $group: { _id: null, avgAge: { $avg: "$age" } }
}
]
}
}
])
3. $lookup
● $lookup: Performs a left outer join between two collections. Useful for joining
data from different collections in MongoDB.
o Example: Join orders collection with users collection to include user
details in each order.
db.orders.aggregate([
{
$lookup: {
from: "users", // The collection to join
localField: "userId", // Field from 'orders'
foreignField: "_id", // Field from 'users'
as: "userDetails" // Name for the resulting
joined field
}
}
])
4. $merge
db.orders.aggregate([
// ... Your aggregation pipeline ...
{ $merge: "aggregatedResults" } // Merge the output
into 'aggregatedResults'
])
5. $unwind
db.user.aggregate([
{ $unwind: "$hobbies" }
])
● $addToSet: Adds a value to an array, only if the value does not already exist
in the array (like a set).
o Example: Add a hobby to a user’s hobbies array, only if it doesn’t
already exist.
db.user.updateOne(
{ _id: userId },
{ $addToSet: { hobbies: "reading" } })
db.user.updateOne(
{ _id: userId },
{ $push: { hobbies: "gaming" } }
)
db.user.updateOne(
{ _id: userId },
{ $pull: { hobbies: "gaming" } }
)
db.user.updateOne(
{ _id: userId },
{ $pop: { hobbies: 1 } } // Use -1 for the first
element
)
● $all: Matches documents where the array field contains all the specified
elements.
o Example: Find users whose hobbies include both "reading" and
"traveling."
db.user.find({
hobbies: { $all: ["reading", "traveling"] }
})
● $nin: Matches documents where the field’s value is not in the specified array.
o Example: Find users whose favorite fruit is neither "Apple" nor
"Banana."
db.user.find({
favoriteFruit: { $nin: ["Apple", "Banana"] }
})
● $ne: Matches documents where the field’s value is not equal to the specified
value.
o Example: Find users who do not live in the USA.
db.user.find({
country: { $ne: "USA" }
})
8. $cond, $expr
db.user.aggregate([
{
$project: {
name: 1,
status: {
$cond: { if: { $gte: ["$age", 18] }, then:
"Adult", else: "Minor" }
}
}
}
])
db.user.find({
$expr: { $gt: ["$age", "$yearsOfExperience"] }
})
Map-Reduce
How it works: Map-Reduce involves two functions: map and reduce. The map function processes
each document and emits key-value pairs. The reduce function then processes these pairs to
aggregate the results.
Flexibility: It allows for complex operations using JavaScript, making it highly flexible.
Performance: Generally slower and less efficient compared to the Aggregation Framework,
especially for large datasets.
Use Cases: Suitable for complex data processing tasks that require custom JavaScript functions.
db.sales.mapReduce(
{ out: "total_sales_by_category" }
);
Aggregation Framework
How it works: Uses a pipeline of stages to process data. Each stage transforms the documents as
they pass through the pipeline.
Built-in Operators: Includes a variety of built-in operators for filtering, grouping, sorting, and
transforming data.
Performance: More efficient and faster than Map-Reduce, especially for large datasets.
Use Cases: Ideal for most aggregation tasks due to its performance and ease of use.
db.sales.aggregate([
]);
Conclusion
Both Map-Reduce and the Aggregation Framework are powerful tools for data
aggregation in MongoDB. The choice between them depends on your specific use
case. For most standard data processing tasks, the Aggregation Framework is the
better option due to its performance and ease of use. However, for more complex or
highly customized data transformations, Map-Reduce may still be the appropriate
choice.
● Covered Query
A Covered Query in MongoDB is a type of query where MongoDB can get all the
information it needs from the index itself, without having to look at the actual documents in
the collection. This makes the query much faster because MongoDB doesn't need to read any
extra data from the disk.
How does it work?
To have a covered query, three things need to happen:
1. All the fields used in the query must be part of the index.
2. The query only asks for fields that are in the index (no extra fields).
3. The index is used for filtering, sorting, and retrieving the results.
Example:
Let's say you have a collection called users, and each document looks like this:
{
"name": "Alice",
"age": 25,
"email": "alice@example.com"
}
Now, you create an index on the name and age fields:
db.users.createIndex({ name: 1, age: 1 });
If you run the following query:
db.users.find({ name: "Alice" }, { name: 1, age: 1, _id: 0 });
This query:
● Filters by name.
● Projects (returns) only the name and age fields.
Since both name and age are part of the index, MongoDB can get the results directly from the
index without reading the full document. This makes the query a covered query.
Why is it good?
● Faster queries: Since MongoDB doesn’t need to fetch the actual documents, it saves time.
● Less data to process: MongoDB only works with the index, so it's quicker and uses fewer
resources.
6. Data Modeling
Normalization vs Denormalization
● Normalization: Splitting data into multiple tables to reduce redundancy. It’s like
organizing your files into different folders to avoid duplicates.
● Denormalization: Combining related data into a single table to improve read
performance. It’s like putting all your important documents in one folder for quick
access.
CRUD stands for Create, Read, Update, and Delete—the four basic operations of
persistent storage in a database. In MongoDB, these operations are performed on
documents within collections.
BulkWrite Operations
BulkWrite operations allow you to perform multiple write operations (insert, update,
delete) in a single request. This can improve performance when dealing with large
numbers of documents.
db.collection.bulkWrite([
{ insertOne: { document: { name: "Bob", age: 30 } } },
{ updateOne: { filter: { name: "Alice" }, update: { $set: { age: 26
} } } },
{ deleteOne: { filter: { name: "John" } } }
]);
Upsert Operation
db.collection.updateOne(
{ name: "Charlie" },
{ $set: { age: 28 } },
{ upsert: true }
);
db.collection.aggregate([
{ $match: { age: { $gt: 20 } } },
{ $group: { _id: "$age", totalUsers: { $sum: 1 } } },
{ $sort: { totalUsers: -1 } }
]);
In this example:
This is a powerful way to perform operations like filtering, grouping, and sorting in a
single query.
● ^A: Matches any string that starts with the letter "A".
● i: Case-insensitive matching.
● $gt: Checks if age is greater than score within the same document.
You can use any aggregation expression with $expr, including $add, $subtract, $and, etc.
3. $elemMatch (Matching Elements in an Array)
The $elemMatch operator is used to match documents that contain an array field, where at
least one element in the array matches the specified condition(s).
Example:
Find all users who have a score array with at least one score greater than 80:
db.users.find({ scores: { $elemMatch: { $gt: 80 } } });
In this example, MongoDB will return documents where the scores array has at least one
element greater than 80.
You can also match multiple conditions on a single element:
db.users.find({ scores: { $elemMatch: { $gt: 80, $lt: 90 } } });
This will return documents where the scores array has at least one element that is greater
than 80 but less than 90.
● $exists: false: The field phone must be absent from the document.
9. Replication
Replication is achieved through Replica Sets, which are groups of MongoDB servers that
maintain the same data set. Replica Sets provide automatic failover, data redundancy, and
recovery options.
● Primary: The primary node in a replica set is the main server that receives all write
operations. It accepts updates, inserts, and deletes, and replicates these changes to the
secondary nodes. Applications connect to the primary node for all write operations.
● Secondary: Secondary nodes replicate data from the primary node. They hold
read-only copies of the data and can be used for read operations, improving query
performance. Secondary nodes help distribute the read load and act as backups in case
the primary node fails.
When the primary node fails, one of the secondary nodes is automatically elected as the new
primary.
Replication in MongoDB is a process that ensures data is copied and maintained across multiple
servers. This helps in achieving high availability and data redundancy, meaning your data is safe
even if one server fails.
Simple Explanation:
● Replication: The process of copying data from one MongoDB server (primary) to other servers
(secondaries).
● Replica Set: A group of MongoDB servers that maintain the same data set. It includes one primary
node and multiple secondary nodes.
How It Works:
1. Primary Node: Handles all write operations.
2. Secondary Nodes: Replicate the data from the primary node.
3. Failover: If the primary node fails, one of the secondary nodes is automatically elected as the new
primary.
Example:
Let’s say you have a users collection in your MongoDB database. You set up a replica set with one
primary and two secondary nodes.
Real-Life Example:
Imagine a popular e-commerce website. To ensure that the website remains available even during
server failures, the company uses MongoDB replication. They set up a replica set with servers
located in different geographical regions. This way, if one server fails due to a hardware issue or a
natural disaster, another server in a different location can take over, ensuring that customers can still
access the website and make purchases.
However, a replica set can have up to 50 nodes, with a maximum of 7 voting members. The
number of nodes can vary based on the need for redundancy, availability, and load balancing.
For most production environments, a 3-node replica set is the most common setup.
4. Voting in Replication
In MongoDB replication, voting is part of the replica set's election process. When the
primary node becomes unavailable, an election is held to choose a new primary. Only the
voting members of the replica set participate in the election process.
Voting helps MongoDB ensure that there's a consistent primary node and that the replica set
remains operational.
Both GridFS and Sharding are MongoDB features used for handling large data, but they
serve different purposes:
● GridFS: GridFS is a specification for storing and retrieving large files, such as
images or videos, in MongoDB. When a file exceeds the BSON document size limit
(16MB), GridFS splits the file into smaller chunks and stores each chunk as a separate
document in a fs.chunks collection, with metadata stored in a fs.files collection.
GridFS is ideal for storing large files and handling media storage within MongoDB.
Example Use Case: Storing and retrieving large media files such as videos or images.
● Sharding: Sharding is a method for distributing data across multiple machines. It
allows MongoDB to scale horizontally by partitioning large datasets across multiple
servers (shards). Each shard holds a subset of the data, and MongoDB distributes
queries across all shards to balance the load.
Example Use Case: Distributing a large user database across multiple servers to
handle high volumes of read and write operations.
In summary, GridFS is used for storing large files, while Sharding is used for distributing
large datasets across multiple servers for scalability.
10. Sharding
What is Sharding?
Sharding is a method of distributing data across multiple servers to handle large datasets
and high throughput operations. It allows MongoDB to scale horizontally by splitting data
into smaller, more manageable pieces called shards.
Components of Sharding
1. Shards: Each shard holds a subset of the data. Shards are typically deployed as replica sets
for high availability.
2. mongos: Acts as a query router, directing client requests to the appropriate shard.
3. Config Servers: Store metadata and configuration settings for the cluster.
Disadvantages:
Sharding vs Replication
● Sharding: Distributes data across multiple servers to handle large datasets and high
throughput. It focuses on horizontal scaling.
● Replication: Duplicates data across multiple servers to ensure high availability and fault
tolerance. It focuses on data redundancy.
CAP Theorem
The CAP Theorem states that a distributed database can only guarantee two out of three
properties at the same time: Consistency, Availability, and Partition Tolerance. MongoDB
prioritizes availability and partition tolerance.
Capped Collections
Capped Collections are fixed-size collections that automatically overwrite the oldest data
when they reach their size limit. They are useful for logging and caching scenarios1.
Sharding Disadvantages
11. GridFS
What is GridFS?
GridFS is a specification in MongoDB for storing and retrieving large files, such as images, videos,
and documents, that exceed the BSON document size limit of 16 MB. Instead of storing a file in a
single document, GridFS divides the file into smaller chunks and stores each chunk as a separate
document. This allows for efficient storage and retrieval of large files.
● GridFS: Used for storing large files by breaking them into smaller chunks. It is ideal for files that
exceed the 16 MB limit and allows for partial file retrieval without loading the entire file into
memory.
● Sharding: Distributes data across multiple servers to handle large datasets and high throughput. It
improves performance and scalability by dividing the data into smaller, more manageable pieces.
Transactions in MongoDB
Transactions in MongoDB allow you to group multiple read and write operations into a
single, atomic operation. This means that either all operations in the transaction succeed, or
none do. Transactions ensure data consistency and are useful for complex operations that
span multiple documents or collections.
ACID Compliance
ACID stands for Atomicity, Consistency, Isolation, and Durability:
Batch Sizing
Batch Sizing in MongoDB controls the number of documents returned in each batch of a
query response. Adjusting the batch size can optimize performance:
● Large Batch Size: Reduces the number of network round trips but uses more memory.
● Small Batch Size: Uses less memory but increases the number of network round trips.
Upsert Operations
An Upsert operation in MongoDB is a combination of update and insert. If a document
matching the query criteria exists, it updates the document. If no matching document is
found, it inserts a new document. This is useful for ensuring that data is always up-to-date
without needing separate insert and update logic.
● Financial Transactions: Ensuring that all steps in a financial transaction (like transferring
money between accounts) are completed successfully.
● Inventory Management: Ensuring that inventory levels are updated correctly when
processing orders.
● Order Processing: Ensuring that all parts of an order (like payment and inventory update) are
completed together.
● Backup: Use the mongodump command to create a backup of your MongoDB database.
o mongodump --db mydatabase --out /backup/directory
● Restore: Use the mongorestore command to restore a MongoDB database from a backup.
o mongorestore --db mydatabase /backup/directory/mydatabase
● Authentication: Verifies the identity of a user or client. MongoDB supports various authentication
mechanisms like SCRAM, x.509 certificates, LDAP, and Kerberos.
● Authorization: Determines what actions an authenticated user can perform. MongoDB uses
Role-Based Access Control (RBAC) to manage permissions
● RBAC: Assigns roles to users, and each role has specific permissions. Roles can be built-in
(like readWrite, dbAdmin) or custom-defined
● Roles: Control access to database resources and operations. Users can have multiple roles, and
roles can inherit permissions from other roles2
Query Optimization
Query Optimization involves refining queries to reduce execution time and resource
consumption. This can be achieved by:
● Using indexes: Ensure queries use indexes to avoid full collection scans.
● Avoiding unnecessary data retrieval: Only fetch the fields you need.
● Optimizing joins and aggregations: Simplify complex queries and use efficient join
operations.
Caching Strategies
Caching stores frequently accessed data in a temporary storage area to reduce access time.
Common caching strategies include:
● Cache-Aside: The application checks the cache first before querying the database.
● Read-Through: The cache automatically loads data from the database on a cache miss.
● Write-Through: Data is written to the cache and the database simultaneously.
● Write-Back: Data is written to the cache first and then asynchronously to the database.
Load Balancing
Load Balancing distributes incoming network traffic across multiple servers to ensure no
single server becomes overwhelmed. This improves application performance and reliability
by:
Atlas Clustering
Atlas Clustering involves creating clusters that can be either replica sets or sharded clusters:
● Replica Sets: Provide high availability and redundancy by replicating data across multiple
nodes.
● Sharded Clusters: Distribute data across multiple shards to handle large datasets and high
throughput.
1. Encryption in Transit: Uses TLS/SSL to encrypt data as it travels over the network.
2. Encryption at Rest: Encrypts data stored on disk to protect it from unauthorized access.
3. IP Access List: Restricts database access to specified IP addresses.
4. User Authentication and Authorization: Uses Role-Based Access Control (RBAC) to manage
permissions.
5. Network Isolation: Supports Virtual Private Cloud (VPC) peering and private endpoints for
secure network configurations.
6. Auditing: Tracks and logs database events for monitoring and compliance.
db.createCollection("students", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["name", "age", "gpa"],
properties: {
name: {
bsonType: "string",
description: "must be a string and is required"
},
age: {
bsonType: "int",
minimum: 0,
description: "must be an integer greater than or equal to 0"
},
gpa: {
bsonType: "double",
minimum: 0,
maximum: 4,
description: "must be a double between 0 and 4"
}
}
}
}
});
This rule ensures that every document in the students collection has a name (string), age (integer),
and gpa (double between 0 and 4).
CAP Theorem
The CAP Theorem states that in a distributed database system, you can only achieve two
out of the following three guarantees at the same time:
Data Redundancy
Data Redundancy refers to the practice of storing the same piece of data in multiple places.
This can be intentional for backup and recovery purposes or accidental due to inefficient
data management. While redundancy can improve data availability and fault tolerance, it
can also lead to data inconsistency and increased storage costs if not managed properly.
Clustered Collections
Clustered Collections in MongoDB store documents ordered by a clustered index key. This
means that the documents are physically stored in the order of the index key, which can
improve query performance for range queries and equality comparisons on the clustered
index key.
Materialized Views
A Materialized View is a database object that contains the results of a query. Unlike regular
views, which are virtual and recomputed each time they are accessed, materialized views
store the query results physically. This can significantly improve query performance,
especially for complex queries that are frequently executed.
Decrement Operations
Decrement Operations in MongoDB are used to decrease the value of a field. This can be
done using the $inc operator with a negative value. For example:
db.collection.updateOne(
{ _id: 1 },
{ $inc: { count: -1 } }
);
This command decreases the count field by 1.
● Alternatives to MongoDB
Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large
amounts of data across many commodity servers without a single point of failure. It is known for its
high availability, fault tolerance, and linear scalability.
Redis
Redis (Remote Dictionary Server) is an in-memory data structure store used as a database, cache,
and message broker. It supports various data structures such as strings, hashes, lists, sets, and
more. Redis is known for its high performance and low latency.
DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service provided by AWS. It offers fast and
predictable performance with seamless scalability. DynamoDB is designed for applications that
require consistent, single-digit millisecond latency at any scale.
HBase
Apache HBase is an open-source, distributed, scalable, and NoSQL database modeled after
Google’s Bigtable. It is designed to handle large amounts of sparse data and is built on top of the
Hadoop Distributed File System (HDFS). HBase is known for its strong consistency and random,
real-time read/write access.
OrientDB
OrientDB is a multi-model NoSQL database that supports graph, document, key-value, and object
models. It is designed to be highly scalable and efficient, combining the flexibility of document
databases with the power of graph databases.
Scaling in MongoDB is essential for handling increasing data volumes, user traffic, and processing
demands. There are two main methods for scaling MongoDB: vertical scaling and horizontal scaling.
● Definition: Increasing the capacity of a single server by adding more resources (CPU, RAM, storage).
● Use Case: Suitable for applications with moderate growth where a single server can handle the
increased load.
● Example: Upgrading your server from 16GB RAM to 32GB RAM to handle more queries and data.
● Definition: Adding more servers to distribute the load and data across multiple machines.
● Use Case: Ideal for applications with significant growth, requiring more resources than a single
server can provide.
● Techniques:
o Replication: Creating copies of the database on multiple servers to ensure high availability and fault
tolerance.
o Sharding: Distributing data across multiple servers (shards) to balance the load and improve
performance.