[go: up one dir, main page]

0% found this document useful (0 votes)
23 views4 pages

Intro To Data Science - Week 10 - LAQ's

Apache Cassandra is an open-source, distributed NoSQL database management system known for its high availability, scalability, and performance, making it suitable for applications requiring continuous uptime. It features a decentralized architecture, write optimization, and tunable consistency, allowing it to handle large volumes of data effectively. Common use cases include real-time analytics, IoT data storage, and social media platforms, though it has limitations in complex queries and requires careful management.

Uploaded by

keerthana5958v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views4 pages

Intro To Data Science - Week 10 - LAQ's

Apache Cassandra is an open-source, distributed NoSQL database management system known for its high availability, scalability, and performance, making it suitable for applications requiring continuous uptime. It features a decentralized architecture, write optimization, and tunable consistency, allowing it to handle large volumes of data effectively. Common use cases include real-time analytics, IoT data storage, and social media platforms, though it has limitations in complex queries and requires careful management.

Uploaded by

keerthana5958v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Cassandra

Apache Cassandra is an open-source, distributed NoSQL database management system


designed to handle large amounts of data across many commodity servers without any
single point of failure. It is known for its high availability, scalability, and performance,
making it ideal for applications that require continuous uptime and can handle huge
volumes of data.

Key Features of Cassandra:

1. Decentralized Architecture:

Cassandra follows a peer-to-peer architecture where all nodes are equal, meaning there is
no master-slave relationship.

Every node in the cluster can handle both read and write requests, ensuring that no single
point of failure exists.

2. Scalability:

Cassandra is designed to scale horizontally by adding more nodes to the cluster. This
allows the database to handle increasing amounts of data and traffic without performance
degradation.

It supports linear scalability, meaning as you add more nodes, the throughput of the
system increases proportionally.

3. High Availability:

Cassandra provides eventual consistency, which prioritizes availability and partition


tolerance (AP in the CAP theorem) over strict consistency. This makes Cassandra highly
available, even in the event of node failures.

It replicates data across multiple nodes, and if one node goes down, the data is still
available on another replica.

4. Data Model:

Cassandra uses a wide-column store data model, which organizes data into tables with
rows and columns. Each row is identified by a unique key (primary key), and columns are
grouped together in column families.

Columns can be added dynamically to rows, offering flexible schema management.

5. Write Optimization:
Cassandra is optimized for write-heavy workloads. It uses a log-structured storage system,
meaning writes are initially written to a commit log and stored in memory. Periodically, data
is flushed to disk in the form of SSTables (Sorted String Tables).

This approach provides high throughput for writes, especially in systems with heavy insert
or update operations.

6. Replication and Fault Tolerance:

Cassandra allows configurable replication strategies. Data can be replicated across


multiple data centers for fault tolerance and disaster recovery.

The replication factor (how many copies of data are kept) can be set per keyspace (a
collection of tables).

7. Tunable Consistency:

Cassandra provides tunable consistency levels, allowing the user to choose between
consistency and performance based on the use case. Consistency can be set at different
levels, such as:

ONE: Only one replica needs to acknowledge the write.

QUORUM: A majority of replicas need to acknowledge the write.

ALL: All replicas must acknowledge the write.

This flexibility lets applications decide the tradeoff between speed and consistency.

8. CQL (Cassandra Query Language):

Cassandra uses CQL, which is similar to SQL but designed for the NoSQL model. CQL is
used to interact with the database, define schemas, insert data, and query data.

9. Secondary Indexes:

Cassandra supports secondary indexes on columns, though they should be used


cautiously due to performance trade-offs. Secondary indexes allow queries to be
performed on columns that are not part of the primary key.

10. Compaction:

Over time, data in Cassandra undergoes compaction, which is the process of merging
SSTables and removing deleted or obsolete data. This helps manage disk space and
optimize read performance.

How Cassandra Works:


1. Write Process:

When a write request is received, it is first written to the commit log for durability. The data
is then stored in memory (in a structure called Memtable) and is eventually flushed to disk
as SSTables.

Data is replicated according to the replication factor defined, ensuring high availability.

2. Read Process:

When a read request is received, Cassandra first checks the Memtables, then checks the
SSTables on disk. It may also involve reading the Bloom filters (which help to quickly
determine if an SSTable contains the requested data) and performing a merge if necessary
to ensure consistency across multiple replicas.

3. Data Replication:

Cassandra uses the Gossip protocol to manage node-to-node communication and monitor
the health of the cluster. It ensures that data is replicated across multiple nodes and that
all replicas are in sync.

4. Partitioning:

Data is partitioned across nodes using a partition key. Cassandra uses a consistent
hashing algorithm to determine which node stores a given piece of data. This ensures even
distribution of data and avoids “hot spots.”

Use Cases:

Real-Time Analytics: Due to its fast write and read capabilities, Cassandra is ideal for
applications that require real-time analytics on massive datasets.

IoT and Time-Series Data: Cassandra is commonly used to store time-series data, as it is
capable of handling high write throughput and large volumes of data from IoT devices.

Social Media and Messaging: Cassandra is well-suited for systems that need to handle
high velocity and volume of messages, such as social media platforms.

Advantages:

High availability and fault tolerance.

Scalable architecture that handles large amounts of data.

Optimized for write-heavy workloads.

Flexible data model that allows dynamic schema changes.


Disadvantages:

Limited support for complex queries (e.g., JOINs and aggregations), making it less suited
for traditional relational use cases.

Tunable consistency can lead to challenges in ensuring data consistency across large
clusters.

Requires careful configuration and management for optimal performance, especially as


the dataset grows.

In summary, Apache Cassandra is a highly scalable, distributed database system that


excels in scenarios requiring high availability, fault tolerance, and large-scale data
processing. It is often chosen for applications that need to handle massive amounts of
data with low-latency reads and writes.

You might also like