Messaging Architectures: Messaging Models
1. Point to Point
2. Publish and Subscribe
Kafka is an example of publish-and-subscribe messaging model
Kafka Overview
• Kafka is a unique distributed publish-subscribe messaging system written in the Scala language with multi-
language support and runs on the Java Virtual Machine (JVM).
• Kafka relies on another service named Zookeeper – a distributed coordination system – to function.
• Kafka has high-throughput and is built to scale-out in a distributed model on multiple servers.
• Kafka persists messages on disk and can be used for batched consumption as well as real time applications.
2
Key Terminology
• Kafka maintains feeds of messages in categories called topics.
• Processes that publish messages to a Kafka topic are called producers.
• Processes that subscribe to topics and process the feed of published messages are called consumers.
• Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
• Communication between all components is done via a high performance simple binary API over TCP protocol
3
Kafka Architecture
Kafka Cluster
Broker
Producer Consumer
Broker
Producer Broker Consumer
Broker
Zookeeper
Understanding Kafka
• Kafka is based on the simple storage-abstraction concept called a log, an append-only totally-ordered sequence
of records ordered by time.
4
• Records are appended to the end of the record and reads proceed from left to right in the log (or topic).
• Each entry is assigned a unique sequential log-entry number (an offset).
• The log entry number is a convenient property that correlates to the notion of a “timestamp” entry but is
decoupled from any clock due to the distributed nature of Kafka.
Kafka Key Design Concepts
• A log is synonymous to a file or table where the records are appended and sorted by the concept of time.
• Conceptually, the log is a natural data-structure for handling data-flow between systems.
5
• Kafka is designed for centralizing an organization’s data into an enterprise log (message bus) for real-time
subscription by other subscribers or application consumers.
Kafka Conceptual Design
• Each logical data source can be modeled as a log corresponding to a topic or data feed in Kafka.
• Each subscribing consuming application should read as quickly as it can from each topic, persist the record it
reads into it’s own data store and advances the offset to the next message entry to be read.
• Subscribers can be any type of data system or middleware system like a cache, Hadoop, a streaming system like
Spark or Storm, a search system, a web services provisioning system, a data warehouse, etc.
• In Kafka, partitioning is a concept applied to the log/topic in other to allow horizontal scaling.
6
Kafka Logical Design
• Each partition is a totally ordered log within a topic, and there is no global ordering between partitions.
• Assignment of messages to specific partitions is controlled by the publisher and may be assigned based on a
unique identification key or messages can be allowed to be randomly assigned to partitions.
• Partitioning allows throughput to scale linearly with the Kafka cluster size.
Kafka Topics
• Kafka topics should have a small number of consumer groups assigned with each one representing a “logical
subscriber”.
• Kafka topic consumption can be scaled by increasing the number of consumer subscriber instances within the
same group which will automatically load-balance message consumption.
7
• Kafka has a notion of partitioning within a topic to provide the notion of parallel consumption
• Partitions in a topic are assigned to the consumers within a consumer group.
• There can be no more consumer instances within a consumer group than partitions within a topic.
• If the total order in which messages are published is important in the consumption, then a single partition for the
topic is the solution which will mean only one consumer process in the consumer group.
Kafka Topic Partitions
• A topic consists of partitions.
• Partition: ordered + immutable sequence of messages that is continually appended to
8
Kafka Topic Partitions
• #partitions of a topic is configurable
• #partitions determines max consumer (group) parallelism
– Cf. parallelism of Storm’s KafkaSpout via builder.setSpout(,,N)
9
– Consumer group A, with 2 consumers, reads from a 4-partition topic
– Consumer group B, with 4 consumers, reads from the same topic
10
Kafka Consumer Groups
• Kafka assigns the partitions in a topic to the consumer instances in a consumer group to provide ordering
guarantees and load balancing over a pool of consumer process. Note that there can be no more consumer
instances per group than total partition count.
• Kafka is a unique distributed publish-subscribe messaging system written in the Scala language with multi-language
support and runs on the Java Virtual Machine (JVM).
11