Why use Kafka in the first place?

Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:

  • Have an online near-real-time system that can process and analyze this data.
  • Have an offline system that can process this data over a longer period.

Most of the processing was done for analysis, for example, analyzing user behavior, how users use LinkedIn, etc.

Requirement Gathering

The problem is easy to understand, but the solution can seem pretty complex. This is because the problem itself has so many constraints and requirements. Here are some examples of the requirements that such a system needs:

  • The system should be highly scalable. Popular products can generate tens or hundreds of TBs of data in events, metrics, and logs daily. This requires an almost linearly scalable distributed system to handle such high throughput.
    This is important because we need to support the extremely high traffic. Easily hundreds of thousands of messages per second. 
  • It should allow “producers” to send messages and “consumers” to subscribe to certain messages. This is important since there can be multiple consumers(like the online and offline systems we discussed) to the same message, and messages are generally asynchronous.
    Consumers should also be able to decide how and when to consume messages. For example, in the problem we discussed, we’d want one consumer to consume messages as soon as possible and the other to do it every few hours.
  • Messages can be immutable (there is no need to delete log data after all), transaction-like semantics and complex delivery guarantees aren’t important requirements.

Message Brokers vs Kafka

Maybe using message brokers such as RabbitMQ, and ActiveMQ, can solve the above problem, but they cannot, and let’s see why:

  • Message Batching: Since we are pulling a lot of messages on the consumer, it doesn’t make sense to pull messages one by one. Most of the time, you’d want to batch messages. Otherwise, most of your time would be wasted on-network calls.
    Since message brokers aren’t really meant to support such high throughput, they generally don’t provide good ways to batch messages.
  • Different consumers with different consumption requirements: We discussed having two types of consumers, one online system which processes messages in real-time and the other an offline system that might want to read messages received in the past twelve or twenty-four hours.
    This pattern doesn’t work with most message brokers or queues. This is because some message brokers, like RabbitMQ, use a push-based model, pushing messages from the broker to the consumer. This leads to lesser flexibility for the consumer since the consumer cannot decide how and when to consume messages.
  • Small and simple messages: Message sizes are generally larger in most message brokers. This isn’t a bug, but it’s by design. Message brokers often support many features, like different options for routing messages, message guarantees, being able to acknowledge every message individually, etc., which leads to large individual message headers.
    Large messages are fine as long as you don’t have a lot of them and you don’t have to store them, but that is precisely what we want to do in our system.
  • Distributed high-throughput system: One of the most important requirements is very high throughput. We want to support hundreds of thousands of messages per second, even going up to millions per second. Running this system in a single node is infeasible.
    We need a distributed system that can support this throughput, which many message brokers don’t.
  • Large queues: Message brokers often have varying support for large queue sizes. This depends on the message broker you are using and your configuration, but the internet is filled with people facing issues with message broker queue sizes.

So, let’s now understand what should be the architecture of the Kafka system with the above mentioned requirements.

HLD or High Level System Design of Apache Kafka Startup

Apache Kafka is a distributed data store optimized for ingesting and lower latency processing streaming data in real time. It can handle the constant inflow of data sequentially and incrementally generated by thousands of data sources.

Similar Reads

Why use Kafka in the first place?

Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:...

High-Level Design Architecture of Kafka

High-level design of Apache Kafka...

Kafka storage layout

Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file....

Contact Us