My Tech Learnings: Kafka Partitioning Strategies

Why partition your data in Kafka?

If you have so much load that you need more than a single instance of your application, you need to partition your data. How you partition serves as your load balancing for the downstream application. The producer clients decide which topic partition that the data ends up in, but it’s what the consumer applications do with that data that drives the decision logic.

How does Kafka partitioning improve performance?

Kafka partitioning allows for

parallel data processing, (hence load distribution)
enabling multiple consumers to work on different partitions simultaneously.
This helps achieve higher throughput

What factors should you consider when determining the number of partitions?

Choose the number of partitions based on factors like

expected data volume,
the number of consumers,
and the desired level of parallelism.

It's essential to strike a balance to avoid over-partitioning or under-partitioning, which can impact performance.

https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/

● so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions

● example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.

So a simple formula could be:

#Partitions = max(NP, NC) where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition

Strategies

It depends upon

Data producing pattern , consumption pattern (When structuring your data for Kafka it really depends on how it´s meant to be consumed)

the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.

~~Stateless vs Stateful~~
Desired level of parallel consumers

1) Random partitioning

- when to use ? when stateless & lot of parallelism ... Random partitioning results in the evenest spread of load for consumers, and thus makes scaling the consumers easier. It is particularly suited for stateless or “embarrassingly parallel” services.

This is effectively what you get when using the default partitioner while not manually specifying a partition or a message key. To get an efficiency boost, the default partitioner in Kafka from version 2.4 onwards uses a “sticky” algorithm, which groups all messages to the same random partition for a batch.

2) Aggregate Partitioning

example ,

If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Consumer will be able to keep a count of a given page's views just by looking at the events in a single partition

If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing

3) Ordering Guarantee

example a streaming service showing data for multiple basketball matches (real time)

Consumer partition assignment

Whenever a consumer enters or leaves a consumer group, the brokers rebalance the partitions across consumers, meaning Kafka handles load balancing with respect to the number of partitions per application instance for you. This is great—it’s a major feature of Kafka. We use consumer groups on nearly all our services.

By default, when a rebalance happens, all consumers drop their partitions and are reassigned new ones (which is called the “eager” protocol). If you have an application that has a state associated with the consumed data, like our aggregator service, for example, you need to drop that state and start fresh with data from the new partition.

References

https://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
https://stackoverflow.com/questions/61518436/spark-streaming-handle-skewed-kafka-partitions

Do not have 1 Streaming Job reading from a 1000 topics. Put those with biggest load into separate Streaming Job(s). Reconfigure, that simple. Load balancing, queuing theory.

https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314784

My Tech Learnings

Wednesday, July 24, 2024

Kafka Partitioning Strategies

No comments:

About Me