Why partition your data in Kafka?
If you have so much load that you need more than a single instance of your application, you need to partition your data. How you partition serves as your load balancing for the downstream application. The producer clients decide which topic partition that the data ends up in, but it’s what the consumer applications do with that data that drives the decision logic.
How does Kafka partitioning improve performance?
Kafka partitioning allows for
- parallel data processing, (hence load distribution)
- enabling multiple consumers to work on different partitions simultaneously.
- This helps achieve higher throughput
What factors should you consider when determining the number of partitions?
Choose the number of partitions based on factors like
- expected data volume,
- the number of consumers,
- and the desired level of parallelism.
It's essential to strike a balance to avoid over-partitioning or under-partitioning, which can impact performance.
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
●
so if you know the average processing time per message, then you should
be able to calculate the number of partitions required to keep up. For example
if each message takes 100ms to process and you
receive 5k a second then you'll need at least 50 partitions
●
example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50
MB/sec, then you need at least 20 partitions and 20 consumers in the
consumer group. Similarly, if you want to achieve the same for producers, and 1
producer can only write at 100 MB/sec, you need 10 partitions. In this case, if
you have 20 partitions, you can maintain 1 GB/sec for producing and consuming
messages. You should adjust the exact number of partitions to number of
consumers or producers, so that each consumer and producer achieve their target
throughput.
So a simple formula
could be:
- #Partitions = max(NP,
NC) where:
- NP is the number of
required producers determined by calculating: TT/TP
- NC is the number of
required consumers determined by calculating: TT/TC
- TT is the total
expected throughput for our system
- TP is the max
throughput of a single producer to a single partition
- TC is the max
throughput of a single consumer from a single partition
Strategies
It depends upon
- Data producing pattern , consumption pattern (When structuring your data for Kafka it really depends on how it´s meant to be consumed)
- the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
Stateless vs Stateful - Desired level of parallel consumers
1) Random partitioning
- when to use ? when stateless & lot of parallelism ... Random partitioning results in the evenest spread of load for consumers, and thus makes scaling the consumers easier. It is particularly suited for stateless or “embarrassingly parallel” services.
This is effectively what you get when using the default partitioner while not manually specifying a partition or a message key. To get an efficiency boost, the default partitioner in Kafka from version 2.4 onwards uses a “sticky” algorithm, which groups all messages to the same random partition for a batch.
2) Aggregate Partitioning
example ,
If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Consumer will be able to keep a count of a given page's views just by looking at the events in a single partition
If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing
3) Ordering Guarantee
example a streaming service showing data for multiple basketball matches (real time)
Consumer partition assignment
Whenever a consumer enters or leaves a consumer group, the brokers rebalance the partitions across consumers, meaning Kafka handles load balancing with respect to the number of partitions per application instance for you. This is great—it’s a major feature of Kafka. We use consumer groups on nearly all our services.
By default, when a rebalance happens, all consumers drop their partitions and are reassigned new ones (which is called the “eager” protocol). If you have an application that has a state associated with the consumed data, like our aggregator service, for example, you need to drop that state and start fresh with data from the new partition.
References
- https://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
- https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
- https://stackoverflow.com/questions/61518436/spark-streaming-handle-skewed-kafka-partitions
- Do not have 1 Streaming Job reading from a 1000 topics. Put those with biggest load into separate Streaming Job(s). Reconfigure, that simple. Load balancing, queuing theory.
- https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314784