Why partition your data in Kafka?
If you have so much load that you need more than a single instance of your application, you need to partition your data. How you partition serves as your load balancing for the downstream application. The producer clients decide which topic partition that the data ends up in, but it’s what the consumer applications do with that data that drives the decision logic.
How does Kafka partitioning improve performance?
Kafka partitioning allows for
- parallel data processing, (hence load distribution)
- enabling multiple consumers to work on different partitions simultaneously.
- This helps achieve higher throughput
What factors should you consider when determining the number of partitions?
Choose the number of partitions based on factors like
- expected data volume,
- the number of consumers,
- and the desired level of parallelism.
It's essential to strike a balance to avoid over-partitioning or under-partitioning, which can impact performance.
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/
●
so if you know the average processing time per message, then you should
be able to calculate the number of partitions required to keep up. For example
if each message takes 100ms to process and you
receive 5k a second then you'll need at least 50 partitions
●
example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50
MB/sec, then you need at least 20 partitions and 20 consumers in the
consumer group. Similarly, if you want to achieve the same for producers, and 1
producer can only write at 100 MB/sec, you need 10 partitions. In this case, if
you have 20 partitions, you can maintain 1 GB/sec for producing and consuming
messages. You should adjust the exact number of partitions to number of
consumers or producers, so that each consumer and producer achieve their target
throughput.
So a simple formula
could be:
- #Partitions = max(NP, NC) where:
- NP is the number of required producers determined by calculating: TT/TP
- NC is the number of required consumers determined by calculating: TT/TC
- TT is the total expected throughput for our system
- TP is the max throughput of a single producer to a single partition
- TC is the max throughput of a single consumer from a single partition
- Data producing pattern , consumption pattern (When structuring your data for Kafka it really depends on how it´s meant to be consumed)
- the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
Stateless vs Stateful- Desired level of parallel consumers
- https://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
- https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
- https://stackoverflow.com/questions/61518436/spark-streaming-handle-skewed-kafka-partitions
- Do not have 1 Streaming Job reading from a 1000 topics. Put those with biggest load into separate Streaming Job(s). Reconfigure, that simple. Load balancing, queuing theory.
- https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314784
No comments:
Post a Comment