Wednesday, July 24, 2024

Kafka Metrics to Monitor

Kafka metrics can be broken down into three categories:

  • Kafka server (broker) metrics
  • Producer metrics
  • Consumer metrics

Because Kafka relies on ZooKeeper to maintain state, it’s also important to monitor ZooKeeper.

Broker Metrics

Broker metrics can be broken down into three classes:
  • Kafka-emitted metrics
  • Host-level metrics
  • JVM garbage collection metrics
















Kafka Emitted Metrics

Type

Description

Resource: Availability

Number of unreplicated partitions

Resource: Availability

Number of offline partitions

Performance

Total time (in ms) to serve the specified request (Produce/Fetch)

Throughput

Aggregate incoming/outgoing byte rate

Throughput

Number of (producer|consumer|follower) requests per second

  • Metric to watch: UnderReplicatedPartitions
  • Metric to alert on: OfflinePartitionsCount (controller only)
  • Metric to watch: TotalTimeMs
  • Metric to watch: RequestsPerSec

Host Level

Type

Description

Utlization

Disk usage

Utlization

CPU usage

Utlization

Network Bytes sent / received

JVM level

Type

Description

Utlization

Total number of GC processes

Utlization

Total time spent in GC



















Producer

JMX attribute

Description

Type

Response-rate

Average number of responses received per second

Throughput

Request-Rate

Average number of requests sent per second

Throughput

Request-Latemcy

Average request latency (in ms)

Throughput

outgoing-byte-rate

Average number of outgoing/incoming bytes per second

Throughput

batch-size-avg

The average number of bytes sent per partition per request

Throughput




















Consumer

JMX attribute

Description

Type

byte-consumer-rate

Average number of bytes consumed per second for a specific topic or across all topics.

Throughput

record-consume-Rate

Average number of records consumed per second for a specific topic or across all topics

Throughput

Fetch-rate

Number of fetch requests per second from the consumer

Throughput





















Name    

Description

Type

outstanding_requests   

Number of requests queued  

Saturation

avg_latency

Average time taken to respond to client requests

Throughput

num_alive_connections

Number of clients connected to ZooKeeper

Availability




References



Kafka Partitioning Strategies

 Why partition your data in Kafka?

If you have so much load that you need more than a single instance of your application, you need to partition your data. How you partition serves as your load balancing for the downstream application. The producer clients decide which topic partition that the data ends up in, but it’s what the consumer applications do with that data that drives the decision logic.

How does Kafka partitioning improve performance?

Kafka partitioning allows for 

  • parallel data processing, (hence load distribution) 
  • enabling multiple consumers to work on different partitions simultaneously.
  • This helps achieve higher throughput 

What factors should you consider when determining the number of partitions?

Choose the number of partitions based on factors like 

  • expected data volume, 
  • the number of consumers, 
  • and the desired level of parallelism. 

It's essential to strike a balance to avoid over-partitioning or under-partitioning, which can impact performance.

https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster/ 

      so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions

       example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.

So a simple formula could be:

  • #Partitions = max(NP, NC) where:
  • NP is the number of required producers determined by calculating: TT/TP
  • NC is the number of required consumers determined by calculating: TT/TC
  • TT is the total expected throughput for our system
  • TP is the max throughput of a single producer to a single partition
  • TC is the max throughput of a single consumer from a single partition

Strategies
It depends upon 
  • Data producing pattern , consumption pattern (When structuring your data for Kafka it really depends on how it´s meant to be consumed) 
    • the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
  • Stateless vs Stateful 
  • Desired level of parallel consumers 
 
1) Random partitioning
- when to use ? when stateless & lot of parallelism ... Random partitioning results in the evenest spread of load for consumers, and thus makes scaling the consumers easier. It is particularly suited for stateless or “embarrassingly parallel” services.

This is effectively what you get when using the default partitioner while not manually specifying a partition or a message key. To get an efficiency boost, the default partitioner in Kafka from version 2.4 onwards uses a “sticky” algorithm, which groups all messages to the same random partition for a batch.

2) Aggregate Partitioning
example , 

If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Consumer will be able to keep a count of a given page's views just by looking at the events in a single partition

If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing

3) Ordering Guarantee
example a streaming service showing data for multiple basketball matches (real time)   


Consumer partition assignment
Whenever a consumer enters or leaves a consumer group, the brokers rebalance the partitions across consumers, meaning Kafka handles load balancing with respect to the number of partitions per application instance for you. This is great—it’s a major feature of Kafka. We use consumer groups on nearly all our services.

By default, when a rebalance happens, all consumers drop their partitions and are reassigned new ones (which is called the “eager” protocol). If you have an application that has a state associated with the consumed data, like our aggregator service, for example, you need to drop that state and start fresh with data from the new partition.
















References
  • https://stackoverflow.com/questions/17205561/data-modeling-with-kafka-topics-and-partitions
  • https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
  • https://stackoverflow.com/questions/61518436/spark-streaming-handle-skewed-kafka-partitions
    • Do not have 1 Streaming Job reading from a 1000 topics. Put those with biggest load into separate Streaming Job(s). Reconfigure, that simple. Load balancing, queuing theory. 
  • https://www.sciencedirect.com/science/article/abs/pii/S0167739X17314784