Start United States USA — software Kafka-Streams – Tips on How to Decrease Re-Balancing Impact for Real-Time Event...

Kafka-Streams – Tips on How to Decrease Re-Balancing Impact for Real-Time Event Processing On Highly Loaded Topics

Von

July 24, 2021

1847

Need to reduce re-balancing time on Kafka consumer group during deployment and understand pitfalls of Kafka Streams? Read this article to learn about the factors that affect re-balance latency.
Join the DZone community and get the full member experience. Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well). Let’s look at one existing use case. We have a micro-service with 45 application instances, that is deployed into Docker Kubernetes with configured up-scaling and down-scaling (based on CPU load). This service consumes a single topic using Kafka-Streams (actually there are more consuming topics, but let’s concentrate on a single one), a topic with 180 partitions and traffic is 20000 messages per second. We use Kafka Streams configuration property, num.stream.threads = 4 so a single app instance processes 4 partitions in 4 threads (45 instances with 4 threads per each, so actually it means each partition out of 180 is processed by its own thread). As a result, a consumer should handle around 110 messages per second from a single partition. In our case, processing of a single message takes around 5 milliseconds and the stream is stateless (processing – both CPU and IO intensive, some invocations into databases and REST calls to other micro-services). During deployment rollout, we had delays on consuming events more than 1 minute by 99th percentile, and definitely, it impacts business flow, as we need real-time processing. After understanding what is going on and tuning configuration, we significantly decreased it, and now latency of processing takes up to 2-3 seconds by 99th percentile (yes, we have metrics for that, and we will talk about that closer to the end of this article). So just understanding and changing configuration, rebalance latency was decreased more than ten times, without any code changes. As of writing this article, we use kafka-streams maven dependency with version 2.8.0. The following tips will allow you to significantly decrease rebalancing latency for your application. Developers of Kafka-Streams and Kafka-Clients regularly improve rebalancing protocol, and performance becomes better and better over time (thanks to the amazing Confluent team) with new smarter and optimized logic. It could be seen by reviewing release notes on each new version. A feature improvement that is worth highlighting is the Incremental cooperative rebalancing protocol. With the introduction of incremental cooperative rebalancing, streams no longer require all tasks to be revoked at the beginning of a rebalance (stop the world effect). Instead, at the completion of the rebalance, only those tasks which are to be migrated to another consumer for overall load balance will need to be closed and revoked. It significantly improves rebalance latency. Now the number of rebalances is much higher, but with a much shorter latency. And Kafka-Streams has this feature out of the box (since version 2.4.0, and with some improvements at 2.6.0), with default partition assignor StreamsPartitionAssignor. Add Kafka-Streams configuration property internal.leave.group.on.close = true for sending consumer leave group requests on app shutdown. By default, Kafka-Streams doesn’t send consumer leave group requests on app graceful shutdown, and, as a result, messages from some partitions (that were assigned to terminating app instance) will not be processed until the session by this consumer will expire (with duration session.timeout.ms), and only after expiration, new rebalance will be triggered.