What are some recommendations to consider when running Apache Kafka® in production? Jun Rao, one of the original Kafka creators, as well as an ongoing committer and PMC member, shares the essential wisdom he's gained from developing Kafka and dealing with a large number of Kafka use cases.
Here are 6 recommendations for maximizing Kafka in production:
1. Nail Down the Operational Part
When setting up your cluster, in addition to dealing with the usual architectural issues, make sure to also invest time into alerting, monitoring, logging, and other operational concerns. Managing a distributed system can be tricky and you have to make sure that all of its parts are healthy together. This will give you a chance at catching cluster problems early, rather than after they have become full-blown crises.
2. Reason Properly About Serialization and Schemas Up Front
At the Kafka API level, events are just bytes, which gives your application the flexibility to use various serialization mechanisms. Avro has the benefit of decoupling schemas from data serialization, whereas Protobuf is often preferable to those practiced with remote procedure calls; JSON Schema is user friendly but verbose. When you are choosing your serialization, it's a good time to reason about schemas, which should be well-thought-out contracts between your publishers and subscribers. You should know who owns a schema as well as the path for evolving that schema over time.
3. Use Kafka As a Central Nervous System Rather Than As a Single Cluster
Teams typically start out with a single, independent Kafka cluster, but they could benefit, even from the outset, by thinking of Kafka more as a central nervous system that they can use to connect disparate data sources. This enables data to be shared among more applications.
4. Utilize Dead Letter Queues (DLQs)
DLQs can keep service delays from blocking the processing of your messages. For example, instead of using a unique topic for each customer to which you need to send data (potentially millions of topics), you may prefer to use a shared topic, or a series of shared topics that contain all of your customers. But if you are sending to multiple customers from a shared topic and one customer's REST API is down—instead of delaying the process entirely—you can have that customer's events divert into a dead letter queue. You can then process them later from that queue.
5. Understand Compacted Topics
By default in Kafka topics, data is kept by time. But there is also another type of topic, a compacted topic, which stores data by key and replaces old data with new data as it comes in. This is particularly useful for working with data that is updateable, for example, data that may be coming in through a change-data-capture log. A practical example of this would be a retailer that needs to update prices and product descriptions to send out to all of its locations.
6. Imagine New Use Cases Enabled by Kafka's Recent Evolution
The biggest recent change in Kafka's history is its migration to the cloud. By using Kafka there, you can reserve your engineering talent for business logic. The unlimited storage enabled by the cloud also means that you can truly keep data forever at reasonable cost, and thus you don't have to build a separate system for your historical data needs.
EPISODE LINKS
The Pro’s Guide to Fully Managed Apache Kafka Services ft. Ricardo Ferreira
Kafka Screams: The Scariest JIRAs and How To Survive Them ft. Anna McDonald
Data Integration with Apache Kafka and Attunity
Distributed Systems Engineering with Apache Kafka ft. Colin McCabe
Apache Kafka on Kubernetes, Microsoft Azure, and ZooKeeper with Lena Hall
Improving Fairness Through Connection Throttling in the Cloud with KIP-402 ft. Gwen Shapira
Data Modeling for Apache Kafka – Streams, Topics & More with Dani Traphagen
MySQL, Cassandra, BigQuery, and Streaming Analytics with Joy Gao
Scaling Apache Kafka with Todd Palino
Understand What’s Flying Above You with Kafka Streams ft. Neil Buesing
KIP-500: Apache Kafka Without ZooKeeper ft. Colin McCabe and Jason Gustafson
Should You Run Apache Kafka on Kubernetes? ft. Balthazar Rouberol
Jay Kreps on the Last 10 Years of Apache Kafka and Event Streaming
Connecting to Apache Kafka with Neo4j
Ask Confluent #15: Attack of the Zombie Controller
Helping Healthcare with Apache Kafka and KSQL ft. Ramesh Sringeri
Contributing to Open Source with the Kafka Connect MongoDB Sink ft. Hans-Peter Grahsl
Teaching Apache Kafka Online with Stéphane Maarek
Connecting Apache Cassandra to Apache Kafka with Jeff Carpenter from DataStax
Transparent GDPR Encryption with David Jacot
Create your
podcast in
minutes
It is Free
Insight Story: Tech Trends Unpacked
Zero-Shot
Fast Forward by Tomorrow Unlocked: Tech past, tech future
The Unbelivable Truth - Series 1 - 26 including specials and pilot
Elliot in the Morning