Working towards GA of change data capture

Currently (end of 2019), the change data capture feature is in beta. This is a thread about figuring what the next steps here are. Some rough thoughts:

  • Publish a yugabytedb-cdc-subscriber library for CDC so that it is easy to build a consumer in any language. It should not be necessary to have a hard linkage to Kafka only.
  • Publish the Java libraries of above library to maven central.
  • Move CDC source code in Java to a separate repo along with example reference implementations.
  • Documentation on how to use CDC from various languages.
  • Failure testing of the CDC client.
  • Performance testing of the CDC client.

From the above points, I would love to have a Java library published that can be brought into projects/applications that would like to digest changes in yugabyte in real time. And I’m happy to help in getting that fully fleshed out. I’ve looked through the current Java code and have a pretty good idea on how that can currently be used to satisfy my goals, but it’s currently unpublished in Maven, which makes me think that it’s likely to change substantially.

Having access to a dependency like a yugabytedb-cdc-subscriber could be really nice.

I’ve viewed a recent presentation from @neha about CDC and noticed that a large percentage of it was devoted to CDC particularly in relation to cross DC replication (I’m also eager to use it for this feature).

I have two initial questions to help me in exploring the codebase and better understanding the the implementation around CDC so far. Is it possible for me to see the code that was written for the Yugabyte to Kafka connector jar that’s in this codebase? And, which area/files of the C++ implementation of Yugabyte should I look in if I want to better understand how the CDC producer process works? (I’ve been poking around the tablet server code so far, but don’t believe I’ve found the meat of it).

Also, given a subscriber library is published for CDC, would it potentially have features such a subscriber group in which the changes from tablets could be distributed over subscribers in the group? (similar to Kafka’s consumer groups)

As another aside, server side filtering with CDC would be very attractive for certain use cases and an area that I believe Kafka is lacking in. Is there an opinion yet on whether this should be included with CDC?

Hi @eliahburns, great to see you trying out CDC!
Most of what you mentioned is in our roadmap:

  1. A published subscriber library
  2. Subscriber group feature
  3. Server side filtering

(1) is something that we plan to get started on next quarter and (2) and (3) will be after that. Would love to get your help and input in flushing these out :slight_smile:

We’ll publish the code for Kafka client soon. In the meantime, you can take a look at our sample console client code (Java) here:

You can also go through the C++ CDC producer for more details: (mostly proto definitions) (Producer code)

I’d also recommend going over the 2DC consumer code. 2DC consumer is basically just a CDC consumer that we use for data center replication. This code is production ready and will give you an idea of the robustness and scalability that your consumer code will need to handle: (files of interest are cdc_* and twodc_*)

Excellent, thanks a bunch for the detailed response @neha!

I’m super excited to hear to hear 2 and 3 are on the roadmap.

The closest alternative to something like this that I’ve found so far is this hbase-connector-to-kafka. I imagine a fair number of folks on the team have seen that–it acts as a replication peer to hbase and funnels WAL events into a kafka topic. So it’s a bit similar in some respects.

Thanks for pointing me to some good places in the codebase. I’ll continue looking through that and may have some more questions/comments soon.