Working towards GA of change data capture

karthik · December 27, 2019, 5:16pm

Currently (end of 2019), the change data capture feature is in beta. This is a thread about figuring what the next steps here are. Some rough thoughts:

Publish a yugabytedb-cdc-subscriber library for CDC so that it is easy to build a consumer in any language. It should not be necessary to have a hard linkage to Kafka only.
Publish the Java libraries of above library to maven central.
Move CDC source code in Java to a separate repo along with example reference implementations.
Documentation on how to use CDC from various languages.
Failure testing of the CDC client.
Performance testing of the CDC client.

eliahburns · January 2, 2020, 4:59am

From the above points, I would love to have a Java library published that can be brought into projects/applications that would like to digest changes in yugabyte in real time. And I’m happy to help in getting that fully fleshed out. I’ve looked through the current Java code and have a pretty good idea on how that can currently be used to satisfy my goals, but it’s currently unpublished in Maven, which makes me think that it’s likely to change substantially.

Having access to a dependency like a yugabytedb-cdc-subscriber could be really nice.

I’ve viewed a recent presentation from @neha about CDC and noticed that a large percentage of it was devoted to CDC particularly in relation to cross DC replication (I’m also eager to use it for this feature).

I have two initial questions to help me in exploring the codebase and better understanding the the implementation around CDC so far. Is it possible for me to see the code that was written for the Yugabyte to Kafka connector jar that’s in this codebase? And, which area/files of the C++ implementation of Yugabyte should I look in if I want to better understand how the CDC producer process works? (I’ve been poking around the tablet server code so far, but don’t believe I’ve found the meat of it).

eliahburns · January 2, 2020, 5:03am

Also, given a subscriber library is published for CDC, would it potentially have features such a subscriber group in which the changes from tablets could be distributed over subscribers in the group? (similar to Kafka’s consumer groups)

As another aside, server side filtering with CDC would be very attractive for certain use cases and an area that I believe Kafka is lacking in. Is there an opinion yet on whether this should be included with CDC?

neha · January 7, 2020, 9:36pm

Hi @eliahburns, great to see you trying out CDC!
Most of what you mentioned is in our roadmap:

A published subscriber library
Subscriber group feature
Server side filtering

(1) is something that we plan to get started on next quarter and (2) and (3) will be after that. Would love to get your help and input in flushing these out

We’ll publish the code for Kafka client soon. In the meantime, you can take a look at our sample console client code (Java) here: https://github.com/yugabyte/yugabyte-db/tree/master/java/yb-cdc/src/main/java/org/yb/cdc

You can also go through the C++ CDC producer for more details:
https://github.com/yugabyte/yugabyte-db/tree/master/src/yb/cdc (mostly proto definitions)
https://github.com/yugabyte/yugabyte-db/tree/master/ent/src/yb/cdc (Producer code)

I’d also recommend going over the 2DC consumer code. 2DC consumer is basically just a CDC consumer that we use for data center replication. This code is production ready and will give you an idea of the robustness and scalability that your consumer code will need to handle: https://github.com/yugabyte/yugabyte-db/tree/master/ent/src/yb/tserver (files of interest are cdc_* and twodc_*)

eliahburns · January 10, 2020, 5:55am

Excellent, thanks a bunch for the detailed response @neha!

I’m super excited to hear to hear 2 and 3 are on the roadmap.

The closest alternative to something like this that I’ve found so far is this hbase-connector-to-kafka. I imagine a fair number of folks on the team have seen that–it acts as a replication peer to hbase and funnels WAL events into a kafka topic. So it’s a bit similar in some respects.

Thanks for pointing me to some good places in the codebase. I’ll continue looking through that and may have some more questions/comments soon.

Topic		Replies	Views
Rough timeline about new “Kafka Yugabyte CDC Connector”? General	2	684	September 27, 2021
CDC & Kafka integration page is gone General	5	1198	March 25, 2022
Change Data Capture for YCQL General	1	524	May 24, 2023
Can yugabyteDB be used for storage & streaming? Design Discussions	7	925	February 5, 2022
Event Stream API support General	2	976	April 27, 2019

Working towards GA of change data capture

Related topics