The data source for backfill

Hi
When doing a backfill, does each node in the cluster only get the data from the main table tile on this node to do the data backfill, right?
For example, if a cluster has three nodes, A, B and C, does A only get the data from the main table tile above A to do the backfill, and B only gets the data from the main table tile above B to do the backfill? C nodes are also like A and B ?
Thanks a lot.

Hi @ZhenNan2016

When you create an index, depending on it’s type (range,hash sharded), new tablets get created for it, and get distributed across the cluster.

No. For a given row in a table, the index-entry for any given index may be in another node on the cluster.

The logic is like this: “on each node, for each tablet, a full-scan is started, and index-entries are pushed wherever the tablet-leader of the index is located in the cluster”.

1 Like

Hi @dorian_yugabyte
Thank you for your reply.
By full-scan here, I mean, do full-scan for all tablets distributed across node A, node B, and node C of the cluster, right?
Thanks a lot.

Hi @dorian_yugabyte
To add to the starting point of my question, the logic of backfill processing, one direction of understanding may feel that once the backfill is triggered, node A, node B, node C, they all have their own backfill task, just be responsible for pulling the tablets of the main table of this node to do the backfill operation. Another way of understanding is that the backfill task on each node pulls tablets from the main table on all nodes to do the backfill operation at the same time.
Which understanding is correct?
Thanks a lot.

Yes.

Please read the design docs yugabyte-db/architecture/design/online-index-backfill.md at master · yugabyte/yugabyte-db · GitHub

Feel free to ask more questions if you’re still unclear after reading the design doc.

Hi @dorian_yugabyte
From the looks of what’s in the red boxed portion of the screenshot below, it looks like it should be the “the backfill is triggered, node A, node B, node C, they all have their own backfill task, just be responsible for pulling the tablets of the main table of this node to do the backfill operation.”

But I thought it was “the backfill task on each node pulls tablets from the main table on all nodes to do the backfill operation at the same time” before.

Thanks a lot.

I don’t really understand the sentences in your latest comment. Can you repeat in more clear terms?

Example: I never mentioned “pull” in my comments above. I said each table-tablet does a scan & a push.

I’m really sorry, it’s my mistake to express.
Actually, the point I wanted to express is that each table-tablet backfill task, will it only scan & push the data of its own tablet-tablet once , or will it do a full table scan including tablets of other nodes?

Only it’s own data. Doesn’t make sense otherwise. Since tables are split into tablets, many tasks can be parallelized to be per-tablet.

Hi @dorian_yugabyte
I got it now,thanks a lot.