Yugabyte so slow?


I’m running the docker-compose from here Deploy local clusters using Docker Compose | YugabyteDB Docs and it seems very slow.

  • I have a CSV file with 750000 rows and is 296391592 bytes.
  • I have a python script that opens a connection to the yugabyte running via the docker-compose above
  • I’m using the PostgreSQL copy_from function to insert only on rows in the CSV file
  • I timed the duration it took to finish the copy_from function and i’m getting rates below 2 MB/s

I modify the python script to connect to postgres 14.0 and getting almost 40MB/s.

Would appreciate any help. I’ve tried toggling different settings but no help…

Thank you in advance…

Hi @Captain_KBLS

Note that ingesting a ~300MB file (OLAP scenario) in 1 transaction is not something we’ve optimized for, so a single PostgreSQL node will probably do better. We’re mostly optimized for OLTP scenarios.

You can check out our benchmarks blog Performance Benchmarks Archives - The Distributed SQL Blog or run benchmarks yourself Benchmark YugabyteDB | YugabyteDB Docs

Hi, copy_from is the right way. Here is an example of the best way I know to ingest a CSV though Python: Bulk load into PostgreSQL / YugabyteDB - psycopg2 - DEV Community
Is your table created just with a primary key or are there indexes and foreign keys?

Thank you so much for your responses.

I created very “simple” table.


id UUID default gen_random_uuid(),
var1 character varying(255),
var2 character varying(255),
var3 character varying(255),
var4 character varying(255),
var5 character varying(255),
var6 character varying(255),
var7 character varying(255),
var8 character varying(255),
var9 character varying(255),
var10 character varying(255),
var11 character varying(255),
var12 character varying(255),
var13 character varying(255),
val1 integer,
val2 integer,
val3 integer,
val5 integer,
val5 integer,
val6 integer);

Adding PRIMARY(id), made little difference since, i’m only doing insert only rates…

Hardcoded row into python script that can generated same row repeatedly (id will be different for each row inserted) and create csv file:
row = “var1 | var2 | var3 | var4 | var5 | var6 | var7 | var8 | var9 | var10 | var11 | var12 | var13 | 1 | 2 | 3| 4 | 5 | 6”

Create CSV file with desired number of rows.

Call copy_from function to handle insert of CSV file.

I created another python script that can run copy_from of the same CSV file using threads (threading).

Capture start and stop time when all copy_from has completed and calculate total rate.

I run the same test script connecting to yugabyte or postgresql, both setup with single drive/node (same ssd drive /data01). yugabyte rate is extremely less, i just ran 1500 row 5928200 bytes, yugabyte got about 4 MB/s and postgresql got about 31 MB/s. For both tests, I run using 8 threads calling the same CSV file. For every test I delete all the data in the table before rerunning.

Will try very very small file with many many copy_from…

So you have a ratio of 1:7 between YugabyteDB and PostgreSQL on single node. This is not surprising. Even when you run on single node, the database is optimized for distributed transactions. When you see the advantage of YugabyteDB is:

  1. when going to replication factor RF=3 because then you have no-data-loss-HA which you cannot have with PostgreSQL
  2. and adding nodes because the load can be balanced over multiple machines. And then the rates will linearly increase
    Running on single node without HA protection is faster on PostgreSQL. But it cannot scale out.

I’m seeing (what I believe) is also a slow performance, this time on a cluster with RF=3; essentially writing large blobs of data – 50MB/s. Could someone comment whether that seems ‘normal’ according to their experience? Further details of my setup are in here: Lost leader, then the yb-tserver - while testing long-running (low throughput) writes - #2 by roman

I’m planning to replicate my tests against traditional Postgres soon, but would appreciate hearing other voices.