We are using yugabyte database with YCQL driver(gocql)
We have below two queries:
selectQueryMessage= "Select col2, col3, col4, col5 from table1 " +
"WHERE primarykeyCol = ?"
insertQueryMessage = "INSERT INTO table1 (primarykeyCol, col2, col3, col4, col5) VALUES (?, ?, ?, ?, ?) " +
"IF NOT EXISTS RETURNS STATUS AS ROW"
In last 7 days, each SELECT query is taking 10 ms consistently where as each INSERT query takes 30-40 ms
Why INSERT query 3-4 times slower than SELECT?
If your RF > 1, insert operation goes through a distributed transaction cycle where by record is stored on majority of replicas (if RF=3, then 2; if RF=4, then 3, it RF=7, then 5). However in case of SELECT record is read from only one tablet (assuming you are retrieving 1 record).
yugaware shows Replication factor as 3
Yes select query retrieves one record
For INSERT query, What is distributed transaction cycle?
See this page it explains how inserts go through distributed transaction.
“Transactional IO path | YugabyteDB Docs” Transactional IO path | YugabyteDB Docs
Can I say,
if RF=3 in our cluster environment is changed to RF=1, then INSERT query would execute with less latency
Yes, in theory it will.
But you compromise on the resiliency. With RF=3, you can sustain 1 node failure, i.e. YB can continue serve query without interruption. But with RF=1, any node failure will mean interruption.
What is your current cluster configuration? i.e. Single-AZ, Multi-AZ, Multi-Region
Depending on where the nodes of your cluster are located it may cause increased latencies for writes due to networking between nodes (as Yogi mentioned each write has to be acknowledged by 2/3 peers with RF=3). Another aspect to consider if you are in a multi-region setup is where your users are. For example, if you are querying from US-East but the tablet-leader is in US-West you will see increased latencies.
We have 6 node cluster(2 regions 3 AZs)
4 nodes in central region US
2 nodes in eastern region US
Node1 (Master, TServer) US-east
Node2 (Master, TServer) US-central-1
Node3 (TServer) US-central-1
Node4 (TServer) US-east
Node5 (Master, TServer) US-central-2 (Leader)
Node6 (TServer) US-central-2
Application(installed in central region) uses YCL driver and talks to database instance of node 2
- What is the difference between Master leader vs TServer vs Master+TServer+LEADER ?
- Can a node just be a master but not tserver?
- Does INSERT query transaction run on all 6 nodes and then respond to client(Application) unlike SELECT query(which run transaction only on node2)?
There is only 1 master-leader server.
While on tservers, the master is chosen on a per-tablet basis. So it will depend on which tablet-leader the row that you’re trying to select/insert belongs in.
Yes, but you still need the minimum number of tservrs/masters.
The YCQL driver will contact directly to the server which has the tablet-leader for that row (assuming single row insert/select or a single partition), and the server will also replicate to another tserver before returning “success” to the driver.
As mentioned above,
application is running in central region.
and there are 4 nodes(of database cluster) in central region
Does the performance impact depend on the node(out of 4 nodes in central region) that application points to?
if yes, what would be the performance impact for INSERT query?
if no, does performance remain same, if application point to any of the 4 nodes in central region?
Yes, if the tablet-leader for the row you’re select/inserting is residing in the same node.
If the tablet-leader is not in the node you’re connecting to, it will involve 1 extra network hop. (the server you’re connecting will redirect the request to the tablet-leader, and the tablet-leader will replicate it to another server that has the tablet-peer, and return “ok” to the node you’re connecting, and that node returns “ok” to the client).
If there are two tables involved using SELECT query, then how tablet leaders are picked? Because we want to split our single table to two tables that has referential integrity.
Think you send the query to
node_x, assuming a complex query involving 2+ tablets, will identify the location of each tablet_leader on the cluster and query the data on them, like a coordinator.
So it will depend on which tablet_leader each row resides, and where each tablet_leader reside in the cluster.
So that means, from the performance perspective, it is not a good idea to have referential integrity constraints across tables
It is slower on INSERT because the db will check the row exists on the other table (the same with DELETE).
But not on SELECT.