I’m evaluating yugabyte on a very tiny database (a few Mb) with RF=3. One of the servers is ~25ms away from the others and I observe that the first query is always slower than if repeating the same query. What’s the cause for this?
For example, when I do “select count(0) from projects” it will be 75ms on the first attempt and 25ms on further attempts. Looks like the first query takes 3 round-trips to the “distant” server, but why is that the case?
Is there a way to warm up yugabyte so that this doesn’t happen with live queries?
More generally, do you have any documentation that can help understand the number of round-trips required per queries?
Some of my queries are timing out after 4s - I expect that it involves many roundtrips but I don’t really understand why. They are along the lines of “select * from projects where visibility=1 order by timestamp”. Yes, they use a full table scan based on EXPLAIN but there are only 26 rows in projects altogether..
FWIW the warmup problem is not only caused by the network latency. I moved all masters/tservers close to each other, but the first query still takes 200ms, while the others only 9ms. This applies separately for each table: for example warming up “projects” does not warm up “users”.
The initial query execution (especially the first “preparation” of the query plan for a statement on a TServer) needs to do some extra work to load the relevant information from the system catalog (e.g., to determine the table definitions, resolve non-fully qualified names using the search path order, etc.).
To understand why say your query on a 26 row table is taking 200ms for example on the first execute vs. subsequent executes taking 9ms etc. would you mind sharing the output of:
Please see the explain logs from the two executions below. This time it was 100ms vs 3ms, but if I run it from a follower it is more pronounced (560ms vs 7ms)
See the planning times, your first query has Planning Time: 59.024 ms while the second has Planning Time: 0.109 ms.
Execution time is the same on both cases ~2ms.
What is the hardware on each node?
One of the servers is ~25ms away from the others and I observe that the first query is always slower than if repeating the same query.
Please explain what’s the purpose of this server being 25ms away? Doesn’t really make sense unless you have some requirement that you haven’t specified.
There are different things that apply.
In this case we have to understand why planning is so high.
Use prepared statements on the client?
The more complex the setup, the harder it will be to know what’s going on.
The architecture section of docs is a good start like:
“server leader/follower” is only for yb-master. Leaders & followers are per-tablet on yb-tservers. Unless it’s a colocated db and all your tables & indexes are in a single tablet.
That option has been deprecated in this commit. It’s a yb-tserver flag and not a PostgreSQL option (so you can see it on http://yb-tserver-ip:7000/varz).
In place of that we’ve added yb_fetch_row_limit which does show as PostgreSQL option:
yugabyte=# show yb_fetch_row_limit;
yb_fetch_row_limit
--------------------
1024
(1 row)
This should help identify additional tables that may need to be preloaded into the cache on each tserver to help mitigate the impact of the catalog fetches.