Database schema in YugabyteDB for storing statistics collected periodically

Please answer the other questions too.

max-pool-size configuration parameter set to 750

On 3 VMs with 8 vCPUs and 16GB RAM, opening 750 connections is probably too much. Imagine that they are active, 6 processes will have to share one vCPU thread. Imagine they all allocate 20MB or working memory, that’s 30% of RAM capacity used

I assume that you are using the term “client-server” to referring to a VM where our application (that connects to YugabyteDB) is running, and not the YugabyteDB server nodes. If that is the case, then we are currently using 1 “client-server” VM that has 1 vCPU and 16 GB RAM.
Since we are using Wildfly datasource, I do not know if the pgbouncer levels of “brutality” when rotating connections apply exactly to our case, but it seems to me that Transaction Pooling is closer to what Wildfly datasource offers.

I should clarify a couple of points about our tests:

  • We recently upgraded the YugabyteDB server nodes to have 32 GB RAM instead of 16 GB, due to the memory issues we had seen in our initial tests.
  • The max-pool-size configuration parameter setting of 750 that we have is not being reached by any means in our current tests, since we are still testing with a relatively low number of “objects” in our database. In fact, right now the maximum number of connections simultaneously in use is 217.
1 Like

Yes. The server where you Java app is running.

Why? Is your app memory-constrained? If yes, why is that? Are you keeping a big memory cache or something? Are you using async?

Java can use multiple processes and threads. I would expect to use, say, 16vcpu VMs for the client.
Even if the whole app is single-threaded, you can have 16 threads, and all of them will share the same connection pool in the same process.

So you have just 1 client-server with 1vcpu? You didn’t answer how many wildfly clients VM you have?

If your question is why we put so much RAM (16 GB) memory on our “client-server” VM, the answer is that our application is memory-intensive, and indeed we do have a big memory cache within our application. With respect to why we have only 1 CPU on that VM instead of more (e.g., 16 that you suggest): While performing our tests with YugabyteDB so far we find that less than 50% of that single CPU is being used in the worst case, and in the past when we had performed the corresponding tests with PostgreSQL for 50000 objects we had noticed the same.

Regarding your question “how many wildfly clients VM you have?”, if you mean the number of clients that are connected to our single “client-server” VM, the answer is currently only 1, since we are not performing client stress test at the moment. We are focusing on the performance of the backend first, and once that has been stabilized later on we will perform stress test with multiple clients.

Agree. I meant that it’s better to use fewer beefier client VMs, because you can share client connections (and other things, like cache memory etc).

I am starting to implement the approach you had suggested to use 40 tables (1 table for each day of the month, plus 9 for buffer), and before I get too deep into the implementation I would like to ask the following:

  • For the purpose of queries that we need to run to produce reports of the performance tables’ data, I suppose I need to define a view the selects from among the 40 tables using UNION ALL. Currently, we have an index on timestamp field DESC on each performance table, so that we can produce reports ordered by this field in descending order. With the UNION ALL performed on the 40 tables, however, I assume that if we still want to present the report rows in the same order, the YugabyteDB back end will need to perform an extra sort on the timestamp field for the combined results, thus taking more time compared to using single performance table. Is this true, and if so, is there any way we can overcome this?
  • I assume the extra “9 tables for buffer” are used for cases where for some reason we fell many days behind in our data collection, correct?

Your input is greatly appreciated.

Thank you

You have to show the full query.

This is true, and was true also with partitions. PostgreSQL is doing the same thing underneath.

Yes.

Currently when the user opens a report for a performance table, the following query is executed:

SELECT *
FROM stats
WHERE timestamp >= (CAST (timeofDay() AS TIMESTAMP WITH TIME ZONE) - INTERVAL '1 DAY') 
ORDER BY timestamp DESC
LIMIT 50 OFFSET 0;

To implement the approach with the 40 tables, my thought was to use the same query to select the data, but instead of “stats” being a table it would now be a view defined as follows:

CREATE VIEW stats AS
    SELECT * FROM stats_1
UNION ALL
    SELECT * FROM stats_2
...
UNION ALL
    SELECT * FROM stats_40;

I thought users would query the aggregated monthly tables?

Also, wouldn’t you just need to query only the last day or 2 days tables since you’re querying by timestamp and partitioning daily?

The users can specify filters and ranges for the fields in the report, and among them for the timestamp. If the user specifies a range for the timestamp field that corresponds to a time period, for example, older than a month ago, then in that case indeed the aggregated monthly tables will be queried. However, our application by default sets a range for the timestamp field for all report data since 1 day ago (the user is of course allowed to change that), and I was just referring to this default case. I agree that for the default case we could query only the last day or 2 days tables and it would be much more efficient.

I thank you for your input, I will now proceed to implement the approach with the 40 tables and then to test it.