Postgresql - requests on rarely used data slow everything

Postgresql - requests on rarely used data slow everything - linux

I have a problem with a postgresql database which slow down when some requests are done on old rarely used data.
This database contains data on appointments, messages, etc. Everything is dated, and requests are mostly for current and future times. Data about old appointments and messages is rarely used, but still needed for accounting and history.
On normal times, the server is very responsive. The web app show a 40ms response time with 15k requests per minute, 80ms on high traffic days (20k requests per minute).
The database is about 120GB in size.
The debian server memory is fulled used, roughly 2GB by postgresql and 29GB in disk cache.
Disk I/O usage show only disk writes, nearly no disk read.
However, if I make a request on old data, for example statistics on appointments made 2 years ago, the server show massive disk read spike (as expected), but meanwhile every other request is slowed down. The web app show a 250ms+ response time for the duration of the request.
The cpu usage doesn't really increase while this is happening, staying at 40-60% usage.
This lag spikes are happening multiple times per day, and are really annoying, even if not critical. Anyone have any idea on how to reduce or eliminate this problem?

That is normal. As long as you don't have enough memory to cache the whole database, rarely used data will not be cached and have to be read from disk. This is naturally slower.
To improve that, there are two options:
get enough RAM to cache the whole database
get faster storage

This can be solved programmaticlly.As you mentioned, your data includes hot data and cold data, whcih is divided by dates.So just divide hot data and cold data into two tables.
You can migrate cold data to a archived table automaticlly.And then every request should specify time range to search.For example, if the request requested current and future times, then access hot data table.

Related

Parallel read and write to postgres database slows down application (backend)

I have a backend in nestjs using typeorm and postgres. This backend saves and reads data frequently from the database. In this database we are dealing with row counts of 10k + at times that needs to get updated and saved or created.
In this particular case where I need some brain juice I have a table (lets call it table a)
the backend fetches data from table a every few seconds
the content in table A needs to get updated frequently (properties and values overwritten). I am doing this updating task from a several application backend solely for this use-case.
Example case
Table A holds 100K records
update-service splits these 100K records into chunks of 5 and parallell updates 25K records each. While doing so, the main application that retrieves data from the backend slows down.
What is the best way to have performant read and write in parallel? I am assuming the slow down comes from locks (main backend retrieves data while update service tries to update) but I am not sure as I have not that much experience working with databases.

Don't assume, assert.
While you experiencing bad performance, check how the operating system's resources are doing; in this case, mostly CPU and disk. If one of them is maxed out, you know what is going on, and you either have to reduce the degree of parallelism or make the system stronger.
It is also interesting to look at wait events in PostgreSQL:
SELECT wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event;
That will show I/O related events if you are running out of disk bandwidth, but it will also show database-internal contention that you can potentially hit with very high degrees of parallelism.

Will the Write-Ahead-Log become the bottleneck of Cassandra?

In a Cassandra database, a write needs to be logged in the Write Ahead Log first and then added to the memtable in memory. Since the Write Ahead Log is on disk, although it performs sequential writes（i.e., append only）, will it still be much slower than memory access, thus become the performance bottleneck for the writes？
If I understand it correctly, Cassandra supports the mechanism to store the Write Ahead Log in OS cache, and then flush it to disk every pre-configured amount of time(say 10 seconds). However, does it mean the data changes made within this 10 seconds could be all lost if the machine crashes?

You can control if the sync of commit log using the commitlog-sync configuration. By default it's periodic, and synced to disk every 10 seconds (controlled by commitlog_sync_period_in_ms setting).
And yes, if you lose the power there is a risk that data in the commit log is lost. But Cassandra relies on the fact that you have multiple replicas, and if you did setup correctly, each replica should be in separate rack (at least, better if you have additional data centers) with separate power, etc.

Forcing MongoDB to prefetch memory

I'm currently running MongoDB 2.6.7 on CentOS Linux release 7.0.1406 (Core).
Soon after the server starts, as well as (sometimes) after a period of inactivity (at least a few minutes) all queries take longer than usual. After a while, their speeds increase and stabilize around a much shorter duration (for a particular (more complex) query, the difference is from 30 seconds initially, to about 7 after the "warm-up" period).
After monitoring my VM using both top and various network traffic monitoring tools, I've noticed that the bottleneck is hit because of hard page faults (which had been my hunch from the beginning).
Given that my data takes up <2Gb and that my machine has 3.5 GB available, my collections should all fit in-memory (even with their indexes). And they actually do end up being fetched, but only on an on-demand basis, which can end up having a relatively negative impact on the user experience.
MongoDB uses memory-mapped files to operate on collections. Is there any way to force the operating system to prefetch the whole file into memory as soon as MongoDB starts up, instead of waiting for queries to trigger random page faults?

From mongodb docs:
The touch command loads data from the data storage layer into memory. touch can load the data (i.e. documents) indexes or both documents and indexes. Use this command to ensure that a collection, and/or its indexes, are in memory before another operation. By loading the collection or indexes into memory, mongod will ideally be able to perform subsequent operations more efficiently.

How to exaust a machine's resources with RethinkDB?

I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.

What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.

My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.

Cassandra Compaction takes all the resources and leads to node failure

I met very strange problem during testing cassandra. I have a very simple column family that stores video data (keys point to time period and there is only one column containing ~2MB video for this period).
Use Case
I start to load data using Hector API (round-robin) to 6 empty nodes (8GB RAM for Cassandra)- load is run in 4 threads adding 4 rows in second for each thread.
After a while (running load for hour or so) near 100-200 GB are added to the node (depending on the replication factor) and then one or several nodes become unreachable. (no pinging just reboot helps)
Why Compaction
I do use tiered-level compaction and monitoring the system(Debian) i can see that it actually not writes but compaction that takes almost all resources (disk, memory) and cause server to refuse writes and than fail.
After like 30-40 minutes of test compaction tasks just cannot be handled and get queued. Interesting thing is that there are no deletes and updates - so compaction just reads/writes data again and again without bringing actual value to me (like it can be compacted once in the evening).
When i slow down the pace - i.e running 2 threads with 1 second delay things go better but whether it still be working when i have 20TB not 100 GB on a node.
Is Cassandra optimized for such type of workload? How the resources are normally distributed between compaction and reads/writes?
Update
Update of network driver solved problem with unreachable cluster
Thanks,
Sergey.

Cassandra will use up to in_memory_compaction_limit_in_mb memory for a compaction. It is routine to have compaction running while reads and writes are served simultaneously. It is also normal that compaction can fall behind if you continue to throw writes at it as fast as possible; if your read workload requires that compaction be up to date or close to it at all times, then you'll need a larger cluster to spread the load around more machines.
Recommended amount of disk per node for online queries is up to 500GB, maybe 1TB if you're pushing it. Remember that this amount of data will have to be rebuilt if a node fails. Typical Cassandra workloads are CPU-bound or iops-bound, not disk-space bound, so you won't be able to make good use of that space anyway.
(It's also possible to do batch analytics against Cassandra, which we do with the Cassandra Filesystem, in which case higher disk:cpu ratios are desirable, but we use a custom compaction strategy for that as well.)
It's not clear from your report why a server would become unreachable. This is really an OS-level problem. (Are you swapping? Disabling swap would be a good first step.)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string