High i/o while running AQL in arangodb - arangodb

I am investigating the performance issues in arangodb I am having. I have noticed that it does a heavy i/o (read operation specifically) while performing certain AQL. I have following questions in my mind:
What's actually loaded into physical memory. Is it the write ahead log file, datafile or Journal
If I assume that all 3 are loaded, then why there should be an i/o while reading the data
AQL runs on Journal/datafile or it takes write-ahead log file data into account also.
If it runs only on Journal/datafile then it's possible that we have new data in write-ahead file but the aql won't return that
System is not using any swap at all. But still AQL run time is increasing. it's taking as high as 10 sec sometimes to run. I did a iotop on the arangodb and I see that this (see below) particular commands does a heavy read. It is as high as 15 mbps read.
" 2320 be/4 arangodb 10.69 M/s 149.39 K/s 0.00 % 22.48 % arangod -c /etc/arangodb/arangod.conf --uid arangodb --gid arang~emp-path /var/tmp/arangod --log.tty --supervisor [[dispat_def]]"

ArangoDB keeps all of the above in memory and AQL considers everything that has been written and committed to the database, regardless of whether it still resides in the write-ahead-log or not.
Is it possible that your system has run out of main memory to map all this data? That could explain the high i/o

As stj mentions
AQL queries, even if read only, allocate memory for their intermediate results. It is possible that the OS satisfies these memory allocation requests by unloading pages from arangod's collection datafiles. That may cause a lot of follow-up I/O when these pages are accessed again, either by the AQL query itself or other operations on the data. Whether or not and when this happens at all depends a lot on OS and VM configuration so it's hard to tell from here
It was the issue of Main memory being exhausted due to intermediate results.

Related

When to use RocksDB and when to use MMFiles storage engine in ArangoDB?

We use ArangoDB to store telco data. The main goal of our application is to let users build a certain types of reports very quickly. The reports are mostly based on the data we get from ArangoDB when we traverse different graphs. The business logic of the reports is not simple which leads to very complex AQL queries with multiple nested traversals (sub-queries).
Quick Overview of the data we store in ArangoDB:
28 collections with documents (the biggest collection consist of 3500K documents, average collection would usually have from 100K to 1000K)
3 collections with edges (335K edges, 3500K edges and 15000K edges)
3 graphs (each graph is linked to one edge collection and the biggest graph has 23 from/to collections)
The overall data set takes about 28 GB of RAM when fully loaded (including indexes).
We have been using MMFiles for almost two years now and were very happy with the results, except for some problems:
unprecedented memory consumption which I described here
very slow restart (takes 1 hour 30 minutes before the database is fully responsive again)
the fact that we have to use very expensive VMs with 64 GB of RAM to be able to fit all the data into the RAM
After some research we started to look into a new RocksDB storage engine. I have read:
https://www.arangodb.com/why-arangodb/rocksdb-storage-engine/
https://docs.arangodb.com/3.4/Manual/Architecture/StorageEngines.html
From the documents and from the proposed answers on my question about the problem with RAM consumption I can see that RocksDB should be a way to go for us. All the documents say it is new default engine for ArangoDB and it should be used if you want to store more data than fits into the RAM.
I installed new ArangoDB 3.4.1 and converted our database from MMFiles to RocksDB (via arangodumpa and arangorestore). Then I run some performance tests and found that all traversals became 2-6 times slower compare to what we had with MMFiles engine. Some queries which took 20 seconds with MMFiles engine now take 40 seconds with RocksDB, even if you run the same query multiple times (i.e. the data mush be already cashed).
Update 2/15/2019:
We run ArangoDB inside of a docker container on m4.4xlarge instance on AWS with 16 vCPU and 64 GB of RAM. We allocated 32 GB of RAM and 6144 CPU units for ArangoDB container. Here is a short summary of our tests (the numbers show the time it took to execute a particular AQL traversal query in HH:mm:ss format):
Note, in this particular table we do not have 10 times performance degradation as I mentioned in my original question. The maximum is 6 times slower when we run AQL right after the restart of ArangoDB (which I guess is OK). But, most of the queries are 2 times slower compare to MMFiles even when you run it a second time when all the data must be already cached in the RAM. The situation is even worse on Windows (it is there I had performance degradation like 10 times and more). I will post the detailed spec of my Windows PC with the performance tests a bit later.
My question is: Is it an expected behavior that AQL traversals are much slower with RocksDB engine? Are there any general recommendations on when to use MMFiles engine and when to use RocksDB engine and in which cases RocksDB is not an option?
With Arangodb 3.7 Support for MMFiles has been dropped, hence this question can be answered with "use rocksdb".
It took us a while to mature the rocksdb based storage engine in ArangoDB, but we now feel confident it fully can handle all loads.
We demonstrate how to work with parts of the rocksdb storage system and which effects they have in this article.

InfluxDB write performance statistics with sensu

We are writing cluster performance metric collected using Sensu to influxDB on RHEL VM(16GB). I want to collect the write rates for the influxd process per second issued by it. My device location is /dev/vda1 and file location /var/lib/influxDB/data.
The problem:
There is a substantial delay between the data collection time from sensu and the time to which data is written to the InfluxDB. We suspect the disk IO performance of influx may be bottleneck but do not have concrete data to support the claim.
Tried things:
I have tried iostat, iotop and a bunch of other ways.
Using iotop influxd process shows write rate of 35kb/s average which I am sure is far less for the load we have. (I suspect it is NOT showing me the VM stats but the physical machine stats?)
Question:
1. Is there is any other way I can collect the correct write rate metric for influxd process?
2. Has someone else faces similar issue with sensu and InfluxDB? how did yo solve it?
Thanks
You can use the _internal Influx database. It stores query times, disk usage, writes/reads, measurements, series cardinality and so on.
You could also install Telegraf on the data nodes and get disk IO, disk, CPU, network, memory and so on from the system.inputs section of Telegraf.

Forcing MongoDB to prefetch memory

I'm currently running MongoDB 2.6.7 on CentOS Linux release 7.0.1406 (Core).
Soon after the server starts, as well as (sometimes) after a period of inactivity (at least a few minutes) all queries take longer than usual. After a while, their speeds increase and stabilize around a much shorter duration (for a particular (more complex) query, the difference is from 30 seconds initially, to about 7 after the "warm-up" period).
After monitoring my VM using both top and various network traffic monitoring tools, I've noticed that the bottleneck is hit because of hard page faults (which had been my hunch from the beginning).
Given that my data takes up <2Gb and that my machine has 3.5 GB available, my collections should all fit in-memory (even with their indexes). And they actually do end up being fetched, but only on an on-demand basis, which can end up having a relatively negative impact on the user experience.
MongoDB uses memory-mapped files to operate on collections. Is there any way to force the operating system to prefetch the whole file into memory as soon as MongoDB starts up, instead of waiting for queries to trigger random page faults?
From mongodb docs:
The touch command loads data from the data storage layer into memory. touch can load the data (i.e. documents) indexes or both documents and indexes. Use this command to ensure that a collection, and/or its indexes, are in memory before another operation. By loading the collection or indexes into memory, mongod will ideally be able to perform subsequent operations more efficiently.

Massive inserts kill arangod (well, almost)

I was wondering of anyone has ever encountered this:
When inserting documents via AQL, I can easily kill my arango server. For example
FOR i IN 1 .. 10
FOR u IN users
INSERT {
_from: u._id,
_to: CONCAT("posts/",CEIL(RAND()*2000)),
displayDate: CEIL(RAND()*100000000)
} INTO canSee
(where users contains 500000 entries), the following happens
canSee becomes completely locked (also no more reads)
memory consumption goes up
arangosh or web console becomes unresponsive
fails [ArangoError 2001: Could not connect]
server is still running, accessing collection gives timeouts
it takes around 5-10 minutes until the server recovers and I can access the collection again
access to any other collection works fine
So ok, I'm creating a lot of entries and AQL might be implemented in a way that it does this in bulk. When doing the writes via db.save method it works but is much slower.
Also I suspect this might have to do with write-ahead cache filling up.
But still, is there a way I can fix this? Writing a lot of entries to a database should not necessarily kill it.
Logs say
DEBUG [./lib/GeneralServer/GeneralServerDispatcher.h:411] shutdownHandler called, but no handler is known for task
DEBUG [arangod/VocBase/datafile.cpp:949] created datafile '/usr/local/var/lib/arangodb/journals/logfile-6623368699310.db' of size 33554432 and page-size 4096
DEBUG [arangod/Wal/CollectorThread.cpp:1305] closing full journal '/usr/local/var/lib/arangodb/databases/database-120933/collection-4262707447412/journal-6558669721243.db'
bests
The above query will insert 5M documents into ArangoDB in a single transaction. This will take a while to complete, and while the transaction is still ongoing, it will hold lots of (potentially needed) rollback data in memory.
Additionally, the above query will first build up all the documents to insert in memory, and once that's done, will start inserting them. Building all the documents will also consume a lot of memory. When executing this query, you will see the memory usage steadily increasing until at some point the disk writes will kick in when the actual inserts start.
There are at least two ways for improving this:
it might be beneficial to split the query into multiple, smaller transactions. Each transaction then won't be as big as the original one, and will not block that many system resources while ongoing.
for the query above, it technically isn't necessary to build up all documents to insert in memory first, and only after that insert them all. Instead, documents read from users could be inserted into canSee as they arrive. This won't speed up the query, but it will significantly lower memory consumption during query execution for result sets as big as above. It will also lead to the writes starting immediately, and thus write-ahead log collection starting earlier. Not all queries are eligible for this optimization, but some (including the above) are. I worked on a mechanism today that detects eligible queries and executes them this way. The change was pushed into the devel branch today, and will be available with ArangoDB 2.5.

How to exaust a machine's resources with RethinkDB?

I'm asking this question because I would like to understand how I can run RethinkDB better, which means what kind of hardware should it be running on, what kind of filesystem it should be running on and other system configurations to maximize it's throughput.
I'm trying to fill a table as fast as I can with documents that are {"n": <counter>, "rand": <Math.random()>}. I read somewhere that this is faster with batches of 200 documents, so that's what I'm inserting. I am also using soft durability. I started one nodejs process of this and I can insert on average 10k documents per second, pretty good.
But while this is happening, rethinkdb is using about 70% of one core (I have 8 virtual cores, it's an i7-4770) and the nodejs process is using 5%. So it seems that CPU is not the bottleneck.
As soon as I start another nodejs process doing the same thing, the inserts per second on both processes drop to about 4k-5k. Again, the CPU load keeps the same.
I fired iotop and I do see a lot of action there, but not what I expected. I configured two SSDs in a RAID0, and a quick dd test says I can write and read at about 800MBps. That's far above what the actual read and actual write speed iotop reports (average read ~14MBps average write ~50MBps).
So how can I exaust my machine's resources ? What does rethinkdb need to run faster ? Why doesn't it spend more resources and have a higher throughput ?
More information on what it's running on: It's an EX40SSD from Hetzner, two SSDs in a software RAID0, ext4 filesystem (tomorrow I'll try to mount it with noatime to see if it's better). The rethinkdb configuration is everything by default, the inserts are done to a table that has only one shard and one replica. Please feel free to ask anything else relevant I might have forgotten to mention.
Thanks in advance.
What I suspect is going on here is lock contention on the actual btrees. When you're inserting a large batch of documents, the system grabs various parts of the btree in parallel to update it with the new documents. This is a set of read-write locks -- other parts of the system can still read, but if you insert another large batch in parallel, there is a high probability that it will touch similar parts of the btree, and therefore has to wait for the system to start unlocking as it inserts parts of the first batch. (This isn't specific to RethinkDB, but a problem in databases in general) This is likely why you're not hitting 100% CPU/disk throughput.
There are a few things you can try, but note that there are subtleties to various approaches. Benchmarking in general is hard.
You can try to shard the table into 32 shards and retry your benchmark. You don't actually have to create a cluster, you can shard into 32 shards on a single machine. This will result in multiple btrees, so you'll minimize contention and will be able to use more system resources. Note that while this will likely increase the throughput, increasing the number of shards also slightly increases the latency, so you might need to significantly increase parallelism before you start seeing throughput increases.
You can try not to batch writes and instead write one document at a time (which generally approximates real-world use cases a little better). Then, start hundreds of parallel clients instead of just one or two, and have all them write one document at a time in parallel. Note that you need to make sure the clients themselves aren't a bottleneck in this situation.
You can try to rerun your benchmark and also spin up clients that read from the database in parallel with the writes. In RethinkDB reads can usually go through even when you're writing to a particular document, so this will give you the opportunity to up CPU usage and bypass contention.
Pay attention to the IDs of the documents. If the database is large enough (say, millions of documents), and the IDs you're inserting are random, you're much less likely to touch the same parts of the btree so contention becomes less of an issue.
You can combine various approaches (sharding, reading + writing, various numbers of concurrent clients) to start getting a sense for how the database behaves in various scenarios.
Note that there might be things going on that you wouldn't normally be aware of. For example, RethinkDB has a log-structured storage engine that does live compaction on disk, and this might use up some IO (and CPU) cycles that you'd be surprised by if you didn't know about live compaction. There are dozens of other components like this that might compound to surprising behavior, as these systems are typically very complex under the hood.
Hope this helps -- would love to hear about your progress on the benchmarks. We do a lot of them internally, and it's an art and a science to discover the boundaries of the system's performance on different use cases.
My guess is that the bottleneck here is the disk system, but not its throughput. What's more likely is that writes are happening in chunks that are too small to be efficient, or that there are delays due to latency between individual writes.
It's also possible that the latency between individual write queries coming from the client and their processing on the server slows the system down.
Here are a few things I recommend trying:
Increase the batch size further. Your documents are very small. Therefore I think you might get a significantly higher throughput with batches of 1,000-10,000 documents. This might work especially well in combination with the next point.
Run multiple concurrent clients. You mentioned that you have run 2 clients concurrently, but this might not be enough. I recommend running 16-32 if possible.
Check the cache size RethinkDB is using. By default, RethinkDB picks the cache size as a fraction of the available memory, but that is not always reliable. I recommend passing a --cache-size <MB> parameter to RethinkDB (or adding the cache-size=<MB> parameter to the configuration file, if you're using one). I can see that your server has 32 GB of RAM. I recommend using in the range of 20000 MB (or even more) as the cache size. A larger cache reduces the number of reads, but up to a certain limit also increases the amount of unsaved data that RethinkDB can accumulate in RAM to make disk writes more efficient.
Experiment with the --io-threads <THREADS> parameter. The default is 64, but you can try increasing it to e.g. 128 and see if it has any effect.

Resources