ArangoDB Java Batch mode insert performance - arangodb

I'm using ArangoDb 3.0.5 with arangodb-java-driver 3.0.1. ArangoDB is running on a 3.5ghz i7 with 24gb ram and an ssd.
Loading some simple Vertex data from Apache Flink seems to be going very slowly, at roughly 1000 vertices/sec. Task Manager shows it is CPU bound on the ArangoDB process.
My connector is calling startBatchMode, iterating through 500 calls to graphCreateVertex (with wait for sync set to false), and then calling executeBatch.
System resources in the management interface shows roughly 15000 (per sec?) while the load is running, and used CPU time fixed to 1 for user time. I'm new to ArangoDB and am not sure how to profile what is going on. Any help much appreciated!
Rob

Your performance result is the expected behavior. The point with batchMode is, that all of you 500 calls are send in one and executed on the server in only one thread.
To gain better performance, you can use more than one thread in your client for creating your vertices. More requests in parallel will allow the server to use more than one thread.
Also you can use createDocument instead of graphCreateVertex. This avoids consistency checks on the graph, but is a lot faster.
If you don't need these checks, you can also use importDocuments instead of batchMode + createDocument which is even faster.

Related

Why does the response time curve of NodeJS API become sinus like under load?

I am currently performing an API Load Test on my NodeJS API using JMeter and am completely new to the field. The API is deployed on an IBM Virtual Server with 4 vCPUs and 8GB of RAM.
One of my load tests includes stress testing the API in a 2500 thread (users) configuration with a ramp-up period of 2700ms (45 min) on infinite loop. The goal is not to reach 2500 threads but rather to see at what point my API would throw its first error.
I am only testing one endpoint on my API, which performs a bubble sort to simulate a CPU intensive task. Using Matplotlib I plotted the results of the experiment. I plotted the response time in ms over the active threads.
I am unsure why the response time curve becomes sinus like once crossing roughly 1100 Threads. I expected the response time curve keep rising in the same manner it does in the beginning (0 - 1100 threads). Is there an explanation for the sinus like behaviour of the curve towards the end?
Thank you!
Graph:
Red - Errors
Blue - Response time
There could be 2 possible reasons for this:
Your application cannot handle such a big load and performs frequent garbage collection in order to free up resources or tasks are queuing up as application cannot process them as they come. You can try using i.e. JMeter PerfMon Plugin to ensure that the system under test doesn't lack CPU or RAM
JMeter by default comes up with relatively low JVM Heap size and a very little GC tuning (like it's described in Concurrent, High Throughput Performance Testing with JMeter article where the guy has very similar symptoms) so it might be the case JMeter cannot send requests fast enough, make sure to follow JMeter Best Practices and consider going for distributed testing if needed.

OutOfDirectMemoryError using Spring-Data-Redis with Lettuce in a multi-threading context

We are using spring-data-redis with the spring-cache abstraction and lettuce as our redis-client.
Additionally we use multi-threading and async execution on some methods.
An example workflow would look like this:
Main-Method A (main-thread) --> calls Method B (#Async) which is a proxy-method to be able to run the logic asynchronously in another thread. --> Method B calls Method C, which is #Cacheable. The #Cacheable Annotation handles reading/writing to our redis-cache.
What's the problem?
Lettuce is Netty-based which works relying on the DirectMemory. Due to the #Async nature of our program, we have multiple threads using the LettuceConnection (and therefore Netty) at the same time.
By design all threads will use the same (?) Netty which shares the DirectMemory. Due to an apparently too small MaxDirectMemorySize we get an OutOfDirectMemoryError, when too many threads are accessing Netty.
Example:
io.lettuce.core.RedisException: io.netty.handler.codec.EncoderException: io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 8388352 byte(s) of direct memory (used: 4746467, max: 10485760)
What have we found so far?
We use the https://docs.cloudfoundry.org/buildpacks/java/ and calculate the MaxDirectMemorySize using the https://github.com/cloudfoundry/java-buildpack-memory-calculator.
Which leeds to a MaxDirectMemorySize=10M. Having an actual available memory of 4GB the calculated MaxDirectMemorySize is probably way to conservative. This might be part of the problem.
Potential solutions to the problem
increase the MaxDirectMemorySize of the JVM --> but we are not sure that is sufficent
configure Netty not to use the DirectMemory (noPreferDirect=true) --> Netty will then use the heap, but we are insecure if this would slow down our application too much if Netty is too hungry for memory
no idea if this would be an option or even make the problem worse: configure Lettuce to shareNativeConnection=false --> which would lead to multiple connections to the redis
Our Question is: How do we solve this the correct way?
I'll happily provide more information on how we set up the configuration of our application (application.yml, LettuceConnection etc.), if any of these would help to fix the problem.
Thanks to the folks over at: https://gitter.im/lettuce-io/Lobby we got some clues on how to approach these issues.
As suspected, the 10M MaxDirectMemorySize is too conservative considering the total available memory.
Recommendation was to increase this value. Since we don't actually know how much memory Netty would need to perform more stable, we thought of the following steps.
First: We will disable Netty's preference of the MaxDirectMemory by setting noPreferDirect=true. Netty will then use the heap-buffer.
Second: We will then monitor how much heap-memory Netty is going to consume during operation. Doing this, we'll be able to infer an average memory consumption for Netty.
Third: We will take the average memory consumption value and set this as the "new" MaxDirectMemorySize by setting it in the JVM option -XX:MaxDirectMemorySize. Then we'll re-enable Netty to use the DirectMemory by setting noPreferDirect=false.
Fourth: Monitor log-entries and exceptions and see if we still have a problem or if this did the trick.
[UPDATE]
We started with the mentioned steps but realized, that setting noPreferDirect=true does not completely deter netty from using DirectMemory. For some use-cases (nio-Processes) Netty still uses DirectMemory.
So we had to increase the MaxDirectMemorySize.
For now we set the following JAVA_OPTS -Dio.netty.noPreferDirect=true -XX:MaxDirectMemorySize=100M. Which will probably fix our issue.

Perl threads to execute a sybase stored proc parallel

I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.
What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.
No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...

Is reproducible benchmarking possible?

I need to test some node frameworks, or at least their routing part. That means from the request arrives at the node process for processing until a route has been decided and a function/class with the business logic is called, e.g. just before calling it. I have looked hard and long for a suitable approach, but concluded that it must be done directly in the code and not using an external benchmark tool. I fear measuring the wrong attributes. I tried artillery and ab but they measure a lot more attributes then I want to measure, like RTT, bad OS scheduling, random tasks executing in the OS and so on. My initial benchmarks for my custom routing code using process.hrtime() shows approx. 0.220 ms (220 microseconds) execution time but the external measure shows 0.700 (700 microseconds) which is not an acceptable difference, since it's 3.18x additional time. Sometimes execution time jumps to 1.x seconds due to GC or system tasks. Now I wonder how a reproducible approach would look like? Maybe like this:
Use Docker with Scientific Linux to get a somewhat controlled environment.
A minimal docker container install, node enabled container only, no extras.
Store time results in global scope until test is done and then save to disk.
Terminate all applications with high/moderate diskIO and/or CPU on host OS.
Measure time as explained before and crossing my fingers.
Any other recommendations to take into consideration?

Node.js async parallel - what consequences are?

There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?
Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.
I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.
The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.
you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here

Resources