Non-blocking gets on Hazelcast ReplicatedMap? - hazelcast

We are using a ReplicatedMap in a Hazelcast client. Client and server are both version 4.2.1.
The map is quite small (<10 entries, each value less than 100 bytes). The client only reads from the map, which is updated infrequently on the server.
We expected ReplicatedMap.get to be non-blocking, but during a long running performance test, we started getting warnings like the one below from vertx (which monitors for blocked threads). The first error came after 6 hours, so it is not easily reproduced.
Is there any way to do a non-blocking get? Or do we need to add an EntryListener, which maintains a ConcurrentHashmap?
Clarification: The real issue here is not blocking vertx (which can be solved by moving the call to a Vertx worker verticle), but rather avoiding delaying the lookup. The business requirement is that we process messages in 50 ms or less, so even if we moved the call to a worker, we would be unable to fulfill that.
[vertx-blocked-thread-checker] WARN io.vertx.core.impl.BlockedThreadChecker - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 12777 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323) ~[?:?]
at com.hazelcast.spi.impl.AbstractInvocationFuture.manageParking(AbstractInvocationFuture.java:693) ~[hazelcast-4.2.1.jar!/:4.2.1]
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:615) ~[hazelcast-4.2.1.jar!/:4.2.1]
at com.hazelcast.client.impl.spi.ClientProxy.invokeOnPartition(ClientProxy.java:188) ~[hazelcast-4.2.1.jar!/:4.2.1]
at com.hazelcast.client.impl.spi.ClientProxy.invoke(ClientProxy.java:182) ~[hazelcast-4.2.1.jar!/:4.2.1]
at com.hazelcast.client.impl.proxy.ClientReplicatedMapProxy.get(ClientReplicatedMapProxy.java:214) ~[hazelcast-4.2.1.jar!/:4.2.1]
at my.package.StateGetter.getState(StateGetter.java:44) ~[classes!/:1.5.189]

ReplicatedMap entries are replicated on all members of the cluster. A client still needs to perform a remote call to a member to fetch the entry. For your requirements, I think the best way to achieve predictable latency is to setup a near cache on your client. This way entries will be cached locally on the client and only updated whenever necessary.

Golden rule of vertx is not to block the event loop in the app and the blocking tolerance is very low (2 sec). However, since we often need to execute blocking operations vertx offers two solutions. The one that I prefer to use are Worker verticles (https://vertx.io/docs/vertx-core/java/#_verticles under the Worker verticles section). You start them with the following code:
DeploymentOptions options = new DeploymentOptions().setWorker(true);
vertx.deployVerticle("com.verticles.HazelcastVerticle", options);
This type of verticle is designed for blocking operations and has a much higher tolerance before warnings start (60 sec by default). If you expect the operation to be longer than that you will probably need a separate thread or a different approach, but from your question I assume that is not the case so I won't go in that direction.
Other approach is to use vertx.executeBlocking. Details about that you can find here: https://dzone.com/articles/how-to-run-blocking-code-in-vertx.

Related

Knot Resolver: Paralelism and concurrency in modules

Context
Dear Knot Resolver users, I have a module that hooks into Knot's finish phase,
static knot_layer_api_t _layer = {
.finish = &collect,
};
the purpose of the collect function static int collect(knot_layer_t *ctx) { is to ask an external oraculum via a REST API whether a particular domain is listed for containing a malware or phishing campaign and whether it should be resolved or sinkholed.
It works well as long as Knot Resolver is not targeted with hundreds of concurrent DNS requests.
When that happens, given the fact that the oraculum's API response time varies and could be as long as tens to hundreds of milliseconds on occasion,
clients start to temporarily perceive very long response times from Knot Resolver, far exceeding the hard timeout set on communication to oraculum's API.
Possible problem
I think that the scaling-with-processes actually
renders the module very inefficiently implemented, because queries are being queued and processed by
module one by one (in a particular process). That means if n queries almost-hit oraculum's API timeout limit t, the client
who sent its n+1 query to this particular kresd process, will perceive a very long response time of accumulated n*t.
Or would it? Am I completely off?
When I prototyped similar functionality in GoDNS using goroutines, GoDNS server (at the cost of hideous CPU usage) let numerous
DNS clients' queries talk to the oraculum and return to clients "concurrently".
Question
Is it O.K. to use Apache Portable Runtime threading or OpenMP threading and to start hiding the API's response time in the module? Isn't it a complete Knot Resolver antipattern?
I'm caching oraculum's API responses in a simple in memory ephemeral LRU cache that resides in each kresd process. Would it be possible to use kresd's own MVCC cache instead for my arbitrary structure?
Is it possible that the problem is elsewhere, for instance, that Knot Resolver doesn't expect any blocking delay in finish layer and thus some network queue is filled and subsequent DNS queries are rejected and/or intolerably delayed?
Thanks for pointers (pun intended)
A Knot Resolver developer here :-) (I also repeat some things answered by Jan already.)
Scaling-with-processes is able to work fine. Waiting for responses from name-servers is done by libuv (via event-loop and callbacks, all within a single thread).
Due to the single-threaded style, no layer function should be blocked (on I/O), as that would make everything block on it. AFAIK currently the only case when this can really happen is when (part of) the cache gets swapped-out.
There is the YIELD state http://knot-resolver.readthedocs.io/en/latest/lib.html?highlight=yield It's used when a sub-request is needed before processing of the layer can continue, but I currently don't know details of its working. I don't think it's directly applicable, as resuming the layers seems currently only triggered by a sub-request finishing.
Cache: if you put your module before the rrcache module and you change the RRset, it will get cached changed already.
Knot DNS developer here (not Resolver though). I think you are right. My understanding is that the layer code is executed synchronously in the daemon thread. The asynchrony appears only at the resolver network I/O level.
Internally the server runs libuv loop which just executes callbacks for events on primitives provided by libuv (sockets, timers, signals, etc.). The problem is that you just cannot suspend the running callback (C function) at an arbitrary point, escape back to libuv loop, and continue with the callback execution at some point later.
That said, asynchronous waiting for an event can happen only where this was expected. And the code driving layers doesn't expect that.
Answers:
I'm not very familiar with libapr or OpenMP. But I don't think this could be really solved without reworking the layer interface and making it asynchronous.
The shared cache could be used for sure. If you cannot find the API, jolly Knot DNS folks will happily accept a patch or help you writing one.
This is exactly the case. Knot Resolver doesn't expect blocking code in the layer finish callback.

ArangoDB Java Batch mode insert performance

I'm using ArangoDb 3.0.5 with arangodb-java-driver 3.0.1. ArangoDB is running on a 3.5ghz i7 with 24gb ram and an ssd.
Loading some simple Vertex data from Apache Flink seems to be going very slowly, at roughly 1000 vertices/sec. Task Manager shows it is CPU bound on the ArangoDB process.
My connector is calling startBatchMode, iterating through 500 calls to graphCreateVertex (with wait for sync set to false), and then calling executeBatch.
System resources in the management interface shows roughly 15000 (per sec?) while the load is running, and used CPU time fixed to 1 for user time. I'm new to ArangoDB and am not sure how to profile what is going on. Any help much appreciated!
Rob
Your performance result is the expected behavior. The point with batchMode is, that all of you 500 calls are send in one and executed on the server in only one thread.
To gain better performance, you can use more than one thread in your client for creating your vertices. More requests in parallel will allow the server to use more than one thread.
Also you can use createDocument instead of graphCreateVertex. This avoids consistency checks on the graph, but is a lot faster.
If you don't need these checks, you can also use importDocuments instead of batchMode + createDocument which is even faster.

How Cassandra handle blocking execute statement in datastax java driver

Blocking execute fethod from com.datastax.driver.core.Session
public ResultSet execute(Statement statement);
Comment on this method:
This method blocks until at least some result has been received from
the database. However, for SELECT queries, it does not guarantee that
the result has been received in full. But it does guarantee that some
response has been received from the database, and in particular
guarantee that if the request is invalid, an exception will be thrown
by this method.
Non-blocking execute fethod from com.datastax.driver.core.Session
public ResultSetFuture executeAsync(Statement statement);
This method does not block. It returns as soon as the query has been
passed to the underlying network stack. In particular, returning from
this method does not guarantee that the query is valid or has even
been submitted to a live node. Any exception pertaining to the failure
of the query will be thrown when accessing the {#link
ResultSetFuture}.
I have 02 questions about them, thus it would be great if you can help me to understand them.
Let's say I have 1 million of records and I want all of them to be arrived in the database (without any lost).
Question 1: If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
Will this cause performance problem for cassandra? Does Cassandra have to make sure that for every single insert record, all the nodes in the clusters should know about the new record immediately? In order to maintain the consistency in data. (I assume cassandra node won't even think about using the local machine time for controlling the record insertion time).
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Thank you very much for your helps.
If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
To some extent. Lets divorce the client implementation details a bit and look at things from the perspective of "Number of concurrent requests", as you don't need to have a thread for each ongoing request if you use executeAsync. In my testing I have found that while there is a lot of value in having a high number of concurrent requests, there is a threshold for which there are diminishing returns or performance starts to degrade. My general rule of thumb is (number of Nodes *native_transport_max_threads (default: 128)* 2), but you may find more optimal results with more or less.
The idea here is that there is not much value in enqueuing more requests than cassandra will handle at a time. While reducing the number of inflight requests, you limit unnecessary congestion on the connections between your driver client and cassandra.
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Waiting on the ResultSetFuture via get is one route, but if you are developing a fully async application, you want to avoid blocking as much as possible. Using guava, your two best weapons are Futures.addCallback and Futures.transform.
Futures.addCallback allows you to register a FutureCallback that gets executed when the driver has received the response. onSuccess gets executed in the success case, onFailure otherwise.
Futures.transform allows you to effectively map the returned ResultSetFuture into something else. For example if you only want the value of 1 column you could use it to transform ListenableFuture<ResultSet> to a ListenableFuture<String> without having to block in your code on the ResultSetFuture and then getting the String value.
In the context of writing a dataloader program, you could do something like the following:
To keep things simple use a Semaphore or some other construct with a fixed number of permits (that will be your maximum number of inflight requests). Whenever you go to submit a query using executeAsync, acquire a permit. You should really only need 1 thread (but may want to introduce a pool of # cpu cores size that does this) that acquires the permits from the Semaphore and executes queries. It will just block on acquire until there is an available permit.
Use Futures.addCallback for the future returned from executeAsync. The callback should call Sempahore.release() in both onSuccess and onFailure cases. By releasing a permit, this should allow your thread in step 1 to continue and submit the next request.
To further improve throughput, you might want to consider using BatchStatement and submitting requests in batches. This is a good option if you keep your batches small (50-250 is a good number) and if your inserts in a batch all share the same partition key.
Besides the above answer,
Looks like execute() calls executeAsync(statement).getUninterruptibly(), so whether you manage your own "n thread pool" using execute() and block yourself until execution completes up to a max of n running threads OR using executeAsync() on all records, cassandra side performance should be roughly same, depending on execution time/count + timeouts.
They executions will all run connections borrowed from a pool, each execution has a streamId on client side and gets notified you via future when the response comes back for this streamId, limited by total requests per connection on client side and total requests limited by read threads on each node that was picked to execute your request, any higher number will be buffered in a queue (not blocked) limited by the connection maxQueueSize and maxRequestsPerConnection, any higher than this should fail. The beauty of this is that executeAsync() does not run on a new thread per request/execution.
So, there has to be a limit on how many requests can run via execute() or executeAsync(), in execute() you are avoiding beyond these limits.
Performance wise, you will start seeing a penalty beyond what each node can handle so execute() with a good size pool makes sense to me. Even better, use a reactive architecture to avoid creating so many threads that are doing nothing but waiting, so large number of threads will cause wasted context switching on client side. For smaller number of requests, executeAsync() will be better by avoiding thread pools.
DefaultResultSetFuture future = new DefaultResultSetFuture(..., makeRequestMessage(statement, null));
new RequestHandler(this, future, statement).sendRequest();

synchronous vs asynchronous write/delete in Cassandra

What is the difference in synchronous and asynchronous write/delete in Cassandra ?
If I use executeAsynchronously() instead of execute() method of CqlOperation class (datastax driver) will it improve the performance in terms of throughput (TPS) ? In my application I am doing single insert/delete as well as batch insert.
Till now i was using only execute method (synchronous) and I am thinking to use asynchronous execute to improve the performance of application in terms of TPS.
Async writes offer better performance per worker but it adds overhead of callbacks and error handling.
I have done a test recently to find performance benefits as well as callback implementation with error handling using a single worker with 1M records written Async was found 4 Times as fast as Synchronous ones. in_flight queries were limited to 1000, this number can be tuned accordingly as per environment conditions (Number of connections you want to put on the wire, say with 200ms network latency and 1ms server response time one may go for 200 queries to be put in_flight, while Sync call would have let server free for 199ms out of 200ms in this case server would be processing atleast one query almost all the time) but without restriction it will congest network with possible degradation in performance.
In some cases Synchronous query may be more suitable, especially if the result of the query is critical before moving ahead with program. But in most of the cases Async suffices.
In short the answer to your question is Yes - I have tested TPS increase of 4x.
Reference - performance evaluation using async writes
sync write(or delete) to cassandra will block code execution until the client receives a confirmation that the operation has been completed based on the consistency level.
On the other hand, async write(or delete) will send the query to cassandra, and then proceed with the code execution(will not block). Now you have to register some kind of callback that will inform you(asynchronously) that the write operation has completed.
All of the blocking adds up, and can slow down your application. Because async queries immediately proceed, they allow you send more async queries right after instead of waiting on the first one to finish. This is where the performance increase occurs, especially if you are sending a lot of queries to cassandra.
It will definitely increase the performance.
I have not tried it but link below says the same
Read the question
http://www.datastax.com/dev/blog/java-driver-async-queries

Node.js async parallel - what consequences are?

There is code,
async.series(tasks, function (err) {
return callback ({message: 'tasks execution error', error: err});
});
where, tasks is array of functions, each of it peforms HTTP request (using request module) and calling MongoDB API to store the data (to MongoHQ instance).
With my current input, (~200 task to execute), it takes
[normal mode] collection cycle: 1356.843 sec. (22.61405 mins.)
But simply trying change from series to parallel, it gives magnificent benefit. The almost same amount of tasks run in ~30 secs instead of ~23 mins.
But, knowing that nothing is for free, I'm trying to understand what the consequences of that change? Can I tell that number of open sockets will be much higher, more memory consumption, more hit to DB servers?
Machine that I run the code is only 1GB of RAM Ubuntu, so I so that app hangs there one time, can it be caused by lacking of resources?
Your intuition is correct that the parallelism doesn't come for free, but you certainly may be able to pay for it.
Using a load testing module (or collection of modules) like nodeload, you can quantify how this parallel operation is affecting your server to determine if it is acceptable.
Async.parallelLimit can be a good way of limiting server load if you need to, but first it is important to discover if limiting is necessary. Testing explicitly is the best way to discover the limits of your system (eachLimit has a different signature, but could be used as well).
Beyond this, common pitfalls using async.parallel include wanting more complicated control flow than that function offers (which, from your description doesn't seem to apply) and using parallel on too large of a collection naively (which, say, may cause you to bump into your system's file descriptor limit if you are writing many files). With your ~200 request and save operations on 1GB RAM, I would imagine you would be fine as long as you aren't doing much massaging in the event handlers, but if you are experiencing server hangs, parallelLimit could be a good way out.
Again, testing is the best way to figure these things out.
I would point out that async.parallel executes multiple functions concurrently not (completely) parallely. It is more like virtual parallelism.
Executing concurrently is like running different programs on a single CPU core, via multitasking/scheduling. True parallel execution would be running different program on each core of multi-core CPU. This is important as node.js has single-threaded architecture.
The best thing about node is that you don't have to worry about I/O. It handles I/O very efficiently.
In your case you are storing data to MongoDB, is mostly I/O. So running them parallely will use up your network bandwidth and if reading/writing from disk then disk bandwidth too. Your server will not hang because of CPU overload.
The consequence of this would be that if you overburden your server, your requests may fail. You may get EMFILE error (Too many open files). Each socket counts as a file. Usually connections are pooled, meaning to establish connection a socket is picked from the pool and when finished return to the pool. You can increase the file descriptor with ulimit -n xxxx.
You may also get socket errors when overburdened like ECONNRESET(Error: socket hang up), ECONNREFUSED or ETIMEDOUT. So handle them with properly. Also check the maximum number of simultaneous connections for mongoDB server too.
Finally the server can hangup because of garbage collection. Garbage collection kicks in after your memory increases to a certain point, then runs periodically after some time. The max heap memory V8 can have is around 1.5 GB, so expect GC to run frequently if its memory is high. Node will crash with process out of memory if asking for more, than that limit. So fix the memory leaks in your program. You can look at these tools.
The main downside you'll see here is a spike in database server load. That may or may not be okay depending on your setup.
If your database server is a shared resource then you will probably want to limit the parallel requests by using async.eachLimit instead.
you'll realize the difference if multiple users connect:
in this case the processor can handle multiple operations
asynch tries to run several operations of multiple users relative equal
T = task
U = user
(T1.U1 = task 1 of user 1)
T1.U1 => T1.U2 => T2.U1 => T8.U3 => T2.U2 => etc
this is the oposite of atomicy (so maybe watch for atomicy on special db operations - but thats another topic)
so maybe it is faster to use:
T2.U1 before T1.U1
- this is no problem until
T2.U1 is based on T1.U1
- this is preventable by using callbacks/ or therefore are callbacks
...hope this is what you wanted to know... its a bit late here

Resources