I am experimenting with KeyDB to see if and how much performance improvements can be gained, as there are definitely bottlenecks with Redis single-threaded query model. So I found KeyDB, and they say they use "real" multithreading to do parallel queries to the db, unlike Redis that only has IO multithreading and not the actual queries.
From the documentation link above:
Unlike Redis6 and Elasticache, KeyDB multithreads several aspects
including placing the event loop on multiple threads, with network IO, and query parsing done concurrently.
My simple test setup:
First, I install KeyDB on Ubuntu (WSL2) and get it running
I note that when starting KeyDb, 2 threads are active:
Thread 0 alive. Thread 1 alive.
I modify the keydb.conf to disable some saving/persisting, but most importantly, I change the server-threads option to 2: server-threads 2. note: I have also tried without the use of the config file and just add the cmd flag --server-threads 2 and also setting threads to 4, no difference.
Then I run a simple script:
Create 1M entries into a hash with some simple JSON objects
Create a simple console app that uses two threads; one thread starts doing very simple SETs (SET key1 1) or GETs (GET key1 1) in a loop, and another thread that does a "fetch all" from the hash (HGETALL testhash). The second thread waits 1 sec before it starts its "long query".
GitHub repo (using StackExchange.Redis lib) can be found here.
What I expect:
I expect that the simple quick SET/GETs takes approx the same time every time, without any delays or throttling due to a block in KeyDB while the long query is running.
What happens:
The simple quick SET/GETs are blocked/delayed for around 500-700 ms while the long query is running, indicating that only one thread is being used and thus blocking other operations. This is in line with how Redis works, and what I wanted to avoid with KeyDB.
Log:
The "Starting long query" is when we do the HGETALL and almost immediately after, the simple SET is throttled and takes over 500ms, when it should take 0-1 ms, as can be seen before and after.
Using ServiceStack Redis client:
10:50:55.336 GetValueFromHashAsync took 1
10:50:55.367 GetValueFromHashAsync took 1
10:50:55.397 GetValueFromHashAsync took 0
10:50:55.416 Starting long query
10:50:56.191 GetValueFromHashAsync took 766 <-- THROTTLED! Delayed with what I think is the actual query time, not the IO part, so at this point, the line fetching data has not completed yet
10:50:56.228 GetValueFromHashAsync took 0
10:50:56.261 GetValueFromHashAsync took 1
....
....
10:51:00.592 GetValueFromHashAsync took 1
10:51:00.620 GetValueFromHashAsync took 1
10:51:00.651 GetValueFromHashAsync took 1
10:51:00.663 Long query done in 5244 <-- The long query returns here, line is completed, total time was about 5 seconds, while the block was about 0.7 seconds
I have also tested to do a Get from hash instead of a SET, same thing.
Using StackExchange.Redis:
In the GitHub reproducable project, found here, I am instead using StackExchange.Redis instead of ServiceStack, and I get a different (worse!) behaviour:
11:27:12.084 HashGetAsync took 0
11:27:12.115 HashGetAsync took 0
11:27:12.146 HashGetAsync took 0
11:27:12.177 HashGetAsync took 1
11:27:12.183 Starting long query
11:27:14.877 Long query done in 2692
11:27:14.893 HashGetAsync took 2686 <-- THROTTLED! This time the other thread is delayed the entire time, query + IO.
11:27:14.929 HashGetAsync took 0
11:27:14.960 HashGetAsync took 0
11:27:14.992 HashGetAsync took 0
11:27:15.023 HashGetAsync took 0
11:27:15.053 HashGetAsync took 0
Conclusion
Regardless of what client library I use, KeyDB is throttling requests/queries while a "long query" is running, even though I have 2 threads. It does not matter if I start KeyDB with 4 threads, same behaviour.
I don't know why StackExchange behaves differently from ServiceStack, but that is not the main question right now.
KeyDB, in fact, only runs the IO operations and Redis protocol parsing operations in parallel. It processes the commands in serial, i.e. process commands one-by-one, and working threads are synced with a spin lock.
That's why those simple set/get commands are blocked by a slow command. So even with KeyDB, you should NOT run slow command either and, the multiple threading won't help.
UPDATE
KeyDB can have multiple threads listen on the same IP:port, so that it can accept multiple connections in parallel, i.e. SO_REUSEPORT. Also it reads (including parsing received data into commands with redis protocol, i.e. RESP) and writes socket in parallel.
While Redis only have a single thread, i.e. main thread, listen on the IP:port. By default, Redis reads and writes socket in a single thread. Since Redis 6.0, you can enable io-threads to make it write socket in parallel. Also, if you enable io-threads-do-reads, Redis will also reading and protocol parsing in parallel.
Related
I'm using pika 1.1 and graph-tool 3.4 in my python application. It consumes tasks from RabbitMQ, which then used to build graphs with graph-tool and then runs some calculations.
Some of the calculations, such as betweenness, take a lot of cpu power which make cpu usage hit 100% for a long time. Sometimes rabbitmq connection drops down, which causes task to start from the beginning.
Even though calculations are run in a separate process, my guess is during the time cpu is loaded 100%, it can't find any opportunity to send a heartbeat to rabbitmq, which causes connection to terminate. This doesn't happen all the time, which indicates by chance it could send heartbeats time to time. This is only my guess, I am not sure what else can cause this.
I tried lowering the priority of the calculation process using nice(19), which didn't work. I'm assuming it's not affecting the processes spawned by graph-tool, which parallelizes work on its own.
Since it's just one line of code, graph.calculate_betweenness(... I don't have a place to manually send heartbeats or slow the execution down to create chance for heartbeats.
Can my guess about heartbeats not getting sent because cpu is super busy be correct?
If yes, how can I handle this scenario?
Answering to your questions:
Yes, that's basically it.
The solution we do is creating a separate process for the CPU intensive tasks.
import time
from multiprocessing import Process
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.exchange_declare(exchange='logs', exchange_type='fanout')
result = channel.queue_declare(queue='', exclusive=True)
queue_name = result.method.queue
channel.queue_bind(exchange='logs', queue=queue_name)
def cpu_intensive_task(ch, method, properties, body):
def work(body):
time.sleep(60) # If I remember well default HB is 30 seconds
print(" [x] %r" % body)
p = Process(target=work, args=(body,))
p.start()
# Important to notice if you do p.join() You will have the same problem.
channel.basic_consume(
queue=queue_name, on_message_callback=cpu_intensive_task, auto_ack=True)
channel.start_consuming()
I wonder if this is the best solution to this problem or if rabbitMQ is the best tool for CPU intensive tasks. (For really long CPU intensive tasks (more than 30 min) if you send manual ACK you will need to handle with this also: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)
i remember the default thread pool size for node is 4 (or based on cpu count). This brings my question like this.
For the very basic simplified case, i'm writing a service1 in node, which sends requests to service2, wait till it finishes the computation and then continue. Now service2 in another server can handle 1000 requests at the same time, it takes time, also it's a blocking call (which is out of my control).
If i do the java way, i can create 1000 threads from glassfish, so the 1st 1000 blast requests can be processed at the same time. The 1001th may need to wait a little bit.
1000 incoming req -> java server1 -> 1000 threads -> 1000 outgoing req -> server2
But in node, if the thread pool size is 4 given it's a 4 core CPU machine, that means node app will be slower than java in this case ? What happens if i increase the pool size to 1000 ? Can i increase to 1000 ?
1000 incoming req -> node server1 -> ~4 threads -> 1000 outgoing req -> server2
I don't see an easy for node, or i can let node handle most stuff, for the above blocking call, add a small java server and dispatch outing req to that ? Any suggestion ?
UPDATE: found this, We use setTimeout( function(){} , 0 ); to create asynchronous functions in JavaScript!
https://medium.com/from-the-scratch/javascript-writing-your-own-non-blocking-asynchronous-functions-60091ceacc79
Guess if i convert the block call into async function, it can solve my issue, i hope, praying !!!
Node hands it's I/O tasks off to the operating system to handle, which are generally multi-threaded. It takes the approach of not having to wait for requests to finish (by blocking a thread), because it wastes time sitting. So, Node hands these tasks off and just tells it to poke Node when it's done. There is a very good related question.
How, in general, does Node.js handle 10,000 concurrent requests?
The problem seems simple, I have a number (huge) of operations that I need to work and the main thread can only proceed when all of those operations return their results, however. I tried in one thread only and each operation took about let's say from 2 to 10 seconds at most, and at the end it took about 2,5 minutes. Tried with future tasks and submited them all to the ExecutorService. All of them processed at a time, however each of them took about let's say from 40 to 150 seconds. In the end of the day the full process took about 2,1 minutes.
If I'm right, all the threads were nothing but a way of execute all at once, although sharing processor's power, and what I thought I would get would be the processor working heavily to get me all the tasks executed at the same time taking the same time they take to excecuted in a single thread.
Question is: Is there a way I can reach this? (maybe not with future tasks, maybe with something else, I don't know)
Detail: I don't need them to exactly work at the same time that actually doesn't matter to me what really matters is the performance
You might have created way too many threads. As a consequence, the cpu was constantly switching between them thus generating a noticeable overhead.
You probably need to limit the number of running threads and then you can simply submit your tasks that will execute concurrently.
Something like:
ExecutorService es = Executors.newFixedThreadPool(8);
List<Future<?>> futures = new ArrayList<>(runnables.size());
for(Runnable r : runnables) {
es.submit(r);
}
// wait they all finish:
for(Future<?> f : futures) {
f.get();
}
// all done
I have a CPU intensive task (looping through a some data and evaluating results). I want to make use of multiple cores for these but my performance is consistently worse than just using a single core.
I've tried:
Creating multiple processes on different ports with express and sending the tasks to these processes
Using webworker-threads to run the tasks in different threads using the thread pool
I'm measuring the results by counting the total number of iterations I can complete and dividing by the amount of time I spent working on the problem. When using a single core, my results are significantly better.
some points of interest:
I can identify when I am just using one core and when I am using multiple cores through task manager. I am using the expected number of cores.
I have lots of ram
I've tried running on just 2 or 3 cores
I added nextTicks which doesn't seem to impact anything in this case
The tasks take several seconds each so I don't feel like I'm losing a lot to overhead
Any idea as to what is going on here?
Update for threads: I suspect a bug in webworker-threads
Skipping express for now, I think the issue may have to do with my thread loop. What I'm doing is creating a threads and then trying to continuously run them but send data back and forth between them. Even though both of the threads are using up CPU, only thread 0 is returning values. My assumption was emit any would generally end up emitting the message to the thread that had been idle the longest but that does not seem to be the case. My set up looks like this
Within threadtask.js
thread.on('init', function() {
thread.emit('ready');
thread.on('start', function(data) {
console.log("THREAD " + thread.id + ": execute task");
//...
console.log("THREAD " + thread.id + ": emit result");
thread.emit('result', otherData));
});
});
main.js
var tp = Threads.createPool(NUM_THREADS);
tp.load(threadtaskjsFilePath);
var readyCount = 0;
tp.on('ready', function() {
readyCount++;
if(readyCount == tp.totalThreads()) {
console.log('MAIN: Sending first start event');
tp.all.emit('start', JSON.stringify(data));
}
});
tp.on('result', function(eresult) {
var result = JSON.parse(eresult);
console.log('MAIN: result from thread ' + result.threadId);
//...
console.log('MAIN: emit start' + result.threadId);
tp.any.emit('start' + result.threadId, data);
});
tp.all.emit("init", JSON.stringify(data2));
The output to this disaster
MAIN: Sending first start event
THREAD 0: execute task
THREAD 1: execute task
THREAD 1: emit result
MAIN: result from thread 1
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
THREAD 0: execute task
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
THREAD 0: execute task
THREAD 0: emit result
THREAD 0: execute task
THREAD 0: emit result
MAIN: result from thread 0
MAIN: result from thread 0
I did try another approach as well where I would emit all but then have each thread listen for a message that only it could answer. Eg, thread.on('start' + thread.id, function() { ... }). This doesn't work because in the result when I do tp.all.emit('start' + result.threadId, ... ), the message doesn't get picked up.
MAIN: Sending first start event
THREAD 0: execute task
THREAD 1: execute task
THREAD 1: emit result
THREAD 0: emit result
Nothing more happens after that.
Update for multiple express servers: I'm getting improvements but smaller than expected
I revisited this solution and had more luck. I think my original measurement may have been flawed. New results:
Single process: 3.3 iterations/second
Main process + 2 servers: 4.2 iterations/second
Main process + 3 servers: 4.9 iterations/second
One thing I find a little odd is that I'm not seeing around 6 iterations/second for 2 servers and 9 for 3. I get that there are some losses for networking but if I increase my task time to be sufficiently high, the network losses should be pretty minor I would think.
You shouldn't be pushing your Node.js processes to run multiple threads for performance improvements. Running on a quad-core processor, having 1 express process handling general requests and 3 express processes handling the CPU intensive requests would probably be the most effective setup, which is why I would suggest that you try to design your express processes to defer from using Web workers and simply block until they produce a result. This will get you down to running a single process with a single thread, as per design, most likely yielding the best results.
I do not know the intricacies of how the Web workers package handles synchronization, affects the I/O thread pools of Node.js that happen in c space, etc., but I believe you would generally want to introduce Web workers to be able to manage more blocking tasks at the same time without severely affecting other requests that require no threading and system I/O, or can otherwise be expediently responded to. It doesn't necessarily mean that applying this would yield improved performance for the particular tasks being performed. If you run 4 processes with 4 threads that perform I/O, you might be locking yourself into wasting time continuously switching between the thread contexts outside the application space.
We have an application that is undergoing performance testing. Today, I decided to take a dump of w3wp & load it in windbg to see what is going on underneath the covers. Imagine my surprise when I ran !threads and saw that there are 640 background threads, almost all of which seem to say the following:
OS Thread Id: 0x1c38 (651)
Child-SP RetAddr Call Site
0000000023a9d290 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()
0000000023a9d2d0 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()
0000000023a9d330 000007fef727c978 Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()
0000000023a9d380 000007fef9001552 System.Threading.ExecutionContext.runTryCode(System.Object)
0000000023a9dc30 000007fef72f95fd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0000000023a9dc80 000007fef9001552 System.Threading.ThreadHelper.ThreadStart()
If i had to give a guess, I'm thinkign that one of these threads are getting spawned for each run of our app - we have 2 app servers, 20 concurrent users, and ran the test approximately 30 times...it's in the neighborhood.
Is this 'expected behavior', or perhaps have we implemented something improperly? The test ran hours ago, so i would have expected any timeouts to have occurred already.
Edit: Thank you all for your replies. It has been requested that more detail be shown about the callstack - here is the output of !mk from sosex.dll.
ESP RetAddr
00:U 0000000023a9cb38 00000000775f72ca ntdll!ZwWaitForMultipleObjects+0xa
01:U 0000000023a9cb40 00000000773cbc03 kernel32!WaitForMultipleObjectsEx+0x10b
02:U 0000000023a9cc50 000007fef8f5f595 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
03:U 0000000023a9ccf0 000007fef8f59f49 mscorwks!Thread::DoAppropriateAptStateWait+0x41
04:U 0000000023a9cd50 000007fef8e55b99 mscorwks!Thread::DoAppropriateWaitWorker+0x191
05:U 0000000023a9ce50 000007fef8e2efe8 mscorwks!Thread::DoAppropriateWait+0x5c
06:U 0000000023a9cec0 000007fef8f0dc7a mscorwks!CLREvent::WaitEx+0xbe
07:U 0000000023a9cf70 000007fef8fba72e mscorwks!Thread::Block+0x1e
08:U 0000000023a9cfa0 000007fef8e1996d mscorwks!SyncBlock::Wait+0x195
09:U 0000000023a9d0c0 000007fef9463d3f mscorwks!ObjectNative::WaitTimeout+0x12f
0a:M 0000000023a9d290 000007ff002321b3 *** ERROR: Module load completed but symbols could not be loaded for Microsoft.Practices.EnterpriseLibrary.Caching.DLL
Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()(+0x0 IL)(+0x11 Native)
0b:M 0000000023a9d2d0 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()(+0xf IL)(+0x18 Native)
0c:M 0000000023a9d330 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()(+0x9 IL)(+0x12 Native)
0d:M 0000000023a9d380 000007fef727c978 System.Threading.ExecutionContext.runTryCode(System.Object)(+0x18 IL)(+0x106 Native)
0e:U 0000000023a9d440 000007fef9001552 mscorwks!CallDescrWorker+0x82
0f:U 0000000023a9d490 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
10:U 0000000023a9d530 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
11:U 0000000023a9d790 000007fef8f0cbd2 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x12a
12:U 0000000023a9da20 000007fef945e572 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0x172
13:M 0000000023a9dc30 000007fef7261722 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)(+0x60 IL)(+0x51 Native)
14:M 0000000023a9dc80 000007fef72f95fd System.Threading.ThreadHelper.ThreadStart()(+0x8 IL)(+0x2a Native)
15:U 0000000023a9dcd0 000007fef9001552 mscorwks!CallDescrWorker+0x82
16:U 0000000023a9dd20 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
17:U 0000000023a9ddc0 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
18:U 0000000023a9e010 000007fef8f9ae8d mscorwks!ThreadNative::KickOffThread_Worker+0x191
19:U 0000000023a9e330 000007fef8f59374 mscorwks!TypeHandle::GetParent+0x5c
1a:U 0000000023a9e380 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
1b:U 0000000023a9e450 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
1c:U 0000000023a9e490 000007fef8e1c985 mscorwks!ILCodeStream::GetToken+0x25
1d:U 0000000023a9e4c0 000007fef8f594e1 mscorwks!Thread::DoADCallBack+0x145
1e:U 0000000023a9e630 000007fef8f59399 mscorwks!TypeHandle::GetParent+0x81
1f:U 0000000023a9e680 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
20:U 0000000023a9e750 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
21:U 0000000023a9e790 000007fef8e20e15 mscorwks!ThreadNative::KickOffThread+0x401
22:U 0000000023a9e7f0 000007fef8e20ae7 mscorwks!ThreadNative::KickOffThread+0xd3
23:U 0000000023a9e8d0 000007fef8f814fc mscorwks!Thread::intermediateThreadProc+0x78
24:U 0000000023a9f7a0 00000000773cbe3d kernel32!BaseThreadInitThunk+0xd
25:U 0000000023a9f7d0 00000000775d6a51 ntdll!RtlUserThreadStart+0x1d
Yes, the caching block has some - issues - with regard to the scavenger threads in older versions of Entlib, particularly if things are coming in faster than the scavenging settings let them come out.
This was completely rewritten in Entlib 5, so that now you'll never have more than two threads sitting in the caching block, regardless of the load, and usually it'll only be one.
Unfortunately there's no easy tweak to change the behavior in earlier versions. The best you can do is change the cache settings so that each scavenge will clean out more items at a time so not as many scavenge requests need to get scheduled.
640 threads is very bad for performance. If they are all waiting for something, then I'd say it's a fair bet that you have a deadlock and they will never exit. If they are all running (not waiting)... well, with 600+ threads on a 2 or 4 core processor none of them will get enough time slices to run very far! ;>
If your app is set up with a main thread that waits on the thread handles to find out when the threads exit, and the background threads get caught up in a loop or in a wait state and never exit the thread proc, then the process and all of its threads will never exit.
Check your thread code to make sure that every threadproc has a clear path to exit the threadproc. It's bad form to write an infinite loop in a background thread on the assumption that the thread will be forcibly terminated when the process shuts down.
If the background thread code spins in a loop waiting for an event handle to signal, make sure that you have some way to signal that event so that the thread can perform a normal orderly exit. Otherwise, you need to write the background thread to wait on multiple events and unblock when any one of the events signals. One of those events can be the activity that the background thread is primarily interested in and the other can be a shutdown event.
From the names of things in the stack dump you posted, it would appear that the thread is waiting for something to appear in the ProducerConsumerQueue. Investigate how that queue object is supposed to be shut down, probably on the producer side, and whether shutting down the queue will automatically release all consumers that are waiting on that queue.
My guess is that either the queue is not being shut down correctly or shutting it down does not implicitly release the consumers that are waiting on it. If the latter case, you may need to pump a terminate message through the queue to wake up all the consumers waiting on that queue and tell them to break out of their wait loop and exit.
You have an major issue. Every Thread occupies 1MB of stack and there is significant cost paid for Context Switching every thread in and out. Especially it becomes worst with managed code because every time GC has to run , it would have walk the threads stack to look for roots and when these threads are paged to the disk the cost to read from the disk is expensive,which adds up Perf issue.
Creating threads are Bad unless you know what you are doing? Jeffery Richter has written in detail about this.
To solve the above issue I would look what these threads are blocked on and also put a break-point on Thread Create (example sxe ct within windbg)
And later rearchitect from avoid creating threads , instead use the thread pool.
It would have been nice to some callstacks of these threads.
In Microsoft Enterprise Library 4.1, the BackgroundScheduler class creates a new thread each time an object is instantiated. It will be fixed in version 5.0. I do not know enough of this Microsoft Library to advise you how to avoid that behavior, but you may try the beta version: http://entlib.codeplex.com/wikipage?title=EntLib5%20Beta2