Delphi PPL Task Priority - multithreading

I tried to google this, but was not able to find a example . I also tried searching the PPL library itself for Priority, but only found one commented out line about it :)
My "problem" is , I have 4 Threads , I run these on a 4 Core CPU . For a brief moment I peek out at 100% , the entire process takes no longer then 20 seconds .
Is there a way to set the Priority for the Threads ?
My reason is, I have MSSQL installed on this computer . And I am not 100% sure . If my Threads peek for 20 sec at 100% if the SQL Server is slower then or not.
Thank you.

Never mind, as it seems it is self-tuning.
Just downloaded CPU-Z ... started the Stress Text . And my own Threads automatically slowed down . For my current situation works as expected.
Thank you.

Related

How can I make FutureProducer to perform at least near the performance of ThreadedProducer in rust rdkafka?

I'm just playing around the examples, and I tried to use FutureProducer with Tokio::spawn, and I'm getting about 11 ms per produce.
1000 messages in 11000ms (11 seconds).
While ThreadedProducer produced 1000000 (1 million messages) in about 4.5 seconds (dev), and 2.6 seconds (on --release) !!!, this is insane difference between the two and maybe I missed something, or I'm not doing something ok.
Why to use FutureProducer if this big speed difference exists?
Maybe someone can shed the light to let me understand and to learn about the FutureProducer.
Kafka topic name is "my-topic" and it has 3 partitions.
Maybe my code is not written in the suitable way (for future producer), I need to produce 1000000 messages / less than 10 seconds using FutureProducer.
My attempts are written in the following gists (I updated this question to add new gists)
Note:
After I wrote my question I tried to solve my issue by adding different ideas until I succeeded at the 7th attempt
1- spawn blocking:
https://gist.github.com/arkanmgerges/cf1e43ce0b819ebdd1b383d6b51bb049
2- threaded producer
https://gist.github.com/arkanmgerges/15011348ef3f169226f9a47db78c48bd
3- future producer
https://gist.github.com/arkanmgerges/181623f380d05d07086398385609e82e
4- os threads with base producer
https://gist.github.com/arkanmgerges/1e953207d5a46d15754d58f17f573914
5- os thread with future producer
https://gist.github.com/arkanmgerges/2f0bb4ac67d91af0d8519e262caed52d
6- os thread with spawned tokio tasks for the future producer
https://gist.github.com/arkanmgerges/7c696fef6b397b9235564f1266443726
7- tokio multithreading using #[tokio::main] with FutureProducer
https://gist.github.com/arkanmgerges/24e1a1831d62f9c5e079ee06e96a6329
In my 5th example, I needed to use os threads (thanks for the discussion with #BlackBeans), and inside the os thread I've used tokio runtime that uses 4 worker thread and which it will block in the os thread.
The example used 100 os threads, and each one has tokio runtime with 4 worker threads.
Each os thread will produce 10000 messages.
The code is not optimized and I ran it in build dev.
A new example that I've done in my 7th attempt, which I used #[tokio::main] which is by default will use block_on and when I spawn a new task, it can be put in a new os thread (I've made a separate test to check it using #[tokio::main]) under the main scheduler (inside block_on). And could produced 1 million messages in 2.93 seconds (dev build) and 2.29 seconds (release build)
I think I went through a similar journey: Starting with the FutureProducer because it seemed a good place to start, totally terrible performance. Switching to ThreadedProducer, very fast.
I know Kafka quite well, but a noob at Rust. The FutureProducer is broken, as far as I can see, as every await you call will flush and wait for a confirmation.
That is simply not how Kafka is intended, what makes Kafka fast is that you can keep pumping messages and only occasionally and asynchronously get acks from the current offsets.
I like how you managed to improve thoughput by using many threads, but that is more complex than it should be, and I suppose also much more demanding on both the broker and the client.
If there at least was a batch variant the performance would be bearable, but as I see it now it is suitable for low volume only.
Did you have any insights since you tried this?

Sleep() Methods and OS - Scheduler (Camunda/Groovy)

I got a question for you guys and its not as specific as usual, which could make it a little annoying to answer.
The tool i'm working with is Camunda in combination with Groovy scripts and the goal is to reduce the maximum cpu load (or peak load). I'm doing this by "stretching" the work load over a certain time frame since the platform seems to be unhappy with huge work load inputs in a short amount of time. The resulting problem is that Camunda wont react smoothly when someone tries to operate it at the UI - Level.
So i wrote a small script which basically just lets each individual process determine his own "time to sleep" before running, if a certain threshold is exceeded. This is based on how many processes are trying to run at the same time as the individual process.
It looks like:
Process wants to start -> Process asks how many other processes are running ->
waitingTime = numberOfProcesses * timeToSleep * iterationOfMeasures
CPU-Usage Curve 1,3 without the Script. Curve 2,4 With the script
Testing it i saw that i could stretch the work load and smoothe out the UI - Levels. But now i need to describe why this is working exactly.
The Questions are:
What does a sleep method do exactly ?
What does the sleep method do on CPU - Level?
How does an OS-Scheduler react to a Sleep Method?
Namely: Does the scheduler reschedule or just simply "wait" for the time given?
How can i recreate and test the question given above?
The main goal is not for you to answer this, but could you give me a hint for finding the right Literature to answer these questions? Maybe you remember a book which helped you understand this kind of things or a Professor recommended something to you. (Mine wont answer, and i cant blame him)
I'm grateful for hints and or recommendations !
i'm sure you could use timer event
https://docs.camunda.org/manual/7.15/reference/bpmn20/events/timer-events/
it allows to postpone next task trigger for some time defined by expression.
about sleep in java/groovy: https://www.javamex.com/tutorials/threads/sleep.shtml
using sleep is blocking current thread in groovy/java/camunda.
so instead of doing something effective it's just blocked.

Powershell Foreach -Parallel not reaching throttle limit

I have a script using workflows and Foreach -Parallel -ThrottleLimit 15 which works fine if the code inside the loop is something simple like write-output or some basic math and a sleep statement, but when I swap in the actual logic (updating product for a particular client) it caps at 5 running simultaneously.
Unfortunately, I can't post the full source but at a high level the update process reads from some config files occasionally, writes to log files (separate logs for each thread), reads/writes to a couple of databases, nothing to cause the CPU/RAM/IO to max or anything though and to the best of our knowledge there is no contention going on between threads over these resources.
Things We've looked into:
Powershell 3 had a hard limit of 5 (We are using Powershell 5 and are able to run more than 5 at a time when replacing the actual work with simple math and a write-output)
CPU/RAM/Disk IO all appear to be pretty stable (not maxed out or spiking a lot)
Look for MS documentation on this param (not much available and doesn't mention any strange behavior that I've seen)
I'm at a loss as to where else to look to troubleshoot this. At first we thought maybe the ThrottleLimit was a maximum and the actual number was dynamic based on resources, but our limited testing seemed to indicate that it uses the full 15 even when CPU and Memory are pretty high. Also, the very first iteration is capped at 5, long before any resources are actually being used heavily (there is a start-sleep of 15 minutes right away in the update script as we allow a warning to our users to finish what they are doing before continuing to take down the DB for the update.)
Does anyone have any insight into what else we can check or look into that might cause this? Our previous parallel logic was using Jobs and I wanted to avoid returning to that logic when this seems like it should work perfectly fine.

Do not use all MATLAB pool workers

I have set up a local Matlab (R2015b) pool of workers according to my CPU configuration (quad-core, multi-threading => 8 workers in total.)
I have simulations that last 24h but I want to be able to use my computer at the same time. Therefore, I limit myself to 4 simulations a day (sent via batch) so that I can keep working at the same time.
My question is this: how can I queue several jobs without eating up the 8 workers? Another related question is if I reduce the size of the pool to 4 workers, will I still be able to run Matlab smoothly?
Thank you very much for your answer.
I would say that the best solution to your problem is to do it via bash in stead of matlab. In bash you have a command called nice which allows you to down prioritize the simulation. Which means that if you are using the computer you will get the power, and if you are not using it, the power goes to the computations.
Regarding the second part of your question. The easiest way to queue all the jobs is to make a bash script something like the following:
for f in $(find . -name name_of_matlab_script*)
do
nice -n 10 matlab -nodisplay <$f
done
where the name of the matlab scripts would be called something with the same base and then the start will take care of the rest. Then it will run the scripts after each other however give priority to what you otherwise use your computer for.
If you want more advanced scheduling software I normally uses Slurm.
Regarding the 4 workers in stead of 8, then as Ander Biguri says in the comments, as few as possible as long as you do not add to much extra time.

Puzzled by the cpushare setting on Docker.

I wrote a test program 'cputest.py' in python like this:
import time
while True:
for _ in range(10120*40):
pass
time.sleep(0.008)
, which costs 80% cpu when running in a container (without interference of other runnng containers).
Then I ran this program in two containers by the following two commands:
docker run -d -c 256 --cpuset=1 IMAGENAME python /cputest.py
docker run -d -c 1024 --cpuset=1 IMAGENAME python /cputest.py
and used 'top' to view their cpu costs. It turned out that they relatively cost 30% and 67% cpu. I'm pretty puzzled by this result. Would anyone kindly explain it for me? Many thanks!
I sat down last night and tried to figure this out on my own, but ended up not being able to explain the 70 / 30 split either. So, I sent an email to some other devs and got this response, which I think makes sense:
I think you are slightly misunderstanding how task scheduling works - which is why the maths doesn't work. I'll try and dig out a good article but at a basic level the kernel assigns slices of time to each task that needs to execute and allocates slices to tasks with the given priorities.
So with those priorities and the tight looped code (no sleep) the kernel assigns 4/5 of the slots to a and 1/5 to b. Hence the 80/20 split.
However when you add in sleep it gets more complex. Sleep basically tells the kernel to yield the current task and then execution will return to that task after the sleep time has elapsed. It could be longer than the time given - especially if there are higher priority tasks running. When nothing else is running the kernel then just sits idle for the sleep time.
But when you have two tasks the sleeps allow the two tasks to interweave. So when one sleeps the other can execute. This likely leads to a complex execution which you can't model with simple maths. Feel free to prove me wrong there!
I think another reason for the 70/30 split is the way you are doing "80% load". The numbers you have chosen for the loop and the sleep just happen to work on your PC with a single task executing. You could try moving the loop to be based on elapsed time - so loop for 0.8 then sleep for 0.2. That might give you something closer to 80/20 but I don't know.
So in essence, your time.sleep() call is skewing your expected numbers, removing the time.sleep() causes the CPU load to be far closer to what you'd expect.

Resources