linux RT scheduling - multithreading

linux RT scheduling - multithreading

Our product is running linux 2.6.32, and we have some user space processes which run periodically - 'keep alive - like' processes. We don't place hard requirements on these processes - they just need to run once in several seconds and refresh some watchdog.
We gave these processes a scheduling class of RR or FIFO with max priority, and yet, we see many false positives - it seems like they don't get the CPU for several seconds.
I find it very odd because I know that Linux, while not being a RT OS, can still yield very good performance (I see people talking about orders of several msec) - and I can't even make that the process run once in 5 sec.
The logic of Linux RT scheduler seems very straight forward and simple, so I suspected that these processes get blocked by something else - I/O contention, interrupts or kernel threads taking too long - but now I'm not so sure:
I wrote a really basic program to simulate such a process - it wakes up every 1 second and measures the time spent since the last time it finished running. The time measurement doesn't include blocking on any I/O as far as I understand, so the results printed by this process reflect the behavior of the scheduler:
#include <sched.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/param.h>
#include <time.h>
#define MICROSECONDSINASEC 1000000
#define MILLISECONDSINASEC 1000
int main()
{
struct sched_param schedParam;
struct timeval now, start;
int spent_time = 0;
time_t current_time;
schedParam.sched_priority = sched_get_priority_max(SCHED_RR);
int retVal = sched_setscheduler(0, SCHED_RR, &schedParam);
if (retVal != 0)
{
printf("failed setting RT sched");
return 0;
}
gettimeofday(&start, 0);
start.tv_sec -= 1;
start.tv_usec += MICROSECONDSINASEC;
while(1)
{
sleep(1);
gettimeofday(&now, 0);
now.tv_sec -= 1;
now.tv_usec += MICROSECONDSINASEC;
spent_time = MILLISECONDSINASEC * (now.tv_sec - start.tv_sec) + ((now.tv_usec - start.tv_usec) / MILLISECONDSINASEC);
FILE *fl = fopen("output_log.txt", "aw");
if (spent_time > 1100)
{
time(&current_time);
fprintf(fl,"\n (%s) - had a gap of %d [msec] instead of 1000\n", ctime(&current_time), spent_time);
}
fclose(fl);
gettimeofday(&start, 0);
}
return 0;
}
I ran this process overnight on several machines - including ones that don't run our product (just plain Linux) and I still saw gaps of several seconds - even though I made sure the process DID get the priority - and I can't figure out why - technically this process should preeampt any other running process, so how can it wait so long to run?
A few notes:
I ran these processes mostly on virtual machines - so maybe there can be an intervention from the hypervisor. But in the past I've seen such behavior on physical machines as well.
Making the process RT did improve the results drastically, but not totally prevented the problem.
There are no other RT processes running on the machine except for the Linux migration and watchdog process (which I don't believe can cause starvation to my processes).
What can I do? I feel like I'm missing something very basic here.
thanks!

Related

mmap: performance when using multithreading

I have a program which performs some operations on a lot of files (> 10 000). It spawns N worker threads and each thread mmaps some file, does some work and munmaps it.
The problem I am facing right now is that whenever I use just 1 process with N worker threads, it has worse performance than spawning 2 processes each with N/2 worker threads. I can see this in iotop because 1 process+N threads uses only around 75% of the disk bandwidth whereas 2 processes+N/2 threads use full bandwidth.
Some notes:
This happens only if I use mmap()/munmap(). I have tried to replace it with fopen()/fread() and it worked just fine. But since the mmap()/munmap() comes with 3rd party library, I would like to use it in its original form.
madvise() is called with MADV_SEQUENTIAL but it doesn't seem to change anything (or it just slows it down) if I remove it or change the advise argument.
Thread affinity doesn't seem to matter. I have tried to limit each thread to specific core. I have also tried to limit threads to core pairs (Hyper Threading). No results so far.
Load reported by htop seems to be the same even in both cases.
So my questions are:
Is there anything about mmap() I am not aware of when used in multithreaded environment?
If so, why do 2 processes have better performance?
EDIT:
As pointed out in the comments, it is running on server with 2xCPU. I should probably try to set thread affinities such that it is always running on the same CPU but I think I already tried that and it didn't work.
Here is a piece of code with which I can reproduce the same issue as with my production software.
#include <condition_variable>
#include <deque>
#include <filesystem>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
#ifndef WORKERS
#define WORKERS 16
#endif
bool stop = false;
std::mutex queue_mutex;
std::condition_variable queue_cv;
std::pair<const std::uint8_t*, std::size_t> map_file(const std::string& file_path)
{
int fd = open(file_path.data(), O_RDONLY);
if (fd != -1)
{
auto dir_ent = std::filesystem::directory_entry{file_path.data()};
if (dir_ent.is_regular_file())
{
auto size = dir_ent.file_size();
auto data = mmap(nullptr, size, PROT_READ, MAP_PRIVATE, fd, 0);
madvise(data, size, MADV_SEQUENTIAL);
close(fd);
return { reinterpret_cast<const std::uint8_t*>(data), size };
}
close(fd);
}
return { nullptr, 0 };
}
void unmap_file(const std::uint8_t* data, std::size_t size)
{
munmap((void*)data, size);
}
int main(int argc, char* argv[])
{
std::deque<std::string> queue;
std::vector<std::thread> threads;
for (std::size_t i = 0; i < WORKERS; ++i)
{
threads.emplace_back(
[&]() {
std::string path;
while (true)
{
{
std::unique_lock<std::mutex> lock(queue_mutex);
while (!stop && queue.empty())
queue_cv.wait(lock);
if (stop && queue.empty())
return;
path = queue.front();
queue.pop_front();
}
auto [data, size] = map_file(path);
std::uint8_t b = 0;
for (auto itr = data; itr < data + size; ++itr)
b ^= *itr;
unmap_file(data, size);
std::cout << (int)b << std::endl;
}
}
);
}
for (auto& p : std::filesystem::recursive_directory_iterator{argv[1]})
{
std::unique_lock<std::mutex> lock(queue_mutex);
if (p.is_regular_file())
{
queue.push_back(p.path().native());
queue_cv.notify_one();
}
}
stop = true;
queue_cv.notify_all();
for (auto& t : threads)
t.join();
return 0;
}

Is there anything about mmap() I am not aware of when used in multithreaded environment?
Yes. mmap() requires significant virtual memory manipulation - effectively single-threading your process in places. Per this post from one Linus Torvalds:
... playing games with the virtual memory mapping is very expensive
in itself. It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to mmap:
quite noticeable setup and teardown costs. And I mean noticeable.
It's things like following the page tables to unmap everything
cleanly. It's the book-keeping for maintaining a list of all the
mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated,
and it's quite slow.
Note that much of the above also has to be single-threaded across the entire machine, such as the actual mapping of physical memory.
So the virtual memory manipulations mapping files requires are not only expensive, they really can't be done in parallel - there's only one chunk of actual physical memory that the kernel has to keep track of, and multiple threads can't parallelize changes to a process's virtual address space.
You'd almost certainly get better performance reusing a memory buffer for each file, where each buffer is created once and is large enough to hold any file read into it, then reading from the file using low-level POSIX read() call(s). You might want to experiment with using page-aligned buffers and using direct IO by calling open() with the O_DIRECT flag (Linux-specific) to bypass the page cache since you apparently never re-read any data and any caching is a waste of memory and CPU cycles.
Reusing the buffer also completely eliminates any munmap() or delete/free().
You'd have to manage the buffers, though. Perhaps prepopulating a queue with N precreated buffers, and returning a buffer to the queue when done with a file?
As far as
If so, why do 2 processes have better performance?
The use of two processes splits the process-specific virtual memory manipulations caused by mmap() calls into two separable sets that can run in parallel.

A few notes:
Try running your application with perf stat -ddd <app> and have a look at context-switches, cpu-migrations and page-faults numbers.
The threads probably contend for vm_area_struct in the kernel process structure on mmap and page faults. Try passing MAP_POPULATE or MAP_LOCKED flag into mmap to minimize page faults. Alternatively, try mmap with MAP_POPULATE or MAP_LOCKED flag in the main thread only (you may like to ensure that all threads run on the same NUMA node in this case).
You may also like to experiment with MAP_HUGETLB and one of MAP_HUGE_2MB, MAP_HUGE_1GB flags.
Try binding threads to the same NUMA node with numactl to make sure that threads only access local NUMA memory. E.g. numactl --membind=0 --cpunodebind=0 <app>.
Lock the mutex before stop = true, otherwise the condition variable notification can get lost and deadlock the waiting thread forever.
p.is_regular_file() check doesn't require the mutex to be locked.
std::deque can be replaced with std::list and use splice to push and pop elements to minimize the time the mutex is locked.

High availability computing: How to deal with a non-returning system call, without risking false positives?

I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.
This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:
Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.
My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).
One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.
Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?

My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)
#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// "main" thread is now running under the monitor
printf("Hello from main!");
while (1) {
int c = getchar();
if (c == EOF) { break; }
nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
putchar(c);
}
return 0;
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
printf("Target: %d; %d\n", v, getpid());
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
exit(1);
}
struct user_regs_struct regs;
fprintf(stderr, "beginning monitor...\n");
while (1) {
sleep(1);
long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
if (ptv == -1) {
perror("failed to interrupt main thread");
break;
}
int status;
if (waitpid(v, &status, __WCLONE) == -1) {
perror("target wait failed");
break;
}
if (!WIFSTOPPED(status)) { // this section is messy. do it better.
fputs("target wait went wrong", stderr);
break;
}
if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
fputs("target wait went wrong (2)", stderr);
break;
}
ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
if (ptv == -1) {
perror("failed to peek at registers of thread");
break;
}
fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
if (ptv == -1) {
perror("failed to resume main thread");
break;
}
}
return 2;
}
Note that this is not production-quality code. You'll need to do a bunch of fixing things up.
Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.
This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.

I think you need a shared activity marker.
Have the main thread (or in a more general application, all worker threads) update the shared activity marker with the current time (or clock tick, e.g. by computing the "current" nanosecond from clock_gettime(CLOCK_MONOTONIC, ...)), and have the heartbeat thread periodically check when this activity marker was last updated, cancelling itself (and thus stopping the heartbeat broadcast) if there has not been any activity update within a reasonable time.
This scheme can easily be extended with a state flag if the workload is very sporadic. The main work thread sets the flag and updates the activity marker when it begins a unit of work, and clears the flag when the work has completed. If there is no work being done then the heartbeat is sent without checking the activity marker. If work is being done then the heartbeat is stopped if the time since the activity marker was updated exceeds the maximum processing time allowed for a unit of work. (Multiple worker threads each need their own activity marker and flag in this case, and the heartbeat thread can be designed to stop when any one worker thread gets stuck, or only when all worker threads get stuck, depending on their purposes and importance to the overall system).
(The activity marker value (and the work flag) will of course have to be protected by a mutex that must be acquired before reading or writing the value.)
Perhaps the heartbeat thread can also cause the whole process to commit suicide (e.g. kill(getpid(), SIGQUIT)) so that it can be restarted by having it be called in a loop in a wrapper script, especially if a process restart clears the condition in the kernel which would cause the problem in the first place.

One possible method would be to have another set of heartbeat messages from the main thread to the heartbeat thread. If it stops receiving messages for a certain amount of time, it stops sending them out as well. (And could try other recovery such as restarting the process.)
To solve the issue of the main thread actually just being in a long sleep, have a (properly-synchronized) flag that the heartbeat thread sets when it has decided that the main thread must have failed - and the main thread should check this flag at appropriate times (e.g. after the potential wait) to make sure that it hasn't been reported as dead. If it has, it stops running, because its job would have already been taken up by a different node.
The main thread can also send I-am-alive events to the heartbeat thread at other times than once around the loop - for example, if it's going into a long-running operation. Without this, there's no way to tell the difference between a failed main thread and a sleeping main thread.

why doesn't the System Monitor show correct CPU affinity?

I have search for questions/answers on CPU affinity and read the results but I am still cannot get my threads to nail up to a single CPU.
I am working on an application that will be run on a dedicated linux box so I am not concerned about other processes, only my own. This app currently spawns off one pthread and then the main thread enters a while loop to process control messages using POSIX msg queues. This while loop blocks waiting for a control msg to come in and then processes it. So the main thread is very simple and non-critical. My code is working very well as I can send this app messages and it will process them just fine. All control messages are very small in size and are use to just control the functionality of the application, that is, only a few control messages are ever send/received.
Before I enter this while loop, I use sched_getaffinity() to log all of the CPUs available. Then I use sched_setaffinity() to set this process to a single CPU. Then I call sched_getaffinity() again to check if it is set to run on only one CPU and it is indeed correct.
The single pthread that was spawned off does a similar thing. The first thing I do in the newly created pthread is call pthread_getaffinity_np() and check the available CPUs, then call pthread_setaffinity_np() to set it to a different CPU then call pthread_getaffinity_np() to check if it is set as desired and it is indeed correct.
This is what is confusing. When I run the app and view the CPU History in System Monitor, I see no difference from when I run the app without all of this set affinity stuff. The scheduler still runs a couple of seconds in each of the 4 CPUs on this quad core box. So it appears that the scheduler is ignoring my affinity settings.
Am I wrong in expecting to see some proof that the main thread and the pthread are actually running in their own single CPU?? or have I forgotten to do something more to get this to work as I intend?
Thanks,
-Andres

You have no answers, I will give you what I can:some partial help
Assuming you checked the return values from pthread_setaffinity_np:
How you assign your cpuset is very important, create it in the main thread. For what you want. It will propagate to successive threads. Did you check return codes?
The cpuset you actually get will be the intersection of hardware available cpus and the cpuset you define.
min.h in the code below is a generic build include file. You have to define _GNU_SOURCE - please note the comment on the last line of the code. CPUSET and CPUSETSIZE are macros. I think I define them somewhere else, I do not remember. They may be in a standard header.
#define _GNU_SOURCE
#include "min.h"
#include <pthread.h>
int
main(int argc, char **argv)
{
int s, j;
cpu_set_t cpuset;
pthread_t tid=pthread_self();
// Set affinity mask to include CPUs 0 & 1
CPU_ZERO(&cpuset);
for (j = 0; j < 2; j++)
CPU_SET(j, &cpuset);
s = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
// lets see what we really have in the actual affinity mask assigned our thread
s = pthread_getaffinity_np(tid, sizeof(cpu_set_t), &cpuset);
if (s != 0)
{
fprintf(stderr, "%d ", s);
perror(" pthread_setaffinity_np");
exit(1);
}
printf("my cpuset has:\n");
for (j = 0; j < CPU_SETSIZE; j++)
if (CPU_ISSET(j, &cpuset))
printf(" CPU %d\n", j);
// #Andres note: any pthread_create call from here on creates a thread with the identical
// cpuset - you do not have to call it in every thread.
return 0;
}

Parallel threads in Linux

#include <iostream>
#include <time.h>
#include <pthread.h>
using namespace std;
void*genFunc2(void*val)
{
int i,j,k;
for(i=0;i<(1<<15);i++)
{
clock_t t1=clock();
for(j=0;j<(1<<20);j++)
{
for(k=0;k<(1<<10);k++)
{
}
}
clock_t t2=clock();
cout<<"t1:"<<t1<<" t2:"<<t2<<" t2-t1:"<<(t2-t1)/CLOCKS_PER_SEC<<endl;
}
}
int main()
{
cout<<"begin"<<endl;
pthread_t ntid1;pthread_t ntid2;pthread_t ntid3;pthread_t ntid4;
pthread_create(&ntid1,NULL,genFunc2,NULL);
pthread_create(&ntid2,NULL,genFunc2,NULL);
pthread_create(&ntid3,NULL,genFunc2,NULL);
pthread_create(&ntid4,NULL,genFunc2,NULL);
pthread_join(ntid1,NULL);pthread_join(ntid2,NULL);
pthread_join(ntid3,NULL);pthread_join(ntid4,NULL);
return 0;
}
I show my example above. When I just create one thread, it can print the time in 2 seconds. However, when I create four threads, each thread only prints its result in 15 seconds. Why?

This kind of algorithm can easily be parallelized using OpenMP, I suggest you check into it to simplify your code.
That being said, you use the clock() function to compute the execution time of your runs. This doesn't show the wallclock of your execution but the number of clock ticks that your CPU was busy executing your program. This is a bit strange because it may, per example, show 4 seconds while only 1 seconds have passed. This is perfectly logic on a 4 cores machine: if the 4 core were all 100% busy in your threads, you used 4 seconds of computing time (in core⋅seconds units). This is because you divide by the CLOCKS_PER_SEC constant, which is true only for a single core. Each of your core are running at CLOCKS_PER_SEC, effectively explaining most of the discrepancy between your experiments.
Furthermore, two notes to take into account with your code:
You should deactivate any kind of optimization (e.g.: -O0 on gcc), otherwise your inner loops may get removed depending on the compiler and other circumstances such as parallelization.
If your computer only have two real cores with Hyper-Threading activated (thus showing 4 cores in your OS), it may explain the remaining difference between your runs and my previous explanation.
To solve your problem with high resolution, you should use the function clock_gettime(CLOCK_MONOTONIC, &timer); as explained in this answer.

How to increase CPU frequency of newly spawned process

I've been working on a hobby project for a while (written in C), and it's still far from complete. It's very important that it will be fast, so I recently decided to do some benchmarking to verify that my way of solving the problem wouldn't be inefficient.
$ time ./old
real 1m55.92
user 0m54.29
sys 0m33.24
I redesigned parts of the program to significantly remove unnecessary operations, reduced memory cache misses and branch mispredictions. The wonderful Callgrind tool was showing me more and more impressive numbers. Most of the benchmarking was done without forking external processes.
$ time ./old --dry-run
real 0m00.75
user 0m00.28
sys 0m00.24
$ time ./new --dry-run
real 0m00.15
user 0m00.12
sys 0m00.02
Clearly I was at least doing something right. Yet running the program for real told a different story.
$ time ./new
real 2m00.29
user 0m53.74
sys 0m36.22
As you might have noticed, the time is mostly dependent on the external processes. I don't know what caused the regression. There's nothing really weird about it; just a traditional vfork/execve/waitpid done by a single thread, running the same programs in the same order.
Something had to be causing forking to be slow, so I made a small test (similar to the one below) that would only spawn the new processes and have none of the overhead associated with my program. Obviously this had to be the fastest.
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, const char **argv)
{
static const char *const _argv[] = {"/usr/bin/md5sum", "test.c", 0};
int fd = open("/dev/null", O_WRONLY);
dup2(fd, STDOUT_FILENO);
close(fd);
for (int i = 0; i < 100000; i++)
{
int pid = vfork();
int status;
if (!pid)
{
execve("/usr/bin/md5sum", (char*const*)_argv, environ);
_exit(1);
}
waitpid(pid, &status, 0);
}
return 0;
}
$ time ./test
real 1m58.63
user 0m68.05
sys 0m30.96
I guess not.
At this time I decided to vote performance for governor, and times got better:
$ for i in 0 1 2 3 4 5 6 7; do sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor";done
$ time ./test
real 1m03.44
user 0m29.30
sys 0m10.66
It seems like every new process gets scheduled on a separate core and it takes a while for it to switch to a higher frequency. I can't say why the old version ran faster. Maybe it was lucky. Maybe it (due to it's inefficiency) caused the CPU to choose a higher frequency earlier.
A nice side effect of changing governor was that compile times improved too. Apparently compiling requires forking many new processes. It's not a workable solution though, as this program will have to run on other people's desktops (and laptops).
The only way I found to improve the original times was to restrict the program (and child processes) to a single CPU by adding this code at the beginning:
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);
Which actually was the fastest despite using the default "ondemand" governor:
$ time ./test
real 0m59.74
user 0m29.02
sys 0m10.67
Not only is it a hackish solution, but it doesn't work well in case the launched program uses multiple threads. There's no way for my program to know that.
Does anyone have any idea for how to get the spawned processes to run at high CPU clock frequency? It has to be automated and not require su priviliges. Though I've only tested this on Linux so far, I intend to port this to more or less all popular and impopular desktop OSes (and it will also run on servers). Any idea on any platform is welcome.

CPU frequency is seen (by the most OSs) as a system property. Thus, you can't change it without root rights. There exists some research on extensions to allow an adoption for specific programs; however since the energy/performance model differs even for the same general architecture, you will hardly find a general solution.
In addition, be aware that in order to guarantee fairness, the linux scheduler shares the execution time of perent and child processes for the first epoch of the child. This might have an impact to your problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string