How to execute longest tasks first with TBB - multithreading

I have 10000 tasks that I am trying to schedule with tbb across N threads. 9900 tasks take O(1) unit time for execution whereas the remaining 100 tasks take O(100)-O(1000) time for execution. I want tbb to schedule these tasks such that the top 100 longest tasks are scheduled first on the threads, so that we maximize efficiency. If some threads finish faster, they can then run the shorter jobs at the end.
I have one (hypothetical) example: Out of 10000 tasks, I have one super long task which takes 1111 units, remaining 9999 tasks all take just 1 unit, and I have 10 threads. I want thread j to run this super long task in 1111 units of time, and the other 9 threads run the remaining 9999 tasks which take 1 unit each so those 9 threads run 9999/9=1111 tasks in 1111 units of time. Which means I am using my threads at 100% efficiency (ignore any overhead).
What I have is a function which does something like this:
bool run( Worker& worker, size_t numTasks, WorkerData& workerData ) {
xtbbTaskData<Worker,WorkerData> taskData( worker, workerData, numThreads);
arena.execute( [&]{ tbb::parallel_for( tbb::blocked_range<size_t>( 0, numTasks ), xtbbTask<Worker,WorkerData>( taskData ) ); } );
}
where I have a xtbb::arena arena created with numThreads. Worker is my worker class with 10000 tasks, workerData is the class with the data needed to run each task, and I have a template class xtbbTaskData which takes the worker, workerdata and eventually has the operator() which calls run on each task in the worker.
What syntax should I use to schedule these tasks such that the longest task gets schedules first? There is task priority, tbb area, enque etc that I have come across but I am not finding good examples of how to code this.
Do I need to create multiple arenas? Or multiple workers? Or put the longest tasks at the end of the vector in worker? Or something else?
If someone can point me to examples or templates that are already doing this, that would be great.

A task_group_context represents a group of tasks that can be canceled, or have their priority level set, together.
Refer to page no: 378 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for examples.
We can also define priority as an attribute to the task arena.
Refer to page no: 494 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for example.

Related

Why ExecutorService is much faster than Coroutines in this example? [Solved]

Update:
I made 2 silly mistakes!
I submitted only 1 task in the executor service example
I forgot to await for the tasks to finish.
Fixing the test, lead to all 3 examples having around 190-200 ms/op latency.
I created a benchmark comparison using kotlinx-benchmark (uses jmh) to compare coroutines and a threadpool when making a blocking call.
My rational behind such benchmark is
Coroutines will block the underlying thread when making a blocking call.
A Network call is generally blocking ()
In an average service, I need to make a million of network calls.
In such scenario will I get any benefit, if I use coroutines?
The benchmark I create simulates the blocking call using Thread.sleep(10) // 10 ms block and I need to create 1000 of them. I created 3 examples with following results
Dispatchers.io
Used Dispatchers.io, which is the recommended way to handle IO operations.
#Benchmark
fun withCoroutines() {
runBlocking {
val coroutines = (0 until 1000).map {
CoroutineScope(Dispatchers.IO).async {
sleep(10)
}
}
coroutines.joinAll()
}
}
Avg time: 188.418 ms/op
Fixed Threadpool
Dispatcher.IO created 64 threads (the exact number is nondeterministic statically). So I kept 60 threads for a comparable scenario
#Benchmark
fun withExecutorService() {
val executors = Executors.newFixedThreadPool(60)
executors.submit { sleep(10) }
executors.shutdown()
}
Avg time: 0.054 ms/op
Threadpool Dispatcher
Since the results were shocking I decided to use the same threadpool above as the dispatcher as
Executors.newFixedThreadPool(60).asCoroutineDispatcher()
Avg time: 206,260 ms/op
Questions
Why are coroutines performing exceptionally bad here?
With limitedParallelism(10) options coroutines performed much better at 30ms/op. Default number of threads used by IO are 64. Does that mean that coroutine scheduler is causing too many context switches, leading to poor performance. Still the performance is not close to that of threadpools
Am I correct to assume that the network calls are always blocking? Both executor service and coroutines schedules execution over underlying threads while not blocking the main thread, so they are the direct competitors.
Notes:
I am running jmh with
#State(Scope.Benchmark)
#Fork(1)
#Warmup(iterations = 50)
#Measurement(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#BenchmarkMode(Mode.AverageTime)
The code can be found here

Measuring Semaphore wait times with Micrometer

We have a throttling implementation that essentially boils down to:
Semaphore s = new Semaphore(1);
...
void callMethod() {
s.acquire();
timer.recordCallable(() -> // call expensive method);
s.release();
}
I would like to gather metrics about the impact semaphore has on the overall response time of the method. For example, I would like to know the number of threads that were waiting for acquire, the time spend waiting etc., What, I guess, I am looking for is guage that also captures timing information?
How do I measure the Semphore stats?
There are multiple things you can do depending on your needs and situation.
LongTaskTimer is a timer that measures tasks that are currently in-progress. The in-progress part is key here, since after the task has finished, you will not see its effect on the timer. That's why it is for long running tasks, I'm not sure if it fits your use case.
The other thing that you can do is having a Timer and a Gauge where the timer measures the time it took to acquire the Semaphore while with the gauge, you can increment/decrement the number of threads that are currently waiting on it.

Parallel execution with data-provider-thread-count and thread-count

I have used DataProviders in my tests. I want to execute them in parallel[#DataProvider(parallel = true)].
When I give parallel = methods, data-provider-thread-count = 1 , thread-count =2.
Total no of thread i want to execute at a given time is 2. I want the DataProviders to pick up the next input whenever there is an idle thread. Currently the DataProvider is using the same thread(for one input after another) for execution which is more like sequential.
If I give data-provider-thread-count = 2 & thread-count =2, 2X2=4 threads are running in Parallel. This will increase the load when there are 100 DataProvider tests.
Is there a way to control the DP threads from spawning separate thread pool? So we can enable them to be picked up for parallel execution.

What is the strategy by thread assignment in Kafka streams?

I am doing more less such a setup in the code:
// loop over the inTopicName(s) {
KStream<String, String> stringInput = kBuilder.stream( STRING_SERDE, STRING_SERDE, inTopicName );
stringInput.filter( streamFilter::passOrFilterMessages ).map( processor_i ).to( outTopicName );
// } end of loop
streams = new KafkaStreams( kBuilder, streamsConfig );
streams.cleanUp();
streams.start();
If there is e.g. num.stream.threads > 1, how tasks are assigned to the prepared and assigned (in the loop) threads?
I suppose (I am not sure) there is thread pool and with some kind of round-robin policy the tasks are assigned to threads, but it can be done fully dynamically in runtime or once at the beginning by creation of the filtering/mapping to structure.
Especially I am interesting in the situation when one topic is getting computing intensive tasks and other not. Is it possible that application will starve because all threads will be assigned to the processor which is time consuming.
Let's play a bit with scenario: num.stream.threads=2, no. partitions=4 per topic, no. topics=2 (huge_topic and slim_topic)
The loop in my question is done once at startup of the app. If in the loop I define 2 topics, and I know from one topic comes messages which are heavy weighted (huge_topic) and from the other comes lightweighted messsages (slim_topic).
Is it possible that both threads from num.stream.threads will be busy only with tasks which are comming from huge_topic? And messages from slimm_topic will have to wait for processing?
Internally, Kafka Streams create tasks based on partitions. Going with your loop example and assume you have 3 input topics A, B, C with 2, 4, and 3 partition respectively. For this, you will get 4 task (ie, max number of partitions over all topics) with the following partition to task assignment:
t0: A-0, B-0, C-0
t1: A-1, B-1, C-1
t2:        B-2, C-2
t3:        B-3
Partitions are grouped "by number" and assigned to the corresponding task. This is determined at runtime (ie, after you call KafakStreams#start()) because before that, the number of partitions per topic is unknown.
It is not recommended to mess with the partitions grouped if you don't understand all the internal details of Kafka Streams -- you can very easily break stuff! (This interface was deprecated already and will be removed in upcoming 3.0 release.)
With regard to threads: tasks limit the number of threads. For our example, this implies that you can have max 4 thread (if you have more, those threads will be idle, as there is no task left for thread assignment). How you "distribute" those thread is up to you. You can either have 4 single threaded application instances of one single application instance with 4 thread (or anything in between).
If you have fewer tasks than threads, task will be assigned in a load balanced way, based on number of tasks (all tasks are assumed to have the same load).
If there is e.g. num.stream.threads > 1, how tasks are assigned to the
prepared and assigned (in the loop) threads?
Tasks are assigned to threads with the usage of a partition grouper. You can read about it here. AFAIK it's called after a rebalance, so it's not a very dynamic process. That said, I'd argue that there is no option for starvation.

Maximum number of tasks supported in AUTOSAR

What is the maximum number of tasks supported in AUTOSAR compliant systems?
In Linux, I can check the maximum process IDs supported to get the maximum number of tasks supported.
However, I couldn't find any source that states the maximum number of tasks supported by AUTOSAR.
Thank you very much for your help!
Well, we are still in an embedded automotive world and not on a PC.
There is usually a tradeoff between the number of tasks you have and what it takes to schedule them and what RAM/ROM and runtime resources your configuration uses.
As already said, if you just need a simple timed loop with some interrupts in between, one task may be ok.
It might be also enough, to have e.g. 3 tasks running at 5ms, 10ms and 20ms cycle. But you could also schedule this in simple cases like this with a single 5ms task:
TASK(TASK_5ms)
{
static uint8 cnt = 0;
cnt++;
// XXX and YYY Mainfunctions shall only be called every 10ms
// but do a load balancing, that does not run 3 functions every 10ms
// and 1 every 5ms, but only two every 5ms
if (cnt & 1)
{
XXX_Mainfunction_10ms();
}
else
{
YYY_Mainfunction_10ms();
}
ZZZ_Mainfunction_5ms();
}
So, if you need something to be run every 5, 10 or 20ms, you put these runnables into the corresponding tasks.
The old OSEK also had a notion of BASIC vs EXTENDED Tasks, where only extended tasks where able to react on OsEvents. This tasks might not run cyclically, but only on configured OsEvents. You would have an OS Waitpoint there, where the tasks is more or less stopped and only woken up by the OS on the arrival of an event. There are also OSALARM, which could either directly trigger the activation of a OsTask, or indirectly over an Event, so, you could e.g. wait on the same Waitpoint on both a cyclic event from an OsAlarm or an OsEvent set by something else e.g. by another task or from an ISR.
TASK(TASK_EXT)
{
EventMaskType evt;
for(;;)
{
WaitEvent(EVT_XXX_START | EVT_YYY_START | EVT_YYY_FINISHED);
GetEvent(TASK_EXT, &evt);
// Start XXX if triggered, but YYY has reported to be finished
if ((evt & (EVT_XXX_START | EVT_YYY_FINISHED) == (EVT_XXX_START | EVT_YYY_FINISHED))
{
ClearEvent(EVT_XXX_START);
XXX_Start();
}
// Start YYY if triggered, will report later to start XXX
if (evt & EVT_YYY_START)
{
ClearEvent(EVT_YYY_START);
YYY_Start();
}
}
}
This direct handling of scheduling is now mostly done/generated within the RTE based on the events you have configured for your SWCs and the Event to Task Mapping etc.
Tasks are scheduled mainly by their priority, that's why they can be interrupted anytime by a higher priority taks. Exception here is, if you configure your OS and tasks to be not preemptive but cooperative. Then it might be necessary to also use Schedule() points in your code, to give up the CPU.
On bigger systems and also on MultiCore systems with an MultiCore OS, there will be higher nunbers of Tasks, because Tasks are bound to a Core, though the Tasks on different Cores run independently, except maybe for the Inter-Core-Synchronization. This can also have a negative performance impact (Spinlocks can stop the whole system)
e.g. there could be some Cyclic Tasks for normal BaseSW components and one specific only for Communication components (CAN Stack and Comm-Services).
We usually separate the communication part, since they need a certain cycle time like 5..10ms, since this cycle is used by the Comm-Stack for message transmission scheduling and also reception timeout monitoring.
Then there might be a task to handle the memory stack (Ea/Fls, Eep/Fee, NvM).
There might be also some kind of Event based Tasks to trigger certain HW-control and processing chains of measured data, since they might be put on different cores, and can be scheduled by start or finished events of each other.
On the other side, for all your cyclic tasks, you should also make sure, that the functions run within such task do not run longer than your task cycle, otherwise you get an OS Shutdown due to multiple activation of the same task, since your task is started again, before it actually finished. And you might have some constraints, that require some tasks to finish in your applications expected measurement cycle.
In safety relevant systems (ASIL-A .. ASIL-D) you'll also have at least one task fpr each safety-level to get freedome-from-interference. In AUTOSAR, you already specify that on the OSApplication which the tasks are assigned to, which also allows you to configure the MemoryProtection (e.g. WrAccess to memory partitions by QM, ASIL-A, ASIL-B application and tasks). That is then another part, the OS has to do at runtime, to reconfigure the MPU according to the OsApplications MemoryAccess settings.
But again, the more tasks you create, the higher the usage of RAM, ROM and runtime.
RAM - runtime scheduling structures and different task stacks
ROM - the actual task and event configurations
Runtime - the context switches of the tasks and also the scheduling itself
It seems to vary. I found that ETAS RTA offers 1024 tasks*, whereas Vector's MICROSAR OS has 65535.
For task handling, OSEK/ASR provides the following functions:
StatusType ActivateTask (TaskType TaskID)
StatusType TerminateTask (void)
StatusType Schedule (void)
StatusType GetTaskID (TaskRefType TaskID)
StatusType GetTaskState (TaskType TaskID, TaskStateRefType State)
*Link might change in future, but it is easy to search ETAS page directly for manuals etc.: https://www.etas.com/en/products/download_center.php
Formally you can have an infinite number of OsTasks. According to the spec. the configuration of the Os can have 0..* OsTask.
Apart from that the (OS) software uses data type TaskType for Task-Index variables. Therefore, if TaskType is of uint16 you could not have more than 65535 tasks.
Besides that, if you have a lot of tasks, you might re-think your design.

Resources