ClientConfig.setHeavyweight - Volt Network 1 & Volt Reaper Thread - voltdb

In ClientConfig Javaoc
says that the client will use Runtime.coresAvailable / 2 (let's say N = cores / 2), that means that must be N Volt Network Threads and N Volt Reaper Threads ? Or N/2 "Volt Network Threads and N/2 Volt Reaper Threads ?

I work at VoltDB. There is a discrepancy in the Javadoc, the formula is actually max(1, (cores/4)) for the number of Volt Network threads. The thread names will be "Volt Client Network - 1" and subsequent numbers. Each client will have a single "VoltDB Client Reaper Thread".

Related

How to execute longest tasks first with TBB

I have 10000 tasks that I am trying to schedule with tbb across N threads. 9900 tasks take O(1) unit time for execution whereas the remaining 100 tasks take O(100)-O(1000) time for execution. I want tbb to schedule these tasks such that the top 100 longest tasks are scheduled first on the threads, so that we maximize efficiency. If some threads finish faster, they can then run the shorter jobs at the end.
I have one (hypothetical) example: Out of 10000 tasks, I have one super long task which takes 1111 units, remaining 9999 tasks all take just 1 unit, and I have 10 threads. I want thread j to run this super long task in 1111 units of time, and the other 9 threads run the remaining 9999 tasks which take 1 unit each so those 9 threads run 9999/9=1111 tasks in 1111 units of time. Which means I am using my threads at 100% efficiency (ignore any overhead).
What I have is a function which does something like this:
bool run( Worker& worker, size_t numTasks, WorkerData& workerData ) {
xtbbTaskData<Worker,WorkerData> taskData( worker, workerData, numThreads);
arena.execute( [&]{ tbb::parallel_for( tbb::blocked_range<size_t>( 0, numTasks ), xtbbTask<Worker,WorkerData>( taskData ) ); } );
}
where I have a xtbb::arena arena created with numThreads. Worker is my worker class with 10000 tasks, workerData is the class with the data needed to run each task, and I have a template class xtbbTaskData which takes the worker, workerdata and eventually has the operator() which calls run on each task in the worker.
What syntax should I use to schedule these tasks such that the longest task gets schedules first? There is task priority, tbb area, enque etc that I have come across but I am not finding good examples of how to code this.
Do I need to create multiple arenas? Or multiple workers? Or put the longest tasks at the end of the vector in worker? Or something else?
If someone can point me to examples or templates that are already doing this, that would be great.
A task_group_context represents a group of tasks that can be canceled, or have their priority level set, together.
Refer to page no: 378 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for examples.
We can also define priority as an attribute to the task arena.
Refer to page no: 494 in the Pro TBB C++ Parallel Programming with
Threading Building Blocks textbook for example.

Precise Throughput Timer stuck with simple setup

I have a similar issue as Synchronizing timer hangs with simple setup, but with Precise Throughput Timer which suppose to replace Synchronizing timer:
Certain cases might be solved via Synchronizing Timer, however Precise Throughput Timer has native way to issue requests in packs. This behavior is disabled by default, and it is controlled with "Batched departures" settings
Number of threads in the batch (threads). Specifies the number of samples in a batch. Note the overall number of samples will still be in line with Target Throughput
Delay between threads in the batch (ms). For instance, if set to 42, and the batch size is 3, then threads will depart at x, x+42ms, x+84ms
I'm setting 10 thread number , 1 ramp up and 1 loop count,
I'm adding 1 HTTP Request only (less than 1 seconds response) and before it Test Action with Precise Throughput Timer as a child with the following setup:
Thread stuck after 5 threads succeeded:
EDIT 1
According to #Dimitri T solution:
Change Duration to 100 and add line to logging configuration and got 5 errors:
2018-03-12 15:43:42,330 INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -99886 ms and the throughput timer generates a delay of 20004077. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.
EDIT 2
According to #Dimitri T solution set "Loop Count" to -1 executed 10 threads, but if I change Number of threads in batch from 2 to 5 it execute only 3 threads and stops
INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -89233 ms and the throughput timer generates a delay of 19999450. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.
Set "Duration (seconds)" in your Thread Group to something non-zero (i.e. to 100)
Depending on what you're trying to achieve you might also want to set "Loop Count" to -1
You can also add the following line to log4j2.xml file:
<Logger name="org.apache.jmeter.timers" level="debug" />
This way you will be able to see what's going on with your timer(s) in jmeter.log file

Finding requests per second for distributed system - a textbook query

Found a question in Pradeep K Sinha's book
From my understanding it is safe to assume howsoever number of threads are available. But how could we compute the time?
Single-threaded:
We want to figure out many requests per second the system can support. This is represented as n below.
1 second
= 1000 milliseconds
= 0.7n(20) + 0.3n(100)
Since 70% of the requests hit the cache, we represent the time spent handling requests that hit the cache with 0.7n(20). We represent the requests that miss the cache with 0.3n(100). Since the thread goes to sleep when there is a cache miss and it contacts the file server, we don't need to worry about interleaving the handling for the next request with the current one.
Solving for n:
1000
= 0.7n(20) + 0.3n(100)
= 0.7n(20) + 1.5n(20)
= 2.2n(20)
= 44n
=> n = 100/44 = 22.73.
Therefore, a single thread can handle 22.73 requests per second.
Multi-threaded:
The question does not give much detail about the multi-threaded state, apart from the context switch cost. The answer to this question depends on several factors:
How many cores does the computer have?
How many threads can exist at once?
When there is a cache miss, how much time does the computer spend servicing the request and how much time does the computer spend sleeping?
I am going to make the following assumptions:
There is 1 core.
There is no bound on how many threads can exist at once.
On a cache miss, the computer spends 20 milliseconds servicing the request (e.g. checking the cache, contacting the file server, and forwarding the response to the client) and 80 milliseconds sleeping.
I can now solve for n:
1000 milliseconds
= 0.7n(20) + 0.3n(20).
On a cache miss, a thread spends 20 milliseconds doing work and 80 milliseconds sleeping. When the thread is sleeping, another thread can run and do useful work. Thus, on a cache miss, the thread only uses the CPU for 20 milliseconds, whereas when the process was single-threaded, the next request was blocked from being serviced for 100 milliseconds.
Solving for n:
1000 milliseconds
= 0.7n(20) + 0.3n(20)
= 1.0n(20)
= 20n
=> n = 1000/20 = 50.
Therefore, a multi-threaded process can handle 50 requests per second given the assumptions above.

What is the strategy by thread assignment in Kafka streams?

I am doing more less such a setup in the code:
// loop over the inTopicName(s) {
KStream<String, String> stringInput = kBuilder.stream( STRING_SERDE, STRING_SERDE, inTopicName );
stringInput.filter( streamFilter::passOrFilterMessages ).map( processor_i ).to( outTopicName );
// } end of loop
streams = new KafkaStreams( kBuilder, streamsConfig );
streams.cleanUp();
streams.start();
If there is e.g. num.stream.threads > 1, how tasks are assigned to the prepared and assigned (in the loop) threads?
I suppose (I am not sure) there is thread pool and with some kind of round-robin policy the tasks are assigned to threads, but it can be done fully dynamically in runtime or once at the beginning by creation of the filtering/mapping to structure.
Especially I am interesting in the situation when one topic is getting computing intensive tasks and other not. Is it possible that application will starve because all threads will be assigned to the processor which is time consuming.
Let's play a bit with scenario: num.stream.threads=2, no. partitions=4 per topic, no. topics=2 (huge_topic and slim_topic)
The loop in my question is done once at startup of the app. If in the loop I define 2 topics, and I know from one topic comes messages which are heavy weighted (huge_topic) and from the other comes lightweighted messsages (slim_topic).
Is it possible that both threads from num.stream.threads will be busy only with tasks which are comming from huge_topic? And messages from slimm_topic will have to wait for processing?
Internally, Kafka Streams create tasks based on partitions. Going with your loop example and assume you have 3 input topics A, B, C with 2, 4, and 3 partition respectively. For this, you will get 4 task (ie, max number of partitions over all topics) with the following partition to task assignment:
t0: A-0, B-0, C-0
t1: A-1, B-1, C-1
t2:        B-2, C-2
t3:        B-3
Partitions are grouped "by number" and assigned to the corresponding task. This is determined at runtime (ie, after you call KafakStreams#start()) because before that, the number of partitions per topic is unknown.
It is not recommended to mess with the partitions grouped if you don't understand all the internal details of Kafka Streams -- you can very easily break stuff! (This interface was deprecated already and will be removed in upcoming 3.0 release.)
With regard to threads: tasks limit the number of threads. For our example, this implies that you can have max 4 thread (if you have more, those threads will be idle, as there is no task left for thread assignment). How you "distribute" those thread is up to you. You can either have 4 single threaded application instances of one single application instance with 4 thread (or anything in between).
If you have fewer tasks than threads, task will be assigned in a load balanced way, based on number of tasks (all tasks are assumed to have the same load).
If there is e.g. num.stream.threads > 1, how tasks are assigned to the
prepared and assigned (in the loop) threads?
Tasks are assigned to threads with the usage of a partition grouper. You can read about it here. AFAIK it's called after a rebalance, so it's not a very dynamic process. That said, I'd argue that there is no option for starvation.

Excesive Linux Latency

Do you think that a latency of 50 msec are normal in Linux System?
I have a program with many threads, one thread is controlling the movement of an object with a motor and photocells.
I have made many thing to get minimun latency, but always get 50 msec that cause a position error in the object.
Things I did:
- nice function to -20
- Thread priority of photeocell control thread: SCHED FIFO, 99
- Kernel configuration: CONFING_PREEMPT=y
- mlockall (MCL_CURRENT | MCL_FUTURE);
Many times, I lose 50 msec waiting for a photocell. I think that the problema is not another of
my thread, but process in the kernel.
Is posible reduced this latency? Is posible to know who is getting this 50 msec extra?
The thread that is controlling photocells make many "read" functions. Can this generate problems?
/**********/
At now the situation is:
There is only one thread running an infinite empty loop, only looking for time at start od the loop an at the end of the loop.
No acces to disk, no acces to GPIO, no serial ports, nothing.
The loop spend 50 milisecond many of the times.
I have not set cpuaffinity, my processor has only one nucleus.
I have been making test in my program.
This is the code in the main function, before the program star the threads, that cause de 50 mseg latency:
struct sched_param lsPrio;
lsPrio.sched_priority = 1;
if (sched_setscheduler (0, SCHED_FIFO, &lsPrio) != 0)
printf ("FALLO sched_set\n");
if I comment this lines the latency is reduced about 1 mseg.
Why this lines cause latency?

Resources