I'm using Runtime.getRuntime.availableProcessors() to create my thread pool. On our own physical data center, it used to return the number of cpus being 8, and it never changed. Then every time a request (data intensive) comes in, we partition the data using the same availableProcessors API, and each thread of that thread pool processes one of those partitions.
However, recently we switch to AWS (and Docker as container), and every time a request (data intensive) comes, and wants to partition the data and so calls to that API, it returns a different number. Example:
Num Partitions Used: [X = 12] [Y= 15] [Z = 12] [W = 5] [Q = 15] [P = 3]
This is not good. What if at the time we create the thread pool availableProcessors is 3 and when the request comes in and we partition the data it is 15.
I know that the API documentation says:
Returns the number of processors available to the Java virtual machine.
This value may change during a particular invocation of the virtual machine. Applications that are sensitive to the number of available processors should therefore occasionally poll this property and adjust their resource usage appropriately.
But my questions are
why it did not used to change before (always returning 8) and why it changes so drastically on AWS, and if there is a way to fix it?
If there is no way to fix it, how could I modify the application to make sure the thread pool and partitions are compatible?
Thanks
Related
Atomicity can be achieved with machine level instructions such as compare and swap (CS).
It could also be achieved with the use of a mutex/lock for a large blocks of code with the OS providing help on it.
On the other hand we also have the concept of memory model. Some machines could have a relaxed model like Arm which could re-order load/stores on a single thread, and some have a more strict model like x86.
I want to confirm my understanding of the term synchronization. Is it pretty much the
promise of both atomicity and the memory model? i.e only using atomic ops on a thread doesn't necessary make it synchronized with other threads?
Something atomic is indivisible. Things that are synchronized are happening together in time.
Atomicity
I like to think of this like having a data structure representing a 2-dimensional point with x, y coordinates. For my purposes, in order for my data to be considered "valid" it must always be a point along the x = y line. x and y must always be the same.
Suppose that initially I have a point { x = 10, y = 10 } and I want to update my data structure so that it represents the point {x = 20, y = 20}. And suppose that the implementation of the update operation is basically these two separate steps:
x = 20
y = 20
If my implementation writes x and y separately like that, then some other thread could potentially observe my point data structure data after step 1 but before step 2. If it is allowed to read the value of the point after I change x but before I change y then that other observer might observe the value {x = 20, y = 10}.
In fact there are three values that could be observed
{x = 10, y = 10} (the original value) [VALID]
{x = 20, y = 10} (x is modified but y is not yet modified) [INVALID x != y]
{x = 20, y = 20} (both x and y are modified) [VALID]
I need a way of updating the two values together so that it is impossible for an outside observer observe {x = 20, y = 10}.
I don't really care when the other observer looks at the value of my point. It is fine it it observes { x = 10, y = 10 } and it is also fine if it observes { x = 20, y = 20 }. Both have the property of x == y, which makes them valid in my scenario.
Simplest atomic operation
The most simple atomic operation is a test and set of a single bit. This operation atomically reads a value of a bit and overwrites it with a 1, returning the state of the bit we overwrote. But we are offered the guarantee that if our operation has concluded then we have the value that we overwrote and any other observer will observe a 1. If many agents attempt this operation simultaneously, only one agent will return 0, and the others will all return 1. Even if it's two CPU's writing on the exact same clock tick, something in the electronics will guarantee that the operation is concluded logically atomically according to our rules.
That's it to logical atomicity. That's all atomic means. It means you have the capability of performing an uninterrupted update with valid data before and after the update and the data cannot be observed by another observer in any intermediate state it may take on during the update. It may be a single bit or it may be an entire database.
x86 Example
A good example of something that can be done on x86 atomically is the 32-bit interlocked increment.
Here a 32-bit (4-byte) value must be incremented by 1. This could potentially need to modify all 4 bytes for this to work correctly. If the value is to be modified from 0x000000FF to 0x00000100, it's important that the 0x00 becomes a 0x00 and the 0xFF becomes a 0x00 atomically. Otherwise I risk observing the value 0x00000000 (if the LSB is modified first) or 0x000001FF (if the MSB is modified first).
The hardware guarantees that we can test and modify 4 bytes at a time to achieve this. The CPU and memory provide a mechanism by which this operation can be performed even if there are other CPUs sharing the same memory. The CPU can assert a lock condition that prevents other CPUs from interfering with this interlocked operation.
Synchronization
Synchronization just talks about how things happen together in time. In the context you propose, it's about the order in which various sections of our program get executed and the order in which various components of our system change state. Without synchronization, we risk corruption (entering an invalid, semantically meaningless or incorrect state of execution of our program or its data)
Let's say we want to have an interlocked increment of a 64-bit number. Let's suppose that the hardware does not offer a way to atomically change 64-bits at a time. We will have to accomplish what we want with more complex data structure that means that even when just reading we can't simply read the most-significant 32 bits and the least-significant 32 bits of our 64-bit number separately. We'd risk observing one part of our 64-bit value changing separately from the other half. It means that we must adhere to some kind of protocol when reading (or writing) this 64-bit value.
To implement this, we need an atomic test and set bit operation and a clear bit operation. (FYI, technically, what we need are two operations commonly referred to as P and V in computer science, but let's keep it simple.) Before reading or writing our data, we perform an atomic test-and-set operation on a single (shared) bit (commonly referred to as a "lock"). If we read a zero, then we know we are the only one that saw a zero and everyone else must have seen a 1. If we see a 1, then we assume someone else is using our shared data, and therefore we have no choice but to just try again. So we loop and keep testing and setting the bit until we observe it as a 0. (This is called a spin lock, and is the best we can do without getting help from the operating system's scheduler.)
When we eventually see a 0, then we can safely read both 32-bit parts of our 64-bit value individually. Or, if we're writing, we can safely write both 32-bit parts of our 64-bit value individually. Once both halves have been read or written, we clear the bit back to 0, permitting access by someone else.
Any such combination of cleverness and use of atomic operations to avoid corruption in this manner constitutes synchronization because we are governing the order in which certain sections of our program can run. And we can achieve synchronization of any complexity and of any amount of data so long as we have access to some kind of atomic data.
Once we have created a program that uses a lock to share a data structure in a conflict-free way, we could also refer to that data structure as being logically atomic. C++ provides a std::atomic to achieve this, for example.
Remember that synchronization at this level (with a lock) is achieved by adhering to a protocol (protecting your data with a lock). Other forms of synchronization, such as what happens when two CPUs try to access the same memory on the same clock tick, are resolved in hardware by the CPUs and the motherboard, memory, controllers, etc. But fundamentally something similar is happening, just at the motherboard level.
I am doing more less such a setup in the code:
// loop over the inTopicName(s) {
KStream<String, String> stringInput = kBuilder.stream( STRING_SERDE, STRING_SERDE, inTopicName );
stringInput.filter( streamFilter::passOrFilterMessages ).map( processor_i ).to( outTopicName );
// } end of loop
streams = new KafkaStreams( kBuilder, streamsConfig );
streams.cleanUp();
streams.start();
If there is e.g. num.stream.threads > 1, how tasks are assigned to the prepared and assigned (in the loop) threads?
I suppose (I am not sure) there is thread pool and with some kind of round-robin policy the tasks are assigned to threads, but it can be done fully dynamically in runtime or once at the beginning by creation of the filtering/mapping to structure.
Especially I am interesting in the situation when one topic is getting computing intensive tasks and other not. Is it possible that application will starve because all threads will be assigned to the processor which is time consuming.
Let's play a bit with scenario: num.stream.threads=2, no. partitions=4 per topic, no. topics=2 (huge_topic and slim_topic)
The loop in my question is done once at startup of the app. If in the loop I define 2 topics, and I know from one topic comes messages which are heavy weighted (huge_topic) and from the other comes lightweighted messsages (slim_topic).
Is it possible that both threads from num.stream.threads will be busy only with tasks which are comming from huge_topic? And messages from slimm_topic will have to wait for processing?
Internally, Kafka Streams create tasks based on partitions. Going with your loop example and assume you have 3 input topics A, B, C with 2, 4, and 3 partition respectively. For this, you will get 4 task (ie, max number of partitions over all topics) with the following partition to task assignment:
t0: A-0, B-0, C-0
t1: A-1, B-1, C-1
t2: B-2, C-2
t3: B-3
Partitions are grouped "by number" and assigned to the corresponding task. This is determined at runtime (ie, after you call KafakStreams#start()) because before that, the number of partitions per topic is unknown.
It is not recommended to mess with the partitions grouped if you don't understand all the internal details of Kafka Streams -- you can very easily break stuff! (This interface was deprecated already and will be removed in upcoming 3.0 release.)
With regard to threads: tasks limit the number of threads. For our example, this implies that you can have max 4 thread (if you have more, those threads will be idle, as there is no task left for thread assignment). How you "distribute" those thread is up to you. You can either have 4 single threaded application instances of one single application instance with 4 thread (or anything in between).
If you have fewer tasks than threads, task will be assigned in a load balanced way, based on number of tasks (all tasks are assumed to have the same load).
If there is e.g. num.stream.threads > 1, how tasks are assigned to the
prepared and assigned (in the loop) threads?
Tasks are assigned to threads with the usage of a partition grouper. You can read about it here. AFAIK it's called after a rebalance, so it's not a very dynamic process. That said, I'd argue that there is no option for starvation.
I need to use task_scheduler_init to limit the number of threads to a number under of cores, however TBB ignores the number & always uses the number of cores (8 in this case).
This doesn't look like normal behavior to me. Please note that it is not a possibility for me to use a different version of TBB.
Snippet:
task_scheduler_init scheduler(nb_thread);
tbb::parallel_for(
tbb::blocked_range<size_t>(0, size),
[&](const tbb::blocked_range<size_t>& subrange) {
int tid = syscall(SYS_gettid);
dragon_draw_raw(
subrange.begin(),
subrange.end(),
dragon,
dragon_width,
dragon_height,
limits,
id
);
});
If the task_scheduler_init instance was not the first call to TBB on this thread, it explains why you see such a behavior. [Almost] any call to TBB initiates auto-initialization and the number of worker threads is fixed at this time, any further call to task_scheduler_init means basically nothing except additional reference to the internal task scheduler object.
If you are in control of the entire application and can make the call to task_scheduler_init object constructor the first call to TBB, it will fix your problem. But if you are writing a component or a library, look at task_arena feature, it allows to control concurrency limits regardless of the current settings.
I'm struggling with this issue since two days. We have a solution where multiple worker threads will try to select job requests from a single database/table, by setting a flag on the selected requests and thus effectively blocking the other workers to select the same requests.
I created a java test application to test my queries, but while in normal situations the test executes without issue, in high contention situation (ex. 1 table entry with 50 threads; no delays or processing) I still have threads which obtain the same request/entry, interestingly it happens when the test just starts. I cannot understand why. I've read all relevant Postgres locking and isolation related documentation... While is possible that the issue is with the test application itself, I suspect that I'm missing something about how the SELECT FOR UPDATE works in READ COMMITTED isolation context.
So the question would be can SELECT FOR UPDATE (with READ COMMITED isolation) guarantee that a general concurrency issue like I described can be safely solved?
Acquire query:
UPDATE mytable SET status = 'LOCK'
WHERE ctid IN (SELECT ctid FROM mytable
WHERE status = 'FREE'
ORDER BY id
LIMIT %d
FOR UPDATE)
RETURNING id, status;
Release query:
UPDATE mytable SET status = 'FREE'
WHERE id = %d AND status = 'LOCK'
RETURNING id, status;
So would you consider these two queries should be safe, or there is some weird case possible that would allow two threads to acquire the same row? I'd like to mention that I tried also SERIALIZABLE isolation and didn't helped.
It turns out (how could it be different?) that I made a mistake in my test. I didn't respected resource acquire/release order. The test was registering the release (decrementing a counter) after the Release Query, which lead another thread to Acquire the resource and register it in the meantime. An error from the category, which you know how to solve, but cannot see even if you look several times, because you wrote the code... Peer review helped in the end.
I suppose at this time I have a test to prove that:
The above two queries are safe
You don't need SERIALIZABLE isolation to solve problems with DB acquire/release as long as you use row locking as in SELECT... FOR UPDATE
You must ORDER (BY) results when using row locking (even if you use LIMIT 1!), otherwise you end up with deadlocks
Safe to acquire multiple resources with one query (LIMIT 2 and above)
Using ctid in the query is safe; it's actually a little faster, but this is insignificant in real world applications
I'm not sure if this will be helpful to others, but I was getting desperate. So all is good with Postgres 9.3 :)
Another aspect I'd like so share is regarding the speed of the Acquire query with LIMIT 2. See the test result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[36] 186/190/624=1000
Thread[19] 184/201/615=1000
Thread[12] 230/211/559=1000
Thread[46] 175/200/625=1000
Thread[ 9] 205/211/584=1000
...
Thread[ 4] 189/232/579=1000
Thread[ 3] 185/198/617=1000
Thread[49] 218/204/578=1000
Thread[ 1] 204/203/593=1000
...
Thread[37] 177/163/660=1000
Thread[31] 168/199/633=1000
Thread[18] 174/187/639=1000
Thread[42] 178/229/593=1000
Thread[29] 201/229/570=1000
...
Thread[10] 203/198/599=1000
Thread[25] 215/210/575=1000
Thread[27] 248/191/561=1000
...
Thread[17] 311/192/497=1000
Thread[ 8] 365/198/437=1000
Thread[15] 389/176/435=1000
All threads finished
Execution time: 31408
Test done; exiting
Compare the above with this query :
UPDATE mytable SET status = 'LOCK'
WHERE id IN (SELECT t1.id FROM (SELECT id FROM mytable
WHERE status = 'FREE' ORDER BY id LIMIT 2) AS t1
FOR UPDATE)
RETURNING id, status;
and the result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[29] 32/121/847=1000
Thread[22] 61/151/788=1000
Thread[46] 36/114/850=1000
Thread[41] 57/132/811=1000
Thread[24] 49/146/805=1000
Thread[13] 47/135/818=1000
...
Thread[20] 48/118/834=1000
Thread[47] 65/152/783=1000
Thread[18] 51/146/803=1000
Thread[ 8] 69/158/773=1000
Thread[14] 56/158/786=1000
Thread[ 0] 66/161/773=1000
Thread[38] 60/148/792=1000
Thread[27] 69/158/773=1000
...
Thread[45] 78/177/745=1000
Thread[30] 96/162/742=1000
...
Thread[32] 162/167/671=1000
Thread[17] 329/156/515=1000
Thread[33] 337/178/485=1000
Thread[37] 381/172/447=1000
All threads finished
Execution time: 15490
Test done; exiting
Conclusion
The test prints for each thread how many times the Acquire query returned 2, 1 or 0 resources totalling the number of test loops (1000).
From the above results we can conclude that we can speed up the query (halfing the time!) at the cost of increasing the thread contention. This means that we will receive more times 0 resources back from the Acquire query. Technically this is not a problem because we need to treat this situation anyway.
Of course situation changes if you add a wait time (sleeping) when no resources are returned, but choosing a correct value for the wait time depends on the application performance requirements...
I've got some code that is trying to create 100 threaded http calls. It seems to be getting capped at about 40.
When I do threadJoin I'm only getting 38 - 40 sets of results from my http calls, despite the loop being from 1 - 100.
// thread http calls
pages = 100;
for (page="1";page <= pages; page++) {
thread name="req#page#" {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#page#");
results = grabber.send().getPrefix();
arrayAppend(VARIABLES.arrResults,results.fileContent);
}
}
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
}
Is there a limit to the number of threads that CF can create? Is it to do with Java running in the background? Or can it not handle that many http requests?
Is there a just a much better way for me to do this than threaded HTTP calls?
The result you're seeing is likely because your variables aren't thread safe.
grabber.addParam(type="url",name="page",value="#page#");
That line is accessing Variables.Page which is shared by all of the spawned threads. Because threads start at different times, the value of page is often different from the value you think it is. This will lead to multiple threads having the same value for page.
Instead, if you pass page as an attribute to the thread, then each thread will have its own version of the variable, and you will end up with 100 unique values. (1-100).
Additionally you're writing to a shared variable as well.
arrayAppend(VARIABLES.arrResults,results.fileContent);
ArrayAppend is not thread safe and you will be overwriting versions of VARIABLES.arrResults with other versions of itself, instead of appending each bit.
You want to set the result to a thread variable, and then access that once the joins are complete.
thread name="req#page#" page=Variables.page {
grabber.setURL('http://site.com/search.htm');
// request headers
grabber.addParam(type="url",name="page",value="#Attributes.page#");
results = grabber.send().getPrefix();
thread.Result = results.fileContent;
}
And the join:
// rejoin threads
for (page="2";page <= pages; page++) {
threadJoin('req#page#',10000);
arrayAppend(VARIABLES.arrResults, CFThread['req#page#'].Result);
}
In the ColdFusion administrator, there's a setting for how many will run concurrently, mine's defaulted to 10. The rest apparently are queued. An Phantom42 mentions, you can up the number of running CF threads, however, with 100 or more threads, you may run into other problems.
On 32-bit processes, your whole process can only use 2gig of memory. Each thread uses up an amount of Stack memory, which isn't part of the heap. We've had problems with running out of memory with high numbers of threads as your Java Binary+Heap+Non-Heap(PermGen)+(threads*512k) can easily go over the 2-gig limit.
You'd also have to allow enough threads to handle your code above, as well as other requests coming into your app, which may bog down the app as a whole.
I would suggest changing your code to create N threads, each of which does more than 1 request. It's more work, but you break the N requests=N Threads problem. There's a couple of approaches you can take:
If you think that each request is going to take roughly the same time, then you can split up the work and give each thread a portion to work on before you start each one up.
Or each thread picks a URL off a list and processes it, you can then join to all N threads. You'd need to make sure you put locking around whatever counter you used to track progress though.
Check your Maximum number of running JRun threads setting in ColdFusion Administrator under the Request Tuning tab. The default is 50.