LongAdder Striped64 wasUncontended implementation detail

LongAdder Striped64 wasUncontended implementation detail - multithreading

This is a question not about how LongAdder works, it's about an intriguing implementation detail that I can't figure out.
Here is the code from Striped64 (I've cut out some parts and left the relevant parts for the question):
final void longAccumulate(long x, LongBinaryOperator fn,
boolean wasUncontended) {
int h;
if ((h = getProbe()) == 0) {
ThreadLocalRandom.current(); // force initialization
h = getProbe();
wasUncontended = true;
}
boolean collide = false; // True if last slot nonempty
for (;;) {
Cell[] as; Cell a; int n; long v;
if ((as = cells) != null && (n = as.length) > 0) {
if ((a = as[(n - 1) & h]) == null) {
//logic to insert the Cell in the array
}
// CAS already known to fail
else if (!wasUncontended) {
wasUncontended = true; // Continue after rehash
}
else if (a.cas(v = a.value, ((fn == null) ? v + x : fn.applyAsLong(v, x)))){
break;
}
A lot of things from code are clear to me, except for the :
// CAS already known to fail
else if (!wasUncontended) {
wasUncontended = true; // Continue after rehash
}
Where does this certainty that the following CAS will fail?
This is really confusing for me at least, because this check only makes sense for a single case : when some Thread enters the longAccumulate method for the n-th time (n > 1) and the busy spin is at it's first cycle.
It's like this code is saying : if you (some Thread) have been here before and you have some contention on a particular Cell slot, don't try to CAS your value to the already existing one, but instead rehash the probe.
I honestly hope I will make some sense for someone.

It's not that it will fail, it's more that it has failed. The call to this method is done by the LongAdder add method.
public void add(long x) {
Cell[] as; long b, v; int m; Cell a;
if ((as = cells) != null || !casBase(b = base, b + x)) {
boolean uncontended = true;
if (as == null || (m = as.length - 1) < 0 ||
(a = as[getProbe() & m]) == null ||
!(uncontended = a.cas(v = a.value, v + x)))
longAccumulate(x, null, uncontended);
}
}
The first set of conditionals is related to existence of the long Cells. If the necessary cell doesn't exist, then it will try to accumulate uncontended (as there was no attempt to add) by atomically adding the necessary cell and then adding.
If the cell does exist, try to add (v + x). If the add failed then there was some form of contention, in that case try to do the accumulating optimistically/atomically (spin until successful)
So why does it have
wasUncontended = true; // Continue after rehash
My best guess is that with heavy contention, it will try to give the running thread time to catch up and will force a retry of the existing cells.

Related

Is this petersons solution correct for N threads?

bool lock[N];
int turn=0;
int offset=0;
int M = N-1;
int pidToN(int pid); //returns a unique number between (0,N-1) for a given pid; mapping pids
void critical()
{
int pidn = pidToN(getpid());
lock[pidn] = true;
turn = M-pidn;
if(turn == pidn)
{
val=1;
turn+=val%N;
}
else
val=0;
while(lock(M-pidn+val) && turn == (M-pidn+val) && lock(M-pidn-val) && turn == (M-pidn-val));
//critical section
lock[pidn] = false;
}
Does this implemention work? Essentially thread[i] tries to pass to thread[N-1-i] and vice versa. If i = N/2 (the thread at the middle, if it exists, which passes to itself) then I increment it by certain val (1 in this case) which then waits.
Couldn't come up with any race conditions.
Any help would be appreciated.

Why is code analysis warning "Using logical && when bitwise & was probably intended" being raised?

Code:
BOOL CCreateReportDlg::CanSwapBrothers()
{
BOOL b1in2 = FALSE, b2in1 = FALSE;
CStringArray aryStrNames;
// Must have valid data
if (!IsSwapBrotherInit())
return FALSE;
// Get cell pointers
auto pCell1 = GetSwapBrotherCell(1);
auto pCell2 = GetSwapBrotherCell(2);
if (pCell1 != nullptr && pCell2 != nullptr)
{
// Look for brother (cell 1) in cell 2 array
auto strName = pCell1->GetText();
pCell2->GetOptions(aryStrNames);
const auto iNumNames = aryStrNames.GetSize();
for (auto iName = 0; iName < iNumNames; iName++)
{
if (aryStrNames[iName] == strName)
{
b1in2 = TRUE;
break;
}
}
if (b1in2)
{
// Look for brother (cell 2) in cell 1 array
auto strName = pCell2->GetText();
pCell1->GetOptions(aryStrNames);
const auto iNumNames = aryStrNames.GetSize();
for (auto iName = 0; iName < iNumNames; iName++)
{
if (aryStrNames[iName] == strName)
{
b2in1 = TRUE;
break;
}
}
}
}
return b1in2 && b2in1;
}
The line of interest is the return statement:
return b1in2 && b2in1;
I am getting a code analysis warning:
lnt-logical-bitwise-mismatch
Using logical && when bitwise & was probably intended.
As far as I am concerned my code is correct. Why is this being raised?

The compiler sees && applies to integer operands and an implicit conversion of the result to an integer type. BOOL has multiple bits; it isn't the same as the built-in type bool.
As noted in the page you linked to, "A logical operator was used with integer values" will cause this warning and that condition is certainly present here.
"MFC" coding styles violate modern recommendations in a lot of ways, using a non-standard boolean type is just one of the smaller issues. CStringArray is also a code smell, modern C++ uses templated containers and has powerful algorithms for manipulating them, you never should be writing search code yourself.

Optimal algorithm for this string decompression

I have been working on an exercise from google's dev tech guide. It is called Compression and Decompression you can check the following link to get the description of the problem Challenge Description.
Here is my code for the solution:
public static String decompressV2 (String string, int start, int times) {
String result = "";
for (int i = 0; i < times; i++) {
inner:
{
for (int j = start; j < string.length(); j++) {
if (isNumeric(string.substring(j, j + 1))) {
String num = string.substring(j, j + 1);
int times2 = Integer.parseInt(num);
String temp = decompressV2(string, j + 2, times2);
result = result + temp;
int next_j = find_next(string, j + 2);
j = next_j;
continue;
}
if (string.substring(j, j + 1).equals("]")) { // Si es un bracket cerrado
break inner;
}
result = result + string.substring(j,j+1);
}
}
}
return result;
}
public static int find_next(String string, int start) {
int count = 0;
for (int i = start; i < string.length(); i++) {
if (string.substring(i, i+1).equals("[")) {
count= count + 1;
}
if (string.substring(i, i +1).equals("]") && count> 0) {
count = count- 1;
continue;
}
if (string.substring(i, i +1).equals("]") && count== 0) {
return i;
}
}
return -111111;
}
I will explain a little bit about the inner workings of my approach. It is a basic solution involves use of simple recursion and loops.
So, let's start from the beggining with a simple decompression:
DevTech.decompressV2("2[3[a]b]", 0, 1);
As you can see, the 0 indicates that it has to iterate over the string at index 0, and the 1 indicates that the string has to be evaluated only once: 1[ 2[3[a]b] ]
The core here is that everytime you encounter a number you call the algorithm again(recursively) and continue where the string insides its brackets ends, that's the find_next function for.
When it finds a close brackets, the inner loop breaks, that's the way I choose to make the stop sign.
I think that would be the main idea behind the algorithm, if you read the code closely you'll get the full picture.
So here are some of my concerns about the way I've written the solution:
I could not find a more clean solution to tell the algorithm were to go next if it finds a number. So I kind of hardcoded it with the find_next function. Is there a way to do this more clean inside the decompress func ?
About performance, It wastes a lot of time by doing the same thing again, when you have a number bigger than 1 at the begging of a bracket.
I am relatively to programming so maybe this code also needs an improvement not in the idea, but in the ways It's written. So would be very grateful to get some suggestions.
This is the approach I figure out but I am sure there are a couple more, I could not think of anyone but It would be great if you could tell your ideas.
In the description it tells you some things that you should be awared of when developing the solutions. They are: handling non-repeated strings, handling repetitions inside, not doing the same job twice, not copying too much. Are these covered by my approach ?
And the last point It's about tets cases, I know that confidence is very important when developing solutions, and the best way to give confidence to an algorithm is test cases. I tried a few and they all worked as expected. But what techniques do you recommend for developing test cases. Are there any softwares?
So that would be all guys, I am new to the community so I am open to suggestions about the how to improve the quality of the question. Cheers!

Your solution involves a lot of string copying that really slows it down. Instead of returning strings that you concatenate, you should pass a StringBuilder into every call and append substrings onto that.
That means you can use your return value to indicate the position to continue scanning from.
You're also parsing repeated parts of the source string more than once.
My solution looks like this:
public static String decompress(String src)
{
StringBuilder dest = new StringBuilder();
_decomp2(dest, src, 0);
return dest.toString();
}
private static int _decomp2(StringBuilder dest, String src, int pos)
{
int num=0;
while(pos < src.length()) {
char c = src.charAt(pos++);
if (c == ']') {
break;
}
if (c>='0' && c<='9') {
num = num*10 + (c-'0');
} else if (c=='[') {
int startlen = dest.length();
pos = _decomp2(dest, src, pos);
if (num<1) {
// 0 repetitions -- delete it
dest.setLength(startlen);
} else {
// copy output num-1 times
int copyEnd = startlen + (num-1) * (dest.length()-startlen);
for (int i=startlen; i<copyEnd; ++i) {
dest.append(dest.charAt(i));
}
}
num=0;
} else {
// regular char
dest.append(c);
num=0;
}
}
return pos;
}

I would try to return a tuple that also contains the next index where decompression should continue from. Then we can have a recursion that concatenates the current part with the rest of the block in the current recursion depth.
Here's JavaScript code. It takes some thought to encapsulate the order of operations that reflects the rules.
function f(s, i=0){
if (i == s.length)
return ['', i];
// We might start with a multiplier
let m = '';
while (!isNaN(s[i]))
m = m + s[i++];
// If we have a multiplier, we'll
// also have a nested expression
if (s[i] == '['){
let result = '';
const [word, nextIdx] = f(s, i + 1);
for (let j=0; j<Number(m); j++)
result = result + word;
const [rest, end] = f(s, nextIdx);
return [result + rest, end]
}
// Otherwise, we may have a word,
let word = '';
while (isNaN(s[i]) && s[i] != ']' && i < s.length)
word = word + s[i++];
// followed by either the end of an expression
// or another multiplier
const [rest, end] = s[i] == ']' ? ['', i + 1] : f(s, i);
return [word + rest, end];
}
var strs = [
'2[3[a]b]',
'10[a]',
'3[abc]4[ab]c',
'2[2[a]g2[r]]'
];
for (const s of strs){
console.log(s);
console.log(JSON.stringify(f(s)));
console.log('');
}

Inverse the input Expression using stacks

I had a coding challenge as one of the process for recruitment into a company. In that coding challenge, one of the question was to inverse an expression.
For Example,
Input : 14-3*2/5
Output : 5/2*3-14
I used stack to put each number say 14 or 3 and expressions and then popped it out again to form the output.
Input format is : num op num op num op num
So we need not worry about input being -2.
num can be between -10^16 to 10^16. I was dealing with strings completely, so even if the number exceeds the 10^16 limit, my algorithm wouldn't have any problem.
My algorithm passed 7 test cases and failed in 2 of them.
I couldn't figure it out what the corner case would be. I couldn't see the test cases as well. Any idea what that might be. I know there isn't enough information, but unfortunately I too don't have them.
// Complete the reverse function below.
static String reverse(String expression) {
expression = expression.trim();
if(expression == ""){
return "";
}
Stack<String> stack = new Stack<String>();
String num = "";
for(int i=0; i<expression.length(); i++){
char c = expression.charAt(i);
if(c==' '){
continue;
}
if(c == '+' || c == '-' || c == '*' || c == '/'){
if(num != "") {
stack.push(num);
}
num = "";
stack.push(Character.toString(c));
} else{
num += c;
}
}
if(num != "") {
stack.push(num);
}
String revExp = "";
while(! stack.empty()){
revExp = revExp + stack.pop();
}
return revExp;
}

how to model a queue in promela?

Ok, so I'm trying to model a CLH-RW lock in Promela.
The way the lock works is simple, really:
The queue consists of a tail, to which both readers and writers enqueue a node containing a single bool succ_must_wait they do so by creating a new node and CAS-ing it with the tail.
The tail thereby becomes the node's predecessor, pred.
Then they spin-wait on pred.succ_must_wait until it is false.
Readers first increment a reader counter ncritR and then set their own flag to false, allowing multiple readers at in the critical section at the same time. Releasing a readlock simply means decrementing ncritR again.
Writers wait until ncritR reaches zero, then enter the critical section. They do not set their flag to false until the lock is released.
I'm kind of struggling to model this in promela, though.
My current attempt (see below) tries to make use of arrays, where each node basically consists of a number of array entries.
This fails because let's say A enqueues itself, then B enqueues itself. Then the queue will look like this:
S <- A <- B
Where S is a sentinel node.
The problem now is, that when A runs to completeness and re-enqueues, the queue will look like
S <- A <- B <- A'
In actual execution, this is absolutely fine because A and A' are distinct node objects. And since A.succ_must_wait will have been set to false when A first released the lock, B will eventually make progress, and therefore A' will eventually make progress.
What happens in the array-based promela model below, though, is that A and A' occupy the same array positions, causing B to miss the fact that A has released the lock, thereby creating a deadlock where B is (wrongly) waiting for A' instead of A and A' is waiting (correctly) for B.
A possible "solution" to this could be to have A wait until B acknowledges the release. But that would not be true to how the lock works.
Another "solution" would be to wait for a CHANGE in pred.succ_must_wait, where a release would increment succ_must_wait, rather than reset it to 0.
But I'm intending to model a version of the lock, where pred may change (i.e. where a node may be allowed to disregard some of its predecessors), and I'm not entirely convinced something like the increasing version wouldn't cause an issue with this change.
So what's the "smartest" way to model an implicit queue like this in promela?
/* CLH-RW Lock */
/*pid: 0 = init, 1-2 = reader, 3-4 = writer*/
ltl liveness{
([]<> reader[1]#progress_reader)
&& ([]<> reader[2]#progress_reader)
&& ([]<> writer[3]#progress_writer)
&& ([]<> writer[4]#progress_writer)
}
bool initialised = 0;
byte ncritR;
byte ncritW;
byte tail;
bool succ_must_wait[5]
byte pred[5]
init{
assert(_pid == 0);
ncritR = 0;
ncritW = 0;
/*sentinel node*/
tail =0;
pred[0] = 0;
succ_must_wait[0] = 0;
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
succ_must_wait[_pid] = 1;
atomic {
pred[_pid] = tail;
tail = _pid;
}
(succ_must_wait[pred[_pid]] == 0)
ncritR++;
succ_must_wait[_pid] = 0;
atomic {
/*freeing previous node for garbage collection*/
pred[_pid] = 0;
}
/*CRITICAL SECTION*/
progress_reader:
assert(ncritR >= 1);
assert(ncritW == 0);
ncritR--;
atomic {
/*necessary to model the fact that the next access creates a new queue node*/
if
:: tail == _pid -> tail = 0;
:: else ->
fi
}
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
succ_must_wait[_pid] = 1;
atomic {
pred[_pid] = tail;
tail = _pid;
}
(succ_must_wait[pred[_pid]] == 0)
(ncritR == 0)
atomic {
/*freeing previous node for garbage collection*/
pred[_pid] = 0;
}
ncritW++;
/* CRITICAL SECTION */
progress_writer:
assert(ncritR == 0);
assert(ncritW == 1);
ncritW--;
succ_must_wait[_pid] = 0;
atomic {
/*necessary to model the fact that the next access creates a new queue node*/
if
:: tail == _pid -> tail = 0;
:: else ->
fi
}
od
}

First of all, a few notes:
You don't need to initialize your variables to 0, since:
The default initial value of all variables is zero.
see the docs.
You don't need to enclose a single instruction inside an atomic {} statement, since any elementary statement is executed atomically. For better efficiency of the verification process, whenever possible, you should use d_step {} instead. Here you can find a related stackoverflow Q/A on the topic.
init {} is guaranteed to have _pid == 0 when one of the two following conditions holds:
no active proctype is declared
init {} is declared before any other active proctype appearing in the source code
Active Processes, includig init {}, are spawned in order of appearance inside the source code. All other processes are spawned in order of appearance of the corresponding run ... statement.
I identified the following issues on your model:
the instruction pred[_pid] = 0 is useless because that memory location is only read after the assignment pred[_pid] = tail
When you release the successor of a node, you set succ_must_wait[_pid] to 0 only and you don't invalidate the node instance onto which your successor is waiting for. This is the problem that you identified in your question, but was unable to solve. The solution I propose is to add the following code:
pid j;
for (j: 1..4) {
if
:: pred[j] == _pid -> pred[j] = 0;
:: else -> skip;
fi
}
This should be enclosed in an atomic {} block.
You correctly set tail back to 0 when you find that the node that has just left the critical section is also the last node in the queue. You also correctly enclose this operation in an atomic {} block. However, it may happen that --when you are about to enter this atomic {} block-- some other process --who was still waiting in some idle state-- decides to execute the initial atomic block and copies the current value of tail --which corresponds to the node that has just expired-- into his own pred[_pid] memory location. If now the node that has just exited the critical section attempts to join it once again, setting his own value of succ_must_wait[_pid] to 1, you will get another instance of circular wait among processes. The correct approach is to merge this part with the code releasing the successor.
The following inline function can be used to release the successor of a given node:
inline release_succ(i)
{
d_step {
pid j;
for (j: 1..4) {
if
:: pred[j] == i ->
pred[j] = 0;
:: else ->
skip;
fi
}
succ_must_wait[i] = 0;
if
:: tail == _pid -> tail = 0;
:: else -> skip;
fi
}
}
The complete model, follows:
byte ncritR;
byte ncritW;
byte tail;
bool succ_must_wait[5];
byte pred[5];
init
{
skip
}
inline release_succ(i)
{
d_step {
pid j;
for (j: 1..4) {
if
:: pred[j] == i ->
pred[j] = 0;
:: else ->
skip;
fi
}
succ_must_wait[i] = 0;
if
:: tail == _pid -> tail = 0;
:: else -> skip;
fi
}
}
active [2] proctype reader()
{
loop:
succ_must_wait[_pid] = 1;
d_step {
pred[_pid] = tail;
tail = _pid;
}
trying:
(succ_must_wait[pred[_pid]] == 0)
ncritR++;
release_succ(_pid);
// critical section
progress_reader:
assert(ncritR > 0);
assert(ncritW == 0);
ncritR--;
goto loop;
}
active [2] proctype writer()
{
loop:
succ_must_wait[_pid] = 1;
d_step {
pred[_pid] = tail;
tail = _pid;
}
trying:
(succ_must_wait[pred[_pid]] == 0) && (ncritR == 0)
ncritW++;
// critical section
progress_writer:
assert(ncritR == 0);
assert(ncritW == 1);
ncritW--;
release_succ(_pid);
goto loop;
}
I added the following properties to the model:
p0: the writer with _pid equal to 4 goes through its progress state infinitely often, provided that it is given the chance to execute some instruction infinitely often:
ltl p0 {
([]<> (_last == 4)) ->
([]<> writer[4]#progress_writer)
};
This property should be true.
p1: there is never more than one reader in the critical section:
ltl p1 {
([] (ncritR <= 1))
};
Obviously, we expect this property to be false in a model that matches your specification.
p2: there is never more than one writer in the critical section:
ltl p2 {
([] (ncritW <= 1))
};
This property should be true.
p3: there isn't any node that is the predecessor of two other nodes at the same time, unless such node is node 0:
ltl p3 {
[] (
(((pred[1] != 0) && (pred[2] != 0)) -> (pred[1] != pred[2])) &&
(((pred[1] != 0) && (pred[3] != 0)) -> (pred[1] != pred[3])) &&
(((pred[1] != 0) && (pred[4] != 0)) -> (pred[1] != pred[4])) &&
(((pred[2] != 0) && (pred[3] != 0)) -> (pred[2] != pred[3])) &&
(((pred[2] != 0) && (pred[4] != 0)) -> (pred[2] != pred[4])) &&
(((pred[3] != 0) && (pred[4] != 0)) -> (pred[3] != pred[4]))
)
};
This property should be true.
p4: it is always true that whenever writer with _pid equal to 4 tries to access the critical section then it will eventually get there:
ltl p4 {
[] (writer[4]#trying -> <> writer[4]#progress_writer)
};
This property should be true.
The outcome of the verification matches our expectations:
~$ spin -search -ltl p0 -a clhrw_lock.pml
...
Full statespace search for:
never claim + (p0)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 68 byte, depth reached 3305, errors: 0
...
~$ spin -search -ltl p1 -a clhrw_lock.pml
...
Full statespace search for:
never claim + (p1)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 68 byte, depth reached 1692, errors: 1
...
~$ spin -search -ltl p2 -a clhrw_lock.pml
...
Full statespace search for:
never claim + (p2)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 68 byte, depth reached 3115, errors: 0
...
~$ spin -search -ltl p3 -a clhrw_lock.pml
...
Full statespace search for:
never claim + (p3)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 68 byte, depth reached 3115, errors: 0
...
~$ spin -search -ltl p4 -a clhrw_lock.pml
...
Full statespace search for:
never claim + (p4)
assertion violations + (if within scope of claim)
acceptance cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 68 byte, depth reached 3115, errors: 0
...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

LongAdder Striped64 wasUncontended implementation detail - multithreading

Related

Is this petersons solution correct for N threads?

Why is code analysis warning "Using logical && when bitwise & was probably intended" being raised?

Optimal algorithm for this string decompression

Inverse the input Expression using stacks

how to model a queue in promela?

Categories

Resources