why does it use " ErrCounter >= limt " in the condition? - uml

in state machine diagram ,I don't understand why the condition is ErrCounter >= limit . i think it is good to write ErrCounter == limit .

ErrCounter >= limit is stronger than ErrCounter == limit. You have a gain with no risk.
This is to be on a safe side. The problem is there might be also something else that increments the ErrCounter while in one of the states (or even in transition) or the ErrCounter can be already equal to limit when starting the process (BTW this should lead to rejection anyway but never mind).
Let's make it a life example. Imagine those two scenarios (let's say limit = 3):
The card holder has already tried trice at some other point (e.g. in a shop) failing to use the correct pin. Now ErrCounter = 3. The card holder decides to give it another try in the ATM. The ATM reads the ErrCounter (as part of Authentication) and as the CheckPin failed (automatically due to too many earlier tries) now the ErrCounter is incremented again (so ErrCounter = 4). With weak case you can try again and again in an infinite loop.
The card is duplicated (you know, now it can be handled through any NFC phone for example). Imagine two people want to withdraw a large amount so they work simultaneously on two ATMs. The bad luck is that they find themselves in the situation where both of them make a mistaken the PIN twice. Let's say the ATM reads the current ErrCounter as part of Authentication. So we have (in brackets resultant ErrCounter):
partner 1 enters incorrect PIN on ATM1 (ErrCounter = 1)
partner 2 enters incorrect PIN on ATM2 (ErrCounter = 2)
partner 1 enters incorrect PIN on ATM1 (ErrCounter = 3). Partner 1's try (with phone) is now rejected
partner 2 enters incorrect PIN on ATM2 (ErrCounter = 4). If there wasn't >= it would again put an infinite loop of tries. With the stronger inequation this try is also rejected.

Related

What is thread synchronization and how does it differ form atomicity?

Atomicity can be achieved with machine level instructions such as compare and swap (CS).
It could also be achieved with the use of a mutex/lock for a large blocks of code with the OS providing help on it.
On the other hand we also have the concept of memory model. Some machines could have a relaxed model like Arm which could re-order load/stores on a single thread, and some have a more strict model like x86.
I want to confirm my understanding of the term synchronization. Is it pretty much the
promise of both atomicity and the memory model? i.e only using atomic ops on a thread doesn't necessary make it synchronized with other threads?
Something atomic is indivisible. Things that are synchronized are happening together in time.
Atomicity
I like to think of this like having a data structure representing a 2-dimensional point with x, y coordinates. For my purposes, in order for my data to be considered "valid" it must always be a point along the x = y line. x and y must always be the same.
Suppose that initially I have a point { x = 10, y = 10 } and I want to update my data structure so that it represents the point {x = 20, y = 20}. And suppose that the implementation of the update operation is basically these two separate steps:
x = 20
y = 20
If my implementation writes x and y separately like that, then some other thread could potentially observe my point data structure data after step 1 but before step 2. If it is allowed to read the value of the point after I change x but before I change y then that other observer might observe the value {x = 20, y = 10}.
In fact there are three values that could be observed
{x = 10, y = 10} (the original value) [VALID]
{x = 20, y = 10} (x is modified but y is not yet modified) [INVALID x != y]
{x = 20, y = 20} (both x and y are modified) [VALID]
I need a way of updating the two values together so that it is impossible for an outside observer observe {x = 20, y = 10}.
I don't really care when the other observer looks at the value of my point. It is fine it it observes { x = 10, y = 10 } and it is also fine if it observes { x = 20, y = 20 }. Both have the property of x == y, which makes them valid in my scenario.
Simplest atomic operation
The most simple atomic operation is a test and set of a single bit. This operation atomically reads a value of a bit and overwrites it with a 1, returning the state of the bit we overwrote. But we are offered the guarantee that if our operation has concluded then we have the value that we overwrote and any other observer will observe a 1. If many agents attempt this operation simultaneously, only one agent will return 0, and the others will all return 1. Even if it's two CPU's writing on the exact same clock tick, something in the electronics will guarantee that the operation is concluded logically atomically according to our rules.
That's it to logical atomicity. That's all atomic means. It means you have the capability of performing an uninterrupted update with valid data before and after the update and the data cannot be observed by another observer in any intermediate state it may take on during the update. It may be a single bit or it may be an entire database.
x86 Example
A good example of something that can be done on x86 atomically is the 32-bit interlocked increment.
Here a 32-bit (4-byte) value must be incremented by 1. This could potentially need to modify all 4 bytes for this to work correctly. If the value is to be modified from 0x000000FF to 0x00000100, it's important that the 0x00 becomes a 0x00 and the 0xFF becomes a 0x00 atomically. Otherwise I risk observing the value 0x00000000 (if the LSB is modified first) or 0x000001FF (if the MSB is modified first).
The hardware guarantees that we can test and modify 4 bytes at a time to achieve this. The CPU and memory provide a mechanism by which this operation can be performed even if there are other CPUs sharing the same memory. The CPU can assert a lock condition that prevents other CPUs from interfering with this interlocked operation.
Synchronization
Synchronization just talks about how things happen together in time. In the context you propose, it's about the order in which various sections of our program get executed and the order in which various components of our system change state. Without synchronization, we risk corruption (entering an invalid, semantically meaningless or incorrect state of execution of our program or its data)
Let's say we want to have an interlocked increment of a 64-bit number. Let's suppose that the hardware does not offer a way to atomically change 64-bits at a time. We will have to accomplish what we want with more complex data structure that means that even when just reading we can't simply read the most-significant 32 bits and the least-significant 32 bits of our 64-bit number separately. We'd risk observing one part of our 64-bit value changing separately from the other half. It means that we must adhere to some kind of protocol when reading (or writing) this 64-bit value.
To implement this, we need an atomic test and set bit operation and a clear bit operation. (FYI, technically, what we need are two operations commonly referred to as P and V in computer science, but let's keep it simple.) Before reading or writing our data, we perform an atomic test-and-set operation on a single (shared) bit (commonly referred to as a "lock"). If we read a zero, then we know we are the only one that saw a zero and everyone else must have seen a 1. If we see a 1, then we assume someone else is using our shared data, and therefore we have no choice but to just try again. So we loop and keep testing and setting the bit until we observe it as a 0. (This is called a spin lock, and is the best we can do without getting help from the operating system's scheduler.)
When we eventually see a 0, then we can safely read both 32-bit parts of our 64-bit value individually. Or, if we're writing, we can safely write both 32-bit parts of our 64-bit value individually. Once both halves have been read or written, we clear the bit back to 0, permitting access by someone else.
Any such combination of cleverness and use of atomic operations to avoid corruption in this manner constitutes synchronization because we are governing the order in which certain sections of our program can run. And we can achieve synchronization of any complexity and of any amount of data so long as we have access to some kind of atomic data.
Once we have created a program that uses a lock to share a data structure in a conflict-free way, we could also refer to that data structure as being logically atomic. C++ provides a std::atomic to achieve this, for example.
Remember that synchronization at this level (with a lock) is achieved by adhering to a protocol (protecting your data with a lock). Other forms of synchronization, such as what happens when two CPUs try to access the same memory on the same clock tick, are resolved in hardware by the CPUs and the motherboard, memory, controllers, etc. But fundamentally something similar is happening, just at the motherboard level.

QThread::idealThreadCount() always returning "2"

I need to display, in a QSpinBox, the number of cores, or threads, that the CPU has. The problem is:
QThread cpuInfo(this); //get CPU info
ui->spnBx_nmb_nodes->setValue(cpuInfo.idealThreadCount()); //get thread count
This is always returning "2". I tried in a "2 cores/4 threads" notebook; a "4 cores/8 threads" computer and a "12 cores/ 24 threads" server. In all the cases, this is returning "2" as the ideal thread count.
Can someone, please, give me some light?
idealThreadCount()'s implementation is different on different OS's:
On Windows, QThread::idealThreadCount() calls the Win32 function GetNativeSystemInfo() and from its results, returns the dwNumberOfProcessors value from the SYSTEM_INFO struct that call populates.
On Linux (and most other Unix-y OS)'s, QThread::idealThreadCount() calls sysconf(_SC_NPROCESSORS_ONLN) and returns that value.
On MacOS/X (and BSD and iOS), QThread::idealThreadCount() calls sysctl(CTL_HW, HW_NCPU) and returns the value it receives from there.
QThread::idealThreadCount() also contains some other back-end implementations for less-commonly used OS's, which I won't attempt to summarize here; if you need to look for yourself, the code is at lines 461-515 of qtbase/src/corelib/thread/qthread_unix.cpp.
Given all of the above, the question devolves to, why is the OS-command (that Qt is calling through to) returning 2 instead of a more appropriate number? It sounds like a bug to me, although one other possibility is that idealThreadCount() is returning the correct number, but your QSpinBox is clamping that number down to 2 for some reason. If you haven't done so already, I suggest printing out the value returned by cpuInfo.idealThreadCount() directly, in addition to passing it to setValue(), just to be sure.
Try the following code:
auto const value = 8;
auto *nmb_nodes = ui->spnBx_nmb_nodes;
nmb_nodes->setValue(value);
Q_ASSERT(nmb_nodes->value() == value);
My bet is that the assertion will not be fulfilled. So your problem is likely not what you think it is.

If one thread writes to a location and another thread is reading, can the second thread see the new value then the old?

Start with x = 0. Note there are no memory barriers in any of the code below.
volatile int x = 0
Thread 1:
while (x == 0) {}
print "Saw non-zer0"
while (x != 0) {}
print "Saw zero again!"
Thread 2:
x = 1
Is it ever possible to see the second message, "Saw zero again!", on any (real) CPU? What about on x86_64?
Similarly, in this code:
volatile int x = 0.
Thread 1:
while (x == 0) {}
x = 2
Thread 2:
x = 1
Is the final value of x guaranteed to be 2, or could the CPU caches update main memory in some arbitrary order, so that although x = 1 gets into a CPU's cache where thread 1 can see it, then thread 1 gets moved to a different cpu where it writes x = 2 to that cpu's cache, and the x = 2 gets written back to main memory before x = 1.
Yes, it's entirely possible. The compiler could, for example, have just written x to memory but still have the value in a register. One while loop could check memory while the other checks the register.
It doesn't happen due to CPU caches because cache coherency hardware logic makes the caches invisible on all CPUs you are likely to actually use.
Theoretically, the write race you talk about could happen due to posted write buffering and read prefetching. Miraculous tricks were used to make this impossible on x86 CPUs to avoid breaking legacy code. But you shouldn't expect future processors to do this.
Leaving aside for a second tricks done by the compiler (even ones allowed by language standards), I believe you're asking how the micro-architecture could behave in such scenario. Keep in mind that the code would most likely expand into a busy wait loop of cmp [x] + jz or something similar, which hides a load inside it. This means that [x] is likely to live in the cache of the core running thread 1.
At some point, thread 2 would come and perform the store. If it resides on a different core, the line would first be invalidated completely from the first core. If these are 2 threads running on the same physical core - the store would immediately affect all chronologically younger loads.
Now, the most likely thing to happen on a modern out-of-order machine is that all the loads in the pipeline at this point would be different iterations of the same first loop (since any branch predictor facing so many repetitive "taken" resolution is likely to assume the branch will continue being taken, until proven wrong), so what would happen is that the first load to encounter the new value modified by the other thread will cause the matching branch to simply flush the entire pipe from all younger operations, without the 2nd loop ever having a chance to execute.
However, it's possible that for some reason you did get to the 2nd loop (let's say the predictor issue a not-taken prediction just at the right moment when the loop condition check saw the new value) - in this case, the question boils down to this scenario:
Time -->
----------------------------------------------------------------
thread 1
cmp [x],0 execute
je ... execute (not taken)
...
cmp [x],0 execute
jne ... execute (not taken)
Can_We_Get_Here:
...
thread2
store [x],1 execute
In other words, given that most modern CPUs may execute instructions out of order, can a younger load be evaluated before an older one to the same address, allowing the store (from another thread) to change the value so it may be observed inconsistently by the loads.
My guess is that the above timeline is quite possible given the nature of out-of-order execution engines today, as they simply arbitrate and perform whatever operation is ready. However, on most x86 implementations there are safeguards to protect against such a scenario, since the memory ordering rules strictly say -
8.2.3.2 Neither Loads Nor Stores Are Reordered with Like Operations
Such mechanisms may detect this scenario and flush the machine to prevent the stale/wrong values becoming visible. So The answer is - no, it should not be possible, unless of course the software or the compiler change the nature of the code to prevent the hardware from noticing the relation. Then again, memory ordering rules are sometimes flaky, and i'm not sure all x86 manufacturers adhere to the exact same wording, but this is a pretty fundamental example of consistency, so i'd be very surprised if one of them missed it.
The answer seems to be, "this is exactly the job of the CPU cache coherency." x86 processors implement the MESI protocol, which guarantee that the second thread can't see the new value then the old.

Are Data Races bad?

I like to settle a theoretical computing argument.
Assume everything initial 0
Thread0 Thread1
x=1 | y=x
Here we have a data race. As far as I understand (assuming that x fits in the architecture's word-size and is aligned on the word boundary, which it normally would be), the result is either x=1 ^ y=0 or x=1 ^ y=1.
Now my second example uses explicit locking (assume that lock() gets some global lock), and as far as I understand this is not a data race condition anymore.
Thread0 Thread1
lock() | lock()
x=1 | y=x
unlock() | unlock()
However I would argue that both programs are identical, they produce identical output, have identical race issues. Somehow however people are trying to convince me that data race condition is bad, and I don't see why my first program would be worse than my second.
Edit. The full quote from Wikipedia is:
C++11 introduced formal support for multithreading, and defined a data race strictly as a race condition between non-atomic variables. While race conditions in general will continue to exist, a "data race" must be avoided by the programmer, who must assure that only one thread at a time may access any variable if the access is for writing.
Now, assuming this is correct (it's wikipedia, which tends to be reasonably good on programming but can often be very wrong indeed), it's defining "data race" in this context purely as one of the clearly bad cases; those which can cause shearing of values. Such cases obviously must be avoided, so clearly data-races—defined as they are here—must be avoided.
And by this definition, neither program in your question has a data race.
I leave my original answer on race conditions generally:
The second example has a data-race too. Indeed, it has the exact same data-race as the first one.
Is this bad? That depends. Note before any of the rest. Not only are many cases bad, as I'll describe more below, but those cases that are bad tend to be particularly hard to find and fix, which in itself should lean one towards assuming the worse.
An obvious case where a data race is bad is where it corrupts data. Let's say we change your example so that x and y were larger than the architecture's word size and we're setting x = -1. We'll also assume two's-complement. Now the possible values for y are not just -1 and 0, but also -4294967296 and 4294967295.
In this case, the locking you suggest wouldn't remove the data-race completely, but would remove that part of it that could cause shearing: The only possible values of y would again be -1 and 0.
Another question is serialisation. It's often necessary to be able to consider a sequence of concurrent events as having been one of a limited set of sequential events.
For example, consider we start with X = 0 and then have:
Thread 0 Thread 1
++x x = -50
Now, there's still the risk of sheering here that could result in a possible bogus value.
Assuming that x is word-size or smaller, we still might have an issue. There are two possible values if the operations were not concurrent. Either x could be equal to -50 (increment, then assign -50) or x could be equal to -49 (assign -50 then increment). However, concurrently it's possible for us to end up with x having a value of 1 because thread 0 reads 0, thread 1 assigns -50and then thread 0 increments and assigns 1.
Now, it's quite possible that this is perfectly okay. It's very likely though that it isn't.
As programmers we've got four possibilities:
Identify the data-race. Determine that it is harmless (or relatively harmless*), and let it be.
Identify the data-race. Determine that it can cause problems, and fix it.
Identify the data-race. Just fix it because that way we can't make a mistake in determining it is harmless when it actually isn't.
Identify the data-race. Determine that it can cause problems. Change the code so the race doesn't cause problems.
The importance of case number 2 is obvious - we turn code that has a bug into code that isn't.
The importance of case number 3 comes down to time and provability. We might well be making code less efficient (many methods for stopping data-races have at least some overhead), but it often takes less developer time to remove a race than prove it harmless, and the cost of a wrong example is marginally slower code whereas the cost of being wrong in the other direction is a hard to fix bug.
The importance of number 1 is more complicated, it can be important in some very low-level concurrent code to avoid locking, so there are cases where we want to tolerate races. Number 4 is a way to turn something from number 2 into number 1, and comes up when either the data-race is inherent to the problem (we can't remove it) or we're doing the sort of low-level concurrency that number 1 involves.
Here's an interesting example in C#:
public static SomeResource GetTheResource()
{
get
{
if(_theResource == null)
_theResource = CreateTheResource();
return _theResource
}
}
The data-race should be obvious; until theResource is set and all CPU's caches see the update, we might assign to it several times from different threads. Is this a bug? Many people would say it is, but actually it depends. It's possible that it's safe to have a brief period where different versions of theResource are used, and all we really lose is some efficiency in the beginning from the multiple calls to CreateTheResource(). In code with a high requirement for performance we might decide to tolerate this initial lower efficiency for the long-term efficiency gain of no locking. Or it might be vital that we lock. Or we might just lock because we don't have that pressing a need to avoid it, and it's simpler just to assume that the might be a problem.
Important Point 1: If you do decide to tolerate a race like this, you should add a comment to that effect and why. Otherwise every time someone comes across this code they'll have to check again that it's safe, rather than at most check your stated reasoning.
Important Point 2: While the principle here is language-agnostic, the details in each case often are not. In this case tolerating the race depends not just on the temporary multiple copies being safe, but also on garbage collection cleaning those excess copies up. If we were instead assigning a pointer to the heap in C++ the above would at the very best be leaky, even if otherwise safe.
A more complicated case is something like this (again a C# example, but applicable to other languages):
internal sealed class LockFreeQueue<T>
{
private sealed class Node
{
public readonly T Item;
public Node Next;
public Node(T item)
{
Item = item;
}
}
private volatile Node _head;
private volatile Node _tail;
public LockFreeQueue()
{
_head = _tail = new Node(default(T));
}
#pragma warning disable 420 // volatile semantics not lost as only by-ref calls are interlocked
public void Enqueue(T item)
{
Node newNode = new Node(item);
for(;;)
{
Node curTail = _tail;
if (Interlocked.CompareExchange(ref curTail.Next, newNode, null) == null) //append to the tail if it is indeed the tail.
{
Interlocked.CompareExchange(ref _tail, newNode, curTail); //CAS in case we were assisted by an obstructed thread.
return;
}
else
{
Interlocked.CompareExchange(ref _tail, curTail.Next, curTail); //assist obstructing thread.
}
}
}
public bool TryDequeue(out T item)
{
for(;;)
{
Node curHead = _head;
Node curTail = _tail;
Node curHeadNext = curHead.Next;
if (curHead == curTail)
{
if (curHeadNext == null)
{
item = default(T);
return false;
}
else
Interlocked.CompareExchange(ref _tail, curHeadNext, curTail); // assist obstructing thread
}
else
{
item = curHeadNext.Item;
if (Interlocked.CompareExchange(ref _head, curHeadNext, curHead) == curHead)
{
return true;
}
}
}
}
#pragma warning restore 420
}
This code doesn't prevent data-races, but rather it reacts to them. If an operation is affected by another thread, then rather than error or return an incorrect result, the thread deals with the race and returns something else (and indeed even helps the other thread in some cases).
So in summary, data-races are not in and of themselves bad things. They are though complicating things, and those complications can cause problems. When you have a data-race you have a choice between proving it's not a problem, changing your code to tolerate the race so that it's no longer a problem, or changing your code to remove the race. Of these, just removing the race is often the easiest choice.
*I don't mean "relatively harmless" in a vague way here, but relative to the alternative. E.g. if we decide to leave the race in the C# example given, it's because we've decided that the cost of redundant object creation is less harmful than the relative cost of preventing it.
I thank everybody for their answers, although valuable they did not actually answer the question I was hoping I asked. The answers did allow me to reason better about what I was actually asking, and in the end find something of an answer online:
http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
So I guess my question should have been:
The C(++)11 standard defines my first example as a data race (if I don't use the "atomic" keyword), and the second one not. The first one therefore has undefined behaviour (even though there don't seem to be compiler implementations that would result in anything but x==1 && y==0|1, according to the standard any resulting value for x and y is correct compiler behaviour). I was wondering why this is. I think the Intel document answers that question pretty elaborately.
If x and y fit into a machine register then assignment is atomic by default so locks won't change the outcome. It's equally possible to get y = 0 or y = 1 in the second case as well.

Is it possible to create a binary analysis software which would sort out all possible vulnerabilities and bugs in other software?

I find myself often questioning myself whether it is possible to design a software which would load up another software and try to emulate all possible outcomes from it and figure out bugs and vulnerabilities on the software being analyzed.
Theoretically, it could load any piece of software, have an internal representation of the underlying system (CPU registers, memory, etc) like a Virtual Machine software, and by means of analyzing, it would start fetching the instructions, emulating them, which would go linearly until it finds a conditional jump.
To make it simple to understand, when it finds a conditional jump, it would take a snapshot of the current representational state of the system and follow that conditional jump, it would keep evaluating the instructions and at some point would restore that snapshot and do not follow the conditional jump, going past over it and evaluating the next instructions, and so on.
Such software would be smart enough to emulate user supplied input.
To make things clearer lets imagine we are analyzing the following (pseudo?) C code:
char* gets(char *s)
{
int i = 0;
while( (s[i] = _getche()) != VK_RETURN ) i++;
s[i] = NULL;
return s;
}
void main() {
char buf[8];
char is_admin = FALSE;
do {
gets( buf );
if( _strcmp(buf, "s3cr3t!") == 0 )
is_admin = TRUE;
else
{
if( is_admin )
super_user.exec( buf );
else
unprivileged_user.exec( buf );
}
} while( _strcmp(buf, "exit") != 0 );
}
It just keeps polling for user commands and executes them until the user input is "exit". if the user inputs a password "s3cr3t!" them it will execute the following commands as a super user, otherwise it will just impersonate an unprivileged user.
Moving on, we could ask our analysis software to detect and sort out which ways that would be possible to execute commands as a super user on the subject code being analyzed.
By going through each instruction, it will come to conditional jumps and test both cases, when the jump is made and when it is not. So after a few iterations it would know that if a user inputs the string "s3cr3t!", it will later on come to execute commands as a super user. It would not try every possible string combination until eventually it comes to "s3cr3t!", it would be smart to see there is a comparison for that string, and see what it changes in the program flow.
Then, it would also be able to see that any user input string that has more than 8 letters would overflow the allocated space for the buf char array, thereby corrupting memory. Which in this particular case, assuming that the stack memory layout for this was that the is_admin variable would be sitting right next to the buf char array, would set is_admin to evaluate to TRUE, and then comes to execute commands as super user.
It would also be able to spot an integer overflow in that gets() function, if that would corrupt stack memory somehow that would end up changing the RETURN address from a function call. Figuring it would be a scenario for exploitation where the user inputs the shellcode and by overwriting the RETURN address it would then jump to that shellcode which would also execute commands as a super user.
So... I know I could not go into much detail on the inner workings, but overall I think I made my point. Does anyone see something wrong with that approach or thinks it would not work?
I am thinking about going for an open project on this. I would appreciate any considerations.
If I understand you correctly, there is such thing. Search for static analysis, control flow graph and such things. So generally, your idea is good.
However, writing a program that will find all the bugs in some program is impossible. The proof is by reduction from the Halting problem. So obviously, it is impossilbe to use your approach to find them all.
However, it might be possible to find all the bugs of some family.
For example: I can define the "bug family" of crashing within one minute when only one ASCII char is given as input. Of course you can check this (at least for deterministic programs, for probabilistic programs - a simple check will give probability that there is no bug).
So for spcific bugs your approach might work.
And last thing: notice that this approach might have high time complexity.

Resources