uses for state machines [closed] - state-machine

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
In what areas of programming would I use state machines ? Why ? How could I implement one ?
EDIT: please provide a practical example , if it's not too much to ask .

In what areas of programming would I use a state machine?
Use a state machine to represent a (real or logical) object that can exist in a limited number of conditions ("states") and progresses from one state to the next according to a fixed set of rules.
Why would I use a state machine?
A state machine is often a very compact way to represent a set of complex rules and conditions, and to process various inputs. You'll see state machines in embedded devices that have limited memory. Implemented well, a state machine is self-documenting because each logical state represents a physical condition. A state machine can be embodied in a tiny amount of code in comparison to its procedural equivalent and runs extremely efficiently. Moreover, the rules that govern state changes can often be stored as data in a table, providing a compact representation that can be easily maintained.
How can I implement one?
Trivial example:
enum states { // Define the states in the state machine.
NO_PIZZA, // Exit state machine.
COUNT_PEOPLE, // Ask user for # of people.
COUNT_SLICES, // Ask user for # slices.
SERVE_PIZZA, // Validate and serve.
EAT_PIZZA // Task is complete.
} STATE;
STATE state = COUNT_PEOPLE;
int nPeople, nSlices, nSlicesPerPerson;
// Serve slices of pizza to people, so that each person gets
/// the same number of slices.
while (state != NO_PIZZA) {
switch (state) {
case COUNT_PEOPLE:
if (promptForPeople(&nPeople)) // If input is valid..
state = COUNT_SLICES; // .. go to next state..
break; // .. else remain in this state.
case COUNT_SLICES:
if (promptForSlices(&nSlices))
state = SERVE_PIZZA;
break;
case SERVE_PIZZA:
if (nSlices % nPeople != 0) // Can't divide the pizza evenly.
{
getMorePizzaOrFriends(); // Do something about it.
state = COUNT_PEOPLE; // Start over.
}
else
{
nSlicesPerPerson = nSlices/nPeople;
state = EAT_PIZZA;
}
break;
case EAT_PIZZA:
// etc...
state = NO_PIZZA; // Exit the state machine.
break;
} // switch
} // while
Notes:
The example uses a switch() with explicit case/break states for simplicity. In practice, a case will often "fall through" to the next state.
For ease of maintaining a large state machine, the work done in each case can be encapsulated in a "worker" function. Get any input at the top of the while(), pass it to the worker function, and check the return value of the worker to compute the next state.
For compactness, the entire switch() can be replaced with an array of function pointers. Each state is embodied by a function whose return value is a pointer to the next state. Warning: This can either simplify the state machine or render it totally unmaintainable, so consider the implementation carefully!
An embedded device may be implemented as a state machine that exits only on a catastrophic error, after which it performs a hard reset and re-enters the state machine.

Some great answers already. For a slightly different perspective, consider searching a text in a larger string. Someone has already mentioned regular expressions and this is really just a special case, albeit an important one.
Consider the following method call:
very_long_text = "Bereshit bara Elohim et hashamayim ve'et ha'arets." …
word = "Elohim"
position = find_in_string(very_long_text, word)
How would you implement find_in_string? The easy approach would use a nested loop, something like this:
for i in 0 … length(very_long_text) - length(word):
found = true
for j in 0 … length(word):
if (very_long_text[i] != word[j]):
found = false
break
if found: return i
return -1
Apart from the fact that this is inefficient, it forms a state machine! The states here are somewhat hidden; let me rewrite the code slightly to make them more visible:
state = 0
for i in 0 … length(very_long_text) - length(word):
if very_long_text[i] == word[state]:
state += 1
if state == length(word) + 1: return i
else:
state = 0
return -1
The different states here directly represent all different positions in the word we search for. There are two transitions for each node in the graph: if the letters match, go to the next state; for every other input (i.e. every other letter at the current position), go back to zero.
This slight reformulation has a huge advantage: it can now be tweaked to yield better performance using some basic techniques. In fact, every advanced string searching algorithm (discounting index data structures for the moment) builds on top of this state machine and improves some aspects of it.

What sort of task?
Any task but from what I have seen, Parsing of any sort is frequently implemented as a state machine.
Why?
Parsing a grammar is generally not a straightforward task. During the design phase it is fairly common that a state diagram is drawn to test the parsing algorithm. Translating that to a state machine implementation is a fairly simple task.
How?
Well, you are limited only by your imagination.
I have seen it done with case statements and loops.
I have seen it done with labels and goto statements
I have even seen it done with structures of function pointers which represent the current state. When the state changes, one or more function pointer is updated.
I have seen it done in code only, where a change of state simply means that you are running in a different section of code. (no state variables, and redundent code where necessary. This can be demonstrated as a very simply sort, which is useful for only very small sets of data.
int a[10] = {some unsorted integers};
not_sorted_state:;
z = -1;
while (z < (sizeof(a) / sizeof(a[0]) - 1)
{
z = z + 1
if (a[z] > a[z + 1])
{
// ASSERT The array is not in order
swap(a[z], a[z + 1]; // make the array more sorted
goto not_sorted_state; // change state to sort the array
}
}
// ASSERT the array is in order
There are no state variables, but the code itself represents the state

The State design pattern is an object-oriented way to represent the state of an object by means of a finite state machine. It usually helps to reduce the logical complexity of that object's implementation (nested if's, many flags, etc.)

Most workflows can be implemented as state machines. For example, processing leave applications or orders.
If you're using .NET, try Windows Workflow Foundation. You can implement a state machine workflow quite quickly with it.

If you're using C#, any time you write an iterator block you're asking the compiler to build a state machine for you (keeping track of where you are in the iterator etc).

Here is a tested and working example of a state machine. Say you are on a serial stream (serial port, tcp/ip data, or file are typical examples). In this case I am looking for a specific packet structure that can be broken into three parts, sync, length, and payload. I have three states, one is idle, waiting for the sync, the second is we have a good sync the next byte should be length, and the third state is accumulate the payload.
The example is purely serial with only one buffer, as written here it will recover from a bad byte or packet, possibly discarding a packet but eventually recovering, you can do other things like a sliding window to allow for immediate recovery. This would be where you have say a partial packet that is cut short then a new complete packet starts, the code below wont detect this and will throw away the partial as well as the whole packet and recover on the next. A sliding window would save you there if you really needed to process all the whole packets.
I use this kind of a state machine all the time be it serial data streams, tcp/ip, file i/o. Or perhaps tcp/ip protocols themselves, say you want to send an email, open the port, wait for the server to send a response, send HELO, wait for the server to send a packet, send a packet, wait for the reply, etc. Essentially in that case as well as in the case below you may be idling waiting for that next byte/packet to come in. To remember what you were waiting for, also to re-use the code that waits for something you can use state variables. The same way that state machines are used in logic (waiting for the next clock, what was I waiting for).
Just like in logic, you may want to do something different for each state, in this case if I have a good sync pattern I reset the offset into my storage as well as reset the checksum accumulator. The packet length state demonstrates a case where you may want to abort out of the normal control path. Not all, in fact many state machines may jump around or may loop around within the normal path, the one below is pretty much linear.
I hope this is useful and wish that state machines were used more in software.
The test data has intentional problems with it that the state machine recovers from. There is some garbage data after the first good packet, a packet with a bad checksum, and a packet with an invalid length. My output was:
good packet:FA0712345678EB
Invalid sync pattern 0x12
Invalid sync pattern 0x34
Invalid sync pattern 0x56
Checksum error 0xBF
Invalid packet length 0
Invalid sync pattern 0x12
Invalid sync pattern 0x34
Invalid sync pattern 0x56
Invalid sync pattern 0x78
Invalid sync pattern 0xEB
good packet:FA081234567800EA
no more test data
The two good packets in the stream were extracted despite the bad data. And the bad data was detected and dealt with.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned char testdata[] =
{
0xFA,0x07,0x12,0x34,0x56,0x78,0xEB,
0x12,0x34,0x56,
0xFA,0x07,0x12,0x34,0x56,0x78,0xAA,
0xFA,0x00,0x12,0x34,0x56,0x78,0xEB,
0xFA,0x08,0x12,0x34,0x56,0x78,0x00,0xEA
};
unsigned int testoff=0;
//packet structure
// [0] packet header 0xFA
// [1] bytes in packet (n)
// [2] payload
// ... payload
// [n-1] checksum
//
unsigned int state;
unsigned int packlen;
unsigned int packoff;
unsigned char packet[256];
unsigned int checksum;
int process_packet( unsigned char *data, unsigned int len )
{
unsigned int ra;
printf("good packet:");
for(ra=0;ra<len;ra++) printf("%02X",data[ra]);
printf("\n");
}
int getbyte ( unsigned char *d )
{
//check peripheral for a new byte
//or serialize a packet or file
if(testoff<sizeof(testdata))
{
*d=testdata[testoff++];
return(1);
}
else
{
printf("no more test data\n");
exit(0);
}
return(0);
}
int main ( void )
{
unsigned char b;
state=0; //idle
while(1)
{
if(getbyte(&b))
{
switch(state)
{
case 0: //idle
if(b!=0xFA)
{
printf("Invalid sync pattern 0x%02X\n",b);
break;
}
packoff=0;
checksum=b;
packet[packoff++]=b;
state++;
break;
case 1: //packet length
checksum+=b;
packet[packoff++]=b;
packlen=b;
if(packlen<3)
{
printf("Invalid packet length %u\n",packlen);
state=0;
break;
}
state++;
break;
case 2: //payload
checksum+=b;
packet[packoff++]=b;
if(packoff>=packlen)
{
state=0;
checksum=checksum&0xFF;
if(checksum)
{
printf("Checksum error 0x%02X\n",checksum);
}
else
{
process_packet(packet,packlen);
}
}
break;
}
}
//do other stuff, handle other devices/interfaces
}
}

State machines are everywhere. State machines are key in communications interfaces where a message needs to be parsed as it is received. Also, there have been many times in embedded systems development that I've needed to separate a task into multiple tasks because of strict timing constraints.

A lot of digital hardware design involves creating state machines to specify the behaviour of your circuits. It comes up quite a bit if you're writing VHDL.

QA infrastructure, intended to screen-scrape or otherwise run through a process under test. (This is my particular area of experience; I built a state machine framework in Python for my last employer with support for pushing the current state onto a stack and using various methods of state handler selection for use in all our TTY-based screen scrapers). The conceptual model fits well, as running through a TTY application, it goes through a limited number of known states, and can be moved back into old ones (think about using a nested menu). This has been released (with said employer's permission); use Bazaar to check out http://web.dyfis.net/bzr/isg_state_machine_framework/ if you want to see the code.
Ticket-, process-management and workflow systems -- if your ticket has a set of rules determining its movement between NEW, TRIAGED, IN-PROGRESS, NEEDS-QA, FAILED-QA and VERIFIED (for example), you've got a simple state machine.
Building small, readily provable embedded systems -- traffic light signaling is a key example where the list of all possible states has to be fully enumerated and known.
Parsers and lexers are heavily state-machine based, because the way something streaming in is determined is based on where you're at at the time.

A FSM is used everywhere you have multiple states and need to transition to a different state on stimulus.
(turns out that this encompasses most problems, at least theoretically)

Regular expressions are another example of where finite state machines (or "finite state automata") come into play.
A compiled regexp is a finite state machine, and
the sets of strings that regular expressions can match are exactly the languages that finite state automata can accept (called "regular languages").

I have an example from a current system I'm working on. I'm in the process of building a stock trading system. The process of tracking the state of an order can be complex, but if you build a state diagram for the life cycle of an order it makes applying new incoming transactions to the existing order much simpler. There are many fewer comparisons necessary in applying that transaction if you know from its current state that the new transaction can only be one of three things rather than one of 20 things. It makes the code much more efficient.

I didn't see anything here that actually explained the reason I see them used.
For practical purposes, a programmer usually has to add one when he is forced to return a thread/exit right in the middle of an operation.
For instance, if you have a multi-state HTTP request, you might have server code that looks like this:
Show form 1
process form 1
show form 2
process form 2
The thing is, every time you show a form, you have to quit out of your entire thread on the server (in most languages), even if your code all flows together logically and uses the same variables.
The act of putting a break in the code and returning the thread is usually done with a switch statement and creates what is called a state machine (Very Basic Version).
As you get more complex, it can get really difficult to figure out what states are valid. People usually then define a "State Transition Table" to describe all the state transitions.
I wrote a state machine library, the main concept being that you can actually implement your state transition table directly. It was a really neat exercise, not sure how well it's going to go over though...

Finite state machines can be used for morphological parsing in any natural language.
Theoretically, this means that morphology and syntax are split up between computational levels, one being at most finite-state, and the other being at most mildly context sensitive (thus the need for other theoretical models to account for word-to-word rather than morpheme-to-morpheme relationships).
This can be useful in the area of machine translation and word glossing. Ostensibly, they're low-cost features to extract for less trivial machine learning applications in NLP, such as syntactic or dependency parsing.
If you're interested in learning more, you can check out Finite State Morphology by Beesley and Karttunen, and the Xerox Finite State Toolkit they designed at PARC.

State driven code is a good way to implement certain types of logic (parsers being an example). It can be done in several ways, for example:
State driving which bit of code is actually being executed at a given point (i.e. the state is implicit in the piece of code you are writing). Recursive descent parsers are a good example of this type of code.
State driving what to do in a conditional such as a switch statement.
Explicit state machines such as those generated by parser generating tools such as Lex and Yacc.
Not all state driven code is used for parsing. A general state machine generator is smc. It inhales a definition of a state machine (in its language) and it will spit out code for the state machine in a variety of languages.

Good answers. Here's my 2 cents. Finite State Machines are a theoretical idea that can be implemented multiple different ways, such as a table, or as a while-switch (but don't tell anybody it's a way of saying goto horrors). It is a theorem that any FSM corresponds to a regular expression, and vice versa. Since a regular expression corresponds to a structured program, you can sometimes just write a structured program to implement your FSM. For example, a simple parser of numbers could be written along the lines of:
/* implement dd*[.d*] */
if (isdigit(*p)){
while(isdigit(*p)) p++;
if (*p=='.'){
p++;
while(isdigit(*p)) p++;
}
/* got it! */
}
You get the idea. And, if there's a way that runs faster, I don't know what it is.

A typical use case is traffic lights.
On an implementation note: Java 5's enums can have abstract methods, which is an excellent way to encapsulate state-dependent behavior.

Related

Can I 'pretend' I have a one-to-one producer-consumer problem instead of many-to-one?

I suspect this is a relatively obscure and specific question, so I'll try to be as clear as possible in describing it. All of my code is being done in C++ in pursuit of a computer vision project.
I have a primary thread, acting as a consumer.
This thread generates N producers that need to use a hodgepodge of proprietary software to acquire images from their camera buffers. Each of them does a quick conversion to cv::Mats before passing them along to the consumer for actual processing.
Here's where I trip up. Is it possible to twist this multiple-producer-single-consumer problem into a set of N single-producer-single-consumer problems? Something like this?
while(
0 == pipeBuffer1.read_available() ||
0 == pipeBuffer2.read_available() ||
...
0 == pipeBuffern.read_available()
) { //wait
}
// do a fifo pop for each buffer and fuse them all into a single frame
I ask because, by my best understanding, I can transition over to lock-free ring buffers in the single-single version of the problem. I can't immediately see a reason why I can't treat my design issue this way, but my experience is slim to nil with multithreading and I figured I might as well ask before wasting my time trying to implement something that can't work.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.
After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.
I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}
I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Are Data Races bad?

I like to settle a theoretical computing argument.
Assume everything initial 0
Thread0 Thread1
x=1 | y=x
Here we have a data race. As far as I understand (assuming that x fits in the architecture's word-size and is aligned on the word boundary, which it normally would be), the result is either x=1 ^ y=0 or x=1 ^ y=1.
Now my second example uses explicit locking (assume that lock() gets some global lock), and as far as I understand this is not a data race condition anymore.
Thread0 Thread1
lock() | lock()
x=1 | y=x
unlock() | unlock()
However I would argue that both programs are identical, they produce identical output, have identical race issues. Somehow however people are trying to convince me that data race condition is bad, and I don't see why my first program would be worse than my second.
Edit. The full quote from Wikipedia is:
C++11 introduced formal support for multithreading, and defined a data race strictly as a race condition between non-atomic variables. While race conditions in general will continue to exist, a "data race" must be avoided by the programmer, who must assure that only one thread at a time may access any variable if the access is for writing.
Now, assuming this is correct (it's wikipedia, which tends to be reasonably good on programming but can often be very wrong indeed), it's defining "data race" in this context purely as one of the clearly bad cases; those which can cause shearing of values. Such cases obviously must be avoided, so clearly data-races—defined as they are here—must be avoided.
And by this definition, neither program in your question has a data race.
I leave my original answer on race conditions generally:
The second example has a data-race too. Indeed, it has the exact same data-race as the first one.
Is this bad? That depends. Note before any of the rest. Not only are many cases bad, as I'll describe more below, but those cases that are bad tend to be particularly hard to find and fix, which in itself should lean one towards assuming the worse.
An obvious case where a data race is bad is where it corrupts data. Let's say we change your example so that x and y were larger than the architecture's word size and we're setting x = -1. We'll also assume two's-complement. Now the possible values for y are not just -1 and 0, but also -4294967296 and 4294967295.
In this case, the locking you suggest wouldn't remove the data-race completely, but would remove that part of it that could cause shearing: The only possible values of y would again be -1 and 0.
Another question is serialisation. It's often necessary to be able to consider a sequence of concurrent events as having been one of a limited set of sequential events.
For example, consider we start with X = 0 and then have:
Thread 0 Thread 1
++x x = -50
Now, there's still the risk of sheering here that could result in a possible bogus value.
Assuming that x is word-size or smaller, we still might have an issue. There are two possible values if the operations were not concurrent. Either x could be equal to -50 (increment, then assign -50) or x could be equal to -49 (assign -50 then increment). However, concurrently it's possible for us to end up with x having a value of 1 because thread 0 reads 0, thread 1 assigns -50and then thread 0 increments and assigns 1.
Now, it's quite possible that this is perfectly okay. It's very likely though that it isn't.
As programmers we've got four possibilities:
Identify the data-race. Determine that it is harmless (or relatively harmless*), and let it be.
Identify the data-race. Determine that it can cause problems, and fix it.
Identify the data-race. Just fix it because that way we can't make a mistake in determining it is harmless when it actually isn't.
Identify the data-race. Determine that it can cause problems. Change the code so the race doesn't cause problems.
The importance of case number 2 is obvious - we turn code that has a bug into code that isn't.
The importance of case number 3 comes down to time and provability. We might well be making code less efficient (many methods for stopping data-races have at least some overhead), but it often takes less developer time to remove a race than prove it harmless, and the cost of a wrong example is marginally slower code whereas the cost of being wrong in the other direction is a hard to fix bug.
The importance of number 1 is more complicated, it can be important in some very low-level concurrent code to avoid locking, so there are cases where we want to tolerate races. Number 4 is a way to turn something from number 2 into number 1, and comes up when either the data-race is inherent to the problem (we can't remove it) or we're doing the sort of low-level concurrency that number 1 involves.
Here's an interesting example in C#:
public static SomeResource GetTheResource()
{
get
{
if(_theResource == null)
_theResource = CreateTheResource();
return _theResource
}
}
The data-race should be obvious; until theResource is set and all CPU's caches see the update, we might assign to it several times from different threads. Is this a bug? Many people would say it is, but actually it depends. It's possible that it's safe to have a brief period where different versions of theResource are used, and all we really lose is some efficiency in the beginning from the multiple calls to CreateTheResource(). In code with a high requirement for performance we might decide to tolerate this initial lower efficiency for the long-term efficiency gain of no locking. Or it might be vital that we lock. Or we might just lock because we don't have that pressing a need to avoid it, and it's simpler just to assume that the might be a problem.
Important Point 1: If you do decide to tolerate a race like this, you should add a comment to that effect and why. Otherwise every time someone comes across this code they'll have to check again that it's safe, rather than at most check your stated reasoning.
Important Point 2: While the principle here is language-agnostic, the details in each case often are not. In this case tolerating the race depends not just on the temporary multiple copies being safe, but also on garbage collection cleaning those excess copies up. If we were instead assigning a pointer to the heap in C++ the above would at the very best be leaky, even if otherwise safe.
A more complicated case is something like this (again a C# example, but applicable to other languages):
internal sealed class LockFreeQueue<T>
{
private sealed class Node
{
public readonly T Item;
public Node Next;
public Node(T item)
{
Item = item;
}
}
private volatile Node _head;
private volatile Node _tail;
public LockFreeQueue()
{
_head = _tail = new Node(default(T));
}
#pragma warning disable 420 // volatile semantics not lost as only by-ref calls are interlocked
public void Enqueue(T item)
{
Node newNode = new Node(item);
for(;;)
{
Node curTail = _tail;
if (Interlocked.CompareExchange(ref curTail.Next, newNode, null) == null) //append to the tail if it is indeed the tail.
{
Interlocked.CompareExchange(ref _tail, newNode, curTail); //CAS in case we were assisted by an obstructed thread.
return;
}
else
{
Interlocked.CompareExchange(ref _tail, curTail.Next, curTail); //assist obstructing thread.
}
}
}
public bool TryDequeue(out T item)
{
for(;;)
{
Node curHead = _head;
Node curTail = _tail;
Node curHeadNext = curHead.Next;
if (curHead == curTail)
{
if (curHeadNext == null)
{
item = default(T);
return false;
}
else
Interlocked.CompareExchange(ref _tail, curHeadNext, curTail); // assist obstructing thread
}
else
{
item = curHeadNext.Item;
if (Interlocked.CompareExchange(ref _head, curHeadNext, curHead) == curHead)
{
return true;
}
}
}
}
#pragma warning restore 420
}
This code doesn't prevent data-races, but rather it reacts to them. If an operation is affected by another thread, then rather than error or return an incorrect result, the thread deals with the race and returns something else (and indeed even helps the other thread in some cases).
So in summary, data-races are not in and of themselves bad things. They are though complicating things, and those complications can cause problems. When you have a data-race you have a choice between proving it's not a problem, changing your code to tolerate the race so that it's no longer a problem, or changing your code to remove the race. Of these, just removing the race is often the easiest choice.
*I don't mean "relatively harmless" in a vague way here, but relative to the alternative. E.g. if we decide to leave the race in the C# example given, it's because we've decided that the cost of redundant object creation is less harmful than the relative cost of preventing it.
I thank everybody for their answers, although valuable they did not actually answer the question I was hoping I asked. The answers did allow me to reason better about what I was actually asking, and in the end find something of an answer online:
http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
So I guess my question should have been:
The C(++)11 standard defines my first example as a data race (if I don't use the "atomic" keyword), and the second one not. The first one therefore has undefined behaviour (even though there don't seem to be compiler implementations that would result in anything but x==1 && y==0|1, according to the standard any resulting value for x and y is correct compiler behaviour). I was wondering why this is. I think the Intel document answers that question pretty elaborately.
If x and y fit into a machine register then assignment is atomic by default so locks won't change the outcome. It's equally possible to get y = 0 or y = 1 in the second case as well.

Is it possible to create a binary analysis software which would sort out all possible vulnerabilities and bugs in other software?

I find myself often questioning myself whether it is possible to design a software which would load up another software and try to emulate all possible outcomes from it and figure out bugs and vulnerabilities on the software being analyzed.
Theoretically, it could load any piece of software, have an internal representation of the underlying system (CPU registers, memory, etc) like a Virtual Machine software, and by means of analyzing, it would start fetching the instructions, emulating them, which would go linearly until it finds a conditional jump.
To make it simple to understand, when it finds a conditional jump, it would take a snapshot of the current representational state of the system and follow that conditional jump, it would keep evaluating the instructions and at some point would restore that snapshot and do not follow the conditional jump, going past over it and evaluating the next instructions, and so on.
Such software would be smart enough to emulate user supplied input.
To make things clearer lets imagine we are analyzing the following (pseudo?) C code:
char* gets(char *s)
{
int i = 0;
while( (s[i] = _getche()) != VK_RETURN ) i++;
s[i] = NULL;
return s;
}
void main() {
char buf[8];
char is_admin = FALSE;
do {
gets( buf );
if( _strcmp(buf, "s3cr3t!") == 0 )
is_admin = TRUE;
else
{
if( is_admin )
super_user.exec( buf );
else
unprivileged_user.exec( buf );
}
} while( _strcmp(buf, "exit") != 0 );
}
It just keeps polling for user commands and executes them until the user input is "exit". if the user inputs a password "s3cr3t!" them it will execute the following commands as a super user, otherwise it will just impersonate an unprivileged user.
Moving on, we could ask our analysis software to detect and sort out which ways that would be possible to execute commands as a super user on the subject code being analyzed.
By going through each instruction, it will come to conditional jumps and test both cases, when the jump is made and when it is not. So after a few iterations it would know that if a user inputs the string "s3cr3t!", it will later on come to execute commands as a super user. It would not try every possible string combination until eventually it comes to "s3cr3t!", it would be smart to see there is a comparison for that string, and see what it changes in the program flow.
Then, it would also be able to see that any user input string that has more than 8 letters would overflow the allocated space for the buf char array, thereby corrupting memory. Which in this particular case, assuming that the stack memory layout for this was that the is_admin variable would be sitting right next to the buf char array, would set is_admin to evaluate to TRUE, and then comes to execute commands as super user.
It would also be able to spot an integer overflow in that gets() function, if that would corrupt stack memory somehow that would end up changing the RETURN address from a function call. Figuring it would be a scenario for exploitation where the user inputs the shellcode and by overwriting the RETURN address it would then jump to that shellcode which would also execute commands as a super user.
So... I know I could not go into much detail on the inner workings, but overall I think I made my point. Does anyone see something wrong with that approach or thinks it would not work?
I am thinking about going for an open project on this. I would appreciate any considerations.
If I understand you correctly, there is such thing. Search for static analysis, control flow graph and such things. So generally, your idea is good.
However, writing a program that will find all the bugs in some program is impossible. The proof is by reduction from the Halting problem. So obviously, it is impossilbe to use your approach to find them all.
However, it might be possible to find all the bugs of some family.
For example: I can define the "bug family" of crashing within one minute when only one ASCII char is given as input. Of course you can check this (at least for deterministic programs, for probabilistic programs - a simple check will give probability that there is no bug).
So for spcific bugs your approach might work.
And last thing: notice that this approach might have high time complexity.

Design of a high-performance sorted data structure read by many threads and written by few

I have an interesting data structure design problem that is beyond my current expertise. I'm seeking data structure or algorithm answers about tackling this problem.
The requirements:
Store a reasonable number of (pointer address, size) pairs (effectively two numbers; the first is useful as a sorting key) in one location
In a highly threaded application, many threads will look up values, to see if a specific pointer is within one of the (address, size) pairs - that is, if the pair defines a memory range, if the pointer is within any range in the list. Threads will much more rarely add or remove entries from this list.
Reading or searching for values must be as fast as possible, happening hundreds of thousands to millions of times a second
Adding or removing values, ie mutating the list, happens much more rarely; performance is not as important
It is acceptable but not ideal for the list contents to be out of date, ie for a thread's lookup code to not find an entry that should exist, so long as at some point the entry will exist.
I am keen to avoid a naive implementation such as having a critical section to serialize access to a sorted list or tree. What data structures or algorithms might be suitable for this task?
Tagged with Delphi since I am using that language for
this task. Language-agnostic answers are very welcome.
However, I probably cannot use any of the standard
libraries in any language without a lot of care. The reason is that memory access
(allocation, freeing, etc of objects and their internal memory, eg
tree nodes, etc) is strictly controlled and must go through my own
functions. My current code elsewhere in the same program uses
red/black trees and a bit trie, and I've written these myself. Object
and node allocation runs through custom memory allocation routines.
It's beyond the scope of the question, but is mentioned here to avoid
an answer like 'use STL structure foo.' I'm keen for an algorithmic or
structure answer that, so long as I have the right references or textbooks,
I can implement myself.
I would use a TDictionary<Pointer, Integer> (from Generics.Collections) combined with a TMREWSync (from SysUtils) for the multi-read exclusive-write access. TMREWSync allows multiple readers simulatenous access to the dictionary, as long as no writer is active. The dictionary itself provides O(1) lookup of pointers.
If you don't want to use the RTL classes the answer becomes: use a hash map combined with a multi-read exclusive-write synchronization object.
EDIT: Just realized that your pairs really represent memory ranges, so a hash map does not work. In this case you could use a sorted list (sorted by memory adress) and then use binary search to quickly find a matching range. That makes the lookup O(log n) instead of O(1) though.
Exploring a bit the replication idea ...
From the correctness point of view, reader/writer locks will do the work. However,
in practice, while readers may be able to proceed concurrently and in parallel
with accessing the structure, they will create a huge contention on the lock, for the
obvious reason that locking even for read access involves writing to the lock itself.
This will kill the performance in a multi-core system and even more in a multi-socket
system.
The reason for the low performance is the cache line invalidation/transfer traffic
between cores/sockets. (As a side note, here's a very recent and very interesting study
on the subject Everything You Always Wanted to Know About
Synchronization but Were Afraid to Ask ).
Naturally, we can avoid inter core cache transfers, triggered by readers, by making
a copy of the structure on each core and restricting the reader threads to accessing only
the copy local to the core they are currently executing. This requires some mechanism for a thread to obtain its current core id. It also relies to on the operating system scheduler to not move gratuitously threads across cores, i.e. to maintain core affinity to some extent.
AFACT, most current operating systems do it.
As for the writers, their job would be to update all the existing replicas, by obtaining each lock for writing. Updating one tree (apparently the structure should be some tree) at a time does mean a temporary inconsistency between replicas. From the problem
description this seams acceptable. When a writer works, it will block readers on a single
core, but not all readers. The drawback is that a writer has the perform the same work
many times - as many time as there are cores or sockets in the system.
PS.
Maybe, just maybe, another alternative is some RCU-like approach, but I don't know
this well, so I'll just stop after mentioning it :)
With replication you could have:
- one copy of your data structure (list w/ binary search, the interval tree mentioned, ..) (say, the "original" one) that is used only for the lookup (read-access).
- A second copy, the "update" one, is created when the data is to be altered (write access). So the write is made to the update copy.
Once writing completes, change some "current"-pointer from the "original" to the "update" version. Involving an access-counter to the "original" copy, this one can be destroyed when the counter decremented back to zero readers.
In pseudo-code:
// read:
data = get4Read();
... do the lookup
release4Read(data);
// write
data = get4Write();
... alter the data
release4Write(data);
// implementation:
// current is the datat structure + a 'readers' counter, initially set to '0'
get4Read() {
lock(current_lock) { // exclusive access to current
current.readers++; // one more reader
return current;
}
}
release4Read(copy) {
lock(current_lock) { // exclusive access to current
if(0 == --copy.readers) { // last reader
if(copy != current) { // it was the old, "original" one
delete(copy); // destroy it
}
}
}
}
get4Write() {
aquire_writelock(update_lock); // blocks concurrent writers!
var copy_from = get4Read();
var copy_to = deep_copy(copy_from);
copy_to.readers = 0;
return copy_to;
}
release4Write(data) {
lock(current_lock) { // exclusive access to current
var copy_from = current;
current = data;
}
release4Read(copy_from);
release_writelock(update_lock); // next write can come
}
To complete the answer regarding the actual data structure to use:
Given the fixed size of the data-entries (two integer tuple), also being quite small, i would use an array for storage and binary search for the lookup. (An alternative would be a balanced tree mentioned in the comment).
Talking about performance: As i understand, the 'address' and 'size' define ranges. Thus, the lookup for a given address being within such a range would involve an addition operation of 'address' + 'size' (for comparison of the queried address with the ranges upper bound) over and over again. It may be more performant to store start- and end-adress explicitely, instead of start-adress and size - to avoid this repeated addition.
Read the LMDB design papers at http://symas.com/mdb/ . An MVCC B+tree with lockless reads and copy-on-write writes. Reads are always zero-copy, writes may optionally be zero-copy as well. Can easily handle millions of reads per second in the C implementation. I believe you should be able to use this in your Delphi program without modification, since readers also do no memory allocation. (Writers may do a few allocations, but it's possible to avoid most of them.)
As a side note, here's a good read about memory barriers: Memory Barriers: a Hardware View for Software Hackers
This is just to answer a comment by #fast, the comment space is not big enough ...
#chill: Where do you see the need to place any 'memory barriers'?
Everywhere, where you access shared storage from two different cores.
For example, a writer comes, make a copy of the data and then calls
release4Write. Inside release4write, the writer does the assignment
current = data, to update the shared pointer with the location of the new
data, decrements the counter of the old copy to zero and proceeds with deleting it.
Now a reader intervenes and calls get4Read. And inside get4Read it does copy = current. Since there's no memory barrier, this happens to read the old value of current. For all we know, the write may be reordered after the delete call or the new value of current may still reside in the writer's store queue or the reader may not have yet
seen and processed a corresponding cache invalidation request and whatnot ...
Now the reader happily proceeds to search in that copy of the data
that the writer is deleting or has just deleted. Oops!
But, wait, there's more! :D
With propper use if the > get..() and release..() functions, where do you see the problems to access deleted data or multiple deletion?
See the following interleaving of reader and write operations.
Reader Shared data Writer
====== =========== ======
current = A:0
data = get4Read()
var copy = A:0
copy.readers++;
current = A:1
return A:1
data = A:1
... do the lookup
release4Read(copy == A:1):
--copy.readers current = A:0
0 == copy.readers -> true
data = get4Write():
aquire_writelock(update_lock)
var copy_from = get4Read():
var copy = A:0
copy.readers++;
current = A:1
return A:1
copy_from == A:1
var copy_to = deep_copy(A:1);
copy_to == B:1
return B:1
data == B:1
... alter the data
release4Write(data = B:1)
var copy_from = current;
copy_form == A:1
current = B:1
current = B:1
A:1 != B:1 -> true
delete A:1
!!! release4Read(A:1) !!!
And the writer accesses deleted data and then tries to delete it again. Double oops!

Resources