Given sufficient memory, are locks unnecessary when there is only a single dedicated writer thread?

Given sufficient memory, are locks unnecessary when there is only a single dedicated writer thread? - multithreading

For a scenario with multiple reader threads and a single writer thread, where the readers are allowed to read slightly outdated data, I've concocted a lockless control flow as shown below in its most basic form in pseudocode:
GLOBAL_ATOMIC_POINTER shared_pointer
// Only called by the reader threads.
read()
THREAD_LOCAL_POINTER read_pointer := shared_pointer
return read_data_at(read_pointer)
// Only called by the writer thread.
write(input)
THREAD_LOCAL_ARRAY array
THREAD_LOCAL_POINTER write_pointer := shared_pointer
if write_pointer == location_of_last_element(array)
write_pointer := location_of_first_element(array)
else
write_pointer := location_of_next_element(array, write_pointer)
write_data_at(write_pointer, input)
shared_pointer := write_pointer
Let's call MAX_READING_DURATION the maximum period of time that a call to read() can take to complete, and MIN_WRITING_DURATION the minimum period of time that a call to write() can take to complete.
Now, with shared_pointer guaranteed to be atomic, as long as MAX_READING_DURATION < ELEMENT_COUNT(ARRAY) * MIN_WRITING_DURATION, this scheme should be perfectly safe.
Or have I overlooked something? If not, I'm sure this is a well known thing, and I'd like to know the proper terminology is, so I can use that when I explain/advocate this approach to others.

Sufficient memory and the total number of writing threads are not the criteria for determining what can and can't be lockless.
One important feature of lock-free programming is that if you suspend a single thread it will never prevent other threads from making progress through their own lock-free operations.
But, more importantly: The main feature your (single-writer) code needs to adhere to in order to be lock-free is 'sequential consistency':
Sequential Consistency means 'all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code'.
If the code can't guarantee Sequential Consistency it must prevent memory reordering. ( Here is more info on Memory Reordering: http://preshing.com/20120515/memory-reordering-caught-in-the-act/ )
Finally, I'd recommend checking out these resources to dig deeper into lock-free multi-threaded programming concepts:
http://concurrencykit.org/presentations/lockfree_introduction/#/
http://www.drdobbs.com/lock-free-data-structures/184401865
Good luck!

Related

Is it necessary to do Multi-thread protection for a Boolean property in Delphi?

I found a Delphi library named EventBus and I think it will be very useful, since the Observer is my favorite design pattern.
In the process of learning its source code, I found a piece of code that may be due to multithreading security considerations, which is in the following (property Active's getter and setter methods).
TSubscription = class(TObject)
private
FActive: Boolean;
procedure SetActive(const Value: Boolean);
function GetActive: Boolean;
// ... other members
public
constructor Create(ASubscriber: TObject;
ASubscriberMethod: TSubscriberMethod);
destructor Destroy; override;
property Active: Boolean read GetActive write SetActive;
// ... other methods
end;
function TSubscription.GetActive: Boolean;
begin
TMonitor.Enter(self);
try
Result := FActive;
finally
TMonitor.exit(self);
end;
end;
procedure TSubscription.SetActive(const Value: Boolean);
begin
TMonitor.Enter(self);
try
FActive := Value;
finally
TMonitor.exit(self);
end;
end;
Could you please tell me the lock protection for FActive is whether or not necessary and why?

Summary
Let me start by making this point as clear as possible: Do not attempt to distill multi-threaded development into a set of "simple" rules. It is essential to understand how the data is shared in order to evaluate which of the available concurrency protection techniques would be correct for a particular situation.
The code you have presented suggests the original authors had only a superficial understanding of multi-threaded development. So it serves as a lesson in what not to do.
First, locking the Boolean for read/write access in that way serves no purpose at all. I.e. each read or write is already atomic.
Furthermore, in cases where the property does need protection for concurrent access: it fails abysmally to provide any protection at all.
The net effect is redundant ineffective code that can trigger pointless wait states.
Thread-safety
In order to evaluate 'thread-safety', the following concepts should be understood:
If 2 threads 'race' for the opportunity to access a shared memory location, one will be first, and the other second. In the absence of other factors, you have no control over which thread would 'start' its access first.
Your only control is to block the 'second' thread from concurrent access if the 'first' thread hasn't finished its critical work.
The word "critical" has loaded meaning and may take some effort to fully understand. Take note of the explanation later about why a Boolean variable might need protection.
Critical work refers to all the processing required for the operation on the shared data to be deemed complete.
It's related to concepts of atomic operations or transactional integrity.
The 'second' thread could either be made to wait for the 'first' thread to finish or to skip its operation altogether.
Note that if the shared memory is accessed concurrently by both threads, then there's the possibility of inconsistent behaviour based on the exact ordering of the internal sub-steps of each thread's processing.
This is the fundamental risk and area of concern when thinking about thread-safety. It is the base principle from which other principles are derived.
'Simple' reads and writes are (usually) atomic
No concurrent operations can interfere with the reading/writing of a single byte of data. You will always either get the value in its entirety or replace the value in its entirety.
This concept extends to multiple bytes up to the machine architecture bit size; but does have a caveat, known as tearing.
When a memory address is not aligned on the bit size, then there's the possibility of the bytes spanning the end of one aligned location into the beginning of the next aligned location.
This means that reading/writing the bytes may take 2 operations at the machine level.
As a result 2 concurrent threads could interleave their sub-steps resulting in invalid/incorrect values being read. E.g.
Suppose one thread writes $ffff over an existing value of $0000 while another reads.
"Valid" reads would return either $0000 or $ffff depending on which thread is 'first'.
If the sub-steps run concurrently, then the reading thread could return invalid values of $ff00 or $00ff.
(Note that some platforms might still guarantee atomicity in this situation, but I don't have the knowledge to comment in detail on this.)
To reiterate: single byte values (including Boolean) cannot span aligned memory locations. So they're not subject to the tearing issue above. And this is why the code in the question that attempts to protect the Boolean is completely pointless.
When protection is needed
Although reads and writes in isolation are atomic, it's important to note that when a value is read and impacts a write decision, then this cannot be assumed to be thread-safe. This is best explained by way of a simple example.
Suppose 2 threads invert a shared boolean value: FBool := not FBool;
2 threads means this happens twice and once both threads have finished, the boolean should end up having its starting value. However, each is a multi-step operation:
Read FBool into a location local to the thread (either stack or register).
Invert the value.
Write the inverted value back to the shared location.
If there's no thread-safety mechanism employed then the sub-steps can run concurrently. And it's possible that both threads:
Read FBool; both getting the starting value.
Both threads invert their local copies.
Both threads write the same inverted value to the shared location.
And the end result is that the value is inverted when it should have been reverted to its starting value.
Basically the critical work is clearly more than simply reading or writing the value. To properly protect the boolean value in this situation, the protection must start before the read, and end after the write.
The important lesson to take away from this is that thread-safety requires understanding how the data is shared. It's not feasible to produce an arbitrary generic safety mechanism without this understanding.
And this is why any such attempt as in the EventBus code in the question is almost certainly doomed to be deficient (or even an outright failure).

Memory barrier in the implementation of single producer single consumer

The following implementation from Wikipedia:
volatile unsigned int produceCount = 0, consumeCount = 0;
TokenType buffer[BUFFER_SIZE];
void producer(void) {
while (1) {
while (produceCount - consumeCount == BUFFER_SIZE)
sched_yield(); // buffer is full
buffer[produceCount % BUFFER_SIZE] = produceToken();
// a memory_barrier should go here, see the explanation above
++produceCount;
}
}
void consumer(void) {
while (1) {
while (produceCount - consumeCount == 0)
sched_yield(); // buffer is empty
consumeToken(buffer[consumeCount % BUFFER_SIZE]);
// a memory_barrier should go here, the explanation above still applies
++consumeCount;
}
}
says that a memory barrier must be used between the line that accesses the buffer and the line that updates the Count variable.
This is done to prevent the CPU from reordering the instructions above the fence along-with that below it. The Count variable shouldn't be incremented before it is used to index into the buffer.
If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Thanks

If a fence is not used, won't this kind of reordering violate the correctness of code? The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Good question.
In c++, unless some form of memory barrier is used (atomic, mutex, etc), the compiler assumes that the code is single-threaded. In which case, the as-if rule says that the compiler may emit whatever code it likes, provided that the overall observable effect is 'as if' your code was executed sequentially.
As mentioned in the comments, volatile does not necessarily alter this, being merely an implementation-defined hint that the variable may change between accesses (this is not the same as being modified by another thread).
So if you write multi-threaded code without memory barriers, you get no guarantees that changes to a variable in one thread will even be observed by another thread, because as far as the compiler is concerned that other thread should not be touching the same memory, ever.
What you will actually observe is undefined behaviour.

It seems, that your question is "can incrementing Count and assigment to buffer be reordered without changing code behavior?".
Consider following code tansformation:
int count1 = produceCount++;
buffer[count1 % BUFFER_SIZE] = produceToken();
Notice that code behaves exactly as original one: one read from volatile variable, one write to volatile, read happens before write, state of program is the same. However, other threads will see different picture regarding order of produceCount increment and buffer modifications.
Both compiler and CPU can do that transformation without memory fences, so you need to force those two operations to be in correct order.

If a fence is not used, won't this kind of reordering violate the correctness of code?
Nope. Can you construct any portable code that can tell the difference?
The CPU shouldn't perform increment of Count before it is used to index into buffer. Does the CPU not take care of data dependency while instruction reordering?
Why shouldn't it? What would the payoff be for the costs incurred? Things like write combining and speculative fetching are huge optimizations and disabling them is a non-starter.
If you're thinking that volatile alone should do it, that's simply not true. The volatile keyword has no defined thread synchronization semantics in C or C++. It might happen to work on some platforms and it might happen not to work on others. In Java, volatile does have defined thread synchronization semantics, but they don't include providing ordering for accesses to non-volatiles.
However, memory barriers do have well-defined thread synchronization semantics. We need to make sure that no thread can see that data is available before it sees that data. And we need to make sure that a thread that marks data as able to be overwritten is not seen before the thread is finished with that data.

How/when to release memory in wait-free algorithms

I'm having trouble figuring out a key point in wait-free algorithm design. Suppose a data structure has a pointer to another data structure (e.g. linked list, tree, etc), how can the right time for releasing a data structure?
The problem is this, there are separate operations that can't be executed atomically without a lock. For example one thread reads the pointer to some memory, and increments the use count for that memory to prevent free while this thread is using the data, which might take long, and even if it doesn't, it's a race condition. What prevents another thread from reading the pointer, decrementing the use count and determining that it's no longer used and freeing it before the first thread incremented the use count?
The main issue is that current CPUs only have a single word CAS (compare & swap). Alternatively the problem is that I'm clueless about waitfree algorithms and data structures and after reading some papers I'm still not seeing the light.
IMHO Garbage collection can't be the answer, because it would either GC would have to be prevented from running if any single thread is inside an atomic block (which would mean it can't be guaranteed that the GC will ever run again) or the problem is simply pushed to the GC, in which case, please explain how the GC would figure out if the data is in the silly state (a pointer is read [e.g. stored in a local variable] but the the use count didn't increment yet).
PS, references to advanced tutorials on wait-free algorithms for morons are welcome.
Edit: You should assume that the problem is being solved in a non-managed language, like C or C++. After all if it were Java, we'd have no need to worry about releasing memory. Further assume that the compiler may generate code that will store temporary references to objects in registers (invisible to other threads) right before the usage counter increment, and that a thread can be interrupted between loading the object address and incrementing the counter. This of course doesn't mean that the solution must be limited to C or C++, rather that the solution should give a set of primitives that allowing the implementation of wait-free algorithms on linked data structures. I'm interested in the primitives and how they solve the problem of designing wait-free algorithms. With such primitives a wait-free algorithm can be implemented equally well in C++ and Java.

After some research I learned this.
The problem is not trivial to solve and there are several solutions each with advantages and disadvantages. The reason for the complexity comes from inter CPU synchronization issues. If not done right it might appear to work correctly 99.9% of the time, which isn't enough, or it might fail under load.
Three solutions that I found are 1) hazard pointers, 2) quiescence period based reclamation (used by the Linux kernel in the RCU implementation) 3) reference counting techniques. 4) Other 5) Combinations
Hazard pointers work by saving the currently active references in a well-known per thread location, so any thread deciding to free memory (when the counter appears to be zero) can check if the memory is still in use by anyone. An interesting improvement is to buffer request to release memory in a small array and free them up in a batch when the array is full. The advantage of using hazard pointers is that it can actually guarantee an upper bound on unreclaimed memory. The disadvantage is that it places extra burden on the reader.
Quiescence period based reclamation works by delaying the actual release of the memory until it's known that each thread has had a chance to finish working on any data that may need to be released. The way to know that this condition is satisfied is to check if each thread passed through a quiescent period (not in a critical section) after the object was removed. In the Linux kernel this means something like each task making a voluntary task switch. In a user space application it would be the end of a critical section. This can be achieved by a simple counter, each time the counter is even the thread is not in a critical section (reading shared data), each time the counter is odd the thread is inside a critical section, to move from a critical section or back all the thread needs to do is to atomically increment the number. Based on this the "garbage collector" can determine if each thread has had a chance to finish. There are several approaches, one simple one would be to queue up the requests to free memory (e.g. in a linked list or an array), each with the current generation (managed by the GC), when the GC runs it checks the state of the threads (their state counters) to see if each passed to the next generation (their counter is higher than the last time or is the same and even), any memory can be reclaimed one generation after it was freed. The advantage of this approach is that is places the least burden on the reading threads. The disadvantage is that it can't guarantee an upper bound for the memory waiting to be released (e.g. one thread spending 5 minutes in a critical section, while the data keeps changing and memory isn't released), but in practice it works out all right.
There is a number of reference counting solutions, many of them require double compare and swap, which some CPUs don't support, so can't be relied upon. The key problem remains though, taking a reference before updating the counter. I didn't find enough information to explain how this can be done simply and reliably though. So .....
There are of course a number of "Other" solutions, it's a very important topic of research with tons of papers out there. I didn't examine all of them. I only need one.
And of course the various approaches can be combined, for example hazard pointers can solve the problems of reference counting. But there's a nearly infinite number of combinations, and in some cases a spin lock might theoretically break wait-freedom, but doesn't hurt performance in practice. Somewhat like another tidbit I found in my research, it's theoretically not possible to implement wait-free algorithms using compare-and-swap, that's because in theory (purely in theory) a CAS based update might keep failing for non-deterministic excessive times (imagine a million threads on a million cores each trying to increment and decrement the same counter using CAS). In reality however it rarely fails more than a few times (I suspect it's because the CPUs spend more clocks away from CAS than there are CPUs, but I think if the algorithm returned to the same CAS on the same location every 50 clocks and there were 64 cores there could be a chance of a major problem, then again, who knows, I don't have a hundred core machine to try this). Another results of my research is that designing and implementing wait-free algorithms and data-structures is VERY challenging (even if some of the heavy lifting is outsourced, e.g. to a garbage collector [e.g. Java]), and might perform less well than a similar algorithm with carefully placed locks.
So, yeah, it's possible to free memory even without delays. It's just tricky. And if you forget to make the right operations atomic, or to place the right memory barrier, oh, well, you're toast. :-) Thanks everyone for participating.

I think atomic operations for increment/decrement and compare-and-swap would solve this problem.
Idea:
All resources have a counter which is modified with atomic operations. The counter is initially zero.
Before using a resource: "Acquire" it by atomically incrementing its counter. The resource can be used if and only if the incremented value is greater than zero.
After using a resource: "Release" it by atomically decrementing its counter. The resource should be disposed/freed if and only if the decremented value is equal to zero.
Before disposing: Atomically compare-and-swap the counter value with the minimum (negative) value. Dispose will not happen if a concurrent thread "Acquired" the resource in between.
You haven't specified a language for your question. Here goes an example in c#:
class MyResource
{
// Counter is initially zero. Resource will not be disposed until it has
// been acquired and released.
private int _counter;
public bool Acquire()
{
// Atomically increment counter.
int c = Interlocked.Increment(ref _counter);
// Resource is available if the resulting value is greater than zero.
return c > 0;
}
public bool Release()
{
// Atomically decrement counter.
int c = Interlocked.Decrement(ref _counter);
// We should never reach a negative value
Debug.Assert(c >= 0, "Resource was released without being acquired");
// Dispose when we reach zero
if (c == 0)
{
// Mark as disposed by setting counter its minimum value.
// Only do this if the counter remain at zero. Atomic compare-and-swap operation.
if (Interlocked.CompareExchange(ref _counter, int.MinValue, c) == c)
{
// TODO: Run dispose code (free stuff)
return true; // tell caller that resource is disposed
}
}
return false; // released but still in use
}
}
Usage:
// "r" is an instance of MyResource
bool acquired = false;
try
{
if (acquired = r.Acquire())
{
// TODO: Use resource
}
}
finally
{
if (acquired)
{
if (r.Release())
{
// Resource was disposed.
// TODO: Nullify variable or similar to let GC collect it.
}
}
}

I know this is not the best way but it works for me:
for shared dynamic data-structure lists I use usage counter per item
for example:
struct _data
{
DWORD usage;
bool delete;
// here add your data
_data() { usage=0; deleted=true; }
};
const int MAX = 1024;
_data data[MAX];
now when item is started to be used somwhere then
// start use of data[i]
data[i].cnt++;
after is no longer used then
// stop use of data[i]
data[i].cnt--;
if you want to add new item to list then
// add item
for (i=0;i<MAX;i++) // find first deleted item
if (data[i].deleted)
{
data[i].deleted=false;
data[i].cnt=0;
// copy/set your data
break;
}
and now in the background once in a while (on timer or whatever)
scann data[] an all undeleted items with cnt == 0 set as deleted (+ free its dynamic memory if it has any)
[Note]
to avoid multi-thread access problems implement single global lock per data list
and program it so you cannot scann data while any data[i].cnt is changing
one bool and one DWORD suffice for this if you do not want to use OS locks
// globals
bool data_cnt_locked=false;
DWORD data_cnt=0;
now any change of data[i].cnt modify like this:
// start use of data[i]
while (data_cnt_locked) Sleep(1);
data_cnt++;
data[i].cnt++;
data_cnt--;
and modify delete scan like this
while (data_cnt) Sleep(1);
data_cnt_locked=true;
Sleep(1);
if (data_cnt==0) // just to be sure
for (i=0;i<MAX;i++) // here scan for items to delete ...
if (!data[i].cnt)
if (!data[i].deleted)
{
data[i].deleted=true;
data[i].cnt=0;
// release your dynamic data ...
}
data_cnt_locked=false;
PS.
do not forget to play with the sleep times a little to suite your needs
lock free algorithm sleep times are sometimes dependent on OS task/scheduler
this is not really an lock free implementation
because while GC is at work then all is locked
but if ather than that multi access is not blocking to each other
so if you do not run GC too often you are fine

Delphi threading - which parts of code need to be protected/synchronized?

so far I thought that any operation done on "shared" object (common for multiple threads) must be protected with "synchronize", no matter what. Apparently, I was wrong - in the code I'm studying recently there are plenty of classes (thread-safe ones, as the Author claims) and only one of them uses Critical Section for almost every method.
How do I find what parts / methods of my code needs to be protected with CriticalSection (or any other method) and which not?
So far I haven't stumbled upon any interesting explanation / article / blog note, all google results are:
a) examples of synchronization between thread and the GUI. From simple progressbar to most complex, but still the lesson is obvious: each time you access / modify the property of GUI component, do that in "Synchronize". But nothing more.
b) articles explaining Critical Sections, Mutexes etc. Just a different approaches of protection/synchronization.
c) Examples of very very simple thread-safe classes (thread safe stack or list) - they all do the same - implement lock / unlock methods which do enter/leave critical section and return the actual stack/list pointer on locking.
Now I'm looking for explanation which parts of code should be protected.
could be in form of code ;) but please don't provide me with one more "using Synchronize to update progressbar" ... ;)
thank you!

You are asking for specific answers to a very general question.
Basically, apart of UI operations, you should protect every shared memory/resource access to avoid two potentially competing threads to:
read inconsistent memory
write memory at the same time
try to use the same resource at the same time from more than one thread... until the resource is thread-safe.
Generally, I consider any other operation thread safe, including operations that access not shared memory or not shared objects.
For example, consider this object:
type
TThrdExample = class
private
FValue: Integer;
public
procedure Inc;
procedure Dec;
function Value: Integer;
procedure ThreadInc;
procedure ThreadDec;
function ThreadValue: Integer;
end;
ThreadVar
ThreadValue: Integer;
Inc, Dec and Value are methods which operate over FValue field. The methods are not thread safe until you protect them with some synchronization mechanism. It can be a MultipleReaderExclusiveWriterSinchronizer for Value function and CriticalSection for Inc and Dec methods.
ThreadInc and ThreadDec methods operate over ThreadValue variable, which is defined as ThreadVar, so I consider it ThreadSafe because the memory they access is not shared between threads... each call from different thread will access different memory address.
If you know that, by design, a class should be used only in one thread or inside other synchronization mechanisms, you're free to consider that thread safe by design.
If you want more specific answers, I suggest you try with a more specific question.
Best regards.
EDIT: Maybe someone say the integer fields is a bad example because you can consider integer operations atomic on Intel/Windows thus is not needed to protect it... but I hope you get the idea.

You misunderstood TThread.Synchronize method.
TThread.Synchronize and TThread.Queue methods executes protected code in the context of main (GUI) thread. That is why you should use Syncronize or Queue to update GUI controls (like progressbar) - normally only main thread should access GUI controls.
Critical Sections are different - the protected code is executed in the context of the thread that acquired critical section, and no other thread is permitted to acquire the critical section until the former thread releases it.

You use critical section in case there's a need for a certain set of objects to be updated atomically. This means, they must at all times be either already updated completely or not yet updated at all. They must never be accessible in a transitional state.
For example, with a simple integer reading/writing this is not the case. The operation of reading integer as well as the operation of writing it are atomic already: you cannot read integer in the middle of processor writing it, half-updated. It's either old value or new value, always.
But if you want to increment the integer atomically, you have not one, but three operations you have to do at once: read the old value into processor's cache, increment it, and write it back to memory. Each operation is atomic, but the three of them together are not.
One thread might read the old value (say, 200), increment it by 5 in cache, and at the same time another thread might read the value too (still 200). Then the first thread writes back 205, while the second thread increments its cached value of 200 to 203 and writes back 203, overwriting 205. The result of two increments (+5 and +3) should be 208, but it's 203 due to non-atomicity of operations.
So, you use critical sections when:
A variable, set of variables, or any resource is used from several threads and needs to be updated atomically.
It's not atomic by itself (for example, calling a function which is guarded by critical section inside of the function body, is an atomic operation already)

Have a read of this documentation
http://www.eonclash.com/Tutorials/Multithreading/MartinHarvey1.1/ToC.html

If you use messaging to communicate between threads then you can basically ignore synchronisation primitives completely because each thread only accesses its internal structures and the messages themselves. In essence this is far easier and more scalable architecture than using synchronisation primitives.

Primitive synchronization primitives -- safe?

On constrained devices, I often find myself "faking" locks between 2 threads with 2 bools. Each is only read by one thread, and only written by the other. Here's what I mean:
bool quitted = false, paused = false;
bool should_quit = false, should_pause = false;
void downloader_thread() {
quitted = false;
while(!should_quit) {
fill_buffer(bfr);
if(should_pause) {
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
}
}
quitted = true;
}
void ui_thread() {
// new Thread(downloader_thread).start();
// ...
should_pause = true;
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
}
Of course on a PC I wouldn't do this, but on constrained devices, it seems reading a bool value would be much quicker than obtaining a lock. Of course I trade off for slower recovery (see "sleep(50)") when a change to the buffer is needed.
The question -- is it completely thread-safe? Or are there hidden gotchas I need to be aware of when faking locks like this? Or should I not do this at all?

Using bool values to communicate between threads can work as you intend, but there are indeed two hidden gotchas as explained in this blog post by Vitaliy Liptchinsky:
Cache Coherency
A CPU does not always fetch memory values from RAM. Fast memory caches on the die are one of the tricks used by CPU designers to work around the Von Neumann bottleneck. On some multi-cpu or multi-core architectures (like Intel's Itanium) these CPU caches are not shared or automatically kept in sync. In other words, your threads may be seeing different values for the same memory address if they run on different CPU's.
To avoid this you need to declare your variables as volatile (C++, C#, java), or do explicit volatile read/writes, or make use of locking mechanisms.
Compiler Optimizations
The compiler or JITter may perform optimizations which are not safe if multiple threads are involved. See the linked blog post for an example. Again, you must make use of the volatile keyword or other mechanisms to inform you compiler.

Unless you understand the memory architecture of your device in detail, as well as the code generated by your compiler, this code is not safe.
Just because it seems that it would work, doesn't mean that it will. "Constrained" devices, like the unconstrained type, are getting more and more powerful. I wouldn't bet against finding a dual-core CPU in a cell phone, for instance. That means I wouldn't bet that the above code would work.

Concerning the sleep call, you could always just do sleep(0) or the equivalent call that pauses your thread letting the next in line a turn.
Concerning the rest, this is thread safe if you know the implementation details of your device.

Answering the questions.
Is this completely thread safe? I would answer no this is not thread safe and I would just not do this at all. Without knowing the details of our device and compiler, if this is C++, the compiler is free to reorder and optimize things away as it sees fit. e.g. you wrote:
is_paused = true;
while(should_pause) sleep(50);
is_paused = false;
but the compiler may choose to reorder this into something like this:
sleep(50);
is_paused = false;
this probably won't work even a single core device as others have said.
Rather than taking a lock, you may try to do better to just do less on the UI thread rather than yield in the middle of processing UI messages. If you think that you have spent too much time on the UI thread then find a way to cleanly exit and register an asynchronous call back.
If you call sleep on a UI thread (or try to acquire a lock or do anyting that may block) you open the door to hangs and glitchy UIs. A 50ms sleep is enough for a user to notice. And if you try to acquire a lock or do any other blocking operation (like I/O) you need to deal with the reality of waiting for an indeterminate amount of time to get the I/O which tends to translate from glitch to hang.

This code is unsafe under almost all circumstances. On multi-core processors you will not have cache coherency between cores because bool reads and writes are not atomic operations. This means each core is not guarenteed to have the same value in the cache or even from memory if the cache from the last write hasn't been flushed.
However, even on resource constrained single core devices this is not safe because you do not have control over the scheduler. Here is an example, for simplicty I'm going to pretend these are the only two threads on the device.
When the ui_thread runs, the following lines of code could be run in the same timeslice.
// new Thread(downloader_thread).start();
// ...
should_pause = true;
The downloader_thread runs next and in it's time slice the following lines are executed:
quitted = false;
while(!should_quit)
{
fill_buffer(bfr);
The scheduler prempts the downloader_thread before fill_buffer returns and then activates the ui_thread which runs.
while(!is_paused) sleep(50);
// resize buffer or something else non-thread-safe
should_pause = false;
The resize buffer operation is done while the downloader_thread is in the process of filling the buffer. This means the buffer is corrupted and you'll likely crash soon. It won't happen everytime, but the fact that you are filling the buffer before you set is_paused to true makes it more likely to happen, but even if you switched the order of those two operations on the downloader_thread you would still have a race condition, but you'd likely deadlock instead of corrupting the buffer.
Incidentally, this is a type of spinlock, it just doesn't work. Spinlock's aren't very for wait times that are likely to span to many time slices cause the spin the processor. Your implmentation does sleep which is a bit nicer but the scheduler still has to run your thread and thread context switches aren't cheap. If you are waiting on a critical section or semaphore, the scheduler doesn't active your thread again till the resource has become free.
You might be able to get away with this in some form on a specific platform/architecture, but it is really easy to make a mistake that is very hard to track down.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string