Cross (POSIX) platform analog for sigtimedwait() - linux

Use case is the need to mask SIGPIPE in pthreads that do their own write() and/or SSL_write() and have it compile on current POSIX-ish systems like Linux, macOS, BSD, etc. The typical approach on Linux is explained quite nicely here, and there is lots of good additional discussion on the topic here.
The typical signal(SIGPIPE, SIG_IGN) does work everywhere I have tried, but (I believe) there should be a more surgical solution that avoids globally ignoring SIGPIPE. It would also be nice to avoid platform specific pragma if possible.
The sigtimedwait() function does not appear to exist in (current?) versions of macOS, so a cross platform solution does not look likely using that approach.
The sigwait() function seems to exist everywhere, but that will block forever if the particular signal is not actually pending. So the next best approach appears to be to use sigpending() to see what is pending, and then sigwait() to service it, which both appear to be available.
What has me concerned is that there is virtually nothing (that I can find) written on this particular problem, which is usually a sign that I am missing something painfully obvious.
So, is pthread_sigmask() / sigpending() / sigwait() / pthread_sigmask() a good choice for the above use case? Or are there (non?)obvious pitfalls I should be aware of?

So, is pthread_sigmask() / sigpending() / sigwait() /
pthread_sigmask() a good choice for the above use case? Or are there
(non?)obvious pitfalls I should be aware of?
There's the fact that sigwait() and sigtimedwait() were released in the same version of POSIX. If you're looking to achieve portability by relying on standards, and if macOS fails to conform by omitting the latter, then you should be concerned about how else it fails to conform. Indeed, there are other areas of nonconformance that may bite you, though not necessarily with your particular proposed series of function calls.
For best portability I would suggest going for the simplest solutions possible. In this case I would simply ignore the signal (that is, set its disposition to SIG_IGN). I infer that you understand that signal dispositions are per-process characteristics, not per-thread characteristics, but so what? All of your write()s should be checking their return values to detect short writes and error conditions anyway, and if they do that correctly then they will take appropriate action without any need to receive a signal.

Related

POSIX condition variables VS Win32 Event Objects (about spurious wakeup problem)

In POSIX, because of "spurious wakeup" problem, programmers are forced to use while() instead of if when checking condition.
I think spurious wakeup is unintuitive and confusing problem, but I thought it's an inevitable problem.
Recently, I found that event objects of win32 doesn't have "spurious wakeup" problem.
Why POSIX system and other systems still use condition variable which has "spurious wakeup" problem? (despite this can be solved?)
You ask:
Why POSIX system and other systems still use condition variable which has "spurious wakeup" problem? (despite this can be solved?)
Basically, it's faster than the alternative.
The RATIONALE section of the POSIX.1-2017 treatment of pthread_cond_broadcast and pthread_cond_signal specifically has this to say about "Multiple Awakenings by Condition Signal":
While [the "spurious wakeup"] problem could be resolved, the loss of efficiency for a fringe condition that occurs only rarely is unacceptable, especially given that one has to check the predicate associated with a condition variable anyway. Correcting this problem would unnecessarily reduce the degree of concurrency in this basic building block for all higher-level synchronization operations.
Emphasis added.
The text further observes that forcing "a predicate-testing-loop around the condition wait" is a more robust coding practice than the alternative, because the application will necessarily tolerate superfluous broadcasts and signals from elsewhere in the code.

Difference of FICLONE vs FICLONERANGE vs copy_file_range (for copy-on-write support)

I wonder about an efficient way to copy files (on Linux, on a FS which supports copy-on-write (COW)).
Specifically, I want that my implementation uses copy-on-write if possible, but otherwise falls back to other efficient variants. Specifically, I also care about server-side copy (supported by SMB, NFS and others), and also zero-copy (i.e. bypassing the CPU or memory if possible).
(This question is not really specific to any programming language. It could be C or C++, but also any other like Python, Go or whatever has bindings to the OS syscalls, or has any way to do a syscall. If this is confusing to you, just answer in C.)
It looks like ioctl_ficlonerange, ioctl_ficlone (i.e. ioctl with FICLONE or FICLONERANGE) support copy-on-write (COW). Specifically FICLONE is used by GNU cp (here, via --reflink).
Then there is also copy_file_range, which also seems to support COW, and server-side-copy.
(LWN about copy_file_range.)
It sounds as if copy_file_range is more generic (e.g. it supports server-side-copy; not sure if that is supported by FICLONE).
However, copy_file_range seems to have some issues.
E.g. here, Paul Eggert comments:
[copy_file_range]'s man page
says it uses a size_t (not off_t) to count the number of bytes to be
copied, which is a strange choice for a file-copying API.
Are there situations where FICLONE would work better/different than copy_file_range?
Are there situations where FICLONE would work better/different than FICLONERANGE?
Specifically, assuming the underlying FS supports this, and assume you want to copy a file. I ask about the support of these functions for the functionality of:
Copy-on-write support
Server-side copy support
Zero-copy support
Are they (FICLONE, FICLONERANGE, copy_file_range) always performing exactly the same operation? (Assuming the underlying FS supports copy-on-write, and/or server-side copy.)
Or are there situations where it make sense to use copy_file_range instead of FICLONE? (E.g. COW only works with copy_file_range but not with FICLONE. Or the other way around. Or can this never happen?)
Or formulating the same question differently: Would copy_file_range always be fine, or are there situations where I would want to use FICLONE instead?
Why does GNU cp use FICLONE and not copy_file_range? (Is there a technical reason, or is this just historic?)
Related: GNU cp originally did not use reflink by default (see comment by the GNU coreutils maintainer Pádraig Brady).
However, that was changed recently (this commit, bug report 24400), i.e. COW behavior is the default now (if possible) (--reflink=auto).
Related question about Python for COW support.
Related discussion about FICLONE vs copy_file_range by Python developers. I.e. this seems to be a valid question, and it's not totally clear whether to use FICLONE or copy_file_range.
Related Syncthing documentation about the choice of methods for copying data between files, and
Syncthing issue about copy_file_range and others for efficient file copying, e.g. with COW support.
It also suggests that it is not so clear that FICLONE would do the same as copy_file_range, so their solution is to just try all of them, and fallback to the next, in this order:
ioctl (with FICLONE), copy_file_range, sendfile, duplicate_extents, standard.
Related issue by Go developers on the usage of copy_file_range.
It sounds as if they agree that copy_file_range is always to be preferred over sendfile.
(Question copied from here but I don't see how this is too less focused. This question is very focused and asks a very specific thing (whether FICLONE and copy_file_range behave the same), and should be extremely clear. I formulated the question in multiple different ways, to make the question even more clear. This question is also extremely well researched, and should already be very valuable to the community as-is with all the references. I would have been very happy if I would have found such a question by itself, even without answers, when I started researching about the differences between FICLONE and copy_file_range.)
See the Linux vfs doc about copy_file_range, remap_file_range, FICLONERANGE, FICLONE and FIDEDUPERANGE.
Then see
vfs_copy_file_range. This first tries to call remap_file_range if possible.
FICLONE calls ioctl_file_clone (here),
and FICLONERANGE calls ioctl_file_clone_range.
ioctl_file_clone_range calls the more generic ioctl_file_clone (here).
ioctl_file_clone calls vfs_clone_file_range (here).
vfs_clone_file_range calls do_clone_file_range and that calls remap_file_range (here).
I.e. that answers the question. copy_file_range is more generic, and anyway tries to call remap_file_range (i.e. the same as FICLONE/FICLONERANGE) first internally.
I think the copy_file_range syscall is slightly newer than FICLONE though, i.e. it might be possible that copy_file_range is not available in your kernel but FICLONE is.
In any case, if copy_file_range is available, it should be the best solution.
The order done by Syncthing (ioctl (with FICLONE), copy_file_range, sendfile, duplicate_extents, standard) makes sense.

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?
If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.
The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:
The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.
If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.
Initial analysis
This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:
For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?
For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?
Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).
But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

What happened to libgreen?

As far as I understand libgreen is not a part of Rust standard library anymore. Also I can't find a separate libgreen package. There are a few alternatives - coroutine, which does not provide actual green threads for now, and green-rs, which is broken. Do I right understand that for now there is no lightweight Go-like processes in Rust?
You are correct that there's no lightweight tasking library in std (or the rest of the main distribution), that green doesn't compile and that coroutine doesn't seem to fully handle the threading aspect yet. I do not know of any other library in this space.
As for what happened: the RFC linked to by that issue—RFC 230—is the canonical source of information. The summary is that it was found that the method by which green threading/IO was handled (std tried to abstract across both models, allowing them to be used interoperably automagically) was not worth the downsides. Now, std aims to just provide a minimum base-line of useful support: for IO/threading, that means "thin", safe wrappers for operating system functionality.
Read this https://aturon.github.io/blog/2016/08/11/futures/ and also:
Steve Klabnik's response in the comments:
In the beginning, Rust had only green threads. Eventually, it was
decided that a systems language without systems threads is... strange.
So we needed to add them. Why not add choice? Since the interfaces
could be the same, why not abstract over them, and you could just
choose which one you wanted?
At the same time, the problems with green threads by default were
becoming issues. Segmented stacks cause slow C interop. You need a
runtime to manage them, etc. Furthermore, the overall abstraction was
causing an unacceptable cost. The green threads weren't very green.
Plus, with the need to actually release someday looming, decisions
needed to be made regarding tradeoffs. And since Rust is supposed to
be a systems language, having 1:1 threads and basically no runtime
makes more sense than N:M threads and a runtime. . So libgreen was
removed, the interface was re-done to be 1:1 thread centric.
The 'release someday looming' is a big part of it. We want to be
really stable with Rust, and with all the things to do to actually
ship a 1.0, we didn't want to crystallize an interface we weren't
happy with. Heck, we pulled out a lot of libraries that are even less
important for similar reasons, like rand. Engineering is all about
tradeoffs, and we decided to choose minimalism.
mio is a non starter for us, as are most of the other async i/o frameworks for Rust, because we need Windows and besides we don't want
to get locked into an expensive to replace library which may get
orphaned.
Totally understood here, especially in the general case. In the
specific case, mio is going to either have Windows support, or a
windows-specific version of mio is going to be released, with a
higher-level package providing the features for all platforms. And in
this case, it's maintained by one of the people who's currently using
Rust heavily in production, so it's not likely to go away anytime
soon. But, unless you're actively involved, it's hard to know things
like that, which is, of itself an issue.
One of the reasons we were comfortable removing libgreen is that you
can write your own libraries to do different kinds of IO. 1.0 is a
strong core that we feel good about stabilizing forever, not the final
bit. Libraries like https://github.com/carllerche/mio can test out
different ways of handling things like async IO, and, when they're
mature enough, we can always pull them back in the standard library if
need be. But in the meantime, it's just one line to your Cargo.toml to
add them in.
And such text from reddit:
Unfortunately they ended up canning the greenlet support because
theirs were slower than kernel threads which in turn demonstrates
someone didn’t understand how to get a language compiler to generate
stackless coroutines effectively (not surprising, the number of
engineers wired the right way is not many in this world, but see
http://www.reddit.com/r/rust/comments/2l0a4b/do_rust_web_servers_use_libuv_through_libgreen_or/
for more detail). And they canned the async i/o because libuv is
“slow” (which it is only because it is single threaded only, plus
forces a malloc + free per async operation as the buffers must last
until completion occurs, plus it enforces a penalty over synchronous
i/o see
http://blog.kazuhooku.com/2014/09/the-reasons-why-i-stopped-using-libuv.html),
which was a real shame - they should have taken the opportunity to
replace libuv with something better (hint: ASIO + AFIO, and yes I know
they are both C++, but Rust could do with much better C++ interop than
the presently none it currently has) instead of canning
always-async-everything in what could have been an amazing step up
from C++ with most of the benefits of Erlang without the disadvantages
of Erlang.
For newcomers, there is now may, a crate that implements green threads similar to goroutines.

Using static boolean vs. critical section for concurrency

So I'm designing a new software interface for a USB HID device and I have a question about concurrency protection. I assume that I will have to add concurrency protection around my the calls to ReadFile and WriteFile (please correct me if wrong) as these may be called from different threads in my design.
In the past I have sometimes used static booleans to implement thread safety, adding a loop with a 1ms wait on it until the bool indicated that the code was safe to enter. I have also used CriticalSections. Could anyone tell me if CriticalSections are fundamentally better than using a static bool. I know that I wont have to code up a waiting loop, but what polling rate do they use in VC++ to check the state of the lock? Are they hooked into the OS in some way that makes them better? Is using a bool for a concurrency check not always safe? etc.
I don't know much about C++, but concurrency usually isn't implemented by polling, and generally shouldn't be—polling wastes processor time and wastes energy. The two main low-level approaches are
Blocking on a lock, perhaps with a timeout.
Using a lock-free primitive, most likely either compare-and-set or compare-and-swap.
These approaches are both supported by typical modern hardware. They lead to very different "flavors" of interaction. Writing lock-free data structures is best left to the experts (they tend to be complicated, and, more importantly, it tends to be difficult to see whether they are correct and whether they guarantee progress in the face of contention—their descriptions are usually accompanied by pages of proofs). Fortunately, you can get libraries of them, and they are faster and better-behaved than blocking ones in many but not all cases.
Well never mind. I'll just stick with critical sections. I was hoping to get some interesting information about how critical sections work, or at least a reference to an article about that. IE - why they are different than just writing your own polling loop. I see from this discussion: std::mutex performance compared to win32 CRITICAL_SECTION that there is some confusion about how std::mutex works, but I'm thinking that it is better to use CRITICAL_SECTIONS as it seems to be the surest way to get the the fastest concurrency protection on Windows.
Thanks anyways.

Resources