Multithreading: Read from / write to a pipe

Multithreading: Read from / write to a pipe - multithreading

I write some data to a pipe - possibly lots of data and at random intervals. How to read the data from the pipe?
Is this ok:
in the main thread (current process) create two more threads (2, 3)
the second thread writes sometimes to the pipe (and flush-es the pipe?)
the 3rd thread has infinite loop which reads the pipe (and then sleeps for some time)
Is this so far correct?
Now, there are a few thing I don't understand:
do I have to lock (mutex?) the pipe on write?
IIRC, when writing to pipe and its buffer gets full, the write end will block until I read the already written data, right? How to check for read data in the pipe, not too often, not too rarely? So that the second thread wont block? Is there something like select for pipes?
It is possible to set the pipe to unbuffered more or I have to flush it regularly - which one is better?
Should I create one more thread, just for flushing the pipe after write? Because flush blocks as well, when the buffer is full, right? I just don't want the 1st and 2nd thread to block....
[Edit]
Sorry, I thought the question is platform agnostic but just in case: I'm looking at this from Win32 perspective, possibly MinGW C...

I'm not answering all of your questions here because there's a lot of them, but in answer to:
do I have to lock (mutex?) the pipe on write?
The answer to this question is platform specific, but in most cases I would guess yes.
It comes down to whether the write/read operations on the pipe are atomic. If either the read or write operation is non-atomic (most likely the write) then you will need to lock the pipe on writing and reading to prevent race conditions.
For example, lets say a write to the pipe compiles down to 2 instructions in machine code:
INSTRUCTION 1
INSTRUCTION 2
Let's say you get a thread context switch between these 2 instructions and your reading thread attempts to read the pipe which is in an intermediate state. This could result in a crash, or (worse) data corruption which can often manifest itself in a crash somewhere else in the code. This will often occur as a result of a race condition which are often non-deterministic and difficult to diagnose or reproduce.
In general, unless you can guarantee that all threads will be accessing the shared resource using an atomic instruction set, you must use mutexes or critical sections.

Related

Thread Block Operation

I asked two questions(1,2) about reading a file using using a thread so that the main thread doesn't get blocked. My problem wasn't so much writing and starting a thread, my problem was I didn't understand what operations were blocking. I've been told before that reading from a file is a blocking, using the for loop in my second example, is blocking. I don't really understand why or even spot it when looking at a piece of code.
So my question obviously is, how do you spot or determine when an operation is blocking a thread, and how do you fix it?

So my question obviously is, how do you spot or determine when an
operation is blocking a thread
There's no magic way to do it; in general you have to read the documentation for whatever functions you call in order to get an idea about whether they are guaranteed to return quickly or whether they might block for an extended period of time.
If you're looking at a running program and want to know what its threads are currently doing, you can either watch them using a debugger, or insert print statements at various locations so that you can tell (by seeing what text gets printed to stdout and what text doesn't) roughly where the thread is at and what it is doing.
, and how do you fix it?
Blocking is not "broken", so there's nothing to fix. Blocking is intentional behavior, so that e.g. when you call a function that reads from disk, it can provide you back some useful data when it returns. (Consider an alternative non-blocking read, which would always return immediately, but in most cases wouldn't be able to provide you with any data, since the hard drive had not had time to load in any data yet -- not terribly useful).
That said, for network operations you can set your socket(s) to non-blocking mode so that calls to send(), recv(), etc are guaranteed never to block (they will return an error code instead). That only works for networking though; most OS's don't support non-blocking I/O for disk access.

Meaning of atomicity of POSIX pipe write

According to the POSIX standard, writes to a pipe are guaranteed to be atomic (if the data size is less than PIPE_BUF).
As far as I understand, this means that any thread trying to write to the pipe will never access the pipe in the middle of another thread's write. What's not clear to me is how this is achieved and whether this atomicity guarantee has other implications.
Does this simply mean that the writing thread acquires a lock somewhere inside the write function?
Is the thread that's writing to the pipe guaranteed to never be scheduled out of context during the write operation?

Pipe write are atomic upto the size of Pipe. Let assume that a pipe size is 4kb, then the write are atomic up to the data_size < 4kb. In the POSIX systems, kernel uses internal mutexes, and locks the file descriptors for the pipe. Then it allows the requesting thread to write. If any other thread requests write at this point, then it would have to wait for the first thread. After that the file descriptors are unlocked, so the other waiting threads can write to the pipe. So yes, kernel would not allow more than one thread to write to the pipe at the same time.
However, there is an edge case to think. If data of size close to 4kb has already been written, and the reading has not been done yet, then the pipe may not be thread safe. Because, at this point, it may be possible that the total bytes written to the pipe may exceed to the 4kb limit.

Semaphores & threads - what is the point?

I've been reading about semaphores and came across this article:
www.csc.villanova.edu/~mdamian/threads/posixsem.html
So, this page states that if there are two threads accessing the same data, things can get ugly. The solution is to allow only one thread to access the data at the same time.
This is clear and I understand the solution, only why would anyone need threads to do this? What is the point? If the threads are blocked so that only one can execute, why use them at all? There is no advantage. (or maybe this is a just a dumb example; in such a case please point me to a sensible one)
Thanks in advance.

Consider this:
void update_shared_variable() {
sem_wait( &g_shared_variable_mutex );
g_shared_variable++;
sem_post( &g_shared_variable_mutex );
}
void thread1() {
do_thing_1a();
do_thing_1b();
do_thing_1c();
update_shared_variable(); // may block
}
void thread2() {
do_thing_2a();
do_thing_2b();
do_thing_2c();
update_shared_variable(); // may block
}
Note that all of the do_thing_xx functions still happen simultaneously. The semaphore only comes into play when the threads need to modify some shared (global) state or use some shared resource. So a thread will only block if another thread is trying to access the shared thing at the same time.
Now, if the only thing your threads are doing is working with one single shared variable/resource, then you are correct - there is no point in having threads at all (it would actually be less efficient than just one thread, due to context switching.)

When you are using multithreading not everycode that runs will be blocking. For example, if you had a queue, and two threads are reading from that queue, you would make sure that no thread reads at the same time from the queue, so that part would be blocking, but that's the part that will probably take the less time. Once you have retrieved the item to process from the queue, all the rest of the code can be run asynchronously.

The idea behind the threads is to allow simultaneous processing. A shared resource must be governed to avoid things like deadlocks or starvation. If something can take a while to process, then why not create multiple instances of those processes to allow them to finish faster? The bottleneck is just what you mentioned, when a process has to wait for I/O.
Being blocked while waiting for the shared resource is small when compared to the processing time, this is when you want to use multiple threads.

This is of course a SSCCE (Short, Self Contained, Correct Example)
Let's say you have 2 worker threads that do a lot of work and write the result to a file.
you only need to lock the file (shared resource) access.

The problem with trivial examples....
If the problem you're trying to solve can be broken down into pieces that can be executed in parallel then threads are a good thing.
A slightly less trivial example - imagine a for loop where the data being processed in each iteration is different every time. In that circumstance you could execute each iteration of the for loop simultaneously in separate threads. And indeed some compilers like Intel's will convert suitable for loops to threads automatically for you. In that particular circumstances no semaphores are needed because of the iterations' data independence.
But say you were wanting to process a stream of data, and that processing had two distinct steps, A and B. The threadless approach would involve reading in some data then doing A then B and then output the data before reading more input. Or you could have a thread reading and doing A, another thread doing B and output. So how do you get the interim result from the first thread to the second?
One way would be to have a memory buffer to contain the interim result. The first thread could write the interim result to a memory buffer and the second could read from it. But with two threads operating independently there's no way for the first thread to know if it's safe to overwrite that buffer, and there's no way for the second to know when to read from it.
That's where you can use semaphores to synchronise the action of the two threads. The first thread takes a semaphore that I'll call empty, fills the buffer, and then posts a semaphore called filled. Meanwhile the second thread will take the filled semaphore, read the buffer, and then post empty. So long as filled is initialised to 0 and empty is initialised to 1 it will work. The second thread will process the data only after the first has written it, and the first won't write it until the second has finished with it.
It's only worth it of course if the amount of time each thread spends processing data outweighs the amount of time spent waiting for semaphores. This limits the extent to which splitting code up into threads yields a benefit. Going beyond that tends to mean that the overall execution is effectively serial.
You can do multithreaded programming without semaphores at all. There's the Actor model or Communicating Sequential Processes (the one I favour). It's well worth looking up JCSP on Wikipedia.
In these programming styles data is shared between threads by sending it down communication channels. So instead of using semaphores to grant another thread access to data it would be sent a copy of that data down something a bit like a network socket, or a pipe. The advantage of CSP (which limits that communication channel to send-finishes-only-if-receiver-has-read) is that it stops you falling into the many many pitfalls that plague multithreaded do programs. It sounds inefficient (copying data is inefficient), but actually it's not so bad with Intel's QPI architecture, AMD's Hypertransport. And it means hat the 'channel' really could be a network connection; scalability built in by design.

Sharing stdout among multiple threads/processes

I have a linux program(the language doesn't matter) which prints it's log onto stdout.
The log IS needed for monitoring of the process.
Now I'm going to parallelize it by fork'ing or using threads.
The problem: resulting stdout will contain unreadable mix of unrelated lines...
And finally The question: How would you re-construct the output logic for parallel processes ?

Sorry for answering myself...
The definite solution was to use the GNU parallel utility.
It came to replace the well known xargs utility, but runs the commands in parallel, separating the output into groups.
So I just left my simple one-process, one-thread utility as-is and piped its call through the parallel like that:
generate-argument-list | parallel < options > my-utility
This, depending on parallel's options can produce nicely grouped outputs for multiple calls of my-utility

If you're in C++ I'd consider using Pantheios or derivative version Boost::Log or using look at Logging In C++ : Part 2 or
If you're in another language then file locking around the IO operations is probably the way to go see File Locks, you can achieve the same results using semaphonres or any other process control system but for me file locks are the easiest.
You could also consider using syslog if this monitoring is considered as system wide.

Another approach, which we use, is to delegate a thread, logger thread, for logging. All other threads wishing to log will send it to logger thread. This method gives you flexibility as formatting of logs can be done is single place, which can also be configurable. If you don't want to worry about locks can use sockets for message passing.

If its multithreaded, then you'll need to mutex protect printing/writing to the stdout log. The most common way to do this in Linux and c/c++ is with pthread_mutex. Additionally, if its c++, boost has synchronization that could be used.
To implement it, you should probably encapsulate all the logging in one function or object, and internally lock and unlock the mutex.
If logging blocking performance becomes prohibitive, you could consider buffering the log messages (in the afore mentioned object or function) and only writing to stdout when the buffer is full. You'll still need mutex protection to buffer, but buffering will be faster than writing to stdout.
If each thread will have its own logging messages, then they will still need to all share the same mutex for writing to stdout. In this case it would probably be best for each thread to buffer its respective log messages and only write to stdout when the the buffer is full, thus only obtaining the mutex for writing to stdout.

When should I use critical sections?

Here's the deal. My app has a lot of threads that do the same thing - read specific data from huge files(>2gb), parse the data and eventually write to that file.
Problem is that sometimes it could happen that one thread reads X from file A and second thread writes to X of that same file A. A problem would occur?
The I/O code uses TFileStream for every file. I split the I/O code to be local(static class), because I'm afraid there will be a problem. Since it's split, there should be critical sections.
Every case below is local(static) code that is not instaniated.
Case 1:
procedure Foo(obj:TObject);
begin ... end;
Case 2:
procedure Bar(obj:TObject);
var i: integer;
begin
for i:=0 to X do ...{something}
end;
Case 3:
function Foo(obj:TObject; j:Integer):TSomeObject
var i:integer;
begin
for i:=0 to X do
for j:=0 to Y do
Result:={something}
end;
Question 1: In which case do I need critical sections so there are no problems if >1 threads call it at same time?
Question 2: Will there be a problem if Thread 1 reads X(entry) from file A while Thread 2 writes to X(entry) to file A?
When should I use critical sections? I try to imagine it my head, but it's hard - only one thread :))
EDIT
Is this going to suit it?
{a class for every 2GB file}
TSpecificFile = class
cs: TCriticalSection;
...
end;
TFileParser = class
file :TSpecificFile;
void Parsethis; void ParseThat....
end;
function Read(file: TSpecificFile): TSomeObject;
begin
file.cs.Enter;
try
...//read
finally
file.cs.Leave;
end;
end;
function Write(file: TSpecificFile): TSomeObject;
begin
file.cs.Enter;
try
//write
finally
file.cs.Leave
end;
end;
Now will there be a problem if two threads call Read with:
case 1: same TSpecificFile
case 2: different TSpecificFile?
Do i need another critical section?

In general, you need a locking mechanism (critical sections are a locking mechanism) whenever multiple threads may access a shared resource at the same time, and at least one of the threads will be writing to / modifying the shared resource.
This is true whether the resource is an object in memory or a file on disk.
And the reason that the locking is necessary is that, is that if a read operation happens concurrently with a write operation, the read operation is likely to obtain inconsistent data leading to unpredictable behaviour.
Stephen Cheung has mentioned the platform specific considerations with regards file handling, and I'll not repeat them here.
As a side note, I'd like to highlight another concurrency concern that may be applicable in your case.
Suppose one thread reads some data and starts processing.
Then another thread does the same.
Both threads determine that they must write a result to position X of File A.
At best the values to be written are the same, and one of the threads effectively did nothing but waste time.
At worst, the calculation of one of the threads is overwritten, and the result is lost.
You need to determine whether this would be a problem for your application. And I must point out that if it is, just locking the read and write operations will not solve it. Furthermore, trying to extend the duration of the locks leads to other problems.
Options
Critical Sections
Yes, you can use critical sections.
You will need to choose the best granularity of the critical sections: One per whole file, or perhaps use them to designate specific blocks within a file.
The decision would require a better understanding of what your application does, so I'm not going to answer for you.
Just be aware of the possibility of deadlocks:
Thread 1 acquires lock A
Thread 2 acquires lock B
Thread 1 desires lock B, but has to wait
Thread 2 desires lock A - causing a deadlock because neither thread is able to release its acquired lock.
I'm also going to suggest 2 other tools for you to consider in your solution.
Single-Threaded
What a shocking thing to say! But seriously, if your reason to go multi-threaded was "to make the application faster", then you went multi-threaded for the wrong reason. Most people who do that actually end up making their applications, more difficult to write, less reliable, and slower!
It is a far too common misconception that multiple threads speed up applications. If a task requires X clock-cycles to perform - it will take X clock-cycles! Multiple threads don't speed up tasks, it permits multiple tasks to be done in parallel. But this can be a bad thing! ...
You've described your application as being highly dependent on reading from disk, parsing what's read and writing to disk. Depending on how CPU intensive the parsing step is, you may find that all your threads are spending the majority of their time waiting for disk IO operations. In which case, the multiple threads generally only serve to shunt the disk heads to the far 'corners' of your (ummm round) disk platters. Disk IO is still the bottle-neck, and the threads make it behave as if the files are maximally fragmented.
Queueing Operations
Let's suppose your reason for going multi-threaded are valid, and you do still have threads operating on shared resources. Instead of using locks to avoid concurrency issues, you could queue your shared resource operations onto specific threads.
So instead of Thread 1:
Reading position X from File A
Parsing the data
Writing to position Y in file A
Create another thread; the FileA thread:
the FileA has a queue of instructions
When it gets to the instruction to read position X, it does so.
It sends the data to Thread 1
Thread 1 parses its data --- while FileA thread continues processing instructions
Thread 1 places an instruction to write its result to position Y at the back of FileA thread's queue --- while FileA thread continues to process other instructions.
Eventually FileA thread will write the data as required by Trhead 1.

Synchronization is only needed for shared data that can cause a problem (or an error) if more than one agent is doing something with it.
Obviously the file writing operation should be wrapped in a critical section for that file only if you don't want other writer processes to trample on the new data before the write is completed -- the file may no long be consistent if you have half of the new data modified by another process that does not see the other half of the new data (that hasn't been written out by the original writer process yet). Therefore you'll have a collection of CS's, one for each file. That CS should be released asap when you're done with writing.
In certain cases, e.g. memory-mapped files or sparse files, the O/S may allow you to write to different portions of the file at the same time. Therefore, in such cases, your CS will have to be on a particular segment of the file. Thus you'll have a collection of CS's (one for each segment) for each file.
If you write to a file and read it at the same time, the reader may get inconsistent data. In some O/S's, reading is allowed to happen simultaneously with a write (perhaps the read comes from cached buffers). However, if you are writing to a file and reading it at the same time, what you read may not be correct. If you need consistent data on reads, then the reader should also be subject to the critical section.
In certain cases, if you are writing to a segment and read from another segment, the O/S may allow it. However, whether this will return correct data usually cannot be guaranteed because there you can't always tell whether two segments of the file may be residing in one disk sector, or other low-level O/S things.
So, in general, the advise is to wrap any file operation in a CS, per file.
Theoretically, you should be able to read simultaneously from the same file, but locking it in a CS will only allow one reader. In that case, you'll need to separate your implementation into "read locks" and "write locks" (similar to a database system). This is highly non-trivial though as you'll then have to deal with promoting different levels of locks.
After note: The kind of thing you're trying to data (reading and writing huge data sets that are GB's in size simultaneously in segments) is what is typically done in a database. You should be looking into breaking your data files into database records. Otherwise, you either suffer from non-optimized read/write performance due to locking, or you end up re-inventing the relational database.

Conclusion first
You don't need TCriticalSection. You should implement a Queue-based algorithm that guarantees no two threads are working on the same piece of data, without blocking.
How I got to that conclusion
First of all Windows (Win 7?) will allow you to simultaneously write to a file as many times as you see fit. I have no idea what it does with the writes, and I'm clearly not saying it's a good idea, but I've just done the following test to prove Windows allows simultaneous multiple writes to the same file:
I made a thread that opens a file for writing (with "share deny none") and keeps writing random stuff to a random offset for 30 seconds. Here's a pastebin with the code.
Why a TCriticalSection would be bad
A critical section only allows one thread to access the protect resource at any given time. You have two options: Only hold the lock for the duration of the read/write operation, or hold the lock for the entire time required to process the given resource. Both have serious problems.
Here's what might happen if a thread holds the lock only for the duration of the read/write operations:
Thread 1 acquires the lock, reads the data, releases the lock
Thread 2 acquires the lock, reads the same data, releases the lock
Thread 1 finishes processing, acquires the lock, writes the data, releases the lock
Thread 2 acquires the lock, writes the data, and here's the oops: Thread 2 has been working on old data, since Thread 1 made changes in the background!
Here's what might happen if a thread holds the lock for the entire round-trim read & write operation:
Thread 1 acquires the lock, starts reading data
Thread 2 tries to acquire the same lock, gets blocked...
Thread 1 finishes reading the data, processes the data, writes the data back to file, releases the lock
Thread 2 acquires the lock and starts processing the same data again !
The Queue solution
Since you're multi-threading, and you can have multiple threads simultaneously processing data from the same file, I assume data is somehow "context free": You can process the 3rd part of a file before processing the 1st. This must be true, because if it's not, you can't multi-thread (or are limited to 1 thread per file).
Before you start processing you can prepare a number of "Jobs", that look like this:
File 'file1.raw', offset 0, 1024 Kb
File 'file1.raw', offset 1024, 1024 kb.
...
File 'fileN.raw', offset 99999999, 1024 kb
Put all those "jobs" in a queue. Have your threads dequeue one Job from the queue and process it. Since no two jobs overlap, threads don't need to synchronize with each other, so you don't need the critical section. You only need the critical section to protect access to the Queue itself. Windows makes sure threads can read and write to/from the files just fine, as long as they stick to the allocated "Job".

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string