Synchronizing file processing threads across servers in linux

Synchronizing file processing threads across servers in linux - linux

I need to build a linux service/daemon which processes files. This daemon will most likely be multi-threaded and most likely be running on more than one node. What would be the best way to synchronize the threads of all daemons such that no two threads are processing the same file?
A couple ideas came to mind but wondering whether there is a better approach as I'm new to linux.
Create a directory structure such that only one daemon processes a directory. The daemon itself should be able to easily synchronize the threads within it such that no two threads are processing the same file.
Determine some mechanism using open() and maybe file attributes such that once a process can successfully open a file exclusively when the file is in some state, maybe some file attribute not set yet, the state is changed, by changing some file attribute, and that daemon can process the file knowing that no other daemon will process it.
Come up with a naming convention such that the names are somewhat equally distributed across some numerical name. Each daemon could then be configured to process some modulo number.
Example: file name = 987654321
We have a daemon running on two nodes. The configuration for each daemon would indicate the number of daemons and which modulo the daemon should process. Therefore one daemon would process modulo value 0 and the other would process modulo value 1.
987654321 % 2 = 1 so it would be processed by the daemon processing modulo 1.
I guess we could have a single daemon which divvies out the work to the processing daemons. The processing daemons could communicate with this single daemon which I'll call the "work manager" via some IPC mechanism.
Thanks,
Nick

If you're going to implement logic in Python, you can use python's queue class.
It designed for exchanging data between multiple threads.
You can put your files and/or directories in queue, then each thread will access the queue and get that file. In that way, threads will never hold the same object.

Related

How to start a process with multiple threads right from the beginning?

As I know, process is a common container for all the thread it hosts. Multiple threads can easily share resources if they are running in a same process. All the threads in a process share a common address space. On the other hand, thread is the unit of execution of the program.
Scheduler in the operating system schedules threads, not processes (1). A process is said to be actively running if any one of its thread is running. Otherwise the process is waiting. Scheduler cannot simply schedule a process.
Also, beside priorities, all threads in a process are equal from the OS perspective, even the main thread (2) (some application might have application specific roles assigned to each thread, which I'm ignoring here).
Based on (1) and (2), it seems there is no requirement that all processes should start with one thread which should then spawn child threads as needed. So technically, it is possible to start a process with multiple threads from the beginning, where none of the threads started the other. When that process is started, scheduler can simply schedule any one of the many starter threads. But I'm not getting how to do it!
So, How to start a process with multiple threads right from the beginning? This question is not asked in relation to any specific OS. Also, if programming languages mandating main as entry point is a problem for giving an example, I can (or try to) understand x86-64 assembly code.

How to identify if a long-running process died?

I'm working on a daemon that communicates with several processes. The daemon can't monitor the processes all the time, but it must be able to properly identify if a process dies to release scare resources it holds for it.
The processes can communicate with the daemon, giving it some information at the start, but not vice versa. So the daemon can't just ask a process its identity.
The simplest form would be to use just their PID. But eventually another process could be assigned the same PID without my tool noticing.
A better approach would be to use PID plus the time the process started. A new process with the same PID would have a distinct start time. But I couldn't find a way how to get the process start time in a POSIX way. Using ps or looking at /proc/<pid>/stat seems not portable enough.
A more complicated idea that seems POSIX-compliant would be:
Each process creates a temporary file.
Locks it using flock
Tells my daemon "my identity is connected with this file".
Any time the daemon can check the temporary file. If it's locked, the process is alive. If it's not, the process is dead.
But this seems unnecessarily complicated.
Is there a better, or standard way?
Edit: The daemon must be able to resume after a restart, so it's not possible to keep a persistent connection for each process.

But I couldn't find a way how to get the process start time in a POSIX way.
Try the standard "etime" format specifier: LC_ALL=C ps -eo etime= $PIDS
In fairness, I would probably construct my own table of live processes rather that relying on the process table and elapsed time. That's fundamentally your file-locking approach, though I'd probably aggregate all the lockfiles together in a known place and name them by PID, e.g., /var/run/my-app/8819.lock. Indeed, this might even be retrofitted on to the long-running processes, since file locks on file descriptors can be inherited across exec().
(Of course, if the long-running processes I cared about had a common parent, then I'd rather query the common parent, who can be a reliable authority on which processes are running and which are not.)

The standard way is the unnecessarily complicated one. That' life in a POSIX-compliant environment...

Other methods than the file exist and have various benefits/tradeoffs - most of the "standard" IPC mechanisms would work for this as well - a socket, pipe, message queue, shared memory... Basically pick one mechanism that allows your application to announce to the daemon that it has started (and maybe that it's exiting, for an orderly shutdown). In between, it could send periodic "I'm still here" messages and the daemon could notice when it doesn't get one, or the daemon could poll periodically or something... There's quite a few ways to accomplish what you want, but without knowing more about the exact architecture you're trying to achieve, it's difficult to point at the "one best way"...

When is clone() and fork better than pthreads?

I am beginner in this area.
I have studied fork(), vfork(), clone() and pthreads.
I have noticed that pthread_create() will create a thread, which is less overhead than creating a new process with fork(). Additionally the thread will share file descriptors, memory, etc with parent process.
But when is fork() and clone() better than pthreads? Can you please explain it to me by giving real world example?
Thanks in Advance.

clone(2) is a Linux specific syscall mostly used to implement threads (in particular, it is used for pthread_create). With various arguments, clone can also have a fork(2)-like behavior. Very few people directly use clone, using the pthread library is more portable. You probably need to directly call clone(2) syscall only if you are implementing your own thread library - a competitor to Posix-threads - and this is very tricky (in particular because locking may require using futex(2) syscall in machine-tuned assembly-coded routines, see futex(7)). You don't want to directly use clone or futex because the pthreads are much simpler to use.
(The other pthread functions require some book-keeping to be done internally in libpthread.so after a clone during a pthread_create)
As Jonathon answered, processes have their own address space and file descriptor set. And a process can execute a new executable program with the execve syscall which basically initialize the address space, the stack and registers for starting a new program (but the file descriptors may be kept, unless using close-on-exec flag, e.g. thru O_CLOEXEC for open).
On Unix-like systems, all processes (except the very first process, usuallyinit, of pid 1) are created by fork (or variants like vfork; you could, but don't want to, use clone in such way as it behaves like fork).
(technically, on Linux, there are some few weird exceptions which you can ignore, notably kernel processes or threads and some rare kernel-initiated starting of processes like /sbin/hotplug ....)
The fork and execve syscalls are central to Unix process creation (with waitpid and related syscalls).
A multi-threaded process has several threads (usually created by pthread_create) all sharing the same address space and file descriptors. You use threads when you want to work in parallel on the same data within the same address space, but then you should care about synchronization and locking. Read a pthread tutorial for more.
I suggest you to read a good Unix programming book like Advanced Unix Programming and/or the (freely available) Advanced Linux Programming

The strength and weakness of fork (and company) is that they create a new process that's a clone of the existing process.
This is a weakness because, as you pointed out, creating a new process has a fair amount of overhead. It also means communication between the processes has to be done via some "approved" channel (pipes, sockets, files, shared-memory region, etc.)
This is a strength because it provides (much) greater isolation between the parent and the child. If, for example, a child process crashes, you can kill it and start another fairly easily. By contrast, if a child thread dies, killing it is problematic at best -- it's impossible to be certain what resources that thread held exclusively, so you can't clean up after it. Likewise, since all the threads in a process share a common address space, one thread that ran into a problem could overwrite data being used by all the other threads, so just killing that one thread wouldn't necessarily be enough to clean up the mess.
In other words, using threads is a little bit of a gamble. As long as your code is all clean, you can gain some efficiency by using multiple threads in a single process. Using multiple processes adds a bit of overhead, but can make your code quite a bit more robust, because it limits the damage a single problem can cause, and makes it much easy to shut down and replace a process if it does run into a major problem.
As far as concrete examples go, Apache might be a pretty good one. It will use multiple threads per process, but to limit the damage in case of problems (among other things), it limits the number of threads per process, and can/will spawn several separate processes running concurrently as well. On a decent server you might have, for example, 8 processes with 8 threads each. The large number of threads helps it service a large number of clients in a mostly I/O bound task, and breaking it up into processes means if a problem does arise, it doesn't suddenly become completely un-responsive, and can shut down and restart a process without losing a lot.

These are totally different things. fork() creates a new process. pthread_create() creates a new thread, which runs under the context of the same process.
Thread share the same virtual address space, memory (for good or for bad), set of open file descriptors, among other things.
Processes are (essentially) totally separate from each other and cannot modify each other.
You should read this question:
What is the difference between a process and a thread?
As for an example, if I am your shell (eg. bash), when you enter a command like ls, I am going to fork() a new process, and then exec() the ls executable. (And then I wait() on the child process, but that's getting out of scope.) This happens in an entire different address space, and if ls blows up, I don't care, because I am still executing in my own process.
On the other hand, say I am a math program, and I have been asked to multiply two 100x100 matrices. We know that matrix multiplication is an Embarrassingly Parallel problem. So, I have the matrices in memory. I spawn of N threads, who each operate on the same source matrices, putting their results in the appropriate location in the result matrix. Remember, these operate in the context of the same process, so I need to make sure they are not stamping on each other's data. If N is 8 and I have an eight-core CPU, I can effectively calculate each part of the matrix simultaneously.

Process creation mechanism on unix using fork() (and family) is very efficient.
Morever , most unix system doesnot support kernel level threads i.e thread is not entity recognized by kernel. Hence thread on such system cannot get benefit of CPU scheduling at kernel level. pthread library does that which is not kerenl rather some process itself.
Also on such system pthreads are implemented using vfork() and as light weight process only.
So using threading has no point except portability on such system.
As per my understanding Sun-solaris and windows has kernel level thread and linux family doesn't support kernel threads.
with processes pipes and unix doamin sockets are very efficient IPC without synchronization issues.
I hope it clears why and when thread should be used practically.

Is it possible to completely manage the life cycle of a process and its forks?

Consider a system that manages user-defined programs:
A program can be anything. Its command line is defined by non-privileged users in some configuration file. It could be /bin/ls, it could be /usr/sbin/apache; the user may specify whatever he is permitted to start.
Each program is run as a non-root user.
Any given user can configure any number of programs.
Each program runs for as long as it wants.
Each program may call fork(), exec() etc.
Each program may set itself as a session leader (ie., setsid()).
The system that starts the programs might not run continuously. It starts a program, then quits.
The action "stop all of program P's processes, including children/forks" must be possible.
The action "find all processes belonging to program P" must be possible.
Here's the question: How can one provide such a system within the Linux process model?
The naive method:
Start program with fork(), exec(), setuid(), etc..
Write the child PID (plus its start timestamp, from /proc/stat, to uniquely and permanently identify it) to a file.
To stop a single process, set SIGTERM to PID.
To find all processes, inspect /proc to build the process hiearchy based on the PID.
This method has a big hole: Any process may fork and break out of its process group. It's not sufficient to look at the process hierarchy. After a program has created new processes, it's not possible to trace their origin back to the original program.
A workaround would be to ensure that each program is started with a unique UID. This is not desirable or particularly workable, since a (human) user may define any number of programs; the system would then have to programmatically create new, unique users for each program.
My only idea so far is to inject a special, reserved environment variable into the program's initial process, ie., run the program with env PROGRAM=myprogram <command line>. The system could then mandate that all processes must inherit their parent's environment. At regular intervals, the system could trawl /proc and forcibly kill any process missing the PROGRAM environment variable.
Are there any secrets in the Linux syscall API that I could use?

(1) The action "stop all of program P's processes, including children/forks" must be possible. (2) The action "find all processes belonging to program P" must be possible.
cgroups implement this, and systemd is perhaps the heaviest user to date to make use of (2) to achieve (1). You can break out of progress groups, but not cgroups.

Strange descriptor closing in some linux programs

While stracing some linux daemons (eg. sendmail) I noticed that some of them will call close() on a number of descriptors (usually ranging from 3 to 255) right at the beginning. Is this being done on purpose or is this some sort of a side effect of doing something else?

It is usually done as part of making a process a daemon.
All file descriptors are closed so that the long-running daemon does not unnecessarily hold any resources. For example, if a daemon were to inherit an open file and the daemon did not close it then the file could not be deleted (the storage for it would remain allocated until close) and the filesystem that the file is on could not be unmounted.
Daemonizing a process will also take a number of other actions, but those actions are beyond the scope of this question.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string