How do I avoid race conditions when appending to a file? - linux

I'm looking at using PipelineDB for analytics. For data warehousing I want to append all new data to a file, and tail -F it into psql like the examples on the website.
I have multiple data sources, so to get deterministic results, I'd like to append them all to the same input file, where they'll stay in the same order.
Is there a simple, idiomatic way of avoiding race conditions? Something like a single-file server I can pipe data to?
Edit:
Actually, a race condition is exactly what I want. But each line must be atomic, so no single line is ever corrupted. Lines may be interleaved, though.

You can simulate a mutex by using mkdir which is atomic create-and-check operation (this is ensured at the kernel level):
# locking example -- CORRECT
# Bourne
lockdir=/tmp/myscript.lock
if mkdir "$lockdir"
then # directory did not exist, but was created successfully
echo >&2 "successfully acquired lock: $lockdir"
# continue script
else
echo >&2 "cannot acquire lock, giving up on $lockdir"
exit 0
fi
For more information (and other solutions) take a look to the FAQ:
http://mywiki.wooledge.org/BashFAQ/045

You could prepend/wrap all your writes with a mutex by using GNU Parallel like this:
sem --id atomicwrite echo hi >> file
So, to test it, run each of these in separate terminals:
for i in {0..999}; do sem --id atomicwrite echo hi >> file ; done

Related

Synchronizing four shell scripts to run one after another in unix

I have 4 shell script to generate a file(let's say param.txt) which is used by another tool(informatica) and as the tool is done with processing, it deletes param.txt.
The intent here is all four scripts can get invoked at different time lets say 12:10 am, 12:13 am, 12:16 am, 12:17 am. First script runs at 12:10am and creates param.txt and trigger informatica process which uses param.txt. Informatica process takes another 5-10 minutes to complete and deletes the param.txt. The 2nd script invokes at 12:13 am and waits for unavailability of param.txt and as informatica process deletes it, script 2 creates new param.txt and triggers same informatica again. The same happen for another 2 scripts.
I am using Until and sleep commands in all 4 shell script to check the unavailability of param.txt like below:
until [ ! -f "$paramfile" ]
do
Sleep 10
done
<create param.txt file>
Issue here is, sometimes when all 4 scripts begin, the first one succeeds and generates param.txt(as there was no param.txt before) and other waits but when informatica process completes and deletes param.txt, remaining 3 scripts or 2 of them checks the unavailability at same time and one of them creates it but all succeed. I have checked different combinations of sleep interval between four scripts but this situation is occurring almost every time.
You are experiencing a classical race condition. To solve this issue, you need a shared "lock" (or similar) between your 4 scripts.
There are several ways to implement this. One way to do this in bash is by using the flock command, and an agreed-upon filename to use as a lock. The flock man page has some usage examples which resemble this:
(
flock -x 200 # try to acquire an exclusive lock on the file
# do whatever check you want. You are guaranteed to be the only one
# holding the lock
if [ -f "$paramfile" ]; then
# do something
fi
) 200>/tmp/lock-life-for-all-scripts
# The lock is automatically released when the above block is exited
You can also ask flock to fail right away if the lock can't be acquired, or to fail after a timeout (e.g. to print "still trying to acquire the lock" and restart).
Depending on your use case, you could also put the lock on the 'informatica' binary (be sure to use 200< in that case, to open the file for reading instead of (over)writing)
You can use GNU Parallel as a counting semaphore or a mutex, by invoking it as sem instead of as parallel. Scroll down to Mutex on this page.
So, you could use:
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
sem --id myGlobalId 'create input file; run informatica'
Note I have specified a global id in case you run the jobs from different terminals or cron. This is not necessary if you are starting all jobs from one terminal.
Thanks for your valuable suggestions. It did help me to think from other dimension. However I missed to mention that I am using Solaris UNIX where I couldn't find equivalent of flock or similar function. I could have asked team to install one utility but in mean time I found a workaround for this issue.
I read about the mkdir function being atomic in nature where as 'touch' command to create a file is not(still don't have complete explanation on how it works). That means at a time only 1 script can create/delete directory 'lockdir' out of 4 and other 3 has to wait.
while true;
do
if mkdir "$lockdir"; then
< create param file >
break;
fi
Sleep 30
done

shell difference between redirect position

Is there any difference between these 2 lines?
for i in $(seq 1 10); do echo $i - `date`; sleep 1; done >> /tmp/output.txt
for i in $(seq 1 10); do echo $i - `date` >> /tmp/output.txt ; sleep 1; done
Because Robert told me that the first one only makes the I/O OP outside the for loop.
But if I type tail -f /tmp/output.txt this is behaving exactly the same way.
They do the same if they succeed. However, there might be notable differences if they fail for whatever reason.
The first one:
for ...; do
# things
done >> file
This will redirect to the file supposedly after the loop is done. However, it might happen whenever Bash decides to flush the buffer.
Imagine something fails after the iteration number 3: you cannot tell what has been stored after in the file.
The second one:
for ...; do
# things >> file
done
This will redirect to the file on every iteration.
Imagine something fails after the iteration number 3: you are sure the first two loops have been stored properly in the file.
From How to redirect output from an infinite-loop program:
If your program is using the standard output functions (e.g. puts,
printf and friends from stdio.h in C, cout << … in C++, print
in many high-level languages), then its output is
buffered:
characters accumulate in a memory zone which is called a buffer; when
there's too much data in the buffer, the contents of the buffer is
printed (it's “flushed”) and the buffer becomes empty (ready to be
filled again). If your program doesn't produce much output, it may not
have filled its buffer yet.
Also, from the answer you link:
Placing the redirection operator outside the loop doubles the
performance when writing 500000 lines (on my system).
This makes sense: if you have to flush on every loop, it takes more time than letting Bash flush whenever it finds it convenient. It is easier to write five lines at the time than a line every single time.
There's another important difference that has not been mentioned: >> opens the file for writing every time. This can affect performance sensibly.
Also, if /tmp/output.txt gets deleted while the loop is running, echo ... >> /tmp/output.txt will re-create the file with new contents, while for ... done >> /tmp/output.txt will continue adding data to the same file.
This is something important to remember, especially if we are dealing with hard links or temporary files (in general, we unlink temporary files soon after creating them, to avoid having stale files if the Bash script exit unexpectedly).

How to implement multithreaded access to file-based queue in bash script

I have a bit of a problem with designing a multiprocessed bash script that goes trough websites, follows found links and does some processing on every new page (it actually gathers email addresses but that's an unimportant detail to the problem).
The script is supposed to work like this:
Downloads a page
Parses out all links and adds them to queue
Does some unimportant processing
Pops an URL from queue and starts from
That itself would be quite simple to program, the problem arrises from two restrictions and a feature the script needs to have.
The script must not process one URL twice
The script must be able to process n (supplied as argument) pages at once
The script must be POSIX complient (with the exception of curl) -> so no fancy locks
Now, I've managed to come up with an implementation that uses two files for queues one of which stores all URLs that had already been processed, and other one URLs that had been found but not yet processed.
The main proces simply spawns a bunch of child processes that all share the queue files and (in a loop until URLs-to-be-processed-queue is empty) pops top a URL from URLs-to-be-processed-queue, process the page, try to add every newly found link to URLs-already-processed-queue and if it succeeds (the URL is not already there) add it to URLs-to-be-processed-queue as well.
The issue lies in the fact that you can't (AFAIK) make the queue-file operations atomic and therefore locking is necessary. And locking in POSIX complient way is ... terror... slow terror.
The way I do it is following:
#Pops first element from a file ($1) and prints it to stdout; if file emepty print out empty return 1
fuPop(){
if [ -s "$1" ]; then
sed -nr '1p' "$1"
sed -ir '1d' "$1"
return 0
else
return 1
fi
}
#Appends line ($1) to a file ($2) and return 0 if it's not in it yet; if it, is just return 1
fuAppend(){
if grep -Fxq "$1" < "$2"; then
return 1
else
echo "$1" >> "$2"
return 0
fi
}
#There're multiple processes running this function.
prcsPages(){
while [ -s "$todoLinks" ]; do
luAckLock "$linksLock"
linkToProcess="$(fuPop "$todoLinks")"
luUnlock "$linksLock"
prcsPage "$linkToProcess"
...
done
...
}
#The prcsPage downloads it, does some magic and than calls prcsNewLinks and prcsNewEmails that both get list of new emails / new urls in $1
#$doneEmails, ..., contain file path, $mailLock, ..., contain dir path
prcsNewEmails(){
luAckLock "$mailsLock"
for newEmail in $1; do
if fuAppend "$newEmail" "$doneEmails"; then
echo "$newEmail"
fi
done
luUnlock "$mailsLock"
}
prcsNewLinks(){
luAckLock "$linksLock"
for newLink in $1; do
if fuAppend "$newLink" "$doneLinks"; then
fuAppend "$newLink" "$todoLinks"
fi
done
luUnlock "$linksLock"
}
The problem is that my implementation is slow (like really slow), almost so slow that it doesn't make sense to use more than 2 10 (decreasing lock waiting help a great deal) child processes. You can actually disable the locks (just comment out the luAckLock and luUnlock bits) and it works quite ok (and much faster) but there're race conditions every once in a while with the seds -i and it just doesn't feel right.
The worst(as I see it) is locking in prcsNewLinks as it takes quite a lot time (most of time-run basically) and practically prevents other processes from starting to process a new page (as it requires poping new URL from (currently locked) $todoLinks queue).
Now my question is, how to do it better, faster, and nicer?
The whole script is here (it contains some signal magic, a lot of debug outputs, and not that good code generally).
BTW: Yes, you're right, doing this in bash - and what's more in POSIX compliant way - is insane! But it's university assignment so I kinda have to do it
//Though I feel it's not really expected of me to resolve these problems (as the race conditions arise more frequently only when having 25+ threads which is probably not something a sane person would test).
Notes to the code:
Yes, the wait should have (and already has) a random time. The code shared here was just a proof of concept written during real analysis lecture.
Yes, the number of debug ouptuts and their formatting is terrible and there should be standalone logging function. That's, however, not point of my problem.
First of all, do you need to implement your own HTML/HTTP spidering? Why not let wget or curl recurse through a site for you?
You could abuse the filesystem as a database, and have your queues be directories of one-line files. (or empty files where the filename is the data). That would give you producer-consumer locking, where producers touch a file, and consumers move it from the incoming to the processing/done directory.
The beauty of this is that multiple threads touching the same file Just Works. The desired result is the url appearing once in the "incoming" list, and that's what you get when multiple threads create files with the same name. Since you want de-duplication, you don't need locking when writing to the incoming list.
1) Downloads a page
2) Parses out all links and adds them to queue
For each link found,
grep -ql "$url" already_checked || : > "incoming/$url"
or
[[ -e "done/$url" ]] || : > "incoming/$url"
3) Does some unimportant processing
4) Pops an URL from queue and starts from 1)
# mostly untested, you might have to debug / tweak this
local inc=( incoming/* )
# edit: this can make threads exit sooner than desired.
# See the comments for some ideas on how to make threads wait for new work
while [[ $inc != "incoming/*" ]]; do
# $inc is shorthand for "${inc[0]}", the first array entry
mv "$inc" "done/" || { rm -f "$inc"; continue; } # another thread got that link, or that url already exists in done/
url=${inc#incoming/}
download "$url";
for newurl in $(link_scan "$url"); do
[[ -e "done/$newurl" ]] || : > "incoming/$newurl"
done
process "$url"
inc=( incoming/* )
done
edit: encoding URLs into strings that don't contain / is left as an exercise for the reader. Although probably urlencoding / to %2F would work well enough.
I was thinking of moving URLs to a "processing" list per thread, but actually if you don't need to be able to resume from interruption, your "done" list can instead be a "queued & done" list. I don't think it's actually ever useful to mv "$url" "threadqueue.$$/" or something.
The "done/" directory will get pretty big, and start to slow down with maybe 10k files, depending on what filesystem you use. It's probably more efficient to maintain the "done" list as a file of one url per line, or a database if there's a database CLI interface that's fast for single commands.
Maintaining the done list as a file isn't bad, because you never need to remove entries. You can probably get away without locking it, even for multiple processes appending it. (I'm not sure what happens if thread B writes data at EOF between thread A opening the file and thread A doing a write. Thread A's file position might be the old EOF, in which case it would overwrite thread B's entry, or worse, overwrite only part of it. If you do need locking, maybe flock(1) would be useful. It gets a lock, then executes the commands you pass as args.)
If broken files from lack of write locking doesn't happen, then you might not need write locking. The occasional duplicate entry in the "done" list will be a tiny slowdown compared to having to lock for every check/append.
If you need strictly-correct avoidance of downloading the same URL multiple times, you need readers to wait for the writer to finish. If you can just sort -u the list of emails at the end, it's not a disaster for a reader to read an old copy while the list is being appended. Then writers only need to lock each other out, and readers can just read the file. If they hit EOF before a writer manages to write a new entry, then so be it.
I'm not sure whether it matters if a a thread adds entries to the "done" list before or after they remove them from the incoming list, as long as they add them to "done" before processing. I was thinking that one way or the other might make races more likely to cause duplicate done entries, and less likely to make duplicate downloads / processing, but I'm not sure.

How to create a virtual command-backed file in Linux?

What is the most straightforward way to create a "virtual" file in Linux, that would allow the read operation on it, always returning the output of some particular command (run everytime the file is being read from)? So, every read operation would cause an execution of a command, catching its output and passing it as a "content" of the file.
There is no way to create such so called "virtual file". On the other hand, you would be
able to achieve this behaviour by implementing simple synthetic filesystem in userspace via FUSE. Moreover you don't have to use c, there
are bindings even for scripting languages such as python.
Edit: And chances are that something like this already exists: see for example scriptfs.
This is a great answer I copied below.
Basically, named pipes let you do this in scripting, and Fuse let's you do it easily in Python.
You may be looking for a named pipe.
mkfifo f
{
echo 'V cebqhpr bhgchg.'
sleep 2
echo 'Urer vf zber bhgchg.'
} >f
rot13 < f
Writing to the pipe doesn't start the listening program. If you want to process input in a loop, you need to keep a listening program running.
while true; do rot13 <f >decoded-output-$(date +%s.%N); done
Note that all data written to the pipe is merged, even if there are multiple processes writing. If multiple processes are reading, only one gets the data. So a pipe may not be suitable for concurrent situations.
A named socket can handle concurrent connections, but this is beyond the capabilities for basic shell scripts.
At the most complex end of the scale are custom filesystems, which lets you design and mount a filesystem where each open, write, etc., triggers a function in a program. The minimum investment is tens of lines of nontrivial coding, for example in Python. If you only want to execute commands when reading files, you can use scriptfs or fuseflt.
No one mentioned this but if you can choose the path to the file you can use the standard input /dev/stdin.
Everytime the cat program runs, it ends up reading the output of the program writing to the pipe which is simply echo my input here:
for i in 1 2 3; do
echo my input | cat /dev/stdin
done
outputs:
my input
my input
my input
I'm afraid this is not easily possible. When a process reads from a file, it uses system calls like open, fstat, read. You would need to intercept these calls and output something different from what they would return. This would require writing some sort of kernel module, and even then it may turn out to be impossible.
However, if you simply need to trigger something whenever a certain file is accessed, you could play with inotifywait:
#!/bin/bash
while inotifywait -qq -e access /path/to/file; do
echo "$(date +%s)" >> /tmp/access.txt
done
Run this as a background process, and you will get an entry in /tmp/access.txt each time your file is being read.

How do I create a file descriptor in linux that can be read from multiple processes without consuming the data?

I'd like to create a file descriptor that when written to can be read from multiple processes without consuming the data. I'm aware of named pipes, but since it's a fifo, only one processes can ever get the data.
My use case is the following. With git, hooks use stdin to pass data to be processed into the hook. I want to be able to call multiple sub-hooks from a parent hook. Each sub-hook should get the same stdin data as the parent receives. If I'm not mistaken when I use a pipe, then each subprocess will not get the same stdin. Instead, the first hook to read stdin would consume the data. Is this correct?
The only option I really see being viable at this point is writing stdin to a file then reading that file from each subprocess. Is there another way?
Perhaps you could try to use tee.
From man tee:
tee - read from standard input and write to standard output and files
In your case with processes in this way: tee >(parent) >(hook1) >(hook2) >(hookn)
(where each hook is a different process, command, shell whatever you want)
Here an example:
#!/bin/bash
while read stdinstream
do
echo -n ${stdinstream} | tee >(parent) >(hook2) >(hook1)
done
EDIT:
In your case I do not think you are going to need the while loop, with this could be enough:
read stdinstream
echo -n ${stdinstream} | tee >(parent) >(hook2) >(hook1)
Hopefully this will help you.

Resources