While reading the APUE (3rd edition) book, I came across the open system call and its ability to let user open file for write atomic operation with O_APPEND mode meaning that, multiple processes can write to a file descriptor and kernel ensures that the data written to the single file by multiple processes, doesn't overlap and all lines are intact.
Upon experimenting successfully with open system call with a C/C++ program, I was able to validate the same and it works just like the book describes. I was able to launch multiple processes which wrote to a single file and all lines could be accounted for w.r.t to their process PIDs.
I was hoping to observe the same behavior with perl sysopen, as I have some tasks at work which could benefit with this behavior. Tried it out but it actually did not work. When I analyzed the output file, I was able to see signs of race condition (probably) as there many a times interleaved lines.
Question: Is perl sysopen call not same as the linux's open system call? Is it possible to achieve this type of atomic write operation by multiple processes to a single file?
EDIT: adding C code, and perl code used for testing.
C/C++ code
int main(void)
{
if ((fd = open("outfile.txt",O_WRONLY|O_CREAT|O_APPEND)) == -1) {
printf ("failed to create outfile! exiting!\n");
return -1;
}
for (int counter{1};counter<=MAXLINES;counter++)
{ /* write string 'line' for MAXLINES no. of times */
std::string line = std::to_string(ACE_OS::getpid())
+ " This is a sample data line ";
line += std::to_string(counter) + " \n";
if ((n = write(fd,line.c_str(),strlen(line.c_str()))) == -1) {
printf("Failed to write to outfile!\n";
}
}
return 0;
}
Perl code
#!/usr/bin/perl
use Fcntl;
use strict;
use warnings;
my $maxlines = 100000;
sysopen (FH, "testfile", O_CREAT|O_WRONLY|O_APPEND) or die "failed sysopen\n";
while ($maxlines != 0) {
print FH "($$) This is sample data line no. $maxlines\n";
$maxlines--;
}
close (FH);
__END__
Update (after initial troubleshooting):
Thanks to the information provided in the answer below, I was able to get it working. Although I ran into issue of some missing lines which was caused by me opening the file with each process with O_TRUNC, which I shouldn't have done, but missed it initially. After some careful analysis - I spotted the issue and corrected it. As always - linux never fails you :).
Here is a bash script I used to launch the processes:
#!/bin/bash
# basically we spawn "$1" instances of the same
# executable which should append to the same output file.
max=$1
[[ -z $max ]] && max=6
echo "creating $max processes for appending into same file"
# this is our output file collecting all
# the lines from all the processes.
# we truncate it before we start
>testfile
for i in $(seq 1 $max)
do
echo $i && ./perl_read_write_with_syscalls.pl 2>>_err &
done
# end.
Verification from output file:
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ ls -lrth testfile
-rw-rw-r--. 1 compuser compuser 252M Jan 31 22:52 testfile
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ wc -l testfile
6000000 testfile
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ cat testfile |cut -f1 -d" "|sort|uniq -c
1000000 (PID: 21118)
1000000 (PID: 21123)
1000000 (PID: 21124)
1000000 (PID: 21125)
1000000 (PID: 21126)
1000000 (PID: 21127)
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$
Observations:
To my surprise, there wasn't any wait average load on the system - at all. I was not expecting it. I believe somehow kernel must have taken care of that but don't know how it works. I would be interested to know more about it.
What could be the possible applications of this?
I do a lot of file to file reconciliation(s), and we (at work) always have need to parse huge data files (like 30gb - 50gb each). With this working - I could now do parallel operations instead of my previous approach which comprised of: hashing file1, then hashing file2, then compare key,value pairs from 2 files. Now I could do the hashing part in parallel and bring down the time it takes - even further.
Thanks
It doesn't matter if you open or sysopen; the key is using syswrite and sysread instead of print/printf/say/etc and readline/read/eof/etc.
syswrite maps to a single write(2) call, while print/printf/say/etc can result in multiple calls to write(2) (even if autoflush is enabled).[1]
sysread maps to a single read(2) call, while readline/read/eof/etc can result in multiple calls to read(2).
So, by using syswrite and sysread, you are subject to all the assurances that POSIX gives about those calls (whatever they might be) if you're on a POSIX system.
If you use print/printf/say/etc, and limit your writes to less than the size of the buffer between (explicit or automatic) flushes, you'll get a single write(2) call. The buffer size was 4 KiB in older versions of Perl, and it's 8 KiB by default in newer versions of Perl. (The size is decided when perl is built.)
Related
I'm working on a system on which ubuntu is running. I'm reading basic data like CPU frequency and temperature out of the thermal zones provided in /sys/class/thermal.
Unfortunately, I've got around 100 thermal_zones from which I need to read the data. I do it with:
for SENSOR_NODE in /sys/class/thermal/thermal_zone*; do printf "%s: %s\n" $(cat ${SENSOR_NODE}/type) $(cat ${SENSOR_NODE}/temp); done
To collect all data takes ~2.5-3 sec. which is way to long.
Since I want to collect the data every second my question is, if there is a way to "read" or "collect" the data faster?
Thank you in advance
There's only so much you can do while writing your code in shell, but let's start with the basics.
Command substitutions, $(...), are expensive: They require creating a FIFO, fork()ing a new subprocess, connecting the FIFO to that subprocess's stdout, reading from the FIFO and waiting for the commands running in that subshell to exit.
External commands, like cat, are expensive: They require linking and loading a separate executable; and when you run them without exec (in which case they inherit and consume the shell's process ID), they also require a new process to be fork()ed off.
All POSIX-compliant shells give you a read command:
for sensor_node in /sys/class/thermal/thermal_zone*; do
read -r sensor_type <"$sensor_node/type" || continue
read -r sensor_temp <"$sensor_node/temp" || continue
printf '%s: %s\n' "$sensor_type" "$sensor_temp"
done
...which lets you avoid the command substitution overhead and the overhead of cat. However, read reads content only one byte at a time; so while you're not paying that overhead, it's still relatively slow.
If you switch from /bin/sh to bash, you get a faster alternative:
for sensor_node in /sys/class/thermal/thermal_zone*; do
printf '%s: %s\n' "$(<"$sensor_node/type)" "$(<"$sensor_node/temp")"
done
...as $(<file) doesn't need to do the one-byte-at-a-time reads that read does. That's only faster for being bash, though; it doesn't mean it's actually fast. There's a reason modern production monitoring systems are typically written in Go or with a JavaScript runtime like Node.
I have been doing some research and I got this situation. If you want to write to the STDOUT (screen), you won't be able to do a multithread script which prints the data faster tan a simple single thread script. But, if you write to a file like this:
myPrinter.perl > myPrint
The result change and you can see that the multithread approach gets better time. My doubt is, since STDOUT (screen) or the output file are both shared resources, wouldn't be the access time similar? why multithread approach only performs better writting to file?
the perl scripts that I used in the experiments are:
Single thread
for my $i (1..100000000){
print("things\n");
}
Multithread
use threads;
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
sub worker {
for my $i (1 .. 25000000){
print("things\n");
}
}
my $q = Thread::Queue->new(); #::any
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
for my $i (1 .. 4){
$q->enqueue($i);
}
$q->end();
$_->join for threads->list;
Credits: the queue implementation was taken from one of the ikegami answers.
This could be explained if writing to STDOUT requires some form of locking internally.
When STDOUT is connected to a terminal, the output is flushed after every newline. Otherwise, STDOUT is only flushed every 4 KiB or 8 KiB (depending on your version of Perl). The latter scenario presumably required fewer or shorter locks.
You could use |cat instead of >file to get the same effect.
If your actual worker spends a much smaller proportion of time writing to STDOUT, this problem should go away.
An example, following up on my comment. I understand from the question that you compare STDOUT prints that wind up on the terminal to those that are redirected to a file.
Timed to print to console, and to file
time perl -we'print "hello\n" for 1..1_000_000'
Time: 0.209u 1.562s 0:17.65 9.9% 0+0k 0+0io 0pf+0w (tcsh)
time perl -we'print "hello\n" for 1..1_000_000' > many_writes.out
Time: 0.104u 0.005s 0:00.11 90.9% 0+0k 0+11720io 0pf+0w
That is 17.65 seconds vs. 0.11 seconds. Printing to a terminal is very, very slow.
With multiple threads I expect the difference to be even more pronounced.
How fast you can output data is restricted by the performance of the target. If you write to a local file the performance is restricted by the underlying OS, the file system and the speed of disk. If you write to a file on a network file system it is further restricted by the speed of the network and the performance of the file server etc. Some OS level buffering helps to make this faster.
If you write to STDOUT it depends what the target of STDOUT is. STDOUT can be redirected to a file, piped into another process and also printed to a terminal. In all of these cases the write speed is again dependend on the target medium. Terminals are usually very slow in writing compared to a local file. But again, this is not a question of STDOUT vs. file but of where STDOUT ends up.
I have a file with very, very long lines. A single line will not fit in memory. I need to process each line separately, so I'd like to write each line to a FIFO node, sending EOF between lines. This model works well for what I'm doing, and would be ideal.
The lines cannot be stored in RAM in their entirety, so the read built-in is not an option. Whatever command I use needs to write directly--unbuffered--to the FIFO while reading from the multi-gigabyte source file.
How can I achieve this, preferably with pure bash and basic Linux utilities, and no special programs that I have to install? I've tried things like sed -u 'p', but unfortunately, I can't seem to get any common programs to send EOF between lines.
bash version:
GNU bash, version 4.2.45(1)-release (x86_64-pc-linux-gnu)
Update
I'd like to avoid utilities/tricks that read "line number X" from the file. This file has thousands of very long lines; repeatedly evaluating the same line breaks would take far too long. For this to complete in a reasonable timeframe, whatever program is reading lines will need to read each line in sequence, saving its previous position, much like read + pipe.
Let's think about the question "hashing lines that don't fit in memory", because that's a straightforward problem that you can solve with only a few lines of Python.
import sys
import hashlib
def hash_lines(hashname, file):
hash = hashlib.new(hashname)
while True:
data = file.read(1024 * 1024)
if not data:
break
while True:
i = data.find('\n')
if i >= 0:
hash.update(data[:i]) # See note
yield hash.hexdigest()
data = data[i+1:]
hash = hashlib.new(hashname)
else:
hash.update(data)
break
yield hash.hexdigest()
for h in hash_lines('md5', sys.stdin):
print h
It's a bit wacky because most languages do not have good abstractions for objects that do not fit in memory (an exception is Haskell; this would probably be four lines of Haskell).
Note: Use i+1 if you want to include the line break in the hash.
Haskell version
The Haskell version is like a pipe (read right to left). Haskell supports lazy IO, which means that it only reads the input as necessary, so the entire line doesn't need to be in memory at once.
A more modern version would use conduits instead of lazy IO.
module Main (main) where
import Crypto.Hash.MD5
import Data.ByteString.Lazy.Char8
import Data.ByteString.Base16
import Prelude hiding (interact, lines, unlines)
main = interact $ unlines . map (fromStrict . encode . hashlazy) . lines
The problem was that I should've been using a normal file, not a FIFO. I was looking at this wrong. read works the same way as head: it doesn't remember where it is in the file. The stream remembers for it. I don't know what I was thinking. head -n 1 will read one line from a stream, starting at whatever position the stream is already at.
#!/bin/bash
# So we don't leave a giant temporary file hanging around:
tmp=
trap '[ -n "$tmp" ] && rm -f "$tmp" &> /dev/null' EXIT
tmp="$(mktemp)" || exit $?
while head -n 1 > "$tmp" && [ -s "$tmp" ]; do
# Run $tmp file through several programs, like md5sum
done < "$1"
This works quite effectively. head -n 1 reads each line in sequence from a basic file stream. I don't have to worry about background tasks or blocking when reading from empty FIFOs, either.
Hashing long lines should be perfectly feasible, but probably not so easily in bash.
If I may suggest Python, it could be done like this:
# open FIFO somehow. If it is stdin (as a pipe), fine. Of not, simply open it.
# I suggest f is the opened FIFO.
def read_blockwise(f, blocksize):
while True:
data = f.read(blocksize)
if not data: break
yield data
hasher = some_object_which_does_the_hashing()
for block in read_blockwise(f, 65536):
spl = block.split('\n', 1)
hasher.update(spl[0])
if len(spl) > 1:
hasher.wrap_over() # or whatever you need to do if a new line comes along
new_hasher.update(spl[1])
This is merely Python-orientated pseudo-code which shows the direction how to do what you seem to want to do.
Note that it will not be feasible without memory, but I don't think that it matters. The chunks are very small (and could be made even smaller) and are processed as they come along.
Encountering a line break leads to terminate processing of the old line and starting to process a new one.
I have for project to code a virus scanner by using clamAV signature database.
For increasing speed I use threads. (combo & and wait)
How my code works:
It read all files in a folder and sub-folders
function recursive_files()
{
files=$(find $folder_path -type f)
for f in $files
do
raw_and_scan "$f" &
done
wait
}
As you can see, for each file there is a thread.
function raw_and_scan()
{
raw_test_file $1
read_signature_db_by_line $1
}
Read_signature.. read each line of signature database
function read_signature_db_by_line()
{
cat $signature_path | (while read LINE ; do
stringtokenizer_line_db $LINE $1 $raw_file &
done
wait
) }
As you can see, for each line of DB there is a thread.
I did the double thread implementation because I saw a huge performance (using time benchmark)
When I scan 50 files with 50 lines into the DB. It works fine.
But when I scan my home folder (800 files) it doesn't work and worse I have got a warning(cant fork() anymore) and my computer freeze, It needs to reboot.
I watch the process (htop) until 5000 tasks it works.
You can file my poject https://github.com/peondusud/Bash.antivir
At the end, I would to scan folder with a database 65000 lines.
If you have any idea to limit threads or something like that.
Thanks.
The fact you see a huge improvement going from a single process (not thread) into two does not mean you will go super fast using 5000 processes! Actually it is the opposite- if you plan to have processes doing intensive work you should limit it to 2* number of cpu cores in your system (this is a generic rule of thumb)
I have been trying for about an hour now to find an elegant solution to this problem. My goal is basically to write a bandwidth control pipe command which I could re-use in various situations (not just for network transfers, I know about scp -l 1234). What I would like to do is:
Delay for X seconds.
Read Y amount (or less than Y if there isn't enough) data from pipe.
Write the read data to standard output.
Where:
X could be 1..n.
Y could be 1 Byte up to some high value.
My problem is:
It must support binary data which Bash can't handle well.
Roads I've taken or at least thought of:
Using a while read data construct, it filters all white characters in the encoding your using.
Using dd bs=1 count=1 and looping. dd doesn't seem to have different exit codes for when there were something in if and not. Which makes it harder to know when to stop looping. This method should work if I redirect standard error to a temporary file, read it to check if something was transfered (as it's in the statistics printed on stderr) and repeat. But I suspect that it's extremely slow if used on large amounts of data and if it's possible I'd like to skip creating any temporary files.
Any ideas or suggestions on how to solve this as cleanly as possible using Bash?
may be pv -qL RATE ?
-L RATE, --rate-limit RATE
Limit the transfer to a maximum of RATE bytes per second. A
suffix of "k", "m", "g", or "t" can be added to denote kilobytes
(*1024), megabytes, and so on.
It's not much elegant but you can use some redirection trick to catch the number of bytes copied by dd and then use it as the exit condition for a while loop:
while [ -z "$byte_copied" ] || [ "$byte_copied" -ne 0 ]; do
sleep $X;
byte_copied=$(dd bs=$Y count=1 2>&1 >&4 | awk '$2 == "byte"{print $1}');
done 4>&1
However, if your intent is to limit the transfer throughput, I suggest you to use pv.
Do you have to do it in bash? Can you just use an existing program such as cstream?
cstream meets your goal of a bandwidth controlled pipe command, but doesn't necessarily meet your other criteria with regard to your specific algorithm or implementation language.
What about using head -c ?
cat /dev/zero | head -c 10 > test.out
Gives you a nice 10 bytes file.