Bash while read loop extremely slow compared to cat, why? - linux

A simple test script here:
while read LINE; do
LINECOUNT=$(($LINECOUNT+1))
if [[ $(($LINECOUNT % 1000)) -eq 0 ]]; then echo $LINECOUNT; fi
done
When I do cat my450klinefile.txt | myscript the CPU locks up at 100% and it can process about 1000 lines a second. About 5 minutes to process what cat my450klinefile.txt >/dev/null does in half a second.
Is there a more efficient way to do essentially this. I just need to read a line from stdin, count the bytes, and write it out to a named pipe. But the speed of even this example is impossibly slow.
Every 1Gb of input lines I need to do a few more complex scripting actions (close and open some pipes that the data is being feed to).

The reason while read is so slow is that the shell is required to make a system call for every byte. It cannot read a large buffer from the pipe, because the shell must not read more than one line from the input stream and therefore must compare each character against a newline. If you run strace on a while read loop, you can see this behavior. This behavior is desirable, because it makes it possible to reliably do things like:
while read size; do test "$size" -gt 0 || break; dd bs="$size" count=1 of=file$(( i++ )); done
in which the commands inside the loop are reading from the same stream that the shell reads from. If the shell consumed a big chunk of data by reading large buffers, the inner commands would not have access to that data. An unfortunate side-effect is that read is absurdly slow.

It's because the bash script is interpreted and not really optimised for speed in this case. You're usually better off using one of the external tools such as:
awk 'NR%1000==0{print}' inputFile
which matches your "print every 1000 lines" sample.
If you wanted to (for each line) output the line count in characters followed by the line itself, and pipe it through another process, you could also do that:
awk '{print length($0)" "$0}' inputFile | someOtherProcess
Tools like awk, sed, grep, cut and the more powerful perl are far more suited to these tasks than an interpreted shell script.

The perl solution for count bytes of each string:
perl -p -e '
use Encode;
print length(Encode::encode_utf8($_))."\n";$_=""'
for example:
dd if=/dev/urandom bs=1M count=100 |
perl -p -e 'use Encode;print length(Encode::encode_utf8($_))."\n";$_=""' |
tail
works for me as 7.7Mb/s
to compare how much script used:
dd if=/dev/urandom bs=1M count=100 >/dev/null
run as 9.1Mb/s
seems script not so slow :)

Not really sure what your script is supposed to do. So this might not be an answer to your question but more of a generic tip.
Don't cat your file and pipe it to your script, instead when reading from a file with a bash script do it like this:
while read line
do
echo $line
done <file.txt

Related

How to make many edits to files, without writing to the harddrive very often, in BASH?

I often need to make many edits to text files. The files are typically 20 MB in size and require ~500,000 individual edits, all which must be made in a very specific order. Here is a simple example of a script I might need to use:
while read -r line
do
...
(20-100 lines of BASH commands preparing $a and $b)
...
sed -i "s/$a/$b/g" ./editfile.txt
...
done < ./readfile.txt
As many other lines of code appear before and after the sed script, it seems the only option for editing the file is sed with the -i option. Many have warned me against using sed -i, as that makes too many writes to the file. Recently, I had to replace two computers, as the hard drives stopped working after running the scripts. I need to find a solution that does not damage my computer's hardware.
Is there some way to send files somewhere else, such as storing the whole file into a BASH variable, or into RAM, where I sed, grep, and awk, can make the edits without making millions of writes to the hard drive?
Don't use sed -i once per transform. A far better approach -- leaving you with more control -- is to construct a pipeline (if you can't use a single sed with multiple -e arguments to perform multiple operations within a single instance), and redirect to or from disk at only the beginning and end.
This can even be done recursively, if you use a FD other than stdin for reading from your file:
editstep() {
read -u 3 -r line # read from readfile into REPLY
if [[ $REPLY ]]; then # we read something new from readfile
sed ... | editstep # perform the edits, then a recursive call!
else
cat
fi
}
editstep <editfile.txt >editfile.txt.new 3<readfile.txt
Better than that, though, is to consolidate to a single sed instance.
sed_args=( )
while read -r line; do
sed_args+=( -e "s/in/out/" )
done <readfile.txt
sed -i "${sed_args[#]}" editfile.txt
...or, for edit lists too long to pass in on the command line:
sed_args=( )
while read -r line; do
sed_args+=( "s/in/out/" )
done <readfile.txt
sed -i -f <(printf '%s\n' "${sed_args[#]}") editfile.txt
(Please don't read the above as an endorsement of sed -i, which is a non-POSIX extension and has its own set of problems; the POSIX-specified editor intended for in-place rather than streaming operations is ex, not sed).
Even better? Don't use sed at all, but keep all the operations inline in native bash.
Consider the following:
content=$(<editfile.txt)
while IFS= read -r; do
# put your own logic here to set `in` and `out`
content=${content//$in/$out}
done <readfile.txt
printf '%s\n' "$content" >editfile.new
One important caveat: This approach treats in as a literal string, not a regular expression. Depending on the edits you're actually making, this may actually improve correctness over the original code... but in any event, it's worth being aware of.
Another caveat: Reading the file's contents into a bash string is not necessarily a lossless operation; expect content to be truncated at the first NUL byte (if any exist), and a trailing newline to be added at the end of the file if none existed before.
simple ...
instead of trying too many threads, you can simple copy all your files and dirs to /dev/shm
This is representation of ram drive. When you are done editing, copy all back to the original destination. Do not forget to run sync after you are done :-)

Use I/O redirection between two scripts without waiting for the first to finish

I have two scripts, let's say long.sh and simple.sh: one is very time consuming, the other is very simple. The output of the first script should be used as input of the second one.
As an example, the "long.sh" could be like this:
#!/bin/sh
for line in `cat LONGIFLE.dat` do;
# read line;
# do some complicated processing (time consuming);
echo $line
done;
And the simple one is:
#!/bin/sh
while read a; do
# simple processing;
echo $a + "other stuff"
done;
I want to pipeline the two scripts this:
sh long.sh | sh simple.sh
Using pipelines, the simple.sh has to wait the end of the long script before it could start.
I would like to know if in the bash shell it is possible to see the output of simple.sh per current line, so that I can see at runtime what line is being processed at this moment.
I would prefer not to merge the two scripts together, nor to call the simple.sh inside long.sh.
Thank you very much.
stdout is normally buffered. You want line-buffered. Try
stdbuf -oL sh long.sh | sh simple.sh
Note that this loop
for line in `cat LONGIFLE.dat`; do # see where I put the semi-colon?
reads words from the file. If you only have one word per line, you're OK. Otherwise, to read by lines, use while IFS= read -r line; do ...; done < LONGFILE.dat
Always quote your variables (echo "$line") unless you know specifically when not to.

Remove a specific line from a file WITHOUT using sed or awk

I need to remove a specific line number from a file using a bash script.
I get the line number from the grep command with the -n option.
I cannot use sed for a variety of reasons, least of which is that it is not installed on all the systems this script needs to run on and installing it is not an option.
awk is out of the question because in testing, on different machines with different UNIX/Linux OS's (RHEL, SunOS, Solaris, Ubuntu, etc.), it gives (sometimes wildly) different results on each. So, no awk.
The file in question is just a flat text file, with one record per line, so nothing fancy needs to be done, except for remove the line by number.
If at all possible, I need to avoid doing something like extracting the contents of the file, not including the line I want gone, and then overwriting the original file.
Since you have grep, the obvious thing to do is:
$ grep -v "line to remove" file.txt > /tmp/tmp
$ mv /tmp/tmp file.txt
$
But it sounds like you don't want to use any temporary files - I assume the input file is large and this is an embedded system where memory and storage are in short supply. I think you ideally need a solution that edits the file in place. I think this might be possible with dd but haven't figured it out yet :(
Update - I figured out how to edit the file in place with dd. Also grep, head and cut are needed. If these are not available then they can probably be worked around for the most part:
#!/bin/bash
# get the line number to remove
rline=$(grep -n "$1" "$2" | head -n1 | cut -d: -f1)
# number of bytes before the line to be removed
hbytes=$(head -n$((rline-1)) "$2" | wc -c)
# number of bytes to remove
rbytes=$(grep "$1" "$2" | wc -c)
# original file size
fsize=$(cat "$2" | wc -c)
# dd will start reading the file after the line to be removed
ddskip=$((hbytes + rbytes))
# dd will start writing at the beginning of the line to be removed
ddseek=$hbytes
# dd will move this many bytes
ddcount=$((fsize - hbytes - rbytes))
# the expected new file size
newsize=$((fsize - rbytes))
# move the bytes with dd. strace confirms the file is edited in place
dd bs=1 if="$2" skip=$ddskip seek=$ddseek conv=notrunc count=$ddcount of="$2"
# truncate the remainder bytes of the end of the file
dd bs=1 if="$2" skip=$newsize seek=$newsize count=0 of="$2"
Run it thusly:
$ cat > file.txt
line 1
line two
line 3
$ ./grepremove "tw" file.txt
7+0 records in
7+0 records out
0+0 records in
0+0 records out
$ cat file.txt
line 1
line 3
$
Suffice to say that dd is a very dangerous tool. You can easily unintentionally overwrite files or entire disks. Be very careful!
Try ed. The here-document-based example below deletes line 2 from test.txt
ed -s test.txt <<!
2d
w
!
You can do it without grep using posix shell builtins which should be on any *nix.
while read LINE || [ "$LINE" ];do
case "$LINE" in
*thing_you_are_grepping_for*)continue;;
*)echo "$LINE";;
esac
done <infile >outfile
If n is the line you want to omit:
{
head -n $(( n-1 )) file
tail +$(( n+1 )) file
} > newfile
Given dd is deemed too dangerous for this in-place line removal, we need some other method where we have fairly fine-grained control over the file system calls. My initial urge is to write something in c, but while possible, I think that is a bit of overkill. Instead it is worth looking to common scripting (not shell-scripting) languages, as these typically have fairly low-level file APIs which map to the file syscalls in a fairly straightforward manner. I'm guessing this can be done using python, perl, Tcl or one of many other scripting language that might be available. I'm most familiar with Tcl, so here we go:
#!/bin/sh
# \
exec tclsh "$0" "$#"
package require Tclx
set removeline [lindex $argv 0]
set filename [lindex $argv 1]
set infile [open $filename RDONLY]
for {set lineNumber 1} {$lineNumber < $removeline} {incr lineNumber} {
if {[eof $infile]} {
close $infile
puts "EOF at line $lineNumber"
exit
}
gets $infile line
}
set bytecount [tell $infile]
gets $infile rmline
set outfile [open $filename RDWR]
seek $outfile $bytecount start
while {[gets $infile line] >= 0} {
puts $outfile $line
}
ftruncate -fileid $outfile [tell $outfile]
close $infile
close $outfile
Note on my particular box I have Tcl 8.4, so I had to load the Tclx package in order to use the ftruncate command. In Tcl 8.5, there is chan truncate which could be used instead.
You can pass the line number you want to remove and the filename to this script.
In short, the script does this:
open the file for reading
read the first n-1 lines
get the offset of the start of the next line (line n)
read line n
open the file with a new FD for writing
move the file location of the write FD to the offset of the start of line n
continue reading the remaining lines from the read FD and write them to the write FD until the whole read FD is read
truncate the write FD
The file is edited exactly in place. No temporary files are used.
I'm pretty sure this can be re-written in python or perl or ... if necessary.
Update
Ok, so in-place line removal can be done in almost-pure bash, using similar techniques to the Tcl script above. But the big caveat is that you need to have truncate command available. I do have it on my Ubuntu 12.04 VM, but not on my older Redhat-based box. Here is the script:
#!/bin/bash
n=$1
filename=$2
exec 3<> $filename
exec 4<> $filename
linecount=1
bytecount=0
while IFS="" read -r line <&3 ; do
if [[ $linecount == $n ]]; then
echo "omitting line $linecount: $line"
else
echo "$line" >&4
((bytecount += ${#line} + 1))
fi
((linecount++))
done
exec 3>&-
exec 4>&-
truncate -s $bytecount $filename
#### or if you can tolerate dd, just to do the truncate:
# dd of="$filename" bs=1 seek=$bytecount count=0
#### or if you have python
# python -c "open(\"$filename\", \"ab\").truncate($bytecount)"
I would love to hear of a more generic (bash-only?) way to do the partial truncate at the end and complete this answer. Of course the truncate can be done with dd as well, but I think that was already ruled out for my earlier answer.
And for the record this site lists how to do an in-place file truncation in many different languages - in case any of these could be used in your environment.
If you can indicate under which circumstances on which platform(s) the most obvious Awk script is failing for you, perhaps we can devise a workaround.
awk "NR!=$N" infile >outfile
If course, obtaining $N with grep just to feed it to Awk is pretty bass-ackwards. This will delete the line containing the first occurrence of foo:
awk '/foo/ { if (!p++) next } 1' infile >outfile
Based on Digital Trauma's answere, I found an improvement that just needs grep and echo, but no tempfile:
echo $(grep -v PATTERN file.txt) > file.txt
Depending on the kind of lines your file contains and whether your pattern requires a more complex syntax or not, you can embrace the grep command with double quotes:
echo "$(grep -v PATTERN file.txt)" > file.txt
(useful when deleting from your crontab)

Bash script that prints out contents of a binary file, one word at a time, without xxd

I'd like to create a BASH script that reads a binary file, word (32-bits) by word and pass that word to an application called devmem.
Right now, I have:
...
for (( i=0; i<${num_words}; i++ ))
do
val=$(dd if=${file_name} skip=${i} count=1 bs=4 2>/dev/null)
echo -e "${val}" # Weird output...
devmem ${some_address} 32 ${val}
done
...
${val} has some weird (ASCII?) format character representations that looks like a diamond with a question mark.
If I replace the "val=" line with:
val=$(dd ... | xxd -r -p)
I get my desired output.
What is the easiest way of replicating the functionality of xxd using BASH?
Note: I'm working on an embedded Linux project where the requirements don't allow me to install xxd.
This script is performance driven, please correct me if I'm wrong in my approach, but for this reason I chose (dd -> binary word -> devmem) instead of (hexdump -> file, parse file -> devmem).
- Regardless of the optimal route for my end goal, this exercise has been giving me some trouble and I'd very much appreciate someone helping me figure out how to do this.
Thanks!
As I see you shouldn't use '|' or echo, because they are both ASCII tools. Instead I think '>' could work for you.
I think devmem is a bash function or alias, so I would try something like this:
for (( i=0; i<${num_words}; i++ ))
do
dd if=${file_name} skip=${i} count=1 bs=4 2>/dev/null 1> binary_file
# echo -e "${val}" # Weird output...
devmem ${some_address} 32 $(cat binary_file)
done
"As cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes." wiki
Or you can alter devmem to accept file as input...I hope this will help!

Pipe continuous stream to multiple files

I have a program which will be running continuously writing output to stdout, and I would like to continuously take the last set of output from it and write this to a file. Each time a new set of output comes along, I'd like to overwrite the same file.
An example would be the following command:
iostat -xkd 5
Every 5 seconds this prints some output, with a blank line after each one. I'd like the last "set" to be written to a file, and I think this should be achievable with something simple.
So far I've tried using xargs, which I could only get to do this with a single line, rather than a group of lines delimited by something.
I think it might be possible with awk, but I can't figure out how to get it to either buffer the data as it goes along so it can write it out using system, or get it to close and reopen the same output file.
EDIT: To clarify, this is a single command which will run continuously. I do NOT want to start a new command and read the output of that.
SOLUTION
I adapted the answer from #Bittrance to get the following:
iostat -xkd 5 | (while read r; do
if [ -z "$r" ]; then
mv -f /tmp/foo.out.tmp /tmp/foo.out
else
echo "$r" >> /tmp/foo.out.tmp
fi
done)
This was basically the same, apart from detecting the end of a "section" and writing to a temp file so that whenever an external process tried to read the output file it would be complete.
Partial answer:
./your_command | (while read r ; do
if ... ; then
rm -f foo.out
fi
echo $r >> foo.out
done)
If there is a condition (the ...) such that you know that you are receiving the first line of the "set" this would work.
Why not:
while :; do
iostat -xkd > FILE
sleep 5
done
If you're set on using awk the following writes each output from iostat to a numbered file:
iostat -xkd 5 | awk -v RS=$'\n'$'\n' '{ print >NR; close(NR) }'
Note: the first record is the header that iostat outputs.
Edit In reply to comment, I tested this perl solution to work:
#!/usr/bin/perl
use strict;
use warnings;
$|=1;
open(STDOUT, '>', '/tmp/iostat.running') or die $!;
while(<>)
{
seek (STDOUT, 0, 0) if (m/^$/gio);
print
}
So now you can
iostat -xkd 5 | myscript.pl&
And watch the snapshots:
watch -n10 cat /tmp/iostat.running
If you wanted to make sure that enough space is overwritten at the end (since the output length might vary slightly each time), you could print some padding at the end, like e.g.:
print "***"x40 and seek (STDOUT, 0, 0) if (m/^$/gio);
(snipped old answer text since apparently it was confusing people)

Resources