How to optimize reading from a large file in a while loop in shell script

How to optimize reading from a large file in a while loop in shell script - linux

I was going through some random articles on the internet on how to optimize file input to a loop and tried to test things myself. They claim that file descriptor manipulation in most cases can be fast and efficient than directly reading from file into a loop. I tried to test it doing this:
First read from the file directly into the loop:
time while read a ; do :;done < testfile
The time this command took to run is:
real 0m8.782s
user 0m1.292s
sys 0m0.399s
Now I try to do some file-descriptor manipulation as one of the article suggested as this:
I first redirect file -descriptor zero to a file-descriptor 3 like: exec 3<&0
I then redirect testfile to file descriptor 0 : exec 0 < testfile
And at the end of loop I'm reading the data as 0<&3 which would mean redirect file-descriptor 3 to 0. So the complete line is as below:
exec 3<&0;exec 0<testfile; time for i in $(seq 1 20);do while read a; do :;done; done; exec 0<&3
This gives me a time as:
real 0m8.792s
user 0m1.258s
sys 0m0.430s
But I see the time to be almost same in both cases, in fact a tad bit slower when I use file descriptors. The file testfile is 6MB with close to 400k lines each with 20-25 characters max.
In fact for even bigger files, the direct read from files is actually faster than the file descriptor manipulation.

Use C. This is the fastest you can get and if you really care about speed.
You can write you own program to getline() from input stream and then call system on each line. This may be slower because of the fork() and exec() calls, but may be way faster if you can put your line operations into C code.
You can write your own shell builtin. Shells read builting just calls read(), browse bash here. You can write your own shell builtin which loops a command through input faster then the default read builtin. like my_read_bultin 'file' -- 'command to run on each line'.
To make you post reproducible I created a big file:
$ for ((i=0;i<1200000;++i)); do echo ${RANDOM}; done >/tmp/1
$ du -hs /tmp/1
6.5M /tmp/1
Then run:
$ time ( printf '#include<errno.h>\n#include<stdlib.h>\n#include<stdio.h>\nint main(){char*b=0;size_t n=0;ssize_t r;while((r=getline(&b,&n,stdin))>0);if(errno)abort();return 0;}\n' | gcc -Ofast -Wall -xc -o/tmp/a.out -; /tmp/a.out </tmp/1; )
real 0m0.095s
user 0m0.064s
sys 0m0.031s
$ time ( cat >/dev/null; ) </tmp/1
real 0m0.007s
user 0m0.001s
sys 0m0.006s
$ time ( while read l; do :; done </tmp/1; )
real 0m6.994s
user 0m5.222s
sys 0m1.731s
$ time ( exec 3</tmp/1; while read -u3 l; do :; done; )
real 0m7.953s
user 0m5.965s
sys 0m1.949s
$ time xargs -a /tmp/1 -n1 true
< very, very slow, got impatient and CTRL+C it >

Related

Update Bash commands every 2 seconds (without re-running code everytime)

for my first bash project I am developing a simple bash script that shows basic information about my system:
#!/bash/sh
UPTIME=$(w)
MHZ=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
TEMP=$(cat /sys/class/thermal/thermal_zone0/temp)
#UPTIME shows the uptime of the device
#MHZ shows the overclocked specs
#TEMP shows the current CPU Temperature
echo "$UPTIME" #displays uptime
echo "$MHZ" #displays overclocked specs
echo "$TEMP" #displays CPU Temperature
MY QUESTION: How can I code this so that the uptime and CPU temperature refresh every 2seconds without re-generating the code new every time (I just want these two variables to update without having to enter the file path again and re-running the whole script).
This code is already working fine on my system but after it executes in the command line, the information isn't updating because it executed the command and is standing by for the next command instead of updating the variables such as UPTIME in real time.
I hope someone understands what I am trying to achieve, sorry about my bad wordings of this idea.
Thank you in advance...

I think it will help you. You can use the watch command for updating that for every two seconds without the loop.
watch ./filename.sh
It will give you the update of that command for every two second.
watch - execute a program periodically, showing output fullscreen

Not sure to really understand the main goal, but here's an answer to the basic question "How can I code this so that the uptime and CPU temperature refresh every two seconds ?" :
#!/bash/sh
while :; do
UPTIME=$(w)
MHZ=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq)
TEMP=$(cat /sys/class/thermal/thermal_zone0/temp)
#UPTIME shows the uptime of the device
#MHZ shows the overclocked specs
#TEMP shows the current CPU Temperature
echo "$UPTIME" #displays uptime
echo "$MHZ" #displays overclocked specs
echo "$TEMP" #displays CPU Temperature
sleep 2
done

I may suggest some modifications.
For such simple job I may recommend no to use external utilities. So instead of $(cat file) you could use $(<file). This is a cheaper method as bash does not have to launch cat.
On the other hand if reading those devices returns only one line, you can use the bash built-in read like: read ENV_VAR <single_line_file. It is even cheaper. If there are more lines and for example you want to read the 2nd line, you could use sg like this: { read line_1; read line2;} <file.
As I see w provides much more information and as I assume you need only the header line. This is exactly what uptime prints. The external utility uptime reads the /proc/uptime pseudo file. So to avoid to call externals, you can read this pseudo file directly.
The looping part also uses the external sleep(1) utility. For this the timeout feature of the read internal could be used.
So in short the script would look like this:
while :; do
# /proc/uptime has two fields, uptime and idle time
read UPTIME IDLE </proc/uptime
# Not having these pseudo files on my system, the whole line is read
# Maybe some formatting is needed. For MHZ /proc/cpuinfo may be used
read MHZ </sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
read TEMP </sys/class/thermal/thermal_zone0/temp
# Bash supports only integer arithmetic, so chomp off float
UPTIME_SEC=${UPTIME%.*}
UPTIME_HOURS=$((UPTIME_SEC/3600))
echo "Uptime: $UPTIME_HOURS hours"
echo $MHZ
echo $TEMP
# It reads stdin, so pressing an ENTER it returns immediately
read -t 2
done
This does not call any external utility and does not make any fork. So instead of executing 3 external utilities (using the expensive fork and execve system calls) in every 2 seconds this executes none. Much less system resources are used.

you could use while [ : ] and sleep 2

You need the awesome power of loops! Something like this should be a good starting point:
while true ; do
echo 'Uptime:'
w 2>&1 | sed 's/^/ /'
echo 'Clocking:'
sed 's/^/ /' /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
echo 'Temperature:'
sed 's/^/ /' /sys/class/thermal/thermal_zone0/temp
echo '=========='
sleep 2
done
That should give you your three sections, with the data of each nicely indented.

How to make an expect script to input commands into GDB?

I want to write an Expect script that will simply input commands into GDB regardless of its output. Then I want to take certain parts of the output of GDB and extract information from it using shell commands such as grep and sed. Then I want to use this information to input more commands into GDB.
For example, I would initiate a back trace by sending the command "bt" to GDB from the expect script. Then I would grep for a word such as "pardrivr" and get the line number associated with it. Then I would input "f lineNumberOfPardrivr" into GDB. This process would be repeated until the correct information is eventually extracted.
Is this possible. If so what is the best way to go about doing this?
Thanks

My $0.02: I'd use a coprocess or named pipe under ksh/bash/zsh. Much easier. See: https://unix.stackexchange.com/questions/86270/how-do-you-use-the-command-coproc-in-bash
Also, consider tee'ing the output of gdb into a named pipe that you cat in another xterm. Makes it much easier to debug what your script is reading if you can see a copy of the gdb output.
Edited to add:
Still can't post comments. *sigh*
gdb in batch mode, or via a simple shell redirect, won't let us define commands on the fly based upon current gdb output. A coprocess or named pipe approach is much the same technique, but it lets us create new input dynamically at will based upon gdb's output processed through grep/sed/awk/perl/whatever. Python or Perl might be even easier to use with their facilities for regular expressions and subprocesses. E.g. (perl) open("|gdb ...")
http://perldoc.perl.org/functions/open.html
Edited again to add:
A named pipe is a FIFO (first in first out) that exists much like a file in the filesystem. It's not really a file of course. It's just something that can be used like a file. Anything that you write to it can be read back out, within the limits of the OS buffering. (Otherwise writes will block.)
FIFO's are available under Unix, Linux, & Macs, but not windows. You create them with mkfifo. Any process can write to it. Any process can read from it. From that link I posted up above:
mkfifo in out
cmd <in >out &
exec 3> in 4< out
echo data >&3
read var <&4
From my own playing around to demo this...
#in BASH
mkfifo IN OUT
#or mkfifo IN OUT ERR
gdb < IN > OUT 2>&1 &
#or gdb < IN > OUT 2> ERR &
#or gdb < IN > OUT &
exec 3> IN
exec 4< OUT
echo "help bt" >&3
while read -t 0.001 var <&4 ; do echo $var; done
echo "help stack" >&3
while read -t 0.001 var <&4 ; do echo $var; done
#don't forget to kill the gdb process when you are done...
echo "quit" >&3
while read -t 0.001 var <&4 ; do echo $var; done

I want to write an Expect script that will simply input commands into GDB regardless of its output.
For non interactive control you don't need expect as gdb has a -batch mode and is able to read (-x) commands from a file.
Moreover, as gdbreads input from stdin and produces output to stdout standard redirection might do the trick.
For example, I wrote a simple C program:
sh$ cat hello.c
#include <stdio.h>
int main() {
char msg[] = "Hello world";
printf("%s\n", msg);
return 0;
}
sh$ gcc -ggdb hello.c -o hello
I'm able to "script" the gdb session like that:
sh$ gdb -q hello | awk '$2=="$1" { print "Var was #" $NF }'
br 6
r
print &msg
c
quit
EOF
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7ffff7ffa000
Var was #0x7fffffffe230

How to create a system generated alert tone in linux , without using any external audio file?

I have a linux shell script , which i need to run at every 30 min interval, so i'm using a infinite for loop like this:
for ((i=0;;i++))
do
clear;
./myscript.sh;
sleep 30m;
done
i want to add a line of code before clear , so that before the script executes everytime,system generates an alert tone or sound , which will make the user aware that 30 min interval has passed by and the script is executing again.Any way to do that?
I currently dont have any external .wav file in my system which can be played as an alert. kindly suggest.

For a simple tone you can use beep command. You can find this command in debian repositories, the package name is beep.
It is a simple script and you can use it in commandline.

you can emit a BEL character (char code 7, ^G) to /dev/console. Beware that only root is allowed to do so.
With proper escape codes, you can even control the length and frequency of a beep.
For example, the code below constitutes, run in a cron job, a cuckoo clock :-)
#!/bin/sh
h=$(((`date +%k`+11)%12+1))
echo -en "\033[11;200]" >/dev/console
for ((i=0; i<h; i++)); do
echo -en "\033[10;600]\a"
usleep 200000
echo -en "\033[10;490]\a"
usleep 500000
done >/dev/console
echo -en "\033[10;750]\033[11;100]" >/dev/console

redirecting stdin _and_ stdout to a pipe

I would like to run a program "A", have its output go to the input to another program "B", as well as stdin going to intput of "B". If program "A" closes, I'd like "B" to continue running.
I can redirect A output to B input easily:
./a | ./b
And I can combine stderr into the output if I'd like:
./a 2>&1 | ./b
But I can't figure out how to combine stdin into the output. My guess would be:
./a 0>&1 | ./b
but it doesn't work.
Here's a test that doesn't require us to rewrite up any test programs:
$ echo ls 0>&1 | /bin/sh -i
$ a b info.txt
$
/bin/sh: Cannot set tty process group (No such process)
If possible, I'd like to do this using only bash redirection on the command line (I don't want to write a C program to fork off child processes and do anything complicated everytime I want to do some redirection of stdin to a pipe).

This cannot be done without writing an auxiliary program.
In general, stdin could be a read-only file descriptor (heck, it might refer to read-only file). So you cannot "insert" anything into it.
You will need to write a "helper" program that monitors two file descriptors (say, 0 and 3) in order to read from both and "merge" them. A simple select or poll loop would be sufficient, and you could write it in most scripting languages, but not the shell, I don't think.
Then you can use shell redirection to feed your program's output to descriptor 3 of the "helper".
Since what you want is basically the opposite of "tee", I might call it "eet"...
[edit]
If only you could launch "cat" in the background...
But that will fail because background processes with a controlling terminal cannot read from stdin. So if you could just detach "cat" from its controlling terminal and run it in the background...
On Linux, "setsid cat" should do it, roughly. But (a) I could not get it to work very well and (b) I really do not have time for this today and (c) it is non-standard anyway.
I would just write the helper program.
[edit 2]
OK, this seems to work:
{ seq 5 ; sleep 2 ; seq 5 ; } | /bin/bash -c 'set -m ; setsid cat ; echo HELLO'
The set -m thing forces bash to enable job control, which apparently is needed to prevent the shell from redirecting stdin from /dev/null.
Here, the echo HELLO represents your "program A". The seq commands (with the sleep in the middle) are just to provide some input. And yes, you can pipe this whole thing to process B.
About as ugly and non-portable a solution as you could ask for...

A pipe has two ends. One is for writing, and that which gets written appears in the other end, which is for reading.
It's a pipe, not a T or Y junction.
I don't think your scenario is possible. Having "stdin going to input of" anything doesn't make sense.

If I understand your requirements correctly, you want this set up (ASCII art to the fore):
o----+----->| A |----+---->| B |---->o
| ^
| |
+------------------+
with the additional constraint that if process A closes up shop, process B should be able to continue with the input stream going to B.
This is a non-standard setup, as you realize, and can only be achieved by using an auxilliary program to drive the input to A and B. You end up with some interesting synchronization issues but it will all work remarkably well as long as your messages are short enough.
The plumbing necessary to achieve this is notable - you'll need two pipes, one for the input to A and the other for the input to B, and the output of A will be connected to the input of B as well.
o---->| C |---------->| A |----+---->| B |---->o
| ^
| |
+--------------------------+
Note that C will be writing the data twice, once to A and once to B. Note, too, that the pipe from A to B is the same pipe as the pipe from C to A.

To make the given test case work you have to while ... read from the controlling terminal device /dev/tty inside a sh -c '...' construct.
Note the use of eval (could it be avoided here?) and that multi-line commands on input> will fail.
echo 'ls; export var=myval' | (
stdin="$(</dev/stdin)"
/bin/sh -i -c '
eval "$1";
while IFS="" read -e -r -p "input> " line; do
history -s "${line}"
eval "${line}";
done </dev/tty
' argv0 "${stdin}"
)
input> echo $var
For a similar problem and the use of named pipes see here:
BASH: Best architecture for reading from two input streams

This can't be done exactly as shown, but to perform your example you can make use of cat's ability to join files together:
cat <(echo ls) - | /bin/sh
(You can do -i, but then you'll have to have another process kill the /bin/sh, as your attempts to Ctrl-C and Ctrl-D out will fail.)
This assumes that you want to pass in your piped input and then accept from stdin. You can also make it so that it does something after stdin is done, or on both sides -- but it won't merge input character-by-character or line-by-line.

This seems to do what you want:
$ ( ./a <&-; cat ) | ./b
(It's not clear to me if you want a to get input...this solution sends all input to b)
Of course, in this case the inputs to b are strictly ordered: all of the output of a
is sent to b first, then a terminates, then input goes to b. If you want things
interleaved, try:
$ ( ./a <&- & cat ) | ./b

Multi-threaded BASH programming - generalized method?

Ok, I was running POV-Ray on all the demos, but POV's still single-threaded and wouldn't utilize more than one core. So, I started thinking about a solution in BASH.
I wrote a general function that takes a list of commands and runs them in the designated number of sub-shells. This actually works but I don't like the way it handles accessing the next command in a thread-safe multi-process way:
It takes, as an argument, a file with commands (1 per line),
To get the "next" command, each process ("thread") will:
Waits until it can create a lock file, with: ln $CMDFILE $LOCKFILE
Read the command from the file,
Modifies $CMDFILE by removing the first line,
Removes the $LOCKFILE.
Is there a cleaner way to do this? I couldn't get the sub-shells to read a single line from a FIFO correctly.
Incidentally, the point of this is to enhance what I can do on a BASH command line, and not to find non-bash solutions. I tend to perform a lot of complicated tasks from the command line and want another tool in the toolbox.
Meanwhile, here's the function that handles getting the next line from the file. As you can see, it modifies an on-disk file each time it reads/removes a line. That's what seems hackish, but I'm not coming up with anything better, since FIFO's didn't work w/o setvbuf() in bash.
#
# Get/remove the first line from FILE, using LOCK as a semaphore (with
# short sleep for collisions). Returns the text on standard output,
# returns zero on success, non-zero when file is empty.
#
parallel__nextLine()
{
local line rest file=$1 lock=$2
# Wait for lock...
until ln "${file}" "${lock}" 2>/dev/null
do sleep 1
[ -s "${file}" ] || return $?
done
# Open, read one "line" save "rest" back to the file:
exec 3<"$file"
read line <&3 ; rest=$(cat<&3)
exec 3<&-
# After last line, make sure file is empty:
( [ -z "$rest" ] || echo "$rest" ) > "${file}"
# Remove lock and 'return' the line read:
rm -f "${lock}"
[ -n "$line" ] && echo "$line"
}

#adjust these as required
args_per_proc=1 #1 is fine for long running tasks
procs_in_parallel=4
xargs -n$args_per_proc -P$procs_in_parallel povray < list
Note the nproc command coming soon to coreutils will auto determine
the number of available processing units which can then be passed to -P

If you need real thread safety, I would recommend to migrate to a better scripting system.
With python, for example, you can create real threads with safe synchronization using semaphores/queues.

sorry to bump this after so long, but I pieced together a fairly good solution for this IMO
It doesnt work perfectly, but it will limit the script to a certain number of child tasks running, and then wait for all the rest at the end.
#!/bin/bash
pids=()
thread() {
local this
while [ ${#} -gt 6 ]; do
this=${1}
wait "$this"
shift
done
pids=($1 $2 $3 $4 $5 $6)
}
for i in 1 2 3 4 5 6 7 8 9 10
do
sleep 5 &
pids=( ${pids[#]-} $(echo $!) )
thread ${pids[#]}
done
for pid in ${pids[#]}
do
wait "$pid"
done
it seems to work great for what I'm doing (handling parallel uploading of a bunch of files at once) and keeps it from breaking my server, while still making sure all the files get uploaded before it finishes the script

I believe you're actually forking processes here, and not threading. I would recommend looking for threading support in a different scripting language like perl, python, or ruby.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string