How does data get processed across pipes?

How does data get processed across pipes? - linux

I used this command-line program that I found in another post on SO describing how to spider a website.
wget --spider --force-html -r -l2 http://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' > wget.out
When I crawl a large site, it takes a long time to finish. Meanwhile the wget.out file on disk shows zero size. So when does the piped data get processed and written to the file on disk? Is it after each stage in the pipe having run to completion? In that case, will wget.out fill up after the entire crawling is over?
How do I make the program write intermittently to disk, so that, even if the crawling stage is interrupted, I have some output saved ?

There is buffering in each pipe, and maybe in the stdio layers of each program. Data will not make it to the disk until the final grep has processed enough lines to cause its buffers to fill to the point of being spilled to disk.
If you run your pipeline on the command-line, and then hit Ctrl-C, sigint will be sent to every process, terminating each, and losing any pending output.
Either:
Ignore sigint in all processes but the first. Bash hackery follows:
$ wget --spider --force-html -r -l2 http://example.com 2>&1 grep '^--' |
{ trap '' int; awk '{ print $3 }'; } |
∶
Simply deliver the keyboard interrupt to the first process. Interactively you can discover the pid with jobs -l and then kill that. (Run the pipeline in the background.)
$ jobs -l
[1]+ 10864 Running wget
3364 Running | grep
13500 Running | awk
∶
$ kill -int 10864
Play around with the disown bash builtin.

Related

Why does xtrace show piped commands executing out of order?

Here's a simple reproducer:
cat >sample_pipeline.sh << EOF
set -x
head /dev/urandom | awk '{\$2=\$3=""; print \$0}' | column -t | grep 123 | wc -l >/dev/null
EOF
watch -g -d -n 0.1 'bash sample_pipeline.sh 2>&1 | tee -a /tmp/watchout'
wc -l /tmp/watchout
tail /tmp/watchout
As expected, usually the commands execute in the order they are written in:
+ head /dev/urandom
+ awk '{$2=$3=""; print $0}'
+ column -t
+ grep 123
+ wc -l
...but some of the time the order is different, e.g. awk before head:
+ awk '{$2=$3=""; print $0}'
+ head /dev/urandom
+ column -t
+ grep 123
+ wc -l
I can understand if the shell pre-spawns processes waiting for input, but why wouldn't it spawn them in order?
Replacing bash with dash (the default Ubuntu shell) seems to have the same behavior.

When you write a pipeline like that, the shell will fork off a bunch of child processes to run all of the subcommands. Each child process will then print the command it is executing just before it calls exec. Since each child is an independent process, the OS might schedule them in any order, and various things going on (cache misses/thrashing between CPU cores) might delay some children and not others. So the order the messages come out is unpredictable.

This happens because of how pipelines are implement. The shell will first fork N subshells, and each subshell will (when scheduled) print out its xtrace output and invoke its command. The order of the output is therefore the result of a race.
You can see a similar effect with
for i in {1..5}; do echo "$i" & done
Even though each echo command is spawned in order, the output may still be 3 4 5 2 1.

Bash script optimization for waiting for a particular string in log files

I am using a bash script that calls multiple processes which have to start up in a particular order, and certain actions have to be completed (they then print out certain messages to the logs) before the next one can be started. The bash script has the following code which works really well for most cases:
tail -Fn +1 "$log_file" | while read line; do
if echo "$line" | grep -qEi "$search_text"; then
echo "[INFO] $process_name process started up successfully"
pkill -9 -P $$ tail
return 0
elif echo "$line" | grep -qEi '^error\b'; then
echo "[INFO] ERROR or Exception is thrown listed below. $process_name process startup aborted"
echo " ($line) "
echo "[INFO] Please check $process_name process log file=$log_file for problems"
pkill -9 -P $$ tail
return 1
fi
done
However, when we set the processes to print logging in DEBUG mode, they print so much logging that this script cannot keep up, and it takes about 15 minutes after the process is complete for the bash script to catch up. Is there a way of optimizing this, like changing 'while read line' to 'while read 100 lines', or something like that?

How about not forking up to two grep processes per log line?
tail -Fn +1 "$log_file" | grep -Ei "$search_text|^error\b" | while read line; do
So one long running grep process shall do preprocessing if you will.
Edit: As noted in the comments, it is safer to add --line-buffered to the grep invocation.

Some tips relevant for this script:
Checking that the service is doing its job is a much better check for daemon startup than looking at the log output
You can use grep ... <<<"$line" to execute fewer echos.
You can use tail -f | grep -q ... to avoid the while loop by stopping as soon as there's a matching line.
If you can avoid -i on grep it might be significantly faster to process the input.
Thou shalt not kill -9.

How can I use a pipe or redirect in a qsub command?

There are some commands I'd like to run on a grid using qsub (SGE 8.1.3, CentOS 5.9) that need to use a pipe (|) or a redirect (>). For example, let's say I have to parallelize the command
echo 'hello world' > hello.txt
(Obviously a simplified example: in reality I might need to redirect the output of a program like bowtie directly to samtools). If I did:
qsub echo 'hello world' > hello.txt
the resulting content of hello.txt would look like
Your job 123454321 ("echo") has been submitted
Similarly if I used a pipe (echo "hello world" | myprogram), that message is all that would be passed to myprogram, not the actual stdout.
I'm aware I could write a small bash script that each contain the command with the pipe/redirect, and then do qsub ./myscript.sh. However, I'm trying to run many parallelized jobs at the same time using a script, so I'd have to write many such bash scripts each with a slightly different command. When scripting this solution can start to feel very hackish. An example of such a script in Python:
for i, (infile1, infile2, outfile) in enumerate(files):
command = ("bowtie -S %s %s | " +
"samtools view -bS - > %s\n") % (infile1, infile2, outfile)
script = "job" + str(counter) + ".sh"
open(script, "w").write(command)
os.system("chmod 755 %s" % script)
os.system("qsub -cwd ./%s" % script)
This is frustrating for a few reasons, among them that my program can't even delete the many jobXX.sh scripts afterwards to clean up after itself, since I don't know how long the job will be waiting in the queue, and the script has to be there when the job starts.
Is there a way to provide my full echo 'hello world' > hello.txt command to qsub without having to create another file containing the command?

You can do this by turning it into a bash -c command, which lets you put the | in a quoted statement:
qsub bash -c "cmd <options> | cmd2 <options>"
As #spuder has noted in the comments, it seems that in other versions of qsub (not SGE 8.1.3, which I'm using), one can solve the problem with:
echo "cmd <options> | cmd2 <options>" | qsub
as well.

Although my answer is a bit late I am adding it for any incoming viewers. To use a pipe/direct and submit that as a qsub job you need to do a couple of things. But first, using qsub at the end of a pipe like you're doing will only result in one job being sent to the queue (i.e. Your code will run serially rather than get parallelized).
Run qsub with enabling binary mode since the default qsub behavior rather expects compiled code. For that you use the "-b y" flag to qsub and you'll avoid any errors of the sort "command required for a binary mode" or "script length does not match declared length".
echo each call to qsub and then pipe that to shell.
Suppose you have a file params-query.txt which hold several bowtie commands and piped calls to samtools of the following form:
bowtie -q query -1 param1 -2 param2 ... | samtools ...
To send each query as a separate job first prepare your command line units from STDIN through xargs STDIN. Notice the quotes around the braces are important if you are submitting a command of piped parts. That way your entire query is treated a single unit.
cat params-query.txt | xargs -i echo qsub -b y -o output_log -e error_log -N job_name \"{}\" | sh
If that didn't work as expected then you're probably better off generating an intermediate output between bowtie and samtools before calling samtools to accept that intermediate output. You won't need to change the qsub call through xargs but the code in params-query.txt should look like:
bowtie -q query -o intermediate_query_out -1 param1 -2 param2 && samtools read_from_intermediate_query_out
This page has interesting qsub tricks you might like

grep http *.job | awk -F: '{print $1}' | sort -u | xargs -I {} qsub {}

How to log output in bash and see it in the terminal at the same time?

I have some scripts where I need to see the output and log the result to a file, with the simplest example being:
$ update-client > my.log
I want to be able to see the output of the command while it's running, but also have it logged to the file. I also log stderr, so I would want to be able to log the error stream while seeing it as well.

update-client 2>&1 | tee my.log
2>&1 redirects standard error to standard output, and tee sends its standard input to standard output and the file.

Just use tail to watch the file as it's updated. Background your original process by adding & after your above command After you execute the command above just use
$ tail -f my.log
It will continuously update. (note it won't tell you when the file has finished running so you can output something to the log to tell you it finished. Ctrl-c to exit tail)

You can use the tee command for that:
command | tee /path/to/logfile
The equivelent without writing to the shell would be:
command > /path/to/logfile
If you want to append (>>) and show the output in the shell, use the -a option:
command | tee -a /path/to/logfile
Please note that the pipe will catch stdout only, errors to stderr are not processed by the pipe with tee. If you want to log errors (from stderr), use:
command 2>&1 | tee /path/to/logfile
This means: run command and redirect the stderr stream (2) to stdout (1). That will be passed to the pipe with the tee application.
Learn about this at askubuntu site

another option is to use block based output capture from within the script (not sure if that is the correct technical term).
Example
#!/bin/bash
{
echo "I will be sent to screen and file"
ls ~
} 2>&1 | tee -a /tmp/logfile.log
echo "I will be sent to just terminal"
I like to have more control and flexibility - so I prefer this way.

Suppress Notice of Forked Command Being Killed

Let's suppose I have a bash script (foo.sh) that in a very simplified form, looks like the following:
echo "hello"
sleep 100 &
ps ax | grep sleep | grep -v grep | awk '{ print $1 } ' | xargs kill -9
echo "bye"
The third line imitates pkill, which I don't have by default on Mac OS X, but you can think of it as the same as pkill. However, when I run this script, I get the following output:
hello
foo: line 4: 54851 Killed sleep 100
bye
How do I suppress the line in the middle so that all I see is hello and bye?

While disown may have the side effect of silencing the message; this is how you start the process in a way that the message is truly silenced without having to give up job control of the process.
{ command & } 2>/dev/null
If you still want the command's own stderr (just silencing the shell's message on stderr) you'll need to send the process' stderr to the real stderr:
{ command 2>&3 & } 3>&2 2>/dev/null
To learn about how redirection works:
From the BashGuide: http://mywiki.wooledge.org/BashGuide/TheBasics/InputAndOutput#Redirection
An illustrated tutorial: http://bash-hackers.org/wiki/doku.php/howto/redirection_tutorial
And some more info: http://bash-hackers.org/wiki/doku.php/syntax/redirection
And by the way; don't use kill -9.
I also feel obligated to comment on your:
ps ax | grep sleep | grep -v grep | awk '{ print $1 } ' | xargs kill -9
This will scortch the eyes of any UNIX/Linux user with a clue. Moreover, every time you parse ps, a fairy dies. Do this:
kill $!
Even tools such as pgrep are essentially broken by design. While they do a better job of matching processes, the fundamental flaws are still there:
Race: By the time you get a PID output and parse it back in and use it for something else, the PID might already have disappeared or even replaced by a completely unrelated process.
Responsibility: In the UNIX process model, it is the responsibility of a parent to manage its child, nobody else should. A parent should keep its child's PID if it wants to be able to signal it and only the parent can reliably do so. UNIX kernels have been designed with the assumption that user programs will adhere to this pattern, not violate it.

How about disown? This mostly works for me on Bash on Linux.
echo "hello"
sleep 100 &
disown
ps ax | grep sleep | grep -v grep | awk '{ print $1 } ' | xargs kill -9
echo "bye"
Edit: Matched the poster's code better.

The message is real. The code killed the grep process as well.
Run ps ax | grep sleep and you should see your grep process on the list.
What I usually do in this case is ps ax | grep sleep | grep -v grep
EDIT: This is an answer to older form of question where author omitted the exclusion of grep for the kill sequence. I hope I still get some rep for answering the first half.

Yet another way to disable job termination messages is to put your command to be backgrounded in a sh -c 'cmd &' construct.
And as already pointed out, there is no need to imitate pkill; you may store the value of $! in another variable instead.
echo "hello"
sleep_pid=`sh -c 'sleep 30 & echo ${!}' | head -1`
#sleep_pid=`sh -c '(exec 1>&-; exec sleep 30) & echo ${!}'`
echo kill $sleep_pid
kill $sleep_pid
echo "bye"

Have you tried to deactivate job control? It's a non-interactive shell, so I would guess it's off by default, but it does not hurt to try... It's regulated by the -m (monitor) shell variable.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string