How can I use a pipe or redirect in a qsub command? - linux

There are some commands I'd like to run on a grid using qsub (SGE 8.1.3, CentOS 5.9) that need to use a pipe (|) or a redirect (>). For example, let's say I have to parallelize the command
echo 'hello world' > hello.txt
(Obviously a simplified example: in reality I might need to redirect the output of a program like bowtie directly to samtools). If I did:
qsub echo 'hello world' > hello.txt
the resulting content of hello.txt would look like
Your job 123454321 ("echo") has been submitted
Similarly if I used a pipe (echo "hello world" | myprogram), that message is all that would be passed to myprogram, not the actual stdout.
I'm aware I could write a small bash script that each contain the command with the pipe/redirect, and then do qsub ./myscript.sh. However, I'm trying to run many parallelized jobs at the same time using a script, so I'd have to write many such bash scripts each with a slightly different command. When scripting this solution can start to feel very hackish. An example of such a script in Python:
for i, (infile1, infile2, outfile) in enumerate(files):
command = ("bowtie -S %s %s | " +
"samtools view -bS - > %s\n") % (infile1, infile2, outfile)
script = "job" + str(counter) + ".sh"
open(script, "w").write(command)
os.system("chmod 755 %s" % script)
os.system("qsub -cwd ./%s" % script)
This is frustrating for a few reasons, among them that my program can't even delete the many jobXX.sh scripts afterwards to clean up after itself, since I don't know how long the job will be waiting in the queue, and the script has to be there when the job starts.
Is there a way to provide my full echo 'hello world' > hello.txt command to qsub without having to create another file containing the command?

You can do this by turning it into a bash -c command, which lets you put the | in a quoted statement:
qsub bash -c "cmd <options> | cmd2 <options>"
As #spuder has noted in the comments, it seems that in other versions of qsub (not SGE 8.1.3, which I'm using), one can solve the problem with:
echo "cmd <options> | cmd2 <options>" | qsub
as well.

Although my answer is a bit late I am adding it for any incoming viewers. To use a pipe/direct and submit that as a qsub job you need to do a couple of things. But first, using qsub at the end of a pipe like you're doing will only result in one job being sent to the queue (i.e. Your code will run serially rather than get parallelized).
Run qsub with enabling binary mode since the default qsub behavior rather expects compiled code. For that you use the "-b y" flag to qsub and you'll avoid any errors of the sort "command required for a binary mode" or "script length does not match declared length".
echo each call to qsub and then pipe that to shell.
Suppose you have a file params-query.txt which hold several bowtie commands and piped calls to samtools of the following form:
bowtie -q query -1 param1 -2 param2 ... | samtools ...
To send each query as a separate job first prepare your command line units from STDIN through xargs STDIN. Notice the quotes around the braces are important if you are submitting a command of piped parts. That way your entire query is treated a single unit.
cat params-query.txt | xargs -i echo qsub -b y -o output_log -e error_log -N job_name \"{}\" | sh
If that didn't work as expected then you're probably better off generating an intermediate output between bowtie and samtools before calling samtools to accept that intermediate output. You won't need to change the qsub call through xargs but the code in params-query.txt should look like:
bowtie -q query -o intermediate_query_out -1 param1 -2 param2 && samtools read_from_intermediate_query_out
This page has interesting qsub tricks you might like

grep http *.job | awk -F: '{print $1}' | sort -u | xargs -I {} qsub {}

Related

Why does xtrace show piped commands executing out of order?

Here's a simple reproducer:
cat >sample_pipeline.sh << EOF
set -x
head /dev/urandom | awk '{\$2=\$3=""; print \$0}' | column -t | grep 123 | wc -l >/dev/null
EOF
watch -g -d -n 0.1 'bash sample_pipeline.sh 2>&1 | tee -a /tmp/watchout'
wc -l /tmp/watchout
tail /tmp/watchout
As expected, usually the commands execute in the order they are written in:
+ head /dev/urandom
+ awk '{$2=$3=""; print $0}'
+ column -t
+ grep 123
+ wc -l
...but some of the time the order is different, e.g. awk before head:
+ awk '{$2=$3=""; print $0}'
+ head /dev/urandom
+ column -t
+ grep 123
+ wc -l
I can understand if the shell pre-spawns processes waiting for input, but why wouldn't it spawn them in order?
Replacing bash with dash (the default Ubuntu shell) seems to have the same behavior.
When you write a pipeline like that, the shell will fork off a bunch of child processes to run all of the subcommands. Each child process will then print the command it is executing just before it calls exec. Since each child is an independent process, the OS might schedule them in any order, and various things going on (cache misses/thrashing between CPU cores) might delay some children and not others. So the order the messages come out is unpredictable.
This happens because of how pipelines are implement. The shell will first fork N subshells, and each subshell will (when scheduled) print out its xtrace output and invoke its command. The order of the output is therefore the result of a race.
You can see a similar effect with
for i in {1..5}; do echo "$i" & done
Even though each echo command is spawned in order, the output may still be 3 4 5 2 1.

How to run shell script commands in an sh file in parallel?

I'm trying to take backup of tables in my database server.
I have around 200 tables. I have a shell script that contains commands to take backups of each table like:
backup.sh
psql -u username ..... table1 ... file1;
psql -u username ..... table2 ... file2;
psql -u username ..... table3 ... file3;
I can run the script and create backups in my machine. But as there are 200 tables, it's gonna run the commands sequentially and takes lot of time.
I want to run the backup commands in parallel. I have seen articles where in they suggested to use && after each command or use nohup command or wait command.
But I don't want to edit the script and include around 200 such commands.
Is there any way to run these list of shell script commands parallelly? something like nodejs does? Is it possible to do it? Or am I looking at it wrong?
Sample command in the script:
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.tablename TO "/home/username/Desktop/PostgresFiles/tablename.csv" with DELIMITER ","';
You can leverage xargs to run command in parallel, AND control the number of concurrent jobs. Running 200 backup jobs might overwhelm your database, and result in less than optimal performance.
Assuming you have backup.sh with one backup command per line
xargs -P5 -I{} bash -c "{}" < backup.sh
The commands in backup.sh should be modified to allow quoting (using single quote when possible, escaping double quote):
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.tablename TO \"/home/username/Desktop/PostgresFiles/tablename.csv\" with DELIMITER \",\"';
Where -P5 control the number of concurrent jobs. This will be able to process command lines WITHOUT double quotes. For the above script, you change "\copy ..." to '\copy ...'
Simpler alternative will be to use a helper backup-table.sh, which will take two parameters (table, file), and use
xargs -P5 -I{} backup-table.sh "{}" < tables.txt
And put all the complex quoting into the backup-table.sh
doit() {
table=$1
psql --host=somehost --port=5490 --username=user --dbname=db -c '\copy dbo.'$table' TO "/home/username/Desktop/PostgresFiles/'$table'.csv" with DELIMITER ","';
}
export -f doit
sql --listtables -n postgresql://user:pass#host:5490/db | parallel -j0 doit
Is there any logic in the script other than individual commands? (EG: and if's or processing of output?).
If it's just a file with a list of scripts, you could write a wrapper for the script (or a loop from the CLI) EG:
$ cat help.txt
echo 1
echo 2
echo 3
$ while read -r i;do bash -c "$i" &done < help.txt
[1] 18772
[2] 18773
[3] 18774
1
2
3
[1] Done bash -c "$i"
[2]- Done bash -c "$i"
[3]+ Done bash -c "$i"
$ while read -r i;do bash -c "$i" &done < help.txt
[1] 18820
[2] 18821
[3] 18822
2
3
1
[1] Done bash -c "$i"
[2]- Done bash -c "$i"
[3]+ Done bash -c "$i"
Each line of help.txt contains a command and I run a loop where I take each command and run it in subshell. (this is a simple example where I just background each job. You could get more complex using something like xargs -p or parallel but this is a starting point)

linux strace: How to filter system calls that take more than a second

I'm using "strace -f -T -tt -o foo.txt -p 1234" to print the time spent in system calls. This makes the output file huge, is there a way to just print the system calls that took greater than 1second. I can grep it out from the file later, but is there a better way?
If we simply omit the -o foo.txt argument, the output goes to standard output. We can pipe it through grep and redirect to the file:
strace -f -T -tt -p 1234 | grep pattern > foo.txt
To watch the output at the same time:
strace -f -T -tt -p 1234 | grep pattern | tee foo.txt
If a command prints only to a file that is passed as an argument, and we want to filter/redirect its output, the first step is to check whether it implements the dash convention: can you specify standard input or output using - as a filename argument:
some_command - | our_pipe > file.txt
If not, then the recourse is to use Bash process substitution substitution syntax: >(output command) and <(input command):
some_command >(our_pipe > file.txt)
The process substitution syntax expands into a token that is suitable as a filename argument for a command or function. When the program opens that token, it gets a file descriptor to the command's input or output, depending on direction.
With process substitution, we can redirect the input or output of stubborn programs which work only with files passed as by name as arguments, and which do not support any convention for requesting that standard input or output be used in place of a file.
The token used by process substitution is platform-dependent; we can see what it is using echo. For instance on GNU/Linux, Bash takes advantage of the /dev/fd operating system feature:
$ echo <(ls -l)
/dev/fd/63
You can use the following command:
strace -T command 2>&1 >/dev/null | awk '{gsub(/[<>]/,"",$NF)}$NF+.0>1.0'
Explanation:
strace -T adds the time spent in the syscall end the end of the line, enclosed in <...>
2>&1 >/dev/null | awk pipes stderr to awk. (strace writes it's output to stderr!)
The awk command removes the <> from the last field $NF and prints lines where the time spent is higher than a second.
Probably you'll also want to pass the threshold as a variable to the awk command:
strace -T command 2>&1 >/dev/null \
| awk -v thres=0.001 '{gsub(/[<>]/,"",$NF)}$NF+.0>thres+.0'

/usr/bin/time only timing first component of pipeline

I am trying to time the following command and eventually want to output the result to a file (hence the use of /usr/bin/time. However, the command being timed is always seq 1 2 rather than the entirety of the command:
/usr/bin/time -v seq 1 2 | parallel ssh-keygen -G /folder/{}.candidate -b 768
Is there a way to make sure the whole command is timed and not just the first little part?
If you are using bash it is easier to use the bash built-in:
$ time cmd1 | cmd2 | cmd3
will time the entire pipe.
If you really want to use the external time command, you have to run the pipe in a sub-shell:
$ /usr/bin/time sh -c 'cmd1 | cmd2 | cmd3'
or else you'd be piping the time instead of timing the pipe.
This syntax also has the nice side effect that it is easy to separate the output of the command and that of time:
$ /usr/bin/time sh -c 'cmd1 | cmd2 > cmdout.txt 2> cmderr.txt' 2>> time.txt
However, it has an extra difficulty in the quotes. If your command itself contains quotes, you may need to escape them in hard to read ways.

Does there exist a dummy command that doesn't change input and output?

Does there exist a dummy command in Linux, that doesn't change the input and output, and takes arguments that doesn't do anything?
cmd1 | dummy --tag="some unique string here" | cmd2
My interest in such dummy command is to be able to identify the processes, then I have multiple cmd1 | cmd2 running.
You can use a variable for that:
cmd1 | USELESS_VAR="some unique string here" cmd2
This has the advantage that Jonathan commented about of not performing an extra copy on all the data.
To create the command that you want, make a file called dummy with the contents:
#!/bin/sh
cat
Place the file in your PATH and make it executable (chmod a+x dummy). Because dummy ignores its arguments, you may place any argument on the command line that you choose. Because the body of dummy runs cat (with no arguments), dummy will echo stdin to stdout.
Alternative
My interest in such dummy command is to be able to identify the processes, then I have multiple cmd1 | cmd2 running.
As an alternative, consider
pgrep cmd1
It will show a process ID for every copy of cmd that is running. As an example:
$ pgrep getty
3239
4866
4867
4893
This shows that four copies of getty are running.
I'm not quite clear on your use case, but if you want to be able to "tag" different invocations of a command for the convenience of a sysadmin looking at a ps listing, you might be able to rename argv[0] to something suitably informative:
$:- perl -E 'exec {shift} "FOO", #ARGV' sleep 60 &
$:- perl -E 'exec {shift} "BAR", #ARGV' sleep 60 &
$:- perl -E 'exec {shift} "BAZ", #ARGV' sleep 60 &
$:- ps -o cmd=
-bash
FOO 60
BAR 60
BAZ 60
ps -o cmd
You don't have to use perl's exec. There are other conveniences for this, too, such as DJB's argv0, RedHat's doexec, and bash itself.
You should find a better way of identifying the processes buy anyway, you can use:
ln -s /dev/null /tmp/some-unique-file
cmd1 | tee /tmp/some-unique-file | cmd2

Resources