GNU parallel - keep output colored

GNU parallel - keep output colored - colors

I'm parallelizing some 0-args commands (scripts/whatever) which have colored outputs, but when parallel prints the output it's colorless (unless I use the -u option, but then it's unordered).
Is there a way to change that?
The line I'm using (illustration):
echo "script1 script2 script3" | tr " " "\n" | parallel -j3 'echo {}":\n\n"; eval {}'
BTW, I'm using a local version of GNU parallel, but it's supposed to be more or less the same.
Thanks

The reason is that your commandline tools detect that they are not printing to a terminal (GNU Parallel saves into temporary files before printing them to the terminal). Some tools you can force colors even if the output is to a file:
parallel 'echo {} | grep --color=always o' ::: joe
You can ask GNU Parallel to give the script the tty:
parallel --tty -j+0 'echo {} | grep o' ::: joe
--tty defaults to -j1 so you have to explicitly override that. It also has problem, that GNU Parallel cannot kill the jobs. This will run for 10 sec:
parallel --tty --timeout 5 sleep ::: 10

Related

Why does xtrace show piped commands executing out of order?

Here's a simple reproducer:
cat >sample_pipeline.sh << EOF
set -x
head /dev/urandom | awk '{\$2=\$3=""; print \$0}' | column -t | grep 123 | wc -l >/dev/null
EOF
watch -g -d -n 0.1 'bash sample_pipeline.sh 2>&1 | tee -a /tmp/watchout'
wc -l /tmp/watchout
tail /tmp/watchout
As expected, usually the commands execute in the order they are written in:
+ head /dev/urandom
+ awk '{$2=$3=""; print $0}'
+ column -t
+ grep 123
+ wc -l
...but some of the time the order is different, e.g. awk before head:
+ awk '{$2=$3=""; print $0}'
+ head /dev/urandom
+ column -t
+ grep 123
+ wc -l
I can understand if the shell pre-spawns processes waiting for input, but why wouldn't it spawn them in order?
Replacing bash with dash (the default Ubuntu shell) seems to have the same behavior.

When you write a pipeline like that, the shell will fork off a bunch of child processes to run all of the subcommands. Each child process will then print the command it is executing just before it calls exec. Since each child is an independent process, the OS might schedule them in any order, and various things going on (cache misses/thrashing between CPU cores) might delay some children and not others. So the order the messages come out is unpredictable.

This happens because of how pipelines are implement. The shell will first fork N subshells, and each subshell will (when scheduled) print out its xtrace output and invoke its command. The order of the output is therefore the result of a race.
You can see a similar effect with
for i in {1..5}; do echo "$i" & done
Even though each echo command is spawned in order, the output may still be 3 4 5 2 1.

Running a regression with shell script and make utility

I want to run a regression with a shell script which should start each test via make command. The following is a simple version of my script:
#!/bin/sh
testlist="testlist.txt"
while read line;
do
test_name=$(echo $line | awk '{print $1}')
program_path=$(echo $line | awk '{print $2}')
make sim TEST_NAME=$test_name PROGRAM_PATH=$program_path
done < "$testlist"
The problem with the above script is that when make command starts a program, the script goes to the next iteration without waiting for the completion of that program in the previous iteration and continues to read the next line from the file.
Is there any option in make utility to make sure that it waits for the completion of the program? Maybe I'm missing something else.
This is the related part of Makefile:
sim:
vsim -i -novopt \
-L $(QUESTA_HOME)/uvm-1.1d \
-L questa_mvc_lib \
-do "add wave top/AL_0/*;log -r /*" \
-G/path=$(PROGRAM_PATH) \
+UVM_TESTNAME=$(TEST_NAME) +UVM_VERBOSITY=UVM_MEDIUM -sv_seed random

As OP stated in a comment, he noticed that vsim forks few other processes, that are still running after vsim is finished. So he needs to wait until the other processes are finished.
I emulated your vsim command with a script that forks some sleep processes:
#!/bin/bash
forksome() {
sleep 3&
sleep 2&
sleep 5&
}
echo "Forking some sleep"
forksome
I made a Makefile, that shows your problem after make normal.
When you know which processes are forked, you can make a solution as I demonstrated with make new.
normal: sim
ps -f
new: sim mywait
ps -f
sim:
./vsim
mywait:
echo "Waiting for process(es):"
while pgrep sleep; do sleep 1; done

Does there exist a dummy command that doesn't change input and output?

Does there exist a dummy command in Linux, that doesn't change the input and output, and takes arguments that doesn't do anything?
cmd1 | dummy --tag="some unique string here" | cmd2
My interest in such dummy command is to be able to identify the processes, then I have multiple cmd1 | cmd2 running.

You can use a variable for that:
cmd1 | USELESS_VAR="some unique string here" cmd2
This has the advantage that Jonathan commented about of not performing an extra copy on all the data.

To create the command that you want, make a file called dummy with the contents:
#!/bin/sh
cat
Place the file in your PATH and make it executable (chmod a+x dummy). Because dummy ignores its arguments, you may place any argument on the command line that you choose. Because the body of dummy runs cat (with no arguments), dummy will echo stdin to stdout.
Alternative
My interest in such dummy command is to be able to identify the processes, then I have multiple cmd1 | cmd2 running.
As an alternative, consider
pgrep cmd1
It will show a process ID for every copy of cmd that is running. As an example:
$ pgrep getty
3239
4866
4867
4893
This shows that four copies of getty are running.

I'm not quite clear on your use case, but if you want to be able to "tag" different invocations of a command for the convenience of a sysadmin looking at a ps listing, you might be able to rename argv[0] to something suitably informative:
$:- perl -E 'exec {shift} "FOO", #ARGV' sleep 60 &
$:- perl -E 'exec {shift} "BAR", #ARGV' sleep 60 &
$:- perl -E 'exec {shift} "BAZ", #ARGV' sleep 60 &
$:- ps -o cmd=
-bash
FOO 60
BAR 60
BAZ 60
ps -o cmd
You don't have to use perl's exec. There are other conveniences for this, too, such as DJB's argv0, RedHat's doexec, and bash itself.

You should find a better way of identifying the processes buy anyway, you can use:
ln -s /dev/null /tmp/some-unique-file
cmd1 | tee /tmp/some-unique-file | cmd2

How can I use a pipe or redirect in a qsub command?

There are some commands I'd like to run on a grid using qsub (SGE 8.1.3, CentOS 5.9) that need to use a pipe (|) or a redirect (>). For example, let's say I have to parallelize the command
echo 'hello world' > hello.txt
(Obviously a simplified example: in reality I might need to redirect the output of a program like bowtie directly to samtools). If I did:
qsub echo 'hello world' > hello.txt
the resulting content of hello.txt would look like
Your job 123454321 ("echo") has been submitted
Similarly if I used a pipe (echo "hello world" | myprogram), that message is all that would be passed to myprogram, not the actual stdout.
I'm aware I could write a small bash script that each contain the command with the pipe/redirect, and then do qsub ./myscript.sh. However, I'm trying to run many parallelized jobs at the same time using a script, so I'd have to write many such bash scripts each with a slightly different command. When scripting this solution can start to feel very hackish. An example of such a script in Python:
for i, (infile1, infile2, outfile) in enumerate(files):
command = ("bowtie -S %s %s | " +
"samtools view -bS - > %s\n") % (infile1, infile2, outfile)
script = "job" + str(counter) + ".sh"
open(script, "w").write(command)
os.system("chmod 755 %s" % script)
os.system("qsub -cwd ./%s" % script)
This is frustrating for a few reasons, among them that my program can't even delete the many jobXX.sh scripts afterwards to clean up after itself, since I don't know how long the job will be waiting in the queue, and the script has to be there when the job starts.
Is there a way to provide my full echo 'hello world' > hello.txt command to qsub without having to create another file containing the command?

You can do this by turning it into a bash -c command, which lets you put the | in a quoted statement:
qsub bash -c "cmd <options> | cmd2 <options>"
As #spuder has noted in the comments, it seems that in other versions of qsub (not SGE 8.1.3, which I'm using), one can solve the problem with:
echo "cmd <options> | cmd2 <options>" | qsub
as well.

Although my answer is a bit late I am adding it for any incoming viewers. To use a pipe/direct and submit that as a qsub job you need to do a couple of things. But first, using qsub at the end of a pipe like you're doing will only result in one job being sent to the queue (i.e. Your code will run serially rather than get parallelized).
Run qsub with enabling binary mode since the default qsub behavior rather expects compiled code. For that you use the "-b y" flag to qsub and you'll avoid any errors of the sort "command required for a binary mode" or "script length does not match declared length".
echo each call to qsub and then pipe that to shell.
Suppose you have a file params-query.txt which hold several bowtie commands and piped calls to samtools of the following form:
bowtie -q query -1 param1 -2 param2 ... | samtools ...
To send each query as a separate job first prepare your command line units from STDIN through xargs STDIN. Notice the quotes around the braces are important if you are submitting a command of piped parts. That way your entire query is treated a single unit.
cat params-query.txt | xargs -i echo qsub -b y -o output_log -e error_log -N job_name \"{}\" | sh
If that didn't work as expected then you're probably better off generating an intermediate output between bowtie and samtools before calling samtools to accept that intermediate output. You won't need to change the qsub call through xargs but the code in params-query.txt should look like:
bowtie -q query -o intermediate_query_out -1 param1 -2 param2 && samtools read_from_intermediate_query_out
This page has interesting qsub tricks you might like

grep http *.job | awk -F: '{print $1}' | sort -u | xargs -I {} qsub {}

KSH: constraint the number of thread that can run at one time

I have a script that loop and each iteration invoke a thread that run in a background like below
xn_run_process.sh
...
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP} &
sleep 30
done
done
When I did this, I only think that it will be only 5 threads of myscript.sh be concurrently executed at one time, however things change, and this list execute 30 threads, each does quite heavy process. How do I constraint the number of concurrent processes to 5?

While this is possible in pure shell scripting, the easiest approach would be using a parallelization tool like GNU parallel or GNU make. Makefile example:
SOURCES = ${SOME_LIST}
STAMPS = $(SOME_LIST:=.did-run-stamp)
all : $(STAMPS)
%.did-run-stamp : %
/full/path/myscript.sh -f $<
and then calling make as make -j 5.

Use GNU Parallel (adjust -j as you see fit. Remove it if you want # of CPUs):
for each in `ls ${INPUT_DIR}/MDX*.txt`
do
java -Xms256m -Xmx1024m -cp ${CLASSPATH} com.wf.xn.etcc.Main -config=${CONFIG_FILE}
...
for SCALE_PDF in `ls ${PROCESS_DIR}/*.pdf`
do
OUTPUT_AFP=${OUTPUT_DIR}/`basename ${SCALE_PDF}`
OUTPUT_AFP=`print ${OUTPUT_AFP} | sed s/pdf/afp/g`
sem --id myid -j 5 ${PROJ_DIR}/myscript.sh -i ${SCALE_PDF} -o ${OUTPUT_AFP}
done
done
sem --wait --id myid
sem is part of GNU Parallel.
This will keep 5 jobs running until there is only 5 jobs left. Then it will allow your java to run while finishing the last 5. The sem --wait will wait until the last 5 are finished, too.
Alternatively:
for each ...
java ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel -j 5 ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp
done
This will run 5 jobs in parallel and only let java run when all the jobs are finished.
Alternatively you can use the queue trick described in GNU Parallel's man page: https://www.gnu.org/software/parallel/man.html#example__gnu_parallel_as_queue_system_batch_manager
echo >jobqueue; tail -f jobqueue | parallel -j5 &
for each ...
...
ls ${PROCESS_DIR}/*.pdf |
parallel echo ${PROJ_DIR}/myscript.sh -i {} -o ${OUTPUT_DIR}/{/.}.afp >> jobqueue
done
echo killall -TERM parallel >> jobqueue
wait
This will run java, then add jobs to be run to a queue. After adding jobs java will be run immediately. At all time 5 jobs will be run from the queue until the queue is empty.
You can install GNU Parallel simply by:
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 and walk through the tutorial (man parallel_tutorial). You command line with love you for it.

If you have ksh93 check if JOBMAX is available:
JOBMAX
This variable defines the maximum number running background
jobs that can run at a time. When this limit is reached, the
shell will wait for a job to complete before staring a new job.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string