How to make gnu-parallel split multiple input files - linux

I have a script which takes three arguments and is run like this:
myscript.sh input1.fa input2.fa out.txt
The script reads one line each from input1.fa and input2.fa, does some comparison, and writes the result to out.txt. The two inputs are required to have the same number of lines, and out.txt will also have the same number of lines after the script finishes.
Is it possible to parallelize this using GNU parallel?
I do not care that the output has a different order from the inputs, but I do need to compare the ith line of input1.fa with the ith line of input2.fa. Also, it is acceptable if I get multiple output files (like output{#}) instead of one -- I'll just cat them together.
I found this topic, but the answer wasn't quite what I wanted.
I know I can split the two input files and process them in parallel in pairs using xargs, but would like to do this in one line if possible...

If you can change myscript.sh, so it reads from a pipe and writes to a pipe you can do:
paste input1.fa input2.fa | parallel --pipe myscript.sh > out.txt
So in myscript you will need to read from STDIN and split on TAB to get the input from input1 and input2.

Related

Streaming split

I am trying to split the output of a program into smaller files. This is a long-running program that prints its output to stderr and I'd like to capture the logs in a series of smaller files rather than in one gigantic file. So what I have is:
program 2>&1 | split -l100 &
... but to my dismay I found that the split tool doesn't actually write any files out to disk until the input buffer ends. What I want is a streaming tool that automatically copies its input to the output files in a streaming manner without waiting until the source stream ends, which is unnecessary in my case. I've also tried the -u option of the split tool but it doesn't seem to work unless you choose the -n option but that option doesn't really apply in my case because the number of generated files could be arbitrarily high. Is there a Unix tool that might let me do this?
Barmar's suggestion to add a call to fflush() after every iteration in the awk script worked for me. This was preferable to me to calling close() on each file when it's done since that would only flush when each file is full, while I wanted a line-buffered behavior. I also had to configure the output pipe to be line-buffered, so the command in the end looks like this:
stdbuf -oL -eL command 2>&1 | awk -v number=1 '++i>1000 {++number; i=0} {print > "file" number; fflush("file" number)}'

creating a fake file that shows the output of a program

I am trying to "stich" two programs together. The first program, which I can change as I want, generates an output with some data. The second program cannot be changed, and expects to read the data that is generated by the first program.
This second program exects a file, I cannot use a pipe. I don't want to regenerate the file every x seconds.
Is there a way on linux to create a "fake" file that fetches the first program output every time it's opened for reading? This would be transparent to the second program. Is it doable with fuse?
If you're using bash, you can use process substitution:
program2 <(program1)
If you're not using a shell with process substitution, you can use a named pipe.
mkfifo /tmp/pipe
program1 > /tmp/pipe &
program2 /tmp/pipe
Many programs that require a filename argument for their input also allow that filename to be -, which they interpret to mean standard input. This allows you to pipe to them:
program1 | program2 -

Is there some way in Linux to create a "virtual file" that is the concatenation of two files?

I have two data sets that I want to run a program on. I want to compare the results to running the analysis on each set individually, and on the combined data. Since they're large files, I'd prefer not to create a whole new file that's the two data sets concatenated, doubling the disk space required. Is there some way in Linux I can create something like a symbolic link, but that points to two files, so that when it's read it will read two other files in sequence, as if they're concatenated into one file? I'm guessing probably not, but I thought I'd check.
Can your program read from standard input?
cat file-1 file-2 | yourprogram
If your program can only read from a file that is named on the command line, then this might work:
yourprogram <(cat file-1 file-2)
I think you need to be running the /bin/bash shell for the second example to work. The shell replaces <(foobar) with the name of a named pipe that your program can open and read like a file. The shell runs the foobar command in another process, and sends its output in to the other end of the pipe.

How to split a large variable?

I'm working with large variables and it can be very slow "looping" through them with while read line, I found out that the smaller the variable the faster it works.
How can I split large variable into smaller variables and then read them one by one?
for example,
What I would like to achieve:
bigVar=$(echo "$bigVar" | split_var)
for var in "${bigVar[#]}"; do
while read line; do
...
done <<< "${var}"
done
or may be split to bigVar1, bigVar2, bigVar3 etc.. and than read them one by one.
Instead of doing
bigVar=$(someCommand)
while read line
do
...
done <<< "$bigVar"
Use
while read line
do
...
done < <(someCommand)
This way, you avoid the problem with big variables entirely, and someCommand can output gigabyte after gigabyte with no problem.
If the reason you put it in a variable was to do work in multiple steps on it, rewrite it as a pipeline.
If BigVar is made of words, you could use xargs to split it in lines no longer than the maximum length of a command line, usually 32kb or 64kb :
someCommand|xargs|while read line
do
...
done
In this case xargs uses its default command, which is echo.
I'm curious about what you want to do in the while loop, as it may be optimized with a pipeline.

Is there an efficient way to read line input in bash?

I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.
I'm trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.
But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).
Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.
Use split with the -l option to specify how many lines you want. Use --filter option $FILE is the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:
zcat doc.gz | split -l 1000 --filter='gzip > $FILE.gz'
If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain gzip.
How about using split command with -l option?
gzcat large.csv.gz | split -l 1000 - xxx
gzip xxx*

Resources