Is there an efficient way to read line input in bash? - linux

I want to split large, compressed CSV files into multiple smaller gzip file, split on line boundary.
I'm trying to pipe gunzip to a bash script with a while read LINE. That script writes to a named pipe where a background gzip process is recompressing it. Every X characters read I close the FD and restart a new gzip process for the next split.
But in this scenario the script, with while read LINE, is consuming 90% of the cpu because read is so inefficient here (I understand that it makes a system call to read 1 char at a time).
Any thoughts on doing this efficiently? I would expect gzip to consume the majority cpu.

Use split with the -l option to specify how many lines you want. Use --filter option $FILE is the name split would have used for output to file (and has to be quoted with single quotes to prevent expanding by the shell too early:
zcat doc.gz | split -l 1000 --filter='gzip > $FILE.gz'
If you need any additional processing, just pen a script, that will accept the filename as argument and process standard input accordingly, and use that instead of plain gzip.

How about using split command with -l option?
gzcat large.csv.gz | split -l 1000 - xxx
gzip xxx*

Related

Streaming split

I am trying to split the output of a program into smaller files. This is a long-running program that prints its output to stderr and I'd like to capture the logs in a series of smaller files rather than in one gigantic file. So what I have is:
program 2>&1 | split -l100 &
... but to my dismay I found that the split tool doesn't actually write any files out to disk until the input buffer ends. What I want is a streaming tool that automatically copies its input to the output files in a streaming manner without waiting until the source stream ends, which is unnecessary in my case. I've also tried the -u option of the split tool but it doesn't seem to work unless you choose the -n option but that option doesn't really apply in my case because the number of generated files could be arbitrarily high. Is there a Unix tool that might let me do this?
Barmar's suggestion to add a call to fflush() after every iteration in the awk script worked for me. This was preferable to me to calling close() on each file when it's done since that would only flush when each file is full, while I wanted a line-buffered behavior. I also had to configure the output pipe to be line-buffered, so the command in the end looks like this:
stdbuf -oL -eL command 2>&1 | awk -v number=1 '++i>1000 {++number; i=0} {print > "file" number; fflush("file" number)}'

Is there some way in Linux to create a "virtual file" that is the concatenation of two files?

I have two data sets that I want to run a program on. I want to compare the results to running the analysis on each set individually, and on the combined data. Since they're large files, I'd prefer not to create a whole new file that's the two data sets concatenated, doubling the disk space required. Is there some way in Linux I can create something like a symbolic link, but that points to two files, so that when it's read it will read two other files in sequence, as if they're concatenated into one file? I'm guessing probably not, but I thought I'd check.
Can your program read from standard input?
cat file-1 file-2 | yourprogram
If your program can only read from a file that is named on the command line, then this might work:
yourprogram <(cat file-1 file-2)
I think you need to be running the /bin/bash shell for the second example to work. The shell replaces <(foobar) with the name of a named pipe that your program can open and read like a file. The shell runs the foobar command in another process, and sends its output in to the other end of the pipe.

When does the writer of a named pipe do its work?

I'm trying to understand how a named pipe behaves in terms of performance. Say I have a large file I am decompressing that I want to write to a named pipe (/tmp/data):
gzip --stdout -d data.gz > /tmp/data
and then I sometime later run a program that reads from the pipe:
wc -l /tmp/data
When does gzip actually decompress the data, when I run the first command, or when I run the second and the reader attaches to the pipe? If the former, is the data stored on disk or in memory?
Pipes (named or otherwise) have only a very small buffer if any -- so if nothing is reading, then nothing (or very little) can be written.
In your example, gzip will do very little until wc is run, because before that point its efforts to write output will block. Out-of-the-box there is no nontrivial buffer either on-disk or in-memory, though tools exist which will implement such a buffer for you, should you want one -- see pv with its -B argument, or the no-longer-maintained (and, sadly, removed from Debian by folks who didn't understand its function) bfr.

How to split a large variable?

I'm working with large variables and it can be very slow "looping" through them with while read line, I found out that the smaller the variable the faster it works.
How can I split large variable into smaller variables and then read them one by one?
for example,
What I would like to achieve:
bigVar=$(echo "$bigVar" | split_var)
for var in "${bigVar[#]}"; do
while read line; do
...
done <<< "${var}"
done
or may be split to bigVar1, bigVar2, bigVar3 etc.. and than read them one by one.
Instead of doing
bigVar=$(someCommand)
while read line
do
...
done <<< "$bigVar"
Use
while read line
do
...
done < <(someCommand)
This way, you avoid the problem with big variables entirely, and someCommand can output gigabyte after gigabyte with no problem.
If the reason you put it in a variable was to do work in multiple steps on it, rewrite it as a pipeline.
If BigVar is made of words, you could use xargs to split it in lines no longer than the maximum length of a command line, usually 32kb or 64kb :
someCommand|xargs|while read line
do
...
done
In this case xargs uses its default command, which is echo.
I'm curious about what you want to do in the while loop, as it may be optimized with a pipeline.

How to make gnu-parallel split multiple input files

I have a script which takes three arguments and is run like this:
myscript.sh input1.fa input2.fa out.txt
The script reads one line each from input1.fa and input2.fa, does some comparison, and writes the result to out.txt. The two inputs are required to have the same number of lines, and out.txt will also have the same number of lines after the script finishes.
Is it possible to parallelize this using GNU parallel?
I do not care that the output has a different order from the inputs, but I do need to compare the ith line of input1.fa with the ith line of input2.fa. Also, it is acceptable if I get multiple output files (like output{#}) instead of one -- I'll just cat them together.
I found this topic, but the answer wasn't quite what I wanted.
I know I can split the two input files and process them in parallel in pairs using xargs, but would like to do this in one line if possible...
If you can change myscript.sh, so it reads from a pipe and writes to a pipe you can do:
paste input1.fa input2.fa | parallel --pipe myscript.sh > out.txt
So in myscript you will need to read from STDIN and split on TAB to get the input from input1 and input2.

Resources