Merge sort gzipped files - linux

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.

This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz

#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.

Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4

It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Related

How to pipe the output of `ls` to `mplayer`

I want to run mplayer on all files in a folder, sorted by size.
I tried the following commands
ls -1S folder | mplayer
ls -1S folder | xargs mplayer
ls -1S folder | xargs -print0 mplayer
but none of these are working.
How to do it right?
Don’t parse the output of ls.
Instead, use e.g. for to loop over the files and call stat to get the file sizes. To avoid issues with spaces or newlines in filenames, use zero-terminated strings to sort etc.:
for file in folder/*; do
printf "%s %s\0" "$(stat -c %s "$file")" "$file"
done \
| sort -z -k1 -t ' ' \
| cut -z -f2- -d ' ' \
| xargs -0 mplayer
To call mplayer individually for each file (rather than only once, passing all files as arguments), you’ll need to use a while loop, and pipe in the above. Unfortunate | doesn’t work with while (at least I don’t know how), you need to use process substitution instead:
while IFS= read -r -d '' file; do
mplayer "$file"
done < <(
for file in folder/*; do
printf "%s %s\0" "$(stat -c %s "$file")" "$file"
done \
| sort -z -k1 -t ' ' \
| cut -z -f2- -d ' '
)
Note that the above is Bash code, and uses GNU extensions, it works on Linux but it won’t work without changes e.g. on macOS (BSD cut has no -z flag, and stat -c %s needs to be changed to stat -f %z).
I created a python script to build an executable file for just that I want to do. Here is the complete python code:
import os
import re
import sys
import glob
dir_name = sys.argv[1]
# Get a list of files (file paths) in the given directory
list_of_files = filter( os.path.isfile,
glob.glob(dir_name + '/*') )
# Sort list of files in directory by size
list_of_files = sorted( list_of_files,
key = lambda x: os.stat(x).st_size)
# Iterate over sorted list of files in directory and
# print them one by one along with size
for elem in list_of_files[::-1]:
file_size = os.stat(elem).st_size
print(f"mplayer {re.escape(elem)}")
You redirect the output to a file and execute this. And voila - mplayer plays the files in the order from big to small.

How to create large file (require long compress time) on Linux

I make parallel job now So I'm trying to create dummyFile and compreaa that on the backgrounds.
Like this
Create dummy file
for in ()
do
Compress that file &
done
wait
I need to create dummy data So I tried
fallocate -l 1g test.txt
And
tar cfv test.txt
But this compress job is done just 5seconds
How can I create dummydata big and required long compress time (3minute~5minute)
There are two things going on here. The first is that tar won't compress anything unless you pass it a z flag along with what you already have to trigger gzip compression:
tar cvfz test.txt
For a very similar effect, you can invoke gzip directly:
gzip test.txt
The second issue is that with most compression schemes, a gigantic string of zeros, which is likely what you generate, is very easy to compress. You can fix that by supplying random data. On a Unix-like system you can use the pseudo-file /dev/urandom. This answer gives three options in decreasing order of preference, depending on what works:
head that understands suffixes like G for Gibibyte:
head -c 1G < /dev/urandom > test.txt
head that needs it spelled out:
head -c 1073741824 < /dev/urandom > test.txt
No head at all, so use dd, where file size is block size (bs) times count (1073741824 = 1024 * 1048576):
dd bs=1024 count=1048576 < /dev/urandom > test.txt
Something like this may work. There are some bash specific operators.
#!/bin/bash
function createCompressDelete()
{
_rdmfile="$1"
cat /dev/urandom > "$_rdmfile" & # This writes to file in the background
pidcat=$! #Save the backgrounded pid for later use
echo "createCompressDelete::$_rdmfile::pid[$pidcat]"
sleep 2
while [ -f "$_rdmfile" ]
do
fsize=$(du "$_rdmfile" | awk '{print $1}')
if (( $fsize < (1024*1024) )); then # Check the size for 1G
sleep 10
echo -n "...$fsize"
else
kill "$pidcat" # Kill the pid
tar czvf "${_rdmfile}".tar.gz "$_rdmfile" # compress
rm -f "${_rdmfile}" # delete the create file
rm -f "${_rdmfile}".tar.gz # delete the tarball
fi
done
}
# Run for any number of files
for i in file1 file2 file3 file4
do
createCompressDelete "$i" &> "$i".log & # run it in the background
done

Split a .gz file into multiple 1GB compressed(.gz) files

I have a 250GB gzipped file on Linux and I want to split it in 250 1GB files and compress the generated part files on the fly (as soon as one file is generated, it should be compressed).
I tried using this -
zcat file.gz | split -b 1G – file.gz.part
But this is generating uncompressed file and rightly so. I modified it to look like this, but got an error:
zcat file.gz | split -b 1G - file.gz.part | gzip
gzip: compressed data not written to a terminal. Use -f to force compression.
For help, type: gzip -h
I also tried this, and it did not throw any error, but did not compress the part file as soon as they are generated. I assume that this will compress each file when the whole split is done (or it may pack all part files and create single gz file once the split completed, I am not sure).
zcat file.gz | split -b 1G - file.gz.part && gzip
I read here that there is a filter option, but my version of split is (GNU coreutils) 8.4, hence the filter is not supported.
$ split --version
split (GNU coreutils) 8.4
Please advise a suitable way to achieve this, preferably using a one liner code (if possible) or a shell (bash/ksh) script will also work.
split supports filter commands. Use this:
zcat file.gz | split - -b 1G --filter='gzip > $FILE.gz' file.part.
it's definitely suboptimal but I tried to write it in bash just for fun ( I haven't actually tested it so there may be some minor mistakes)
GB_IN_BLOCKS=`expr 2048 \* 1024`
GB=`expr $GB_IN_BLOCKS \* 512`
COMPLETE_SIZE=`zcat asdf.gz | wc -c`
PARTS=`expr $COMPLETE_SIZE \/ $GB`
for i in `seq 0 $PARTS`
do
zcat asdf.gz | dd skip=`expr $i \* GB_IN_BLOCKS` count=$GB_IN_BLOCKS | gzip > asdf.gz.part$i
done

grep - limit number of files read

I have a directory with over 100,000 files. I want to know if the string "str1" exists as part of the content of any of these files.
The command:
grep -l 'str1' * takes too long as it reads all of the files.
How can I ask grep to stop reading any further files if it finds a match? Any one-liner?
Note: I have tried grep -l 'str1' * | head but the command takes just as much time as the previous one.
Naming 100,000 filenames in your command args is going to cause a problem. It probably exceeds the size of a shell command-line.
But you don't have to name all the files if you use the recursive option with just the name of the directory the files are in (which is . if you want to search files in the current directory):
grep -l -r 'str1' . | head -1
Use grep -m 1 so that grep stops after finding the first match in a file. It is extremely efficient for large text files.
grep -m 1 str1 * /dev/null | head -1
If there is a single file, then /dev/null above ensures that grep does print out the file name in the output.
If you want to stop after finding the first match in any file:
for file in *; do
if grep -q -m 1 str1 "$file"; then
echo "$file"
break
fi
done
The for loop also saves you from the too many arguments issue when you have a directory with a large number of files.

wc gzipped files?

I have a directory with both uncompressed and gzipped files and want to run wc -l on this directory. wc will provide a line count value for the compressed files which is not accurate (since it seems to count newlines in the gzipped version of the file). Is there a way to create a zwc script similar to zgrep that will detected the gzipped files and count the uncompressed lines?
Try this zwc script:
#! /bin/bash --
for F in "$#"; do
echo "$(zcat -f <"$F" | wc -l) $F"
done
You can use zgrep to count lines as well (or rather the beginning of lines)
zgrep -c ^ file.txt
I use too "cat file_name | gzip -d | wc -l"

Resources