wc gzipped files? - linux

I have a directory with both uncompressed and gzipped files and want to run wc -l on this directory. wc will provide a line count value for the compressed files which is not accurate (since it seems to count newlines in the gzipped version of the file). Is there a way to create a zwc script similar to zgrep that will detected the gzipped files and count the uncompressed lines?

Try this zwc script:
#! /bin/bash --
for F in "$#"; do
echo "$(zcat -f <"$F" | wc -l) $F"
done

You can use zgrep to count lines as well (or rather the beginning of lines)
zgrep -c ^ file.txt

I use too "cat file_name | gzip -d | wc -l"

Related

Bash script to move first N files with specific name

I'm trying to move only 100 files with a specific extensions (from the current directory to the parent directory), but the following attempt of mine does not work
for file in $(ls -U | grep *.txt | tail -100)
do
mv $file ../
done
Can you point me to the correct approach?
Since you didn't quote *.txt, the shell expanded it to all the filenames ending in .txt. So your command is something like:
ls -U | grep file1.txt file2.txt file3.txt ... | tail -100
Since grep has filename arguments, it ignores its standard input. It outputs all the lines matching file1.txt in the remaining files. There's probably no matches, so nothing is piped to tail -100. And even if there were matches, the output would be the lines from the files, not filenames, so it wouldn't be useful for the mv command.
You can loop over the filenames directly, and use a counter variable to stop after 100 files.
counter=0
for file in *.txt
do
if (( counter >= 100 ))
then break
fi
mv "$file" ../
((counter++))
done
This avoids the pitfalls of parsing the output of ls.
this will do the job:
ls -U *.txt | tail -100 | while read filename; do mv "$filename" ../; done
while read filename respect spaces in the filename.
Run this in the text file directory:
#!/bin/bash
for txt_file in ./*.txt; do
((c++==100)) && break
mv "$txt_file" ../
done

Printing the number of lines

I have a directory that contains only .txt files. I want to print the number of lines for every file. When I write cat file.txt | wc -l the number of lines appears but when I want to make a script it's more complicated. I have this code:
for fis in `ls -R $1`
do
echo `cat $fis | wc -l`
done
I tried: wc -l $fis , with awk,grep and it doesn't work. It tells that:
cat: fis1: No such file or directory
0
How can I do to print the number of lines?
To find files recursively in subdirectories, use the find command, not ls -R, which is mainly intended for human reading.
find "$1" -type f -exec wc -l {} +
The problems with looping over the output of ls -R are:
Filenames with whitespace won't be parsed correctly.
It prints other output beside just the filenames.
Not the problem here, but the echo command is more than needed:
You can use
wc -l "${fis}"
What goes wrong?
You have a subdir called fis1. Look to the output of ls:
# ls -R fis1
fis1:
file1_in_fis1.txt
When you are parsing this output, your script will try
echo `cat fis1: | wc -l`
The cat will tell you No such file or directory and wc counts 0.
As #Barmar explained, ls prints additional output you do not want.
Do not try to patch your attempt by | grep .txt and if [ -f "${fis}"]; then .., these will fail with filename with spaces.txt. So use find or shopt (and accept the answer of #Barmar or #Cyrus).

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

How to count lines in compressed file in a directory and print the number of lines per file?

I am counting lines in several files in a directory without decompressing, and dividing the result by 4 as below:
gunzip -c *.fastq.gz | echo $((`wc -l`/4))
Everything looks good, except that it's giving me the total number of all the lines. I would like to print the lines per file. Can anyone help. I am using Darwin (Mac OSX). Thank you.
Just make it a small script?
#!/bin/bash
for i in *.fastq.gz
do
echo "$i" $(gunzip -c $i | echo `wc -l`/4 | bc -l)
done
If you want a one-liner:
for i in *.fastq.gz; do echo "$i" $(gunzip -c $i | echo `wc -l`/4 | bc -l); done
gzip usually comes with zcat (maybe gzcat), which is essentially gzip -dc. So this invocation should work for a single file:
echo $(( $(zcat file.gz | wc -l) / 4 ))
If for some reason you don't have zcat, gzip -dc works just fine in its place.
Wrapping that in a for loop to handle different files should be relatively straightforward...

Problems with Grep Command in bash script

I'm having some rather unusual problems using grep in a bash script. Below is an example of the bash script code that I'm using that exhibits the behaviour:
UNIQ_SCAN_INIT_POINT=1
cat "$FILE_BASENAME_LIST" | uniq -d >> $UNIQ_LIST
sed '/^$/d' $UNIQ_LIST >> $UNIQ_LIST_FINAL
UNIQ_LINE_COUNT=`wc -l $UNIQ_LIST_FINAL | cut -d \ -f 1`
while [ -n "`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`" ]; do
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
CURRENT_DUPECHK_FILE=$FILE_DUPEMATCH-$CURRENT_LINE
grep $CURRENT_LINE $FILE_LOCTN_LIST >> $CURRENT_DUPECHK_FILE
MATCH=`grep -c $CURRENT_LINE $FILE_BASENAME_LIST`
CMD_ECHO="$CURRENT_LINE matched $MATCH times," cmd_line_echo
echo "$CURRENT_DUPECHK_FILE" >> $FILE_DUPEMATCH_FILELIST
let UNIQ_SCAN_INIT_POINT=UNIQ_SCAN_INIT_POINT+1
done
On numerous occasions, when grepping for the current line in the file location list, it has put no output to the current dupechk file even though there have definitely been matches to the current line in the file location list (I ran the command in terminal with no issues).
I've rummaged around the internet to see if anyone else has had similar behaviour, and thus far all I have found is that it is something to do with buffered and unbuffered outputs from other commands operating before the grep command in the Bash script....
However no one seems to have found a solution, so basically I'm asking you guys if you have ever come across this, and any idea/tips/solutions to this problem...
Regards
Paul
The `problem' is the standard I/O library. When it is writing to a terminal
it is unbuffered, but if it is writing to a pipe then it sets up buffering.
try changing
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
to
CURRENT LINE=`sed "$UNIQ_SCAN_INIT_POINT"'q;d' $UNIQ_LIST_FINAL`
Are there any directories with spaces in their names in $FILE_LOCTN_LIST? Because if they are, those spaces will need escaped somehow. Some combination of find and xargs can usually deal with that for you, especially xargs -0
A small bash script using md5sum and sort that detects duplicate files in the current directory:
CURRENT="" md5sum * |
sort |
while read md5sum filename;
do
[[ $CURRENT == $md5sum ]] && echo $filename is duplicate;
CURRENT=$md5sum;
done
you tagged linux, some i assume you have tools like GNU find,md5sum,uniq, sort etc. here's a simple example to find duplicate files
$ echo "hello world">file
$ md5sum file
6f5902ac237024bdd0c176cb93063dc4 file
$ cp file file1
$ md5sum file1
6f5902ac237024bdd0c176cb93063dc4 file1
$ echo "blah" > file2
$ md5sum file2
0d599f0ec05c3bda8c3b8a68c32a1b47 file2
$ find . -type f -exec md5sum "{}" \; |sort -n | uniq -w32 -D
6f5902ac237024bdd0c176cb93063dc4 ./file
6f5902ac237024bdd0c176cb93063dc4 ./file1

Resources