How to count lines in compressed file in a directory and print the number of lines per file? - linux

I am counting lines in several files in a directory without decompressing, and dividing the result by 4 as below:
gunzip -c *.fastq.gz | echo $((`wc -l`/4))
Everything looks good, except that it's giving me the total number of all the lines. I would like to print the lines per file. Can anyone help. I am using Darwin (Mac OSX). Thank you.

Just make it a small script?
#!/bin/bash
for i in *.fastq.gz
do
echo "$i" $(gunzip -c $i | echo `wc -l`/4 | bc -l)
done
If you want a one-liner:
for i in *.fastq.gz; do echo "$i" $(gunzip -c $i | echo `wc -l`/4 | bc -l); done

gzip usually comes with zcat (maybe gzcat), which is essentially gzip -dc. So this invocation should work for a single file:
echo $(( $(zcat file.gz | wc -l) / 4 ))
If for some reason you don't have zcat, gzip -dc works just fine in its place.
Wrapping that in a for loop to handle different files should be relatively straightforward...

Related

How to touch all files that are returned by a sorted ls?

If I have the following:
ls|sort -n
How would I touch all those files in the order of the sorted files? Something like:
ls|sort -n|touch
What would be the proper syntax? Note that I need to sort touch the files in the exact order they're being sorted -- as I'm trying to sort these files for a FAT reader with minimal metadata reading.
ls -1tr | while read file; do touch "$file"; sleep 1; done
If you want to preserve distance in modification time from one file to the next then call this instead:
upmodstamps() {
oldest_elapsed=$(( $(date +%s) - $(stat -c %Y "`ls -1tr|head -1`") ))
for file in *; do
oldstamp=$(stat -c %Y "$file")
newstamp=$(( $oldstamp + $oldest_elapsed ))
newstamp_fmt=$(date --date=#${newstamp} +'%Y%m%d%H%M.%S')
touch -t ${newstamp_fmt} "$file"
done
}
Note: date usage assumes GNU
You can use this command
(ls|sort -n >> list.txt )
touch $(cat list.txt)
OR
touch $(ls /path/to/dir | sort -n)
OR if you want to copy files instead of creating empty files use this command
cp list.txt ./DirectoryWhereYouWantToCopy
Try like this
touch $(ls | sort -n)
Can you give a few file name?
if you have file names with numbers as 1file, 10file, 11file .. 20file, then you need use --general-numeric-sort
ls | sort --general-numeric-sort --output=../workingDirectory/sortedFiles.txt
cat sortedFiles.txt
1file
10file
11file
12file
20file
and move sortedFile.txt into your working directory or where ever you want.
touch $(cat ../workingDirectory/sortedFiles.txt)
this will create empty files with the exact same name

AWK don't use first value of a column

First of all, thank you for your help. I have a problem with awk and using the while read. I have a file separated in two columns that each column has 8 values. My script consist of selecting the second columnn and download 8 different files and decompress them. The problem is that the my script doesn't download the first value of the column.
This is my script
#!/bin/bash
cat $1 | while read line
do
echo "Downloading fasta files from NCBI..."
awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
This is the file I am using
Salmonella_enterica_subsp_enterica_Typhi https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/717/755/GCF_003717755.1_ASM371775v1/GCF_003717755.1_ASM371775v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_A https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/818/115/GCF_000818115.1_ASM81811v1/GCF_000818115.1_ASM81811v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Paratyphi_B https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/705/GCF_000018705.1_ASM1870v1/GCF_000018705.1_ASM1870v1_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Infantis https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/011/182/555/GCA_011182555.2_ASM1118255v2/GCA_011182555.2_ASM1118255v2_translated_cds.faa.gz
Salmonella_enterica_subsp_enterica_Typhimurium_LT2 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_translated_cds.faa.gz
Salmonella_enterica_subsp_diarizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/324/755/GCF_003324755.1_ASM332475v1/GCF_003324755.1_ASM332475v1_translated_cds.faa.gz
Salmonella_enterica_subsp_arizonae https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/635/675/GCA_900635675.1_31885_G02/GCA_900635675.1_31885_G02_translated_cds.faa.gz
Salmonella_bongori https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/113/225/GCF_006113225.1_ASM611322v2/GCF_006113225.1_ASM611322v2_translated_cds.faa.gz
The problem is not the download. Check the output of
#!/bin/bash
cat "$1" | while read line
do
awk '{print $2}'
done
This also prints only 7 of the 8 urls. When entering the loop, the read reads the first line into the variable line. However, you never use that variable, so the line is lost. Then awk reads the remaining 7 lines from stdin in one go. The loop only runs once.
You probably wanted to write
#!/bin/bash
cat "$1" | while read -r line
do
echo "Downloading fasta files from NCBI..."
echo "$line" | awk '{print $2}' | wget -i- 2>> log
gzip -d *.gz
done
But there is an easier and safer way:
awk '{print $2}' "$1" | wget -i- 2>> log
gzip -d *.gz
Since the command cut is made to select a column, why not simply issue:
#!/bin/bash
for url in $(cut -f2 "$1")
do
wget "$url" >> log
done
gzip -d *.gz

How do I insert a new line before concatenating?

I have about 80000 files which I am trying to concatenate. This one:
cat files_*.raw >> All
is extremely fast whereas the following:
for f in `ls files_*.raw`; do cat $f >> All; done;
is extremely slow. Because of this reason, I am trying to stick with the first option except that I need to be able to insert a new line after each file is concatenated to All. Is there any fast way of doing this?
What about
ls files_*.raw | xargs -L1 sed -e '$s/$/\n/' >>ALL
That will insert an extra newline at the end of each file as you concat them.
And a parallel version if you don't care about the order of concatenation:
find ./ -name "*.raw" -print | xargs -n1 -P4 sed -e '$s/$/\n/' >>All
The second command might be slow because you are opening the 'All' file for append 80000 times vs. 1 time in the first command. Try a simple variant of the second command:
for f in `ls files_*.raw`; do cat $f ; echo '' ; done >> All
I don't know why it would be slow, but I don't think you have much choice:
for f in `ls files_*.raw`; do cat $f >> All; echo '' >> All; done
Each time awk opens another file to process, the FRN equals 0, so:
awk '(0==FRN){print ""} {print}' files_*.raw >> All
Note, it's all done in one awk process. Performance should be close to the cat command from the question.

Problems with Grep Command in bash script

I'm having some rather unusual problems using grep in a bash script. Below is an example of the bash script code that I'm using that exhibits the behaviour:
UNIQ_SCAN_INIT_POINT=1
cat "$FILE_BASENAME_LIST" | uniq -d >> $UNIQ_LIST
sed '/^$/d' $UNIQ_LIST >> $UNIQ_LIST_FINAL
UNIQ_LINE_COUNT=`wc -l $UNIQ_LIST_FINAL | cut -d \ -f 1`
while [ -n "`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`" ]; do
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
CURRENT_DUPECHK_FILE=$FILE_DUPEMATCH-$CURRENT_LINE
grep $CURRENT_LINE $FILE_LOCTN_LIST >> $CURRENT_DUPECHK_FILE
MATCH=`grep -c $CURRENT_LINE $FILE_BASENAME_LIST`
CMD_ECHO="$CURRENT_LINE matched $MATCH times," cmd_line_echo
echo "$CURRENT_DUPECHK_FILE" >> $FILE_DUPEMATCH_FILELIST
let UNIQ_SCAN_INIT_POINT=UNIQ_SCAN_INIT_POINT+1
done
On numerous occasions, when grepping for the current line in the file location list, it has put no output to the current dupechk file even though there have definitely been matches to the current line in the file location list (I ran the command in terminal with no issues).
I've rummaged around the internet to see if anyone else has had similar behaviour, and thus far all I have found is that it is something to do with buffered and unbuffered outputs from other commands operating before the grep command in the Bash script....
However no one seems to have found a solution, so basically I'm asking you guys if you have ever come across this, and any idea/tips/solutions to this problem...
Regards
Paul
The `problem' is the standard I/O library. When it is writing to a terminal
it is unbuffered, but if it is writing to a pipe then it sets up buffering.
try changing
CURRENT_LINE=`cat $UNIQ_LIST_FINAL | sed "$UNIQ_SCAN_INIT_POINT"'q;d'`
to
CURRENT LINE=`sed "$UNIQ_SCAN_INIT_POINT"'q;d' $UNIQ_LIST_FINAL`
Are there any directories with spaces in their names in $FILE_LOCTN_LIST? Because if they are, those spaces will need escaped somehow. Some combination of find and xargs can usually deal with that for you, especially xargs -0
A small bash script using md5sum and sort that detects duplicate files in the current directory:
CURRENT="" md5sum * |
sort |
while read md5sum filename;
do
[[ $CURRENT == $md5sum ]] && echo $filename is duplicate;
CURRENT=$md5sum;
done
you tagged linux, some i assume you have tools like GNU find,md5sum,uniq, sort etc. here's a simple example to find duplicate files
$ echo "hello world">file
$ md5sum file
6f5902ac237024bdd0c176cb93063dc4 file
$ cp file file1
$ md5sum file1
6f5902ac237024bdd0c176cb93063dc4 file1
$ echo "blah" > file2
$ md5sum file2
0d599f0ec05c3bda8c3b8a68c32a1b47 file2
$ find . -type f -exec md5sum "{}" \; |sort -n | uniq -w32 -D
6f5902ac237024bdd0c176cb93063dc4 ./file
6f5902ac237024bdd0c176cb93063dc4 ./file1

wc gzipped files?

I have a directory with both uncompressed and gzipped files and want to run wc -l on this directory. wc will provide a line count value for the compressed files which is not accurate (since it seems to count newlines in the gzipped version of the file). Is there a way to create a zwc script similar to zgrep that will detected the gzipped files and count the uncompressed lines?
Try this zwc script:
#! /bin/bash --
for F in "$#"; do
echo "$(zcat -f <"$F" | wc -l) $F"
done
You can use zgrep to count lines as well (or rather the beginning of lines)
zgrep -c ^ file.txt
I use too "cat file_name | gzip -d | wc -l"

Resources