How to take count some data of a file which is in archived folder? - linux

I have a archived folder which contain some files, from one of those files I want to take count of 31 delimiter. How to get count without unzipping folder?
archived folder name =mug.tar, file name = APR_17
Below is how to take count
| awk -F "|" '{print $31}'|grep "40411"|sort -n|uniq -c|wc –l

Untar the wanted file from the archive file to stdout and pipe it your awk:
$ tar -xOf mug.tar APR_17 | awk ...
man tar:
-x, --extract, --get
extract files from an archive
-O, --to-stdout
extract files to standard output
-f, --file ARCHIVE
use archive file or device ARCHIVE

Related

How can i get md5sum for all files inside zip without extracting

Is there any way to get md5sum for all/anyone file inside zip without extracting the zip?
I can extract needed files using unzip <.zip>
But i need to get md5sum without extracting the zip.
This may not be exactly what you are looking for but it will get you closer. You wouldn't be extracting the entire zip but extracting a file to pipe it to md5sum to get checksum. Without reading the contents of the file, md5sum won't be able to generate a hash.
Let's say you have 3 files with this MD5:
b1946ac92492d2347c6235b4d2611184 a.txt
591785b794601e212b260e25925636fd b.txt
6f5902ac237024bdd0c176cb93063dc4 c.txt
You zip them into a single file using zip final.zip a.txt b.txt c.txt
When you list the files, you see there are 3 files.
unzip -l final.zip
Archive: final.zip
Length Date Time Name
--------- ---------- ----- ----
6 2021-08-08 17:20 a.txt
6 2021-08-08 17:20 b.txt
12 2021-08-08 17:20 c.txt
--------- -------
24 3 files
To get MD5 of each of the files without extracting the entire zip, you can do this:
unzip -p final.zip a.txt | md5sum
b1946ac92492d2347c6235b4d2611184 -
unzip -p final.zip b.txt | md5sum
591785b794601e212b260e25925636fd -
unzip -p final.zip c.txt | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
Alternative
You can do md5sum *.txt > checksums to get hash of all files and store them in a checksums file. Add that to the zip so you know the md5 of each of the file when the files were added to zipped.
For all files in a zip you can use this;
File='final.zip' ; unzip -lqq $File | while read L ; do unzip -p $File ${L##*[[:space:]]} | md5sum | sed "s/-/${L##*[[:space:]]}/" ; done
Gives;
b1946ac92492d2347c6235b4d2611184 a.txt
591785b794601e212b260e25925636fd b.txt
6f5902ac237024bdd0c176cb93063dc4 c.txt
Based on #MartinMann's answer, a version that works correctly no matter if the file names contain spaces or special characters:
ZIPFILE="final.zip"; unzip -Z1 "$ZIPFILE" | grep -v '/$' | while read L; do "$(unzip -p "$ZIPFILE" "$L" | md5sum | cut '-d ' -f1) $L" ; done
Gives:
b1946ac92492d2347c6235b4d2611184 a.txt
591785b794601e212b260e25925636fd b.txt
6f5902ac237024bdd0c176cb93063dc4 path/to/file with spaces.txt

would like to tar files ending with same timestamp in to a single tar

I got a list of log files and all these files end with a timestamp.
For each day I have bunch of log files all ending with same timestamp
For a week I have long list of file all with time stamps.
The challenge is, I would like to use tar command to archive set of files ending with same timedate stamp as one tar file.
Henc end up with tar files for every day accordingly.
How can I achieve this please? some sort of string matching wild card, I'm new to linux help please.
File Examples:
enter image description here
First, get a list of unique timestamps. Then, for each timestamp archive all files with that timestamp:
printf %s\\n *.log | grep -Eo '\.[0-9]{8}_' | tr -d ._ | sort -u | while read timestamp; do
tar cf "$timestamp.tar" ./*"$timestamp"*.log
done
Here I assumed that the timestamps always have 8 digits, always start with . and always end with _ (as shown in your screenshot).
# get all dates
all_date=`find -type f | awk -F '_' '{print $2}'`
# make a dir to save tar files
mkdir tarfiles
# archive
for d in $all_date ; do
tar zcvf tarfiles/$d.tar.gz *$d*
done

download using rsync and extract using gunzip, and put all together into a pipe

I have "gz" files that I am downloading using "rsync". Then, as these files are compressed, I need to extract them using gunzip (I am open to any other alternative for gunzip). I want to put all these commands together into a pipe to have something like that rsync file | gunzip
My original command is the following:
awk -F "\t" '$5~/^(reference genome|representative genome)$/ {sub("ftp", "rsync", $20); b=$20"/*genomic.fna.gz"; print b" viral/." }' assembly_summary_viral.txt | xargs -l1 rsync --copy-links --times --recursive --verbose --exclude="*rna*" --exclude="*cds*"
It looks a little bit complicated, but it's downloading the files that I need, and there is no problem with it. I added | gunzip However the extraction of the compressed files is not working, and it's only downloading them.
Any suggestion?
A pipe takes the stdout of the left command and sends it to the stdin of the right command. Here we have to take the stdout of rsync and pipe to the stdin of gunzip.
rsync doesn't really output much without the -v flag so you'll have to add that. It will now spit out to stdout something like the following:
>rsync -rv ./ ../viral
sending incremental file list
file1
file2
file3
test1_2/
test1_2/file1
test1_2/file2
sent 393 bytes received 123 bytes 1,032.00 bytes/sec
total size is 0 speedup is 0.00
We can pipe that to awk first to grab only the file path/name and prepend viral/ to the front of it so that it gunzips the files that you just rsync'd TO (instead of the ones FROM which you rsync'd):
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}'
Now we have rsync and awk spitting out a list of filenames that are being sent to the TO directory. Now we need to get gunzip to process that list. Unfortunately, gunzip can't take in a list of files. If you send gunzip something to it's stdin it will assume that the stream is a gzipped stream and will attempt to gunzip it.
Instead we'll employ that xargs method you have above take the stdin and feed it into gunzip as the parameter (filename) that it needs:
rsync -rv ./ ../viral | awk '!NF{endFileList=1} NR>1 && endFileList!=1{print "../viral/"$0}' | xargs -l1 gunzip
Most likely you will have to tweak this a bit to insure you are gunzipping the right files (either your FROM location files or your TO location files). This gets trickier if you are rsyncing to a remote computer of SSH, obviously. Not sure if that can just be piped.

Total size of bunch of gzip files folder

I am trying to figure out the total size of a bunch of gz files inside of a folder.
I know that we can use gzip -l to get uncompressed size, but if you
awk 'print $2' with that it returns uncompressed as well.
If I have 10 files inside a folder that are gz what would be best way to get uncompressed total?
If you just want to remove the word "uncompressed" from the output, there are a few ways to do it.
Most simply, make the AWK command only process lines that aren't the first.
gzip -l *.gz | awk 'NR > 1 { print $2; }'

Merge sort gzipped files

I have 40 files of 2GB each, stored on an NFS architecture. Each file contains two columns: a numeric id and a text field. Each file is already sorted and gzipped.
How can I merge all of these files so that the resulting output is also sorted?
I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones.
PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that.
This is a use case for process substitution. Say you have two files to sort, sorta.gz and sortb.gz. You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator:
sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted
Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file.
For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it:
cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted # or eval "$cmd" | gzip -c > sorted.gz
#!/bin/bash
FILES=file*.gz # list of your 40 gzip files
# (e.g. file1.gz ... file40.gz)
WORK1="merged.gz" # first temp file and the final file
WORK2="tempfile.gz" # second temp file
> "$WORK1" # create empty final file
> "$WORK2" # create empty temp file
gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
# file to first temp file
for I in $FILES; do
echo current file: "$I"
sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
mv "$WORK2" "$WORK1"
done
Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). Your files in $FILES stay unchanged.
Finally, the 80 GB data are compressed in $WORK1. While processing this script no uncompressed data where written to disk.
Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques, sort-merges them and compresses the output, lz4 is used due to it's speed:
find $OUT/uniques -name '*.lz4' |
awk '{print "<( <" $0 " lz4cat )"}' |
tr "\n" " " |
(echo -n sort -m -k3b -k2 " "; cat -; echo) |
bash |
lz4 \
> $OUT/uniques-merged.tsv.lz4
It is true there are zgrep and other common utilities that play with compressed files, but in this case you need to sort/merge uncompressed data and compress the result.

Resources