Why comparison of two md5sum files is not working properly?

Why comparison of two md5sum files is not working properly? - linux

I have 2 lists with files with their md5sum checks and the lists have different paths for the same files.
Example of content in first file with check sums (server.list):
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz/
6e6bcd84f264233cf7c428c0cfdc0c03 tmp/fastq1_L002_R1_001.fastq.gz
Example of content in two file with check sums (downloaded.list):
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz
6e6bcd84f264233cf7c428c0cfdc0c03 /home/projects/fastq1_L002_R1_001.fastq.gz
When I run the following line, I got the following lines:
awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list
fastq1_L001_R1_001.fastq.gz has a different md5sum
fastq1_L001_R2_001.fastq.gz has a different md5sum
fastq1_L002_R2_001.fastq.gz has a different md5sum
Why I am getting this message since the first column is the same in both files? Can someone enlighten me on this issue?
Edit:
If I remove the path and leave only the file name, it works just fine.
Edit 2:
As pointed out, there is another possibility of file path form, which does not start with /. In this case, I cannot use / as the field separator.

Assumptions:
filename (sans path) and md5sum have to match
filenames may not be listed in the same order
filenames may not exist in both files
Sample data:
$ head downloaded.list server.list
==> downloaded.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz # match
YYYYf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R5_911.fastq.gz # different md5sum
c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz # match
MNOPf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R8_abc.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R9_004.fastq.gz # different filename but matching md5sum (vs last line of other file)
==> server.list <==
2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz # match
c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz # match
XXXXf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R5_911.fastq.gz # different md5sum
TUVWff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L999_R6_922.fastq.gz # filename does not exist in other file
ABCDf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R7_933.fastq.gz # different filename but matching md5sum (vs last line of other file)
One awk idea to address white space issues as well as verifying filename matches:
awk ' # stick with default field delimiter of white space but ...
{ md5sum=$1
n=split($2,arr,"/") # split 2nd field on "/" delimiter
fname=arr[n]
if (FNR==NR)
filearray[fname]=md5sum
else {
if (fname in filearray && filearray[fname] == $1)
next
printf "%s has a different md5sum\n",fname
}
}
' downloaded.list server.list
This generates:
fastq1_L001_R5_911.fastq.gz has a different md5sum
fastq1_L999_R6_922.fastq.gz has a different md5sum
fastq1_L001_R7_933.fastq.gz has a different md5sum

The whitespace on $1 used as an array key is causing problems. Removing it:
awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt

Related

How to delete smallest file if names are duplicate

I would like to clean up a folder with videos. I have a bunch of videos that were downloaded with different resolutions, so each file will start with the same name and then end with "_480p" or "_720p" etc.
I just want to keep the largest file of each such set.
So I am looking for a way to delete files based on
check if name before "_" is identical
if true, then delete all files except largest one

Thinking of a flexible and fast way to approach the problem, you can gather a list of files ending in "[[:digit:]]+p" and then a quick way to parse the names is to provide them on stdin to awk and let awk index an array with the file prefix (path + part of name before '_') so it will be unique for files allowing the different format size to be obtained and stored at that index.
Then it's a simply matter of comparing the stored resolution number for the file against the current file number and deleting the lesser of the two.
Your find command to locate all files in the directory below the current, recursively, could be:
find ./tmp -type f -regex "^.*[0-9]+p$"
What I would do is then pipe the filename output to a short awk script where an array stores the last seen number for a given file prefix, and then if the current record (line) resolution number if bigger than the value stored in the array, a filename using the array number is created and that file deleted with system() using rm filename. If the current line resolution number is less than what is already stored in the array for the file, you simply delete the current file.
You can do that as:
#!/usr/bin/awk -f
BEGIN { FS = "/" }
{
num = $NF # last field holds number up to 'p'
prefix = $0 # prefix is name up to "_[[:digit:]]+p
sub (/^.*_/, "", num) # isolate number
sub (/p$/, "", num) # remove 'p' at and
sub (/_[[:digit:]]+p$/, "", prefix) # isolate path and name prefix
if (prefix in a) { # current file in array a[] ?
rmfile = $0 # set file to remove to current
if (num + 0 > a[prefix] + 0) { # current number > array number
rmfile = prefix "_" a[prefix] "p" # for remove filename from array
a[prefix] = num # update array with higher num
}
system ("rm " rmfile); # delete the file
}
else
a[prefix] = num # if no num for prefix in array, store first
}
(note: the field-separator splits the fields using the directory separator so you have all file components to work with.)
Example Use/Output
With a representative set of files in a tmp/ directory below the current, e,g.
$ ls -1 tmp
a_480p
a_720p
b_1080p
b_480p
c_1080p
c_720p
Running the find command piped to the awk script named awkparse.sh would be as follows (don't forget to make the awk script executable):
$ find ./tmp -type f -regex "^.*[0-9]+p$" | ./awkparse.sh
Looking at the directory after piping the results of find to the awk script, the tmp/ directory now only contains the highest resolution (largest) files for any given filename, e.g.
$ ls -1
a_720p
b_1080p
c_1080p
This would be highly efficient. It could also handle all files in a nested directory structure where multiple directory levels hold files you need to clean out. Look things over and let me know if you have questions.

This shell script might be what you want:
previous_prefix=
for file in *_[0-9]*[0-9]p*; do
prefix=${file%_*}
resolution=${file##*_}
resolution=${resolution%%p*}
if [ "$prefix" = "$previous_prefix" ]; then
if [ "$resolution" -gt "$greater_resolution" ]; then
file_to_be_removed=$greater_file
greater_file=$file
greater_resolution=$resolution
else
file_to_be_removed=$file
fi
echo rm -- "$file_to_be_removed"
else
greater_resolution=$resolution
greater_file=$file
previous_prefix=$prefix
fi
done
Drop the echo if the output looks good.

I would try to:
list all non-smallest files (non-480p): *_720p* and *_1080p*
for each of them replace *_720p*/*_1080p* in the name with all possible smaller resolutions
and try to delete those files with rm -f, whether they exist or not
#!/bin/sh -e
shopt -s nullglob
for file in *_1080p*; do
rm -f -- "${file//_1080p/_720p}"
rm -f -- "${file//_1080p/_480p}"
done
for file in *_720p*; do
rm -f -- "${file//_720p/_480p}"
done
And here is a Bash script using nested loops to automate the above:
#!/bin/bash -e
shopt -s nullglob
res=(_1080p _720p _480p _240p)
for r in ${res[#]}; do
res=("${res[#]:1}") # remove the first element in res array
for file in *$r*; do
for r2 in ${res[#]}; do
rm -f -- "${file//$r/$r2}"
done
done
done

Renaming Sequentially Named Files with Date Embedded

The Situation:
I have hundreds of zip files with an arbitrary date/time mixed into its name (4-6-2021 12-34-09 AM.zip). I need to get all of these files in order such that (0.zip, 1.zip 2.zip etc) with in a Linux cli system.
What I've tried:
I've tried ls -tr | while read i; do n=$((n+1)); mv -- "$i" "$(printf '%03d' "$n").zip"; done which almost does what I want but still seems to be out of order (I think its taking the order of when the file was created rather than the filename (which is what I need)).
If I can get this done, my next step would be to rename the file (yes a single file) in each zip to the name of the zip file. I'm not sure how I'd go about this either.
tl;dr
I have these files named with a weird date system. I need the date to be in order and renamed sequentially like 0.zip 1.zip 2.zip etc. It's 3:00 AM and I don't know why I'm up still trying to solve this and I have no idea how I'll rename the files in the zips to that sequential number (read above for more detail on this).
Thanks in advance!

GNU awk is an option here, redirecting the result of the file listing back into awk:
awk '{
fil=$0; # Set a variable fil to the line
gsub("-"," ",$1); # Replace "-" for " " in the first space delimited field
split($1,map," "); # Split the first field into the array map, using " " as the delimiter
if (length(map[1])==1) {
map[1]="0"map[1] # If the length of the day is 1, pad out with "0"
};
if (length(map[2])==1) {
map[2]="0"map[2] # Do the same for month
}
$1=map[1]" "map[2]" "map[3]; # Rebuilt first field based on array values
gsub("-"," ",$2); # Change "-" for " " in time
map1[mktime($1" "$2)]=fil # Get epoch format of date/time using mktime function and use this as an index for array map1 with the original line (fil) as the value
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"; # At the end of processing, set the array sorting to index number ascending
cnt=0; # Initialise a cnt variable
for (i in map1) {
print "mv \""map1[i]"\" \""cnt".zip\""; # Loop through map1 array printing values and using these values along with cnt to generate and print move command
cnt++
}
}' <(for fil in *AM.zip;do echo $fil;done)
Once you are happy with the way the print command are printed, pipe the result into bash and so:
awk '{ fil=$0;gsub("-"," ",$1);split($1,map," ");if (length(map[1])==1) { map[1]="0"map[1] };if (length(map[2])==1) { map[2]="0"map[2] }$1=map[1]" "map[2]" "map[3];gsub("-"," ",$2);map1[mktime($1" "$2)]=fil} END { PROCINFO["sorted_in"]="#ind_num_asc";cnt=0;for (i in map1) { print "mv \""map1[i]"\" \""cnt".zip\"";cnt++ } }' <(for fil in *AM.zip;do echo $fil;done) | bash

How to extract two part-numerical values from a line in shell script

I have multiple text files in this format. I would like to extract lines matching this pattern "pass filters and QC".
File1:
Before main variant filters, 309 founders and 0 nonfounders present.
0 variants removed due to missing genotype data (--geno).
9302015 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
7758518 variants and 309 people pass filters and QC.
Calculating allele frequencies... done.
I was able to grep the line, but when I tried to assign to line variable it just doesn't work.
grep 'people pass filters and QC' File1
line="$(echo grep 'people pass filters and QC' File1)"
I am new to shell script and would appreciate if you could help me do this.
I want to create a tab separated file with just
"File1" "7758518 variants" "309 people"

GNU awk
gawk '
BEGIN { patt = "([[:digit:]]+ variants) .* ([[:digit:]]+ people) pass filters and QC" }
match($0, patt, m) {printf "\"%s\" \"%s\" \"%s\"\n", FILENAME, m[1], m[2]}
' File1

You are almost there, just remove double quotes and echo from your command:
line=$(grep 'people pass filters and QC' File1)
Now view the value stored in variable:
echo $line
And if your file structure is same, i.e., it will always be in this form: 7758518 variants and 309 people pass filters and QC, you can use awk to get selected columns from output. So complete command would be like below:
OIFS=$IFS;IFS=$'\n';for i in $line;do echo $i;echo '';done | awk -F "[: ]" '{print $1"\t"$2" "$3"\t"$5" "$6}';IFS=$OIFS
Explanation:
IFS means internal field separator, and we are setting it to newline character, because we need to use it in for loop.
But before that, we are taking it's backup in another variable OIFS, so we can restore it later.
We are using a for loop to iterate through all the matched strings, and using awk to select, 1st, 2nd, 3rd , 4th and 5th column as per your requirement.
But please note, if your file structure varies, we may need to use a different technique to extract "7758518 variants" and "309 people" part.

Remove a null character (Shell Script)

I've looked everywhere and I'm out of luck.
I am trying to count the files in my current directory and all sub directories so that when I run the shell script count_files.sh it will produce a similar output to:
$
2 sh
4 html
1 css
2 noexts
(EDIT the above output should have each count and extension on a newline)
$
where noexts are either files without any period as an extension (ex: fileName ) or files with a period but no extension (ex: fileName. ).
this pipeline:
find * | awf -F . '{print $NF}'
gives me a comprehensive list of all the files, and I've figured out how to remove files without any period (ex: fileName ) using sed '/\//d'
MY ISSUE is that I cannot remove the files from the output of the above pipeline that are separated by a period but have NULL after the period (ex: fileName. ), as it is separated by the delimiter '.'
How can I use sed like above to remove a null character from a pipe input?
I understand this could be a quick fix, but I've been googling like a madman with no luck. Thanks in advance.
Chip

To filter filenames that end with ., since filenames are the whole input line in find's output, you could use
sed '/\.$/d'
Where \. matches a literal dot and $ matches the end of the line.
However, I think I'd do the whole thing in awk. Since sorting does not appear to be necessary:
EDIT: Found a nicer way to do it with awk and find's -printf action.
find . -type f -printf '%f\n' | awk -F. '!/\./ || $NF == "" { ++count["noext"]; next } { ++count[$NF] } END { for(k in count) { print k " " count[k] } }'
Here we pass -printf '%f\n' to find to make it print only the file name without the preceding directory, which makes it much easier to work with for our purposes -- this way there's no need to worry about periods in directory names (such as /etc/somethingorother.d). The field separator is '.', the awk code is
!/\./ || $NF == "" { # if the line (the filename) does not contain
# a period or there's nothing after the last .
++count["noext"] # increment the "noext" counter
# note that this will be collated with files that
# have ".noext" as filename extension. see below.
next # go to the next line
}
{ # in all other lines
++count[$NF] # increment the counter for the file extension
}
END { # in the very end:
for(k in count) { # print the counters.
print count[k] " " k
}
}
Note that this way, if there is a file "foo.noext", it will be counted among the files without a filename extension. If this is a worry, use a special counter for files without an extension -- either apart from the array or with a key that cannot be a filename extension (such as one that includes a . or the empty string).

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance

You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program

This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...

Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '

There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why comparison of two md5sum files is not working properly? - linux

The whitespace on $1 used as an array key is causing problems. Removing it: awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt

Related

How to delete smallest file if names are duplicate

Renaming Sequentially Named Files with Date Embedded

How to extract two part-numerical values from a line in shell script

Remove a null character (Shell Script)

Bash script to list files periodically

Categories

Resources