How do I compare 80 md5sums with each other in bash - linux

I have to compare md5sums of 80 copies of same file with each other and report a failure on a mismatch. How do I do it effectively in bash? I am looking for an elegant algorithm to do it.

md5sum FILES | sed 's/ .*$//' | sort -u
If you get more than one line of output, you have a mismatch.
(This doesn't tell you where the mismatch is.)
Putting it together, and replacing the sed command with a somewhat less terse awk command:
count=$(md5sum "$#" | awk '{print $1}' | sort -u | wc -l)
if [ $count -eq 1 ] ; then
echo "Everything matches"
else
echo "Nope"
fi

The output of:
md5sum $files | sort -k 1,2
is a list of the checksums in sorted order, with the corresponding file names afterwards. If you need to eyeball the results, this might be sufficient. If you need to identify odd-ball results, you have to decide on the presentation. You say you've got 80 copies of 'the same file'. Suppose there are actually 10 copies of each of 8 versions of 'the file'. How are you going to decide which is correct and which is bogus? What if you have 41 with one hash and 39 with another - are you sure the 39 are wrong and the 41 correct? Clearly, it is likely that one hash will predominate, but you'll have to worry about those pesky boundary conditions.
You can also do fancier things, such as:
md5sum $files | sort -k 1,2 > sorted.md5
sed 's/ .*//' sorted.md5 | uniq -c | sed 's/^ *\([0-9][0-9]*\) \(.*\)/\2 \1/' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
This gives you an output consisting of the MD5 checksum, repetition count, and file name. The first sed script could be replaced by awk '{print $1}' if you prefer. The second would be replaced by awk '{printf "%s %s\n", $2, $1}', which is probably clearer (and is shorter). The reason for that futzing around is to get rid of the leading spaces in the output of uniq -c which confuse join.
md5sum $files | sort -k 1,2 > sorted.md5
awk '{print $1}' sorted.md5 | uniq -c | awk '{printf "%s %s\n", $2, $1}' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
I created some files x1.h, x2.h and x3.h by copying dbatools.h, and set files=$(ls *.h). The output was:
0763af91756ef24f3d8f61131eb8f8f2 1 dblbac.h
10215826449a3e0f967a4c436923cffa 1 dbatool.h
37f48869409c2b0554d83bd86034c9bf 4 dbatools.h
37f48869409c2b0554d83bd86034c9bf 4 x1.h
37f48869409c2b0554d83bd86034c9bf 4 x2.h
37f48869409c2b0554d83bd86034c9bf 4 x3.h
5a48695c6b8673373d30f779ccd3a3c2 1 dbxglob.h
7b22f7e2373422864841ae880aad056d 1 dbstringlist.h
a5b8b19715f99c7998c4519cd67f0230 1 dbimglob.h
f9ef785a2340c7903b8e1ae4386df211 1 dbmach11.h
This can be further processed as necessary (for example, with sort -k2,3nr to get the counts in decreasing order, so the deviant files appear last). You have the names of the duplicate files grouped together along with a count telling you how many there are each duplication. What you do next depends on you.
A real production script would use temporary file names instead of hard-coded names, of course, and would clean up after itself.

md5sum FILES > MD5SUMS.md5
cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n
will return something like this:
1 485fd876eef8e941fcd6fc19643e5e59
1 585fd876eef8e941fcd6fc19643e5e59
5 385fd876eef8e941fcd6fc19643e5e59
Reading: 5 files have the same checksum, two other have "individual" checksums. I assume, that the majority is right, so an additional
| tail -1 | cut -c 9-
returns the checksum of the last line. Now filter everything else (and put the parts together):
md5sum FILES > MD5SUMS.md5
grep -v "$(cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n | tail -1 | cut -c 9-)" MD5SUMS.md5 | cut -c35-
This will print the filenames of the non-majority files.

Related

Awk or cut how to output the count of one unique column and other column values

Right now I have
grep "\sinstalled" combined_dpkg.log | awk -F ' ' '{print $5}' | sort | uniq -c | sort -rn
grep "\sinstalled" combined_dpkg.log | sort -k1 | awk '!a[$5]++' | cut -d " " -f1,5,6
And would like to combine the two into one query that includes the count of $5 with -f1,5,6.
If there is such a way to do so, or a way to retain values to be outputted following the final pipe.
The head -3 result of the first bash command above:
11 man-db:amd64
10 libc-bin:amd64
9 mime-support:all
And of the second bash command:
2015-11-10 linux-headers-4.2.0-18-generic:amd64 4.2.0-18.22
2015-11-10 linux-headers-4.2.0-18:all 4.2.0-18.22
2015-11-10 linux-signed-image-4.2.0-18-generic:amd64 4.2.0-18.22
File format looks like:
2015-11-05 13:23:53 upgrade firefox:amd64 41.0.2+build2-0ubuntu1 42.0+build2-0ubuntu0.15.10.1
2015-11-05 13:23:53 status half-configured firefox:amd64 41.0.2+build2-0ubuntu1
2015-11-05 13:23:53 status unpacked firefox:amd64 41.0.2+build2-0ubuntu1
2015-11-05 13:23:53 status half-installed firefox:amd64 41.0.2+build2-0ubuntu1
grep "\sinstalled" combined_dpkg.log | sort -k1 | awk '!a[$5]' | cut -d " " -f1,5,6 | uniq -c
Based on your comment : "For each package find the earliest (first) version ever installed. Print the package name, the version and the total number of times it was installed."
I guess this awk would do.
awk '$0!~/ installed/{next} !($5 in a){a[$5]=$1 FS $5 FS $6; count[$5]++; next} count[$5]>0 && a[$5]~$6{count[$5]++} END{for (i in a) print a[i],count[i]}' file

How to fInd frequencies of pair of strings within a Unix terminal

how can I compute the following from within the Unix terminal and then store the results in a file?
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115, 2
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
EDIT:
OK, I think, I got it:
cat lol | cut -f 1,2 -d ',' | sort | uniq -c > lol2
My only problem now it is that the fist column of the output file should - in fact - be at the end, and also that the output file should be csv compatible. Any ideas?
Would it be a problem to simply count unique lines instead? If not, the uniq command is your friend - see its manpage, but be sure to sort the list first so that all repetitions happen after another:
sort myfile.txt | uniq -c
For your example data, returns:
2 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
1 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
1 49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
To redirect into a file, append > outfile.txt:
sort myfile.txt | uniq -c > outfile.txt
If you need an output similar to the one in your question, you can use awk to reorder columns and sed to change delimiters:
sort count.txt | uniq -c | awk '{ print $2 " " $1 }' | sed 's/ /,/'

Trying to get file with max lines to print with lines number

So i've been goofing with this since last night and I can get a lot of things to happen just not what I want.
I need a code to find the file with the most lines in a directory and then print the name of the file and the number of lines that file has.
I can get the entire directory's lines to print but can't seem to narrow the field so to speak.
Any help for a fool of a learner?
wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
After some pro help in another question, this is where I got to, but it returns them all, and doesn't print their line counts.
Following awk command should do the job for you and you can avoid all redundant piped commands:
wc -l $1/* | awk '$2 != "total"{if($1>max){max=$1;fn=$2}} END{print max, fn}'
UPDATE: To avoid last line of wc's output this might be better awk command:
wc -l $1/* | awk '{arr[cnt++]=$0} END {for (i=0; i<length(arr)-1; i++)
{split(arr[i], a, " "); if(a[1]>max) {max=a[1]; fn=a[2]}} print max, fn}'
you can try:
wc -l $1/* | grep -v total | sort -g | tail -1
actually to avoid the grep that would also remove files containing "total":
for f in $1/*; do wc -l $f; done | sort -g | tail -1
or even better, as suggested in comments:
wc -l $1/* | sort -rg | sed -n '2p'
you can even make it a function:
function get_biggest_file() {
wc -l $* | sort -rg | sed -n '2p'
}
% ls -l
... 0 Jun 12 17:33 a
... 0 Jun 12 17:33 b
... 0 Jun 12 17:33 c
... 0 Jun 12 17:33 d
... 25 Jun 12 17:33 total
% get_biggest_file ./*
5 total
EDIT2: using the function I gave, you can simply output what you need as follows:
get_biggest $1/* | awk '{print "The file \"" $2 "\" has the maximum number of lines: " $1}'
EDIT: if you tried to write the function as you've written it in the question, you should add line continuation character at the end, as follows, or your shell will think you're trying to issue 4 commands:
wc -l $1/* 2>/dev/null \
| grep -v ' total$' \
| sort -n -k1 \
| tail -1l

Awk: Words frequency from one text file, how to ouput into myFile.txt?

Given a .txt files with space separated words such as:
But where is Esope the holly Bastard
But where is
And the Awk function :
cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"#"$1}'
I get the following output in my console :
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
How to get into printed into myFile.txt ?
I actually have 300.000 lines and near 2 millions words. Better to output the result into a file.
EDIT: Used answer (by #Sudo_O):
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt
Your pipeline isn't very efficient you should do the whole thing in awk instead:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile
If you want the output in sorted order:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile
The actual output given by your pipeline is:
$ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"#"$1}'
Bastard#1
But#2
Esope#1
holly#1
is#2
the#1
where#2
Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an #. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):
$ tr ' ' '\n' < file | sort | uniq -c
1 Bastard
2 But
1 Esope
1 holly
2 is
1 the
2 where
We could sort again a remove the leading spaces with sed:
$ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
But like I mention at the start let awk handle it:
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
Just redirect output to a file.
cat /pathway/to/your/file.txt % tr ' ' '\n' | sort | uniq -c | \
awk '{print $2"#"$1}' > myFile.txt
Just use shell redirection :
echo "test" > overwrite-file.txt
echo "test" >> append-to-file.txt
Tips
A useful command is tee which allow to redirect to a file and still see the output :
echo "test" | tee overwrite-file.txt
echo "test" | tee -a append-file.txt
Sorting and locale
I see you are working with asian script, you need to be need to be careful with the locale use by your system, as the resulting sort might not be what you expect :
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
And have a look at the output of :
locale

Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
uniq -c | sort -nr
This will produce some output like this -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
23/Dec/2008:19:20
Thanks
The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut alternative. And there are undoubtedly other options available too.
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3. The BSD uniq -c produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c shall be formatted as if with:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
Instead of cut -d' ' -f2, try
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
Following awk may help you here.
awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}' Input_file
Solution 2nd: In case you want order of output to be same as input but not as sort.
awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}' Input_file
an alternative solution is this:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
use(since you use -f2 in the cut in your question)
cat file |sort |uniq -c | awk '{ print $2; }'
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c
sort file | uniq -c | sed 's/^ *[0-9]* //'
I would illustrate this with an example. Consider a file
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
The command
sort file | uniq -c | sed 's/^ *[0-9]* //'
would return
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
first solution
just using sort when input repetition has not been taken into consideration. sort has unique option -u
sort -u file
sort -u < file
Ex.:
$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g
second solution
if sorting based on repetition is important
sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'
which has this output:
a
d
g
c
b

Resources