MD5 comparison between two text files - linux

I just started learning Linux shell scripting. I have to compare this two files in Linux shell scripting for version control example :
file1.txt
275caa62391ff4f3096b1e8a4975de40 apple
awd6s54g64h6se4h6se45wahae654j6 ball
e4rby1s6y4653a46h153a41bqwa54tvi cat
r53aghe4354hr35a4hr65a46eeh5j45ro castor
file2.txt
275caa62391ff4f3096b1e8a4975de40 apple
js65fg4a64zgr65f4w65ea465fa65gh7 ball
wroghah4a65ejdtse5z4g6sa7H658aw7 candle
wagjh54hr5ae454zrwrh354aha4564re castor
How to sort this text files in newly added(one which is added in file 2 but not in file 1) ,deleted(one which is deleted in file 2 but not in file 1) and changed files (have same name but different checksum) ?
I tried using diff , bcompare , vimdiff but I am not getting a proper output as a text file.
Thanks in advance

I don't know if such a command exist, but I've taken the liberty to write you a sorting mechanism in Bash. Although it's optimised, I suggest you recreate it in a language of your own choice.
#! /bin/bash
# Sets the array delimiter to a newline
IFS=$'\n'
# If $1 is empty, default to 'file1.txt'. Same for $2.
FILE1=${1:-file1.txt}
FILE2=${2:-file2.txt}
DELETED=()
ADDED=()
CHANGED=()
# Loop over array $1 and print content
function array_print {
# -n creates a "pointer" to an array. This
# way you can pass large arrays to functions.
local -n array=$1
echo "$1: "
for i in "${array}"; do
echo $i
done
}
# This function loops over the entries in file_in and checks
# if they exist in file_tst. Unless doubles are found, a
# callback is executed.
function array_sort {
local file_in="$1"
local file_tst="$2"
local callback=${3:-true}
local -n arr0=$4
local -n arr1=$5
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $file_tst)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# Run callback
$callback "$hit" "$line" arr0 arr1
done < "$file_in"
}
# If tst is empty, line will be added to not_found. For file 1 this
# means that file doesn't exist in file2, thus is deleted. Otherwise
# the file is changed.
function callback_file1 {
local tst=$1
local line=$2
local -n not_found=$3
local -n found=$4
if [[ -z $tst ]]; then
not_found+=($line)
else
found+=($line)
fi
}
# If tst is empty, line will be added to not_found. For file 2 this
# means that file doesn't exist in file1, thus is added. Since the
# callback for file 1 already filled all the changed files, we do
# nothing with the fourth parameter.
function callback_file2 {
local tst=$1
local line=$2
local -n not_found=$3
if [[ -z $tst ]]; then
not_found+=($line)
fi
}
array_sort "$FILE1" "$FILE2" callback_file1 DELETED CHANGED
array_sort "$FILE2" "$FILE1" callback_file2 ADDED CHANGED
array_print ADDED
array_print DELETED
array_print CHANGED
exit 0
Since it might be hard to understand the code above, I've written it out. I hope it helps :-)
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $FILE2)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# If name does not occur, it's deleted (exists in
# file1, but not in file2)
if [[ -z $hit ]]; then
DELETED+=($line)
else
# If name occurs, it's changed. Otherwise it would
# not come here due to previous if-statement.
CHANGED+=($line)
fi
done < "$FILE1"
while read -r line; do
tst_hash=$(grep -Eo '^[^ ]+' <<< "$line")
tst_name=$(grep -Eo '[^ ]+$' <<< "$line")
hit=$(grep $tst_name $FILE1)
# If found, skip. Nothing is changed.
[[ $hit != $line ]] || continue
# If name does not occur, it's added. (exists in
# file2, but not in file1)
if [[ -z $hit ]]; then
ADDED+=($line)
fi
done < "$FILE2"

Files which are only in file1.txt:
awk 'NR==FNR{a[$2];next} !($2 in a)' file2.txt file1.txt > only_in_file1.txt
Files which are only in file2.txt:
awk 'NR==FNR{a[$2];next} !($2 in a)' file1.txt file2.txt > only_in_file2.txt
Then something like this answer:
awk compare columns from two files, impute values of another column
e.g:
awk 'FNR==NR{a[$1]=$1;next}{print $0,a[$1]?a[$2]:"NA"}' file2.txt file1.txt | grep NA | awk '{print $1,$2}' > md5sdiffer.txt
You'll need to come up with how you want to present these though.
There might be a more elegant way to loop though the final example (as opposed to finding those with NA and then re-filtering), however it's still enough to go off

Related

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!
you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename
There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.
Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt
For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

How to clean up multiple file names using bash?

I have. directory with ~250 .txt files in it. Each of these files has a title like this:
Abraham Lincoln [December 01, 1862].txt
George Washington [October 25, 1790].txt
etc...
However, these are terrible file names for reading into python and I want to iterate over all of them to change them to a more suitable format.
I've tried similar things for changing single variables that are shared across many files. But I can't wrap my head around how I should iterate over these files and change the formatting of their names while still keeping the same information.
The ideal output would be something like
1861_12_01_abraham_lincoln.txt
1790_10_25_george_washington.txt
etc...
Please try the straightforward (tedious) bash script:
#!/bin/bash
declare -A map=(["January"]="01" ["February"]="02" ["March"]="03" ["April"]="04" ["May"]="05" ["June"]="06" ["July"]="07" ["August"]="08" ["September"]="09" ["October"]="10" ["November"]="11" ["December"]="12")
pat='^([^[]+) \[([A-Za-z]+) ([0-9]+), ([0-9]+)]\.txt$'
for i in *.txt; do
if [[ $i =~ $pat ]]; then
newname="$(printf "%s_%s_%s_%s.txt" "${BASH_REMATCH[4]}" "${map["${BASH_REMATCH[2]}"]}" "${BASH_REMATCH[3]}" "$(tr 'A-Z ' 'a-z_' <<< "${BASH_REMATCH[1]}")")"
mv -- "$i" "$newname"
fi
done
for file in *.txt; do
# extract parts of the filename to be differently formatted with a regex match
[[ $file =~ (.*)\[(.*)\] ]] || { echo "invalid file $file"; exit; }
# format extracted strings and generate the new filename
formatted_date=$(date -d "${BASH_REMATCH[2]}" +"%Y_%m_%d")
name="${BASH_REMATCH[1]// /_}" # replace spaces in the name with underscores
f="${formatted_date}_${name,,}" # convert name to lower-case and append it to date string
new_filename="${f::-1}.txt" # remove trailing underscore and add `.txt` extension
# do what you need here
echo $new_filename
# mv $file $new_filename
done
I like to pull the filename apart, then put it back together.
Also GNU date can parse-out the time, which is simpler than using sed or a big case statement to convert "October" to "10".
#! /usr/bin/bash
if [ "$1" == "" ] || [ "$1" == "--help" ]; then
echo "Give a filename like \"Abraham Lincoln [December 01, 1862].txt\" as an argument"
exit 2
fi
filename="$1"
# remove the brackets
filename=`echo "$filename" | sed -e 's/[\[]//g;s/\]//g'`
# cut out the name
namepart=`echo "$filename" | awk '{ print $1" "$2 }'`
# cut out the date
datepart=`echo "$filename" | awk '{ print $3" "$4" "$5 }' | sed -e 's/\.txt//'`
# format up the date (relies on GNU date)
datepart=`date --date="$datepart" +"%Y_%m_%d"`
# put it back together with underscores, in lower case
final=`echo "$namepart $datepart.txt" | tr '[A-Z]' '[a-z]' | sed -e 's/ /_/g'`
echo mv \"$1\" \"$final\"
EDIT: converted to BASH, from Bourne shell.

Find file with largest number of lines in single directory

I'm trying to create a function that only outputs the file with the largest number of lines in a directory (and not any sub-directories). I'm being asked to make use of the wc function but don't really understand how to read each file individually and then sort them just to find the largest. Here is what I have so far:
#!/bin/bash
function sort {
[ $# -ne 1 ] && echo "Invalid number of arguments">&2 && exit 1;
[ ! -d "$1" ] && echo "Invalid input: not a directory">&2 && exit 1;
# Insert function here ;
}
# prompt if wanting current directory
# if yes
# sort $PWD
# if no
#sort $directory
This solution is almost pure Bash (wc is the only external command used):
shopt -s dotglob # Include filenames with initial '.' in globs
shopt -s nullglob # Make globs produce nothing when nothing matches
dir=$1
maxlines=-1
maxfile=
for file in "$dir"/* ; do
[[ -f $file ]] || continue # Skip non-files
[[ -L $file ]] && continue # Skip symlinks
numlines=$(wc -l < "$file")
if (( numlines > maxlines )) ; then
maxfile=$file
maxlines=$numlines
fi
done
[[ -n "$maxfile" ]] && printf '%s\n' "$maxfile"
Remove the shopt -s dotglob if you don't want to process files whose names begin with a dot. Remove the [[ -L $file ]] && continue if you want to process symlinks to files.
This solution should handle all filenames (ones containing spaces, ones containing glob characters, ones beginning with '-', ones containing newlines, ...), but it runs wc for each file so it may be unacceptably slow compared to solutions that feed many files to wc at once if you need to handle directories that have large numbers of files.
How about this:
wc -l * | sort -nr | head -2 | tail -1
wc -l counts lines (you get an error for directories, though), then sort in reverse order treating the first column as a number, then take the first two lines, then the second, as we need to skip over the total line.
wc -l * 2>/dev/null | sort -nr | head -2 | tail -1
The 2>/dev/null throws away all the errors, if you want a neater output.
Use a function like this:
my_custom_sort() {
for i in "${1+$1/}"*; do
[[ -f "$i" ]] && wc -l "$i"
done | sort -n | tail -n1 | cut -d" " -f2
}
And use it with or without directory (in latter case, it uses the current directory):
my_custom_sort /tmp
helloworld.txt

Finding index for new folder

I am given a name and I am supposed to make a dir with this name. If this dir already exists, name of the folder should have _$number as its suffix.
Number is calculated as highest value + 1. Examples:
Name:awesome
Files: dummy awesome awesome_2 awesome_4 dummy_3
New folder: awesome_5
Name:awesome
Files: dummy dummy_1
New folder: awesome
My solution for finding highest value works only for names without special characters. Should the name be for example: "$#&*!(#)(%+#$ asdasd \ ^ sad", it fails.
function max_item() {
local prefix="$1"
local max="0"
shopt -s nullglob
for in_file in * ; do
if [[ "$in_file" =~ ^"$prefix"_(-{0,1}[0-9][0-9]*)$ ]]; then
num="${BASH_REMATCH[1]}";
[[ "$max" -lt "$num" ]] && max="$num";
fi
done
echo "$max"
shopt -u nullglob
return 0
}
I guess it has something to do with special characters in regex but I have exhausted all my ideas.
Since you are looking for a number at the end of the name, prefixed by an _, you could do this instead:
max=0
number='^[[:digit:]]+$'
for in_file in "${prefix}_"* ; do
num="${in_file##*_}"
[[ "$num" =~ $number ]] && [[ "$max" -lt "$num" ]] && max="$num"
done
num=$((max + 1))
I have incorporated #Jens' excellent suggestion to loop through the just the matching files.
Looping in shell code is notoriously slow.
For small numbers, codeforester's solution is fine, but starting at around 30 items (the exact number depends on many factors), the external-utility-based solution below will be faster and scale much better.
(For fewer items, an external-utility solution is slower, but that will rarely matter).
The solution below has the added advantage of being more concise:
max_index() {
printf '%d\n' "$(shopt -s nullglob;
printf '%s\n' "$1_"* |
awk -F_ '{print $NF}' |
sort -rn | head -n 1)"
}
Note: The reasonable assumption is made that your filenames have no embedded newlines.
shopt -s nullglob ensures that if a globbing pattern ("$1_"* in this case) matches nothing, it expands to the null (empty) string.
printf '%s\n' "$1_"* prints all matching filesystem items line by line.
awk -F_ '{print $NF}' outputs the last _-based token on each line, i.e., the trailing number.
Note: cut -d_ -f2 would work too, but makes the assumption that only one _ is present in the filename.
sort -rn sorts the trailing numbers numerically (-n), in reverse (-r).
head -n 1 then extracts only the 1st output line, which is by definition the highest number (if any).
Note that printf '%d\n' '' outputs 0, which is effectively what happens if no existing _<number> suffixes are found.

Bash loop to compare files

I'm obviously missing something simply, and know the problem is that it's creating a blank output which is why it can't compare. However if someone could shed some light on this it would be great - I haven't isolated it.
Ultimately, I'm trying to compare the md5sum from a list stored in a txt file, to that stored on the server. If errors, I need it to report that. Here's the output:
root#vps [~/testinggrounds]# cat md5.txt | while read a b; do
> md5sum "$b" | read c d
> if [ "$a" != "$c" ] ; then
> echo "md5 of file $b does not match"
> fi
> done
md5 of file file1 does not match
md5 of file file2 does not match
root#vps [~/testinggrounds]# md5sum file*
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
root#vps [~/testinggrounds]# cat md5.txt
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
Not directly answering your question, but md5sum(1):
-c, --check
read MD5 sums from the FILEs and check them
Like:
$ ls
1.txt 2.txt md5.txt
$ cat md5.txt
d3b07384d113edec49eaa6238ad5ff00 1.txt
c157a79031e1c40f85931829bc5fc552 2.txt
$ md5sum -c md5.txt
1.txt: OK
2.txt: OK
The problem that you are having is that your inner read is executed in a subshell. In bash, a subshell is created when you pipe a command. Once the subshell exits, the variables $c and $d are gone. You can use process substitution to avoid the subshell:
while read -r -u3 sum filename; do
read -r cursum _ < <(md5sum "$filename")
if [[ $sum != $cursum ]]; then
printf 'md5 of file %s does not match\n' "$filename"
fi
done 3<md5.txt
The redirection 3<md5.txt causes the file to be opened as file descriptor 3. The -u 3 option to read causes it to read from that file descriptor. The inner read still reads from stdin.
I'm not going to argue. I simply try to avoid double read from inside loops.
#! /bin/bash
cat md5.txt | while read sum file
do
prev_sum=$(md5sum $file | awk '{print $1}')
if [ "$sum" != "$prev_sum" ]
then
echo "md5 of file $file does not match"
else
echo "$file is fine"
fi
done

Resources