Compare two files in bash - string

I have two files tmp1.txt and tmp2.txt
tmp1.txt has
aaa.txt
bbb.txt
ccc.txt
ddd.txt
tmp2.txt has
/tmp/test1/aaa.txt
/tmp/test1/aac.txt
/tmp/test2/bbb.txt
/tmp/test1/ccc.txt
I want to check if the files in tmp1.txt exists in tmp2.txt and if it exists display which one it has so it displays something similar to this
aaa.txt: test1
bbb.txt: test2
ccc.txt: test1
Thanks

Using awk:
awk -F/ 'FNR==NR {a[$1];next} $NF in a {print $NF ": " $(NF-1)}' tmp1.txt tmp2.txt
aaa.txt: test1
bbb.txt: test2
ccc.txt: test1

I would like to propose a solution using the standard tools diff and basename:
while read filename
do
basename "$filename"
done < tmp2.txt > tmp2.basenames.txt
diff -u tmp1.txt tmp2.basenames.txt
The main advantage of this solution is its simplicity. The output will look a little different though, differentiating between files in tmp1.txt(-), tmp2.txt(+), or both ():
--- tmp1.txt 2014-09-17 17:09:43.000000000 +0200
+++ tmp2.basenames.txt 2014-09-17 17:13:12.000000000 +0200
## -1,4 +1,4 ##
aaa.txt
+aac.txt
bbb.txt
ccc.txt
-ddd.txt

Bash Solution:
#!/bin/bash
while read file && a=$(grep -Fw "$file" tmp2.txt)
do
echo "$(basename $a): $(dirname $a)"
done < tmp1.txt

If you don't want to use awk, there's a little bash cycle:
while read f; do
isFound="$(grep /$f tmp2.txt 2>/dev/null)"
if [ ! -z "$isFound" ]; then
theDir=$(echo "$isFound"|cut -d'/' -f3)
echo "$f: $theDir"
fi
done <tmp1.txt

Related

How to return string count in multiple files in Linux

I have multiple xml files and I want to count some string in it.
How to return string count with files names in Linux?
The string I want to count InvoıceNo:
Result will be;
test.xml InvoiceCount:2
test1.xml InvoiceCount:5
test2.xml InvoiceCount:10
You can probably use the following code
PATTERN=InvoiceNo
for file in *.xml
do
count=$(grep -o $PATTERN "$file" | wc -l)
echo "$file" InvoiceCount:$count
done
Output
test.xml InvoiceCount:1
test1.xml InvoiceCount:2
test2.xml InvoiceCount:3
Refered from: https://unix.stackexchange.com/questions/6979/count-total-number-of-occurrences-using-grep
Following awk may help you in same, since you haven't shown any sample Inputs so not tested it.
awk 'FNR==1{if(count){print value,"InvoiceCount:",count;count=""};value=FILENAME;close(value)} /InvoiceCount/{count++}' *.xml
Use grep -c to get the count of matching lines
for file in *.xml ; do
count=$(grep -c $PATTERN $file)
if [ $count -gt 0 ]; then
echo "$file $PATTERN: $count"
fi
done
First the test files:
$ cat foo.xml
InvoiceCount InvoiceCount
InvoiceCount
$ cat bar.xml
InvoiceCount
The GNU awk using gsub for counting:
$ awk '{
c+=gsub(/InvoiceCount/,"InvoiceCount")
}
ENDFILE {
print FILENAME, "InvoiceCount: " c
c=0
}' foo.xml bar.xml
foo.xml InvoiceCount: 3
bar.xml InvoiceCount: 1
A little shell skript will do want you want
#!/bin/bash
for file in *
do
awk '{count+=gsub(" InvoıceNo","")}
END {print FILENAME, "InvoiceCount:" count}' $file
done
Put the code in a file (e.g. counter.sh) and run it like this:
counter.sh text.xml text1.xml text2.xml

Add a random number at the end of each line of file

I am willing to add a different random number at the end of each line of a file. I have to repeat the process a few times and each file contain about 20k lines and each line contains about 500k characters.
The only solution I came up with so far is
file="example.txt"
for lineIndex in $(seq 1 "$(wc -l ${file})")
do
lineContent=$(sed "${lineIndex}q;d" ${file})
echo "${lineContent} $RANDOM" >> tmp.txt
done
mv tmp.txt ${file}
Is there a faster solution?
You can do it much simpler, and without opening and closing the input and output files and spawning new processes on every line, like this:
while read line
do
echo "$line $RANDOM"
done < "$file" > tmp.txt
You could use awk:
awk '{ print $0, int(32768 * rand()) }' "$file" > tmp && \
mv tmp "$file"
Using awk:
awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $0, int(rand() * 10^5+1)}' file
If you have gnu awk then you can use inplace saving of file:
awk -i inplace -v seed=$RANDOM 'BEGIN{srand(seed)} {print $0, int(rand() * 10^5+1)}' file
Your script could be rewritten as:
file="example.txt"
cat ${file} | while read line; do
echo "${line} $RANDOM"
done > tmp.txt
mv tmp.txt ${file}

Comparing two files script and finding the unmatched data

I am having two .txt files with data stored in the format
1.txt
ASF001-AS-ST73U12
ASF001-AS-ST92U14
ASF001-AS-ST105U33
ASF001-AS-ST107U20
and
2.txt
ASF001-AS-ST121U21
ASF001-AS-ST130U14
ASF001-AS-ST73U12
ASF001-AS-ST92U14
`
I need to find the files which are in 1.txt but not in 2.txt.
I tried to use
diff -a --suppress-common-lines -y 1.txt 2.txt > finaloutput
but it didn't work
Rather than diff you can use comm here:
comm -23 <(sort 1.txt) <(sort 2.txt)
ASF001-AS-ST105U33
ASF001-AS-ST107U20
Or this awk will also work:
awk 'FNR==NR {a[$1];next} $1 in a{delete a[$1]} END {for (i in a) print i}' 1.txt 2.txt
ASF001-AS-ST107U20
ASF001-AS-ST105U33
A relatively simple bash script can do what you need:
#!/bin/bash
while read line || test -n "$line"; do
grep -q $line "$2" || echo "$line"
done < "$1"
exit 0
output:
$ ./uniquef12.sh dat/1.txt dat/2.txt
ASF001-AS-ST105U33
ASF001-AS-ST107U20

splitting folder names with awk in a directory

There are some directories in the working directory with this template
cas-2-32
sat-4-64
...
I want to loop over the directory names and grab the second and third part of folder names. I have wrote this script. The body shows what I want to do. But the awk command seems to be wrong
#!/bin/bash
for file in `ls`; do
if [ -d $file ]; then
arg2=`awk -F "-" '{print $2}' $file`
echo $arg2
arg3=`awk -F "-" '{print $3}' $file`
echo $arg3
fi
done
but it says
awk: cmd. line:1: fatal: cannot open file `cas-2-32' for reading (Invalid argument)
awk expects a filename as input. Since you have said the cas-2-32 etc are directories, awk fails for the same reason.
Feed the directory names to awk using echo:
#!/bin/bash
for file in `ls`; do
if [ -d $file ]; then
arg2=$(echo $file | awk -F "-" '{print $2}')
echo $arg2
arg3=$(echo $file | awk -F "-" '{print $3}')
echo $arg3
fi
done
Simple comand: ls | awk '{ FS="-"; print $2" "$3 }'
If you want the values in each line just add "\n" instead of a space in awk's print.
When executed like this
awk -F "-" '{print $2}' $file
awk treats $file's value as the file to be parsed, instead of parsing $file's value itself.
The minimal fix is to use a here-string which can feed the value of a variable into stdin of a command:
awk -F "-" '{print $2}' <<< $file
By the way, you don't need ls if you merely want a list of files in current directory, use * instead, i.e.
for file in *; do
One way:
#!/bin/bash
for file in *; do
if [ -d $file ]; then
tmp="${file#*-}"
arg2="${tmp%-*}"
arg3="${tmp#*-}"
echo "$arg2"
echo "$arg3"
fi
done
The other:
#!/bin/bash
IFS="-"
for file in *; do
if [ -d $file ]; then
set -- $file
arg2="$2"
arg3="$3"
echo "$arg2"
echo "$arg3"
fi
done

Bash loop to compare files

I'm obviously missing something simply, and know the problem is that it's creating a blank output which is why it can't compare. However if someone could shed some light on this it would be great - I haven't isolated it.
Ultimately, I'm trying to compare the md5sum from a list stored in a txt file, to that stored on the server. If errors, I need it to report that. Here's the output:
root#vps [~/testinggrounds]# cat md5.txt | while read a b; do
> md5sum "$b" | read c d
> if [ "$a" != "$c" ] ; then
> echo "md5 of file $b does not match"
> fi
> done
md5 of file file1 does not match
md5 of file file2 does not match
root#vps [~/testinggrounds]# md5sum file*
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
root#vps [~/testinggrounds]# cat md5.txt
2a53da1a6fbfc0bafdd96b0a2ea29515 file1
bcb35cddc47f3df844ff26e9e2167c96 file2
Not directly answering your question, but md5sum(1):
-c, --check
read MD5 sums from the FILEs and check them
Like:
$ ls
1.txt 2.txt md5.txt
$ cat md5.txt
d3b07384d113edec49eaa6238ad5ff00 1.txt
c157a79031e1c40f85931829bc5fc552 2.txt
$ md5sum -c md5.txt
1.txt: OK
2.txt: OK
The problem that you are having is that your inner read is executed in a subshell. In bash, a subshell is created when you pipe a command. Once the subshell exits, the variables $c and $d are gone. You can use process substitution to avoid the subshell:
while read -r -u3 sum filename; do
read -r cursum _ < <(md5sum "$filename")
if [[ $sum != $cursum ]]; then
printf 'md5 of file %s does not match\n' "$filename"
fi
done 3<md5.txt
The redirection 3<md5.txt causes the file to be opened as file descriptor 3. The -u 3 option to read causes it to read from that file descriptor. The inner read still reads from stdin.
I'm not going to argue. I simply try to avoid double read from inside loops.
#! /bin/bash
cat md5.txt | while read sum file
do
prev_sum=$(md5sum $file | awk '{print $1}')
if [ "$sum" != "$prev_sum" ]
then
echo "md5 of file $file does not match"
else
echo "$file is fine"
fi
done

Resources