Comparing two files script and finding the unmatched data

Comparing two files script and finding the unmatched data - linux

I am having two .txt files with data stored in the format
1.txt
ASF001-AS-ST73U12
ASF001-AS-ST92U14
ASF001-AS-ST105U33
ASF001-AS-ST107U20
and
2.txt
ASF001-AS-ST121U21
ASF001-AS-ST130U14
ASF001-AS-ST73U12
ASF001-AS-ST92U14
`
I need to find the files which are in 1.txt but not in 2.txt.
I tried to use
diff -a --suppress-common-lines -y 1.txt 2.txt > finaloutput
but it didn't work

Rather than diff you can use comm here:
comm -23 <(sort 1.txt) <(sort 2.txt)
ASF001-AS-ST105U33
ASF001-AS-ST107U20
Or this awk will also work:
awk 'FNR==NR {a[$1];next} $1 in a{delete a[$1]} END {for (i in a) print i}' 1.txt 2.txt
ASF001-AS-ST107U20
ASF001-AS-ST105U33

A relatively simple bash script can do what you need:
#!/bin/bash
while read line || test -n "$line"; do
grep -q $line "$2" || echo "$line"
done < "$1"
exit 0
output:
$ ./uniquef12.sh dat/1.txt dat/2.txt
ASF001-AS-ST105U33
ASF001-AS-ST107U20

Related

Difficulty to create .txt file from loop in bash

I've this data :
cat >data1.txt <<'EOF'
2020-01-27-06-00;/dev/hd1;100;/
2020-01-27-12-00;/dev/hd1;100;/
2020-01-27-18-00;/dev/hd1;100;/
2020-01-27-06-00;/dev/hd2;200;/usr
2020-01-27-12-00;/dev/hd2;200;/usr
2020-01-27-18-00;/dev/hd2;200;/usr
EOF
cat >data2.txt <<'EOF'
2020-02-27-06-00;/dev/hd1;120;/
2020-02-27-12-00;/dev/hd1;120;/
2020-02-27-18-00;/dev/hd1;120;/
2020-02-27-06-00;/dev/hd2;230;/usr
2020-02-27-12-00;/dev/hd2;230;/usr
2020-02-27-18-00;/dev/hd2;230;/usr
EOF
cat >data3.txt <<'EOF'
2020-03-27-06-00;/dev/hd1;130;/
2020-03-27-12-00;/dev/hd1;130;/
2020-03-27-18-00;/dev/hd1;130;/
2020-03-27-06-00;/dev/hd2;240;/usr
2020-03-27-12-00;/dev/hd2;240;/usr
2020-03-27-18-00;/dev/hd2;240;/usr
EOF
I would like to create a .txt file for each filesystem ( so hd1.txt, hd2.txt, hd3.txt and hd4.txt ) and put in each .txt file the sum of the value from each FS from each dataX.txt. I've some difficulties to explain in english what I want, so here an example of the result wanted
Expected content for the output file hd1.txt:
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390:/
Expected content for the file hd2.txt:
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
The implementation I've currently tried:
for i in $(cat *.txt | awk -F';' '{print $2}' | cut -d '/' -f3| uniq)
do
cat *.txt | grep -w $i | awk -F';' -v date="$(cat *.txt | awk -F';' '{print $1}' | cut -d'-' -f-2 | uniq )" '{sum+=$3} END {print date";"$2";"sum}' >> $i
done
But it doesn't works...
Can you show me how to do that ?

Because the format seems to be so constant, you can delimit the input with multiple separators and parse it easily in awk:
awk -v FS='[;-/]' '
prev != $9 {
if (length(output)) {
print output >> fileoutput
}
prev = $9
sum = 0
}
{
sum += $9
output = sprintf("%s-%s;/%s/%s;%d;/%s", $1, $2, $7, $8, sum, $11)
fileoutput = $8 ".txt"
}
END {
print output >> fileoutput
}
' *.txt
Tested on repl generates:
+ cat hd1.txt
2020-01;/dev/hd1;300;/
2020-02;/dev/hd1;360;/
2020-03;/dev/hd1;390;/
+ cat hd2.txt
2020-01;/dev/hd2;600;/usr
2020-02;/dev/hd2;690;/usr
2020-03;/dev/hd2;720;/usr
Alternatively, you could -v FS=';' and use split to split first and second column to extract the year and month and the hdX number.
If you seek a bash solution, I suggest you invert the loops - first iterate over files, then over identifiers in second column.
for file in *.txt; do
prev=
output=
while IFS=';' read -r date dev num path; do
hd=$(basename "$dev")
if [[ "$hd" != "${prev:-}" ]]; then
if ((${#output})); then
printf "%s\n" "$output" >> "$fileoutput"
fi
sum=0
prev="$hd"
fi
sum=$((sum + num))
output=$(
printf "%s;%s;%d;%s" \
"$(cut -d'-' -f1-2 <<<"$date")" \
"$dev" "$sum" "$path"
)
fileoutput="${hd}.txt"
done < "$file"
printf "%s\n" "$output" >> "$fileoutput"
done
You could also almost translate awk to bash 1:1 by doing IFS='-;/' in while read loop.

Add a random number at the end of each line of file

I am willing to add a different random number at the end of each line of a file. I have to repeat the process a few times and each file contain about 20k lines and each line contains about 500k characters.
The only solution I came up with so far is
file="example.txt"
for lineIndex in $(seq 1 "$(wc -l ${file})")
do
lineContent=$(sed "${lineIndex}q;d" ${file})
echo "${lineContent} $RANDOM" >> tmp.txt
done
mv tmp.txt ${file}
Is there a faster solution?

You can do it much simpler, and without opening and closing the input and output files and spawning new processes on every line, like this:
while read line
do
echo "$line $RANDOM"
done < "$file" > tmp.txt

You could use awk:
awk '{ print $0, int(32768 * rand()) }' "$file" > tmp && \
mv tmp "$file"

Using awk:
awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $0, int(rand() * 10^5+1)}' file
If you have gnu awk then you can use inplace saving of file:
awk -i inplace -v seed=$RANDOM 'BEGIN{srand(seed)} {print $0, int(rand() * 10^5+1)}' file

Your script could be rewritten as:
file="example.txt"
cat ${file} | while read line; do
echo "${line} $RANDOM"
done > tmp.txt
mv tmp.txt ${file}

Removing strings from file using bash script

I want to delete a specific strings from file.
I try to use:
for line3 in $(cat 2.txt)
do
if grep -Fxq $line3 4.txt
then
sed -i /$line3/d 4.txt
fi
done
I want this code to delete lines from 4.txt if they are also in 2.txt, but this loop deletes all lines from 4.txt and I have no idea why. Can someone tell what is wrong with this code ?
2.txt:
a
ab
abc
4.txt:
a
abc
abcdef

You can do this via single awk command:
awk 'ARGV[1] == FILENAME && FNR==NR {a[$1];next} !($1 in a)' 2.txt 4.txt
abcdef
To store output back to 4.txt use:
awk 'ARGV[1] == FILENAME && FNR==NR {a[$1];next} !($1 in a)' 2.txt 4.txt > _tmp && mv _tmp 4.txt
PS: Added ARGV[1] == FILENAME && to take care of empty file case as noted by #pjh below.

grep -F -v -x -f 2.txt 4.txt
or
grep -Fvxf 2.txt 4.txt
or
fgrep -vxf 2.txt 4.txt

Using just Bash (4) builtins:
declare -A found
while IFS= read -r line || [[ $line ]] ; do found[$line]=1 ; done <2.txt
while IFS= read -r line || [[ $line ]] ; do
(( ${found[$line]-0} )) || printf '%s\n' "$line"
done <4.txt
The '[[ $line ]]' tests are to handle files with unterminated last lines.
Use 'printf' instead of 'echo' in case any of the output lines begin with 'echo' options.

Look ma', only sed...
sed $( sed 's,^, -e /^,;s,$,$/d,' 2.txt ) 4.txt
Transform each line in 2.txt in a sed command, e.g., abc -> -e /^abc$/d
Give the list of sed commands to an instance of sed operating on 4.txt
To store output back to 4.txt use:
sed -i $( sed 's,^, -e /^,;s,$,$/d,' 2.txt ) 4.txt
edit: while I love my answer on an aesthetic base, please don't try
this at home! see pjh comment below for a detailed rationale of the
many ways in which my microscript may fail

Compare two files in bash

I have two files tmp1.txt and tmp2.txt
tmp1.txt has
aaa.txt
bbb.txt
ccc.txt
ddd.txt
tmp2.txt has
/tmp/test1/aaa.txt
/tmp/test1/aac.txt
/tmp/test2/bbb.txt
/tmp/test1/ccc.txt
I want to check if the files in tmp1.txt exists in tmp2.txt and if it exists display which one it has so it displays something similar to this
aaa.txt: test1
bbb.txt: test2
ccc.txt: test1
Thanks

Using awk:
awk -F/ 'FNR==NR {a[$1];next} $NF in a {print $NF ": " $(NF-1)}' tmp1.txt tmp2.txt
aaa.txt: test1
bbb.txt: test2
ccc.txt: test1

I would like to propose a solution using the standard tools diff and basename:
while read filename
do
basename "$filename"
done < tmp2.txt > tmp2.basenames.txt
diff -u tmp1.txt tmp2.basenames.txt
The main advantage of this solution is its simplicity. The output will look a little different though, differentiating between files in tmp1.txt(-), tmp2.txt(+), or both ():
--- tmp1.txt 2014-09-17 17:09:43.000000000 +0200
+++ tmp2.basenames.txt 2014-09-17 17:13:12.000000000 +0200
## -1,4 +1,4 ##
aaa.txt
+aac.txt
bbb.txt
ccc.txt
-ddd.txt

Bash Solution:
#!/bin/bash
while read file && a=$(grep -Fw "$file" tmp2.txt)
do
echo "$(basename $a): $(dirname $a)"
done < tmp1.txt

If you don't want to use awk, there's a little bash cycle:
while read f; do
isFound="$(grep /$f tmp2.txt 2>/dev/null)"
if [ ! -z "$isFound" ]; then
theDir=$(echo "$isFound"|cut -d'/' -f3)
echo "$f: $theDir"
fi
done <tmp1.txt

Show uncommon part of the line

Hi I have two files which contain paths. I want to compare the two files and show only uncommon part of the line.
1.txt:
/home/folder_name/abc
2.txt:
/home/folder_name/abc/pqr/xyz/mnp
Output I want:
/pqr/xyz/mnp
How can I do this?

This bit of awk does the job:
$ awk 'NR==FNR {a[++i]=$0; next}
{
b[++j]=$0;
if(length(a[j])>length(b[j])) {t=a[j]; a[j]=b[j]; b[j]=t}
sub(a[j],"",b[j]);
print b[j]
}' 2.txt 1.txt # or 2.txt 1.txt, it doesn't matter
Write the line from the first file to the array a.
Write the line from the second to b.
Swap a[j] and b[j] if a[j] is longer than b[j] (this might not be necessary if the longer text is always in b).
Remove the part found in a[j] from b[j] and print b[j].
This is a general solution; it makes no assumption that the match is at the start of the line, or that the contents of one file's line should be removed from the other. If you can afford to make those assumptions, the script can be simplified.
If the match may occur more than once on the line, you can use gsub rather than sub to perform a global substitution.

Considering you have strings in 1.txt and in 2.txt following code will do.
paste 1.txt 2.txt |
while read a b;
do
if [[ ${#a} -gt ${#b} ]];
then
echo ${a/$b};
else
echo ${b/$a};
fi;
done;
This is how it works on my system,
shiplu#:~/test/bash$ cat 1.txt
/home/shiplu/test/bash
/home/shiplu/test/bash/hello/world
shiplu#:~/test/bash$ cat 2.txt
/home/shiplu/test/bash/good/world
/home/shiplu/test/bash
shiplu#:~/test/bash$ paste 1.txt 2.txt |
> while read a b;
> do
> if [[ ${#a} -gt ${#b} ]];
> then
> echo ${a/$b};
> else
> echo ${b/$a};
> fi;
> done;
/good/world
/hello/world

This script will compare all lines in the file and only output the change in the line.
First it counts the number of lines in the first file.
Then i start a loop that will iterate for the number of lines.
Declare two variable that are the same line from both files.
Compare the lines and if they are the same output that they are.
If they are not then replace duplicate parts of the string with nothing(effectively removing them)
I used : as the seperator in sed as your variables contain /. So if they contain : then you may want to consider changing them.
Probably not the most efficient solution but it works.
#!/bin/bash
NUMOFLINES=$(wc -l < "1.txt")
echo $NUMOFLINES
for ((i = 1 ; i <= $NUMOFLINES ; i++)); do
f1=$(sed -n $i'p' 1.txt)
f2=$(sed -n $i'p' 2.txt)
if [[ $f1 < $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f1"'::' <<< "$f2"
elif [[ $f1 > $f2 ]]; then
echo -n "Line $i:"
sed 's:'"$f2"'::' <<< "$f1"
else
echo "Line $i: Both lines are the same"
fi
echo ""
done

If you happen to use bash, you could try this one:
echo $(diff <(grep -o . 1.txt) <(grep -o . 2.txt) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n')
It does a character-by-character comparison using diff (where grep -o . gives an intermediate line for each character to be fed to line-wise diff), and just prints the differences (intermediate diff output lines starting with markers < or > omitted, then joining lines with tr).
If you have multiple lines in your input (which you did not mention in your question) then try something like this (where % is a character not contained in your input):
diff <(cat 1.txt | tr '\n' '%' | grep -o .) \
<(cat 2.txt | tr '\n' '%' | sed -e 's/%/%%/g' | grep -o .) \
| sed -n '/^[<>]/ {s/^..//;p}' | tr -d '\n' | tr '%' '\n'
This extends the single-line solution by adding line end markers (e.g. %) which diff is forced to include in its output by adding % on the left and %% on the right.

If both the files have always a single line in each, then below works:
perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
Tested Below:
> cat 1.txt
/home/folder_name/abc
> cat 2.txt
/home/folder_name/abc/pqr/xyz/mnp
> perl -lne '$a=$_ if($.==1);print $1 if(/$a(.*)/ && $.==2)' 1.txt 2.txt
/pqr/xyz/mnp
>

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Comparing two files script and finding the unmatched data - linux

Rather than diff you can use comm here: comm -23 <(sort 1.txt) <(sort 2.txt) ASF001-AS-ST105U33 ASF001-AS-ST107U20 Or this awk will also work: awk 'FNR==NR {a[$1];next} $1 in a{delete a[$1]} END {for (i in a) print i}' 1.txt 2.txt ASF001-AS-ST107U20 ASF001-AS-ST105U33

A relatively simple bash script can do what you need: #!/bin/bash while read line || test -n "$line"; do grep -q $line "$2" || echo "$line" done < "$1" exit 0 output: $ ./uniquef12.sh dat/1.txt dat/2.txt ASF001-AS-ST105U33 ASF001-AS-ST107U20

Related

Difficulty to create .txt file from loop in bash

Add a random number at the end of each line of file

Removing strings from file using bash script

Compare two files in bash

Show uncommon part of the line

Categories

Resources