Comparing 2 files in linux for different word - linux

I have two files like below
file1 has the below words
word23.cs
test.cs
only12.cs
file 2 has the below words
word231.cs
test.cs
only12.cs
The above words might change, And now i need to compare the two files using script or linux command to get the different word , i need to compare the file2 with file1 and need to get the output as word23.cs
Thank you

Use the "diff" command to compare 2 files:
$ diff a.txt b.txt
Or, for a unified diff:
$ diff -u a.txt b.txt
Use -u0 for a unified diff without context.

You can use comm, diff or cmp command to find different word from files.
Also this trick can work with a grep command as follows
grep -Fwf file1 file2

Related

Is there a way to compare N files at once, and only leave lines unique to each file?

Background
I have five files that I am trying to make unique relative to each other. In other words, I want to make it so that the lines of text in each file have no commonality with each other.
Attempted solution
So far, I have been able to run the grep -vf command comparing one file with the other 4 as so:
grep -vf file2.txt file1.txt
grep -vf file3.txt file1.txt
...
This makes it print out the lines in file1 that are not in file2, nor file3, etc.. However, this becomes cumbersome because I would need to do this for the superset of all files. In otherwords, to truly reduce each file to lines of text only in that file, I would have to do every combination of files into the grep -vf command. Given that this sounds cumbersome to me, I wanted to know...
Question
What is the command/series of commands in linux to find the lines of text in each file that is mutually exclusive to all the other files?
You could just do:
awk '!a[$0]++ { out=sprintf("%s.out", FILENAME); print > out}' file*
This will write the lines that are uniq in file to file.out. Each line will be written to the output file of the associated input file in which it first appears, and subsequent duplicates of that same line will be suppressed.

Matching file content with other filenames to extract and merge contents

I have two directories.
In directory_1, I have many .txt files
Content of these files (for example file1.txt) are a list of characters
file1.txt
--
rer_098
dfrkk9
In directory_2, I have many files, two of them are ‘rer_098’ and ‘dfrkk9’.
Content of these files are as follows:
rer_098
--
>123_nbd
sasert
>456_nbd
ffjko
dfrkk9
--
>789_nbd
figyi
>012_nbd
jjjygk
Now in a separate output directory (directory_3), for this above example, I want output files like:
file1.txt
--
>123_nbd
sasert
>456_nbd
ffjko
>789_nbd
figyi
>012_nbd
jjjygk
and so on for file2.txt
Thanks!
This might work for you (GNU parallel):
parallel 'cat {} | parallel -I## cat dir_2/## > dir_3/{/}' ::: dir_1/*.txt
Use two invocations of parallel, the first traverses dir_1 and pipes its output in a second parallel. This cats the input files and outputs the result dir_3 keeping the original name from the first parallel invocation.
N.B. The use of the -I option to rename the parameter delimiters from the default {} to ##.
Pretty easy to do with just shell. Something like
for fullname in directory_1/*.txt; do
file=$(basename "$fullname")
while read -r line; do
cat "directory_2/$line"
done <"$fullname" >"directory_3/$file"
done
for file in directory_1/*.txt; do
awk 'NR==FNR{ARGV[ARGC++]="directory_2/"$0; next} 1' "$file" > "directory_3/${file##%/}"
done

When comparing directories with diff is there a way to exclude certain file differences from the output?

I am running this command:
diff --recursive a.git b.git
It shows differences for some files that do not concern me. Is there a way to for example not have it show lines that end in:
".xaml.g.cs"
or lines that start in:
"Binary files"
or lines that have this text in them:
".git/objects"
A simple but effective way is to pipe output to grep. With grep -Ev you can ignore lines using regular expressions.
diff --recursive a.git b.git | grep -Ev "^< .xaml.g.cs|^> .xaml.g.cs" | grep -Ev "Binary files$" | grep -v ".git/objects"
This ignores all lines with matching text. As for the regular expressions: ^ means line starts with, $ means line ends with. But at least for ^ you have to adjust it to the diff output (where lines normally start with < or >).
Also note that diff provides a flag --ignore-matching-lines=RE but it might not work as you would expect as mentioned in this question/answer. And because it does not work as I would expect I rather use grep for filtering.
man diff
gives following examples:
...
-x, --exclude=PAT
exclude files that match PAT
-X, --exclude-from=FILE
exclude files that match any pattern in FILE
...
Did you try using those switches (like diff -x=".xaml.g.cs" --recursive a.git b.git)?

Comparing list of directories to list in a file and writing another file for comparison [duplicate]

Ok, I have two related lists on my linux box in text files:
/tmp/oldList
/tmp/newList
I need to compare these lists to see what lines got added and what lines got removed. I then need to loop over these lines and perform actions on them based on whether they were added or removed.
How do I do this in bash?
Use the comm(1) command to compare the two files. They both need to be sorted, which you can do beforehand if they are large, or you can do it inline with bash process substitution.
comm can take a combination of the flags -1, -2 and -3 indicating which file to suppress lines from (unique to file 1, unique to file 2 or common to both).
To get the lines only in the old file:
comm -23 <(sort /tmp/oldList) <(sort /tmp/newList)
To get the lines only in the new file:
comm -13 <(sort /tmp/oldList) <(sort /tmp/newList)
You can feed that into a while read loop to process each line:
while read old ; do
...do stuff with $old
done < <(comm -23 <(sort /tmp/oldList) <(sort /tmp/newList))
and similarly for the new lines.
The diff command will do the comparing for you.
e.g.,
$ diff /tmp/oldList /tmp/newList
See the above man page link for more information. This should take care of your first part of your problem.
Consider using Ruby if your scripts need readability.
To get the lines only in the old file:
ruby -e "puts File.readlines('/tmp/oldList') - File.readlines('/tmp/newList')"
To get the lines only in the new file:
ruby -e "puts File.readlines('/tmp/newList') - File.readlines('/tmp/oldList')"
You can feed that into a while read loop to process each line:
while read old ; do
...do stuff with $old
done < ruby -e "puts File.readlines('/tmp/oldList') - File.readlines('/tmp/newList')"
This is old, but for completeness we should say that if you have a really large set, the fastest solution would be to use diff to generate a script and then source it, like this:
#!/bin/bash
line_added() {
# code to be run for all lines added
# $* is the line
}
line_removed() {
# code to be run for all lines removed
# $* is the line
}
line_same() {
# code to be run for all lines at are the same
# $* is the line
}
cat /tmp/oldList | sort >/tmp/oldList.sorted
cat /tmp/newList | sort >/tmp/newList.sorted
diff >/tmp/diff_script.sh \
--new-line-format="line_added %L" \
--old-line-format="line_removed %L" \
--unchanged-line-format="line_same %L" \
/tmp/oldList.sorted /tmp/newList.sorted
source /tmp/diff_script.sh
Lines changed will appear as deleted and added. If you don't like this, you can use --changed-group-format. Check the diff manual page.
I typically use:
diff /tmp/oldList /tmp/newList | grep -v "Common subdirectories"
The grep -v option inverts the match:
-v, --invert-match
Selected lines are those not matching any of the specified pat-
terns.
So in this case it takes the diff results and omits those that are common.
Have you tried diff
$ diff /tmp/oldList /tmp/newList
$ man diff

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!
Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")
The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2
See BashFAQ/036 for the list of usual solutions to this type of problem.
Use VimDIFF command, this gives a nice presentation of difference
If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Resources