How to compare two text files for the same exact text using BASH? - linux

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!

Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")

The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2

See BashFAQ/036 for the list of usual solutions to this type of problem.

Use VimDIFF command, this gives a nice presentation of difference

If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Related

Shell Script - How to merge two text files without repeating lines

my case is apparently easy, but I couldn't do it in a simple way and I need it because the real files is very large.
So, I have two txt files and I would like to generate a new file containing the both content of the two without duplicating the lines. Something like that:
file1.txt
192.168.0.100
192.168.0.101
192.168.0.102
file2.txt
192.168.0.100
192.168.0.101
192.168.1.200
192.168.1.201
I would like to merge these files above and generate another one like this:
result.txt
192.168.0.100
192.168.0.101
192.168.0.102
192.168.1.200
192.168.1.201
Any simple sugestions?
Thank you
If changing the order is not an issue:
sort -u file1.txt file2.txt > result.txt
First this sorts the lines of both files (in memory), then it runs through them and outputs each unique line only once (-u flag).
There's a semi-standard idiom in awk for removing duplicates:
awk '!a[$0]++ {print}' file1.txt file2.txt
The array a counts occurrences of each line, but only prints a line the first time it is added (i.e., when a[$0] is 0 before it is incremented).
This is asymptotically faster than sorting the input (and preserves the input order), but requires more memory.

Linux: Comparing two files but not caring what line only content

I am trying to use comm or diff Linux commands to compare to different files. Each file has a list of volume names. File A has 1500 volumes and file B has those same 1500 volumes plus another 200 with a total of 1700. I am looking for away to just find those 200 volumes. I dont care if the volumes match and are on different lines, I only want the mismatched volumes but the diff and comm command seem to only compare line by line. Does anyone know another command or a way to use the comm or diff command to find these 200 volumes?
First 5 lines of both files: (BTW there is only one volume on each line so File A has 1500 lines and File B has 1700 lines)
File A:
B00004
B00007
B00010
B00011
B00013
File B:
B00003
B00004
B00007
B00008
B00010
So I would want the command to show me B00003 and B00008 just from the first 5 lines because those volumes are not in File A
awk can also help.
awk 'NR==FNR {a[$1]=$1; next}!($1 in a) {print $0}' fileA fileB
Try
comm -23 <( sort largerFile) <(sort smallerFile)
This assumes that your Vol name will be the first "field" in the data. If not, check man sort for ways to sort files on alternate fields (and combinations of fields).
The <( ....) construct is known as process substitution. If you're using a really old shell/unix or reduced functionality shell (dash?), process substitution may not be available. Then you'll have to sort your files before you run comm and manage what you do with the unsorted file.
Note that as comm -23 means "suppress output from 2nd file" (-2) and "suppress output from the two files in common" (-3), the remaining output is differences found in file1 that are not in file2. This is why I list largerFile first.
IHTH

What is the usage of sorted command?

I have read most of the example comes with sort command. How ever I am not sure what is the usage of sort command in this style?
sort <word> sorted
That would just be two file names, as in
sort file1 file2 file3...
If you pass multiple file names, sort concatenates them and sorts all of them together.
If you're asking how to sort a string with the sort command:
echo "tatoine" | grep -o . | sort | tr -d "\n"
aeinott
because sort operate on lines so you've got to cut the string in multiple lines with one letter on each (grep -o .) and after sorting you just delete the new lines with the tr command.
Are those < and > symbols explicit, or do they indicate a parameter that is to be replaced? If the latter, then you're reading from a file called "word", and writing the sorted data to a file called "sorted".
Are you trying to save the content in a sorted order?
Let's say you have a file name.txt with the following content.
Zoe
John
Amy
Mary
Mark
Peter
You can use the sort commmand "sort name.txt" and the output goes to the console
You can save the output using "sort name.txt -o sortedname.txt"
e.g.
Amy
John
Mark
mary
Peter
Zoe
You can found more option with the command "man sort" and "info sort"
rojomoke was right about the > and < commands. Those are redirection commands.
We usually read the data from standard input (stdin) and output goes to standard output aka the screen (stdout)
< means get the data from somewhere else. e.g. a file.
> means redirect the output to somewhere else e.g. a file.
So for the command above "sort name.txt -o sortedname.txt", I could have written as follow.
sort < name.txt > sortedname.txt
You can read more about the redirection in this wiki entry.
https://en.wikipedia.org/wiki/Redirection_(computing)
Commands like | >> will come in handy down the road.

cat | sort csv file by name in bash

i have a bunch of csv files that i want to save them in one file ordered by name
i use
cat *.csv | sort -t\ -k2 -n *.csv > output.csv
works good for a naming like a001, a002, a010. a100
but in my files the names are fup a bit so they are like a1. a2. a10. a100
and the command i wrote will arrange my things like this:
cn201
cn202
cn202
cn203
cn204
cn99
cn98
cn97
cn96
..
cn9
can anyone please help me ?
Thanks
If I understand correctly, you want to use the -V (version-sort) flag instead of -n. This is only available on GNU sort, but that's probably the one you are using.
However, it depends how you want the prefixes to be sorted.
If you don't have the -V option, sort allows you to be more precise about what characters constitute a sort key.
sort -t\ -k2.3n *.csv > output.csv
The .3 tells sort that the key to sort on starts with the 3rd character of the second field, effectively skipping the cn prefix. You can put the n directly in the field specifier, which saves you two whole characters, but more importantly for more complex sorts, allows you to treat just that key as a number, rather than applying -n globally (which is only an issue if you specify multiple keys with several uses of -k).
The sort version on the live server is 5.97 from 2006
so few things did not work correctly.
However the code bellow is my solution
#!/bin/bash
echo "This script reads all CSVs into a single file (clusters.csv) in this directory"
for filers in *.csv
do
echo "" >> clusters.csv
echo "--------------------------------" >> clusters.csv
echo $filers >> largefile.txt
echo "--------------------------------" >> clusters.csv
cat $filers >> clusters.csv
done
or if you want to keep it simple inside one command
awk 'FNR > 1' *.csv > clusers.csv

Diff-ing files with Linux command

What Linux command allow me to check if all the lines in file A exist in file B? (it's almost like a diff, but not quite). Also file A has uniq lines, as is the case with file B as well.
The comm command compares two sorted files, line by line, and is part of GNU coreutils.
Are you looking for a better diff tool?
https://stackoverflow.com/questions/12625/best-diff-tool
So, what if A has
a
a
b
and b has
a
b
What would you want the output to be(yes or no)?
Use diff command.
Here is a useful vide with complete usage of diff command under 3 min
Click Here
if cat A A B | sort | uniq -c | egrep -e '^[[:space:]]*2[[:space:]]' > /dev/null; then
echo "A has lines that are not in B."
fi
If you do not redirect the output, you will get a list of all the lines that are in A that are not in B (except each line will have a 2 in front if it). This relies on the lines in A being unique, and the lines in B being unique.
If they aren't, and you don't care about counting duplicates, it's relatively simple to transform each file into a list of unique lines using sort and uniq.

Resources