check list of email addresses against other list - string

I have two files with email addresses (one per line): file1 and file2.
How can I remove all the emails in file1 which also exist in file2? Looking for a bash answer, but any other scripting language is fine as well.
If it helps, in each file are only unique email addresses.

join -v1 <(sort file1) <(sort file2)
This tells join to print the lines (emails) in file1 that do not appear in file2. They have to be sorted, whence the <(sort ...).

If you must preserve the order for whatever reason and want to be overly complicated by contemplating case sensitiveness and carriage returns (^M) you can try:
perl -e '%e=();while(<>){s/[\r\n]//g;$e{lc($_)}=1}open($so,"<","file1");while(<$so>){s/[\r\n]//g;print "$_\n" if(!exists($e{lc($_)}))}close($so)' file2

Related

How to compare two text files for the same exact text using BASH Script?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File1.txt
ami-1234567
ami-1234567654
ami-23456
File-2.txt
ami-1234567654
ami-23456
ami-2345678965
I want all the data of file2.txt which looks same from file1.txt.
This is litteratly my first comment so I hope it works,
but you can try using diff:
diff file1.txt file2.txt
Did you try join?
join -o 0 File1.txt File2.txt
ami-1234567654
ami-23456
remark: For join to work correctly, it needs your files to be sorted, which seems to be the case with your sample.
Just another option:
$ comm -1 -2 <(sort file1.txt) <(sort file2.txt)
The options specify that "unique" limes from first file (-1) and second file (-2) should be omitted.
This is basically the same as
$ join <(sort file1.txt) <(sort file2.txt)
Note that the sorting in both examples happens without creating an intermediate temp file, if you don't want to bother creating one.
I don`t know if I proper understand You:
but You can Try sort this files (after extract):
sort file1 > file1.sorted
sort file2 > file2.sorted

How to compare 2 files and replace words for matched lines in file2 in bash?

FILE1 (/var/widenet.jcml) holds the LAN's server entries while FILE2 (hosts.out) contains a list of IPs. My idea is to use FILE2 to search for IPs on FILE1 and update the entries based on matched IPs.
This is how FILE1 looks
[romolo#remo11g ~]$ grep -F -f hosts.out /var/widenet.jcml |head -2
2548,0,00:1D:55:00:D4:D1,10.0.209.76,wd18-209-76-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,NAS,ALL
2549,0,00:1D:55:00:D4:D2,10.0.209.77,wd18-209-77-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,NAS,ALL
While FILE2 is essentially a list of IPs, one IP per line
cat hosts.out
10.0.209.76
10.0.209.77
10.0.209.158
10.0.209.105
10.0.209.161
10.0.209.169
10.0.209.228
Basically FILE2 contains 160 IPs which entries in /var/widenet.jcml are needed to be updated. In specific the word NAS on column 14 of /var/widenet.jcml needs to be replaced with SAS.
I came up with the following syntax, however instead of just replacing the word NAS for the matched IPs, it will instead replace every entries in FILE1 which does contain the word NAS, therefore ignoring the list of IPs from FILE2.
grep -F -f hosts.out /var/widenet.jcml |awk -F"," '{print$4,$14}' |xargs -I '{}' sed -i 's/NAS/SAS/g' /var/widenet.jcml
I spent hours googling for an answer but I couldn't find any examples that cover search and replace between two text files. Thanks
Assuming file2 doesn't really have leading blanks (if it does it's an easy tweak to fix):
$ awk 'BEGIN{FS=OFS=","} NR==FNR{ips[$1];next} $4 in ips{$14="SAS"} 1' file2 file1
2548,0,00:1D:55:00:D4:D1,10.0.209.76,wd18-209-76-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,SAS,ALL
2549,0,00:1D:55:00:D4:D2,10.0.209.77,wd18-209-77-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,SAS,ALL
If I understand the question correctly, you only want to change NAS to SAS per IP address found in hosts.out?
while read line
do
grep $line file1 | sed 's/NAS/SAS/g' >> results
done < hosts.out

comparing two files diff awk else

Could you please quickly help me with this? I have two files with 1 columns in each file. I need to compare fileA to fileB and find out which items in FileA are already in FILEB and print them out to another file.So basically like to find out which name they have in common.
so I have something like this
FILEA
MATT.1
HANNA.1
OTTOO.2
MARK.2
SAM.3
FILEB
SAM.3
MATT.1
JEFF.6
ALI.8
The result file should be
SAM.3
MATT.1
I was thinking of writing a shell script cat one file and do a line by line comparison, but there must a better and easier way to do this using one of many commands. Can you help?
Regards
This is a job for comm. The input files need to be sorted though
comm -12 <(sort file1) <(sort file2)
will give you the common lines.
An awk answer:
awk 'NR==FNR {f[$0]=1; next} $0 in f' fileb filea
Put the smaller file as the first argument to limit the amount of memory required.
This looks returns lines from filea that match any line in fileb:
$ grep -Ff fileb filea
MATT.1
SAM.3
-F tells grep to look for fixed patterns, not regular expressions.
-f tells grep to get the list of patterns from a file which, in this case, is fileb.
More options
We can make the matches more restrictive with these options:
-w would tell grep to match only whole words.
-x would tell grep to match only whole lines.

remove line character form csv file

I have 2 csv files, file1 contain 1000 email address and file2 contain 150 email address which are already exist in file1.
I wonder if there is a Linux command to remove the 150 email from file1 ?
I test this :
grep -vf file2.csv file1.csv > file3.csv
it's works
This should work, with the added benefit of providing sorted output:
comm -23 <(sort file1) <(sort file2)

how to subtract the two files in linux

I have two files like below:
file1
"Connect" CONNECT_ID="12"
"Connect" CONNECT_ID="11"
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
file2
"Quit" CONNECT_ID="12"
"Quit" CONNECT_ID="11"
The file contents are not exactly same but similar to above and the number of records are minimum 100,000.
Now i want to get the result as show below into file1 (means the final result should be there in file1)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
I have used a while loop something like below:
awk {'print $2'} file2 | sed "s/CONNECTION_ID=//g" > sample.txt
while read actual; do
grep -w -v $actual file1 > file1_tmp
mv -f file1_tmp file1
done < sample.txt
Here I have adjusted my code according to example. So it may or may not work.
My problem is the loop is repeating for more than 1 hour to complete the process.
So can any one suggest me how to achieve the same with any other ways like using diff or comm or sed or awk or any other linux command which will run faster?
Here mainly I want to eliminate this big typical while loop.
Most UNIX tools are line based and as you don't have whole line matches that means grep, comm and diff are out the window. To extract field based information like you want awk is perfect:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
To store the results back to file1 you'll need to redict the output to a temporary file and then move the file into file1 like so:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1 > tmp && mv tmp file1
Explanation:
The awk variable NR increments for every record read, that is each line in every file. The FNR variable increments for every record but gets reset for every file.
NR==FNR # This condition is only true when reading file1
a[$2] # Add the second field in file1 into array as a lookup table
next # Get the next line in file1 (skips any following blocks)
!($2 in a) # We are now looking at file2 if the second field not in the look up
# array execute the default block i.e print the line
To modify this command you just need to change the fields that matched. In your real case if you want to match field 1 from file1 with field 4 from file2 then you would do:
$ awk 'NR==FNR{a[$1];next}!($4 in a)' file2 file1
This might work for you (GNU sed):
sed -r 's|\S+\s+(\S+)|/\1/d|' file2 | sed -f - -i file1
The tool best suited to this job is join(1). It joins two files based on values in a given column of each file. Normally it just outputs the lines that match across the two files, but it also has a mode to output the lines from one of the files that do not match the other file.
join requires that the files be sorted on the field(s) you are joining on, so either pre-sort the files, or use process substitution (a bash feature - as in the example below) to do it on the one command line:
$ join -j 2 -v 1 -o "1.1 1.2" <(sort -k2,2 file1) <(sort -k2,2 file2)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
-j 2 says to join the files on the second field for both files.
-v 1 says to only output fields from file 1 that do not match any in file 2
-o "1.1 1.2" says to order the output with the first field of file 1 (1.1) followed by the second field of file 1 (1.2). Without this, join will output the join column first followed by the remaining columns.
You may need to analyze file2 at fist, and append all ID which have appered to a cache(eg. memory)
Than scan file1 line by line to adjust whether the ID in the cache.
python code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile(r'CONNECT_ID="(.*)"')
quit_ids = set([])
for line in open('file2'):
m = p.search(line)
if m:
quit_ids.add(m.group(1))
output = open('output_file', 'w')
for line in open('file1'):
m = p.search(line)
if m and m.group(1) not in quit_ids:
output.write(line)
output.close()
The main bottleneck is not really the while loop, but the fact that you rewrite the output file thousands of times.
In your particular case, you might be able to get away with just this:
cut -f2 file2 | grep -Fwvf - file1 >tmp
mv tmp file1
(I don't think the -w option to grep is useful here, but since you had it in your example, I retained it.)
This presupposes that file2 is tab-delimited; if not, the awk '{ print $2 }' file2 you had there is fine.

Resources