How to compare 2 files and replace words for matched lines in file2 in bash? - linux

FILE1 (/var/widenet.jcml) holds the LAN's server entries while FILE2 (hosts.out) contains a list of IPs. My idea is to use FILE2 to search for IPs on FILE1 and update the entries based on matched IPs.
This is how FILE1 looks
[romolo#remo11g ~]$ grep -F -f hosts.out /var/widenet.jcml |head -2
2548,0,00:1D:55:00:D4:D1,10.0.209.76,wd18-209-76-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,NAS,ALL
2549,0,00:1D:55:00:D4:D2,10.0.209.77,wd18-209-77-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,NAS,ALL
While FILE2 is essentially a list of IPs, one IP per line
cat hosts.out
10.0.209.76
10.0.209.77
10.0.209.158
10.0.209.105
10.0.209.161
10.0.209.169
10.0.209.228
Basically FILE2 contains 160 IPs which entries in /var/widenet.jcml are needed to be updated. In specific the word NAS on column 14 of /var/widenet.jcml needs to be replaced with SAS.
I came up with the following syntax, however instead of just replacing the word NAS for the matched IPs, it will instead replace every entries in FILE1 which does contain the word NAS, therefore ignoring the list of IPs from FILE2.
grep -F -f hosts.out /var/widenet.jcml |awk -F"," '{print$4,$14}' |xargs -I '{}' sed -i 's/NAS/SAS/g' /var/widenet.jcml
I spent hours googling for an answer but I couldn't find any examples that cover search and replace between two text files. Thanks

Assuming file2 doesn't really have leading blanks (if it does it's an easy tweak to fix):
$ awk 'BEGIN{FS=OFS=","} NR==FNR{ips[$1];next} $4 in ips{$14="SAS"} 1' file2 file1
2548,0,00:1D:55:00:D4:D1,10.0.209.76,wd18-209-76-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,SAS,ALL
2549,0,00:1D:55:00:D4:D2,10.0.209.77,wd18-209-77-man 91.widenet.lan,10.0.101.2,255.255.0.0,NULL,NULL,NULL,NULL,NULL,NULL,SAS,ALL

If I understand the question correctly, you only want to change NAS to SAS per IP address found in hosts.out?
while read line
do
grep $line file1 | sed 's/NAS/SAS/g' >> results
done < hosts.out

Related

How to compare the columns of file1 to the columns of file2, select matching values, and output to new file using grep or unix commands

I have two files, file1 and file2, where the target_id compose the first column in both.
I want to compare file1 to file2, and only keep the rows of file1 which match the target_id in file2.
file2:
target_id
ENSMUST00000128641.2
ENSMUST00000185334.7
ENSMUST00000170213.2
ENSMUST00000232944.2
Any help would be appreciated.
% grep -x -f file1 file2 resulted in no output in my terminal
Sample data that actually shows overlaps between the files.
file1.csv:
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000178862.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000179664.2,0,0
ENSMUST00000177564.2,0,0
file2.csv
target_id
ENSMUST00000178537.2
ENSMUST00000196221.2
ENSMUST00000177564.2
Your grep command, but swapped:
$ grep -F -f file2.csv file1.csv
target_id,KO_1D_7dpi,KO_2D_7dpi
ENSMUST00000178537.2,0,0
ENSMUST00000196221.2,0,0
ENSMUST00000177564.2,0,0
Edit: we can add the -F argument since it is a fixed-string search. Plus it adds protection against the . matching something else as a regex. Thanks to #Sundeep for the recommendation.

delete lines based on one file contain to another

I'm trying to found a way to speed a delete process.
Currently I've two files, file1.txt and file2.txt
file1 contain records on 20 digits near 10000 lines.
file2 contain length records of 6500 digits and near 2 millions.
My goal is to delete lines on file2 that matches records on file1.
To do this I create a sed file with the record line from the fist file like this:
File1:
/^20606516000100070004/d
/^20630555000100030001/d
/^20636222000800050001/d
command used : sed -i -f file1 file2
The command works fine but it take about 4hours to delete the 10 000 lines on the file2.
I'm looking for a solution that can speed up the delete process.
Additional information:
each records of file1 is on file2 for sure !
line from file2 always start with a number of 20digits that should match or not with the records contain on file1.
to illustrate the upper point here is a line from file2(this is not the entire line as explain each records of file 2 is 6500 length)
20606516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
Thanks in advance.
All you need is this, using any awk in any shell on every Unix box:
awk 'NR==FNR{a[$0]; next} !(substr($0,1,20) in a)' file1 file2
and with files such as you described on a reasonable processor it'll run in a couple of seconds rather than 4 hours.
Just make sure file1 only contains the numbers you want to match on, not a sed script using those numbers, e.g.:
$ head file?
==> file1 <==
20606516000100070004
20630555000100030001
20636222000800050001
==> file2 <==
20606516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
99906516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
$ awk 'NR==FNR{a[$0]; next} !(substr($0,1,20) in a)' file1 file2
99906516000100070004XXXXXXX19.202107.04.202105.03.202101.11.202001.11.2020WWREABBXBOU
You can read the 1st file (containing the 20 first digits) of the files to suppress like this:
while IFS= read -r code; do
< ... process the current code ... >
done < first_file.txt
And to process the current code, you should read only the 1st 20 characters of every file. To read these first characters you could use:
var=$(head -c 20 $curfile)
Then, you can test if the code you read from the 1st file ($code) matches with the first 20 characters you read from $curfile.
if [ "$code" == "$var" ] ; then rm -v $curfile ; fi
Reading only the 1st 20 characters of every big file is likely to be much faster.
With GNU awk, you could try following solution too.
awk 'FNR==NR{arr[$0];next} !($1 in arr)' file1 FPAT="^.{20}" file2
Explanation: This will give difference of lines(which are not present in file1) by comparing only first 20 characters from file2 and complete line from file1.

grep between two files

I want to find matching lines from file 2 when compared to file 1.
file2 contains multiple columns and column one contains information that could match file1.
I tried below commands and they didn't give any matching results (contents in file1 are definitely in file2) . I have used these commands previously to compare between different files and they worked.
grep -f file1 file2
grep -Fwf file1 file2
When i tried to grep whatever that's not matching, i get results
grep -vf file1 file2
file1 contains list of genes (754 genes) , one line each
ATM
ATP5B
ATR
ATRIP
ATRX
I have a feeling the problem is with my file1. When I tried to type several items manually in my file1 just to test, and do grep with file2, I get the matching lines from file2.
When I copied the contents of file1 (originally in excel) into notepad making a .txt file, I didn't get any matching results.
I can't see any problem with my file1. Any suggestion?
You said,
I copied the contents of file1 (originally in excel) into notepad making a .txt file
It's likely that the txt file contains carriage-return/linefeed pairs which are screwing up the grep. As I suggested in a comment, try this:
tr -d '\015' < file1 > file1a
grep -Fwf file1a file2
The tr invocation deletes all the carriage returns, giving you a proper Unix/Linux text file with only newlines (\n) as line terminators.
You said:
I can't see any problem with my file1.
Here's how to see the extra-carriage-return problem:
cat -v test1
Those little ^M markers at the end of each line are cat -v's way of showing you the carriage return control codes.
Addendum:
Carriage Return (CR) is decimal 13, hex 0x0d, octal 015, \r in C.
Line Feed (LF) is decimal 10, hex 0x0a, octal 012, \n in C.
Because it's an old-school utility, tr accepts octal (base 8) notation for control characters.
(I think in some versions tr -d '\r' would work, but I'm not sure, and anyway I'm not sure what version you have. tr -d '\015' should be universal.)
Simple shell script that performs grep for every input in file1.txt
#!/bin/bash
while read content; do
grep -q "$content" file2.txt
if [ $? -eq "0" ]; then
echo "$content" was found in file2 >> results.txt
fi
done < file1.txt
Let's suppose this is file2:
$ cat file2
a b ATM
c d e
f ATR g
Using grep and process substitution
We can get lines from file1 that match any of the columns in file2 via:
$ grep -wFf <(sed 's/[[:space:]]/\n/g' file2) file1
ATM
ATR
This works because it converts file2 to a form that grep understands:
$ sed 's/[[:space:]]/\n/g' file2
a
b
ATM
c
d
e
f
ATR
g
Using awk
$ awk 'FNR==NR{for (i=1;i<=NF;i++) seen[$i]; next} $0 in seen' file2 file1
ATM
ATR
Here, awk keeps track of every column that it sees in file2 and then print only those lines in file1 that match one of those columns
Try to use command
comm
it is a reversed version of diff

grep a large list against a large file

I am currently trying to grep a large list of ids (~5000) against an even larger csv file (3.000.000 lines).
I want all the csv lines, that contain an id from the id file.
My naive approach was:
cat the_ids.txt | while read line
do
cat huge.csv | grep $line >> output_file
done
But this takes forever!
Are there more efficient approaches to this problem?
Try
grep -f the_ids.txt huge.csv
Additionally, since your patterns seem to be fixed strings, supplying the -F option might speed up grep.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
Use grep -f for this:
grep -f the_ids.txt huge.csv > output_file
From man grep:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file contains zero
patterns, and therefore matches nothing. (-f is specified by POSIX.)
If you provide some sample input maybe we can even improve the grep condition a little more.
Test
$ cat ids
11
23
55
$ cat huge.csv
hello this is 11 but
nothing else here
and here 23
bye
$ grep -f ids huge.csv
hello this is 11 but
and here 23
grep -f filter.txt data.txt gets unruly when filter.txt is larger than a couple of thousands of lines and hence isn't the best choice for such a situation. Even while using grep -f, we need to keep a few things in mind:
use -x option if there is a need to match the entire line in the second file
use -F if the first file has strings, not patterns
use -w to prevent partial matches while not using the -x option
This post has a great discussion on this topic (grep -f on large files):
Fastest way to find lines of a file from another larger file in Bash
And this post talks about grep -vf:
grep -vf too slow with large files
In summary, the best way to handle grep -f on large files is:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} $0 in hash' filter.txt data.txt > matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$1]; next} $2 in hash' filter.txt data.txt > matching.txt
and for grep -vf:
Matching entire line:
awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > not_matching.txt
Matching a particular field in the second file (using ',' delimiter and field 2 in this example):
awk -F, 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.txt > not_matching.txt
You may get a significant search speedup with ugrep to match the strings in the_ids.txt in your large huge.csv file:
ugrep -F -f the_ids.txt huge.csv
This works with GNU grep too, but I expect ugrep to run several times faster.

how to subtract the two files in linux

I have two files like below:
file1
"Connect" CONNECT_ID="12"
"Connect" CONNECT_ID="11"
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
file2
"Quit" CONNECT_ID="12"
"Quit" CONNECT_ID="11"
The file contents are not exactly same but similar to above and the number of records are minimum 100,000.
Now i want to get the result as show below into file1 (means the final result should be there in file1)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
I have used a while loop something like below:
awk {'print $2'} file2 | sed "s/CONNECTION_ID=//g" > sample.txt
while read actual; do
grep -w -v $actual file1 > file1_tmp
mv -f file1_tmp file1
done < sample.txt
Here I have adjusted my code according to example. So it may or may not work.
My problem is the loop is repeating for more than 1 hour to complete the process.
So can any one suggest me how to achieve the same with any other ways like using diff or comm or sed or awk or any other linux command which will run faster?
Here mainly I want to eliminate this big typical while loop.
Most UNIX tools are line based and as you don't have whole line matches that means grep, comm and diff are out the window. To extract field based information like you want awk is perfect:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
To store the results back to file1 you'll need to redict the output to a temporary file and then move the file into file1 like so:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1 > tmp && mv tmp file1
Explanation:
The awk variable NR increments for every record read, that is each line in every file. The FNR variable increments for every record but gets reset for every file.
NR==FNR # This condition is only true when reading file1
a[$2] # Add the second field in file1 into array as a lookup table
next # Get the next line in file1 (skips any following blocks)
!($2 in a) # We are now looking at file2 if the second field not in the look up
# array execute the default block i.e print the line
To modify this command you just need to change the fields that matched. In your real case if you want to match field 1 from file1 with field 4 from file2 then you would do:
$ awk 'NR==FNR{a[$1];next}!($4 in a)' file2 file1
This might work for you (GNU sed):
sed -r 's|\S+\s+(\S+)|/\1/d|' file2 | sed -f - -i file1
The tool best suited to this job is join(1). It joins two files based on values in a given column of each file. Normally it just outputs the lines that match across the two files, but it also has a mode to output the lines from one of the files that do not match the other file.
join requires that the files be sorted on the field(s) you are joining on, so either pre-sort the files, or use process substitution (a bash feature - as in the example below) to do it on the one command line:
$ join -j 2 -v 1 -o "1.1 1.2" <(sort -k2,2 file1) <(sort -k2,2 file2)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
-j 2 says to join the files on the second field for both files.
-v 1 says to only output fields from file 1 that do not match any in file 2
-o "1.1 1.2" says to order the output with the first field of file 1 (1.1) followed by the second field of file 1 (1.2). Without this, join will output the join column first followed by the remaining columns.
You may need to analyze file2 at fist, and append all ID which have appered to a cache(eg. memory)
Than scan file1 line by line to adjust whether the ID in the cache.
python code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile(r'CONNECT_ID="(.*)"')
quit_ids = set([])
for line in open('file2'):
m = p.search(line)
if m:
quit_ids.add(m.group(1))
output = open('output_file', 'w')
for line in open('file1'):
m = p.search(line)
if m and m.group(1) not in quit_ids:
output.write(line)
output.close()
The main bottleneck is not really the while loop, but the fact that you rewrite the output file thousands of times.
In your particular case, you might be able to get away with just this:
cut -f2 file2 | grep -Fwvf - file1 >tmp
mv tmp file1
(I don't think the -w option to grep is useful here, but since you had it in your example, I retained it.)
This presupposes that file2 is tab-delimited; if not, the awk '{ print $2 }' file2 you had there is fine.

Resources