Diff 2 settings files and replace the difference - linux

I have 2 files with settings:
file1.txt and file2.txt
A=1 A=2
B=3 B=3
C=5 C=4
D=6 .
. E=7
I am looking for the best approach to replace the values of the file1.txt with the diff values of file2.txt, so the file1.txt would look like:
file1.txt:
A=2
B=3
C=4
D=6
E=7
Currently i didn't write any code, but the only approach i think about is to write a bash script that diffs both files (provided as positional arguments), and use sed to replace non-matching strings. Something in this vein:
./diffreplace.bash file1.txt file2.txt > NEWfile1.txt
I wonder whether there is something more elegant that alerady exists?

All of the following solutions may change the order of assignments. I assumed that would be ok.
Lazy Solution
If you use these assignments in some way that allows overwriting, then you can simple append file2 to the end of file1. All old values will be overwritten be the new ones when you execute result.
cat old new > result
Slightly Better Solution
Extending the previous approach, you can iterate over the lines of result and for every variable, keep only the last assignment:
cat new old |
awk -F= '{if (a[$1]!="x") {print $0; a[$1]=x}}'
Alternative Solution
Use join to combine both files, then filter out the values from the first file by using cut. When your files are sorted, use
join -t= -a1 -a2 new old | cut -d= -f1,2
if not, use
join -t= -a1 -a2 <(sort new) <(sort old) |
cut -d= -f1,2

I'm a little puzzed over your comment the structure of the file must remain untouched. Sort mixes the order so I'm assuming that the As are always on line 1 or line 1 is . etc:
$ awk '
BEGIN { RS="\r?\n" } # in case of Windows line-endings
$0!="." { # we dont store . (change it to null if you need to)
a[FNR]=$0 # hash using line number as key
}
END { # after all that hashing
for(i=1;i<=FNR;i++) # iterate in line number order
print a[i] # output the last met version
}' file1 file2 # mind the file order
Output:
A=2
B=3
C=4
D=6
E=7
Edit: A version with a whitelist:
$ cat whitelist
A
B
E
Script:
$ awk -F= '
NR==FNR { # process the whitelist
a[FNR]=$1 # for a key is linenumber, record as value
b[$1]=FNR # bor b record is key, linenumber is value
n=FNR # remember the count for END
next
} # process file1 and file2 ... filen
($1 in b) { # if record is found in b
a[b[$1]]=$0 # we set the record to a[linenumber]=record
}
END {
for(i=1;i<=n;i++) # here we loop on linenumbers, 1 to n
print a[i]
}' whitelist file1 file2
Output:
A=2
B=3
E=7

Related

compare two columns in two different files in shell script

there is a file1 as below:
21,2018042100
22,2018042101
87,2018042102
98,2018042103
there is file2 as below:
45,2018042100
86,2018042102
87,2018042103
what I need is: (file3)
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
in row #2 in file3, data for 2018042101 is exist in file1 but it is not exist in file2. So, 0 is inserted in column $3 which is belong to file2.
kindly please assist to find out how I can create a file like file3.
Thanks.
Join seems like made for that problem:
join -t',' -a 1 -a 2 -j 2 file1 file2
2018042100,21,45
2018042101,22
2018042102,87,86
2018042103,98,87
except for the missing ",0" in line 2, but maybe you find a solution in the manpage for that problem too. Else you may use sed to correct for that issue.
join -t',' -a 1 -a 2 -e "0" -j 2 file1 file2 | sed -r 's/^[^,]+,[^,]+$/&,0/'
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Another using awk:
$ awk 'BEGIN{FS=OFS=","}NR==FNR{a[$2]=$1;next}{print $2,$1,(a[$2]+0)}' file2 file1
2018042100,21,45
2018042101,22,0
2018042102,87,86
2018042103,98,87
Explained:
$ awk '
BEGIN {
FS=OFS="," # set field separators
}
NR==FNR { # process first file
a[$2]=$1 # hash value on date
next # process next record in first file
}
{ # process second file
print $2,$1,(a[$2]+0) # output date, value, value from first file if exists
}' file2 file1 # mind the file order
Notice, that (a[$2]+0) expects the first field value to be a number like in your example. All other values will produce 0.

How to add data beside each other in a csv file

If I have 3 csv files, and I want to merge the data all into one, but beside each other, how would I do it? For example:
Initial Merged file:
,,,,,,,,,,,,
File 1:
20,09/05,5694
20,09/06,3234
20,09/08,2342
File 2:
20,09/05,2341
20,09/06,2334
20,09/09,342
File 3:
20,09/05,1231
20,09/08,3452
20,09/10,2345
20,09/11,372
Final merged File:
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
Basically data from each file goes into a specific column of the merged file.
I know the awk function can be used for this, but I have no clue how to start
EDIT: Only the 2nd and 3rd Columns of each file are being printed. I was using this to print out the 2nd and 3rd columns:
awk -v f="${i}" -F, 'match ($0,f) { print $2","$3 }' file3.csv > d$i.csv
however, say for example, file1 and file2 were null in that row, the data for that row would be shifted to the left. so I came up with this to account for the shift:
awk -v x="${i}" -F, 'match ($0,x) { if ($2='/NULL') { print "," }; else { print $2","$3}; }' alld.csv > d$i.csv
Using GNU awk for ARGIND:
$ gawk '{ a[FNR,ARGIND]=$0; maxFnr=(FNR>maxFnr?FNR:maxFnr) }
END {
for (i=1;i<=maxFnr;i++) {
for (j=1;j<ARGC;j++)
printf "%s%s", (j==1?"":",,,"), (a[i,j]?a[i,j]:",")
print ""
}
}
' file1 file2 file3
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
If you don't have GNU awk, just add an initial line that says FNR==1{ARGIND++}.
Commented version per request:
$ gawk '
{ a[FNR,ARGIND]=$0; # Store the current line in a 2-D array `a` indexed by
# the current line number `FNR` and file number `ARGIND`.
maxFnr=(FNR>maxFnr?FNR:maxFnr) # save the max FNR value
}
END{
for (i=1;i<=maxFnr;i++) { # Loop from 1 to max number of fields
# seen across all files and for each:
for (j=1;j<ARGC;j++) # Loop from 1 to total number of files parsed and:
printf "%s%s", # Print 2 strings, specifically:
(j==1?"":",,,"), # A field separator - empty if were printing
# the first field, three commas otherwise.
(a[i,j]?a[i,j]:",") # The value stored in the array if it was
# present in the files, a comma otherwise.
print "" # Print a newline
}
}
' file1 file2 file3
I originally was using an array fnr[FNR] to track the max value of FNR but IMHO that's kinda obscure and it has a flaw where if no lines have, say, a 2nd field then a loop on for (i=1;i in fnr;i++) in the END section would bail out before getting to the 3rd field.
paste is done for this:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Note that the paste alone will output just one comma:
$ paste -d, f1 f2 f3
09/05,5694,09/05,2341,09/05,1231
09/06,3234,09/06,2334,09/08,3452
09/08,2342,09/09,342,09/10,2345
,,09/11,372
So to have multiple ones we can use another delimiter like ; and then replace by ,,, with sed:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Using pr:
$ pr -mts',,,' file[1-3]
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

extracting data from two list using a shell script

I am trying to create a shell script that pulls a line from a file and checks another file for an instance of the same. If it finds an entry then it adds it to another file and loops through the first list until the it has gone through the whole file. The data in the first file looks like this -
email#address.com;
email2#address.com;
and so on
The other file in which I am looking for a match and placing the match in the blank file looks like this -
12334 email#address.com;
32213 email2#address.com;
I want it to retain the numbers as well as the matching data. I have an idea of how this should work but need to know how to implement it.
My Idea
#!/bin/bash
read -p "enter first file name:" file1
read -p "enter second file name:" file2
FILE_DATA=( $( /bin/cat $file1))
FILE_DATA1=( $( /bin/cat $file2))
for I in $((${#FILE_DATA[#]}))
do
echo $FILE_DATA[$i] | grep $FILE_DATA1[$i] >> output.txt
done
I want the output to look like this but only for addresses that match -
12334 email#address.com;
32213 email2#address.com;
Thank You
quite like manipulating text using SQL:
$ cat file1
b#address.com
a#address.com
c#address.com
d#address.com
$ cat file2
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
$ join -1 1 -2 2 <(sort file1) <(sort -k2 file2) | awk '{print $2,$1}'
11457 b#address.com
22519 d#address.com
make keys sorted(we use emails as keys here)
join on keys(file1.column1, file2.column2)
format output(use awk to reverse columns)
As you've learned about diff and comm, now it's time to learn about another tool in the unix toolbox, join.
Join does just what the name indicates, it joins together 2 files. The way you join is based on keys embedded in the file.
The number 1 restraint on using join is that the data must be sorted in both files on the same column.
file1
a abc
b bcd
c cde
file2
a rec1
b rec2
c rec3
join file1 file2
a abc rec1
b bcd rec2
c cde rec3
you can consult the join man page for how to reduce and reorder the columns of output. for example
1>join -o 1.1 2.2 file1 file2
a rec1
b rec2
c rec3
You can use your code for file name input to turn this into a generalizable script.
Your solution using a pipeline inside a for loop will work for small sets of data, but as the size of data grows, the cost of starting a new process for each word you are searching for will drag down the run time.
I hope this helps.
Read line by the file1.txt file and assign the line to var ADDR. grep file2.txt with the content of var ADDR and append the output to file_result.txt.
(while read ADDR; do grep "${ADDR}" file2.txt >> file_result.txt ) < file1.txt
This awk one-liner can help you do that -
awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
NR and FNR are awk's built-in variables that stores the line numbers. NR does not get reset to 0 when working with two files. FNR does. So while that condition is true we add everything to an array a. Once the first file is completed, we check for the second column of second file. If a match is present in the array we put the entire line in a file f3.txt. If not then we ignore it.
Using data from Kev's solution:
[jaypal:~/Temp] cat f1.txt
b#address.com
a#address.com
c#address.com
d#address.com
[jaypal:~/Temp] cat f2.txt
10712 e#address.com
11457 b#address.com
19985 f#address.com
22519 d#address.com
[jaypal:~/Temp] awk 'NR==FNR{a[$1]++;next}($2 in a){print $0 > "f3.txt"}' f1.txt f2.txt
[jaypal:~/Temp] cat f3.txt
11457 b#address.com
22519 d#address.com

How to remove the lines which appear on file B from another file A?

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

Resources