manipulating files using awk linux - linux

I have a 1.txt file (with field separator as ||o||):
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bf6265003ae067b19b88fa4359d5c392||o||Aida||o||Aida||o||Garic Gara
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolfa#hotmail.com||o||14f87ec1e760d16c0380c74ec7678b04||o||Aida||o||Aida||o||Rodriguez Puerto
2.txt (with field separator as :):
bf6265003ae067b19b88fa4359d5c392:hyworebu:#
14f87ec1e760d16c0380c74ec7678b04:sujycugu
I have a result.txt file (which will match 2nd column of 1.txt with first column of 2.txt and if results match, it will replace the 2nd column of 1.txt with 2nd column of 2.txt)
aidagolf6#gmail.com||o||hyworebu:#||o||Aida||o||Aida||o||Garic Gara
aidagolfa#hotmail.com||o||sujycugu||o||Aida||o||Aida||o||Rodriguez Puerto
And a left.txt file (which contains unmatched rows from 1.txt that have no match in 2.txt):
aidagolf6#gmail.com||o||d3a6a8b1ed3640188e985f8a1efbfe22||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bb1e6b92d60454122037f302359d8a53||o||Aida||o||Aida||o||Muji?
aidagolf6#gmail.com||o||bcfddb5d06bd02b206ac7f9033f34677||o||Aida||o||Aida||o||Muji?
The script I am trying is:
awk -F '[|][|]o[|][|]' -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' 1.txt 2.txt
The problem is the script is using ||o|| in 2.txt also due to which I am getting wrong results.
EDIT
Modified script:
awk -v s1="||o||" '
NR==FNR {
a[$2] = $1;
b[$2]= $3s1$4s1$5;
next
}
($1 in a){
$1 = "";
sub(/:/, "")
print a[$1]s1$2s1b[$1] > "result.txt";
next
}' FS = "||o||" 1.txt FS = ":" 2.txt
Now, I am getting following error:
awk: fatal: cannot open file `FS' for reading (No such file or
directory)

I've modified your original script:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
a[$2] = $1;
b[$2] = $3 s1 $4 s1 $5;
c[$2] = $0; # keep the line for left.txt
}
NR != FNR {
split($0, d, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
if (a[d[1]] != "") {
print a[d[1]] s1 r s1 b[d[1]] > "result.txt";
c[d[1]] = ""; # drop from the list of left.txt
}
}
END {
for (var in c) {
if (c[var] != "") {
print c[var] > "left.txt"
}
}
}' 1.txt 2.txt
Next verion changes the order of file reading to reduce memory consumption:
awk -F'[|][|]o[|][|]' -v s1="||o||" '
NR == FNR {
split($0, a, ":");
r = substr($0, index($0, ":") + 1); # right side of the 1st ":"
map[a[1]] = r;
}
NR != FNR {
if (map[$2] != "") {
print $1 s1 map[$2] s1 $3 s1 $4 s1 $5 > "result.txt";
} else {
print $0 > "left.txt"
}
}' 2.txt 1.txt
and the final version makes use of file-based database which minimizes DRAM consumption, although I'm not sure if Perl is acceptable in your system.
perl -e '
use DB_File;
$file1 = "1.txt";
$file2 = "2.txt";
$result = "result.txt";
$left = "left.txt";
my $dbfile = "tmp.db";
tie(%db, "DB_File", $dbfile, O_CREAT|O_RDWR, 0644) or die "$dbfile: $!";
open(FH, $file2) or die "$file2: $!";
while (<FH>) {
chop;
#_ = split(/:/, $_, 2);
$db{$_[0]} = $_[1];
}
close FH;
open(FH, $file1) or die "$file1: $!";
open(RESULT, "> $result") or die "$result: $!";
open(LEFT, "> $left") or die "$left: $!";
while (<FH>) {
#_ = split(/\|\|o\|\|/, $_);
if (defined $db{$_[1]}) {
$_[1] = $db{$_[1]};
print RESULT join("||o||", #_);
} else {
print LEFT $_;
}
}
close FH;
untie %db;
'
rm tmp.db

Related

Merge two files using AWK with conditions

I am new to bash scripting and need help with below Question. I parsed a log file to get below and now stuck on later part.
I have a file1.csv with content as:
mac-test-1,10.32.9.12,15
mac-test-2,10.32.9.13,10
mac-test-3,10.32.9.14,11
mac-test-4,10.32.9.15,13
and second file2.csv has below content:
mac-test-3,10.32.9.14
mac-test-4,10.32.9.15
I want to do a file comparison and if the line in second file matches any line in first file then change the content of file 1 as below:
mac-test-1,10.32.9.12, 15, no match
mac-test-2,10.32.9.13, 10, no match
mac-test-3,10.32.9.14, 11, matched
mac-test-4,10.32.9.15, 13, matched
I tried this
awk -F "," 'NR==FNR{a[$1]; next} $1 in a {print $0",""matched"}' file2.csv file1.csv
but it prints below and doesn't include the not matching records
mac-test-3,10.32.9.14,11,matched
mac-test-4,10.32.9.15,13,matched
Also, in some cases the file2 can be empty so the result should be like this:
mac-test-1,10.32.9.12,15, no match
mac-test-2,10.32.9.13,10, no match
mac-test-3,10.32.9.14,11, no match
mac-test-4,10.32.9.15,13, no match
With your shown samples please try following awk code. You need not to check condition first and then print the statement because when you are checking $1 in a then those items who doesn't exist will NEVER come inside this condition's block. So its better to print whole line
of file1.csv and then print status of that particular line either its matched OR not-matched based on their existence inside array.
awk '
BEGIN { FS=OFS="," }
FNR==NR{
arr[$0]
next
}
{
print $0,(($1 OFS $2) in arr)?"Matched":"Not-matched"
}
' file2.csv file1.csv
EDIT: Adding a solution to handle empty file of file2.csv scenario here, same concept wise as above only thing it handles scenarios when file2.csv is an Empty file.
awk -v lines=$(wc -l < file2.csv) '
BEGIN { FS=OFS=","}
(lines==0){
print $0,"Not-Matched"
next
}
FNR==NR{
arr[$0]
next
}
{
print $0,(($1 OFS $2) in arr)?"Matched":"Not-matched"
}
' file2.csv file1.csv
You are not printing the else case:
awk -F "," 'NR==FNR{a[$1]; next}
{
if ($1 in a) {
print $0 ",matched"
} else {
print $0 ",no match"
}
}' file2.csv file1.csv
Output
mac-test-1,10.32.9.12,15,no match
mac-test-2,10.32.9.13,10,no match
mac-test-3,10.32.9.14,11,matched
mac-test-4,10.32.9.15,13,matched
Or in short, without manually printing the comma but using OFS:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next}{ print $0 OFS (($1 in a)?"":"no")"match"}' file2.csv file1.csv
Edit
I found a solution on this page handling FNR==NR on an empty file.
When file2.csv is empty, all output lines will be:
mac-test-1,10.32.9.12,15,no match
Example
awk -F "," '
ARGV[1] == FILENAME{a[$1];next}
{
if ($1 in a) {
print $0 ",matched"
} else {
print $0 ",no match"
}
}' file2.csv file1.csv
Each of #RavinderSingh13's and #Thefourthbird's answers contain large parts of the solution but here it is all together:
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $2 }
FILENAME == ARGV[1] {
arr[key]
next
}
{
print $0, ( key in arr ? "matched" : "no match")
}
' file2.csv file1.csv
or if you prefer:
awk '
BEGIN { FS=OFS="," }
{ key = $1 FS $2 }
!f {
arr[key]
next
}
{
print $0, ( key in arr ? "matched" : "no match")
}
' file2.csv f=1 file1.csv

Awk script to sum multiple column if value in column1 is duplicate

Need your help to resolve the below query.
I want to sum up the values for column3,column5,column6, column7,column9,column10 if value in column1 is duplicate.
Also need to make duplicate rows as single row in output file and also need to put the value of column1 in column 8 in output file
input file
a|b|c|d|e|f|g|h|i|j
IN27201800024099|a|2.01|ad|5|56|6|rr|1|5
IN27201800023963|b|3|4|rt|67|6|61|ty|6
IN27201800024099|a|4|87|ad|5|6|1|rr|7.45
IN27201800024099|a|5|98|ad|5|6|1|rr|8
IN27201800023963|b|7|7|rt|5|5|1|ty|56
IN27201800024098|f|80|67|ty|6|6|1|4rght|765
output file
a|b|c|d|e|f|g|h|i|j|k
IN27201800024099|a|11.01|190|ad|66|18|3|rr|20.45|IN27201800024099
IN27201800023963|b|10|11|rt|72|11|62|ty|62|IN27201800023963
IN27201800024098|f|80|67|ty|6|6|1|4rght|765|IN27201800024098
Tried below code, but it is not working and also no clue how to complete the code to get correct output
awk 'BEGIN {FS=OFS="|"} FNR==1 {a[$1]+= (f3[key]+=$3;f5[key]+=$5;f6[key]+=$6;f7[key]+=$7;f9[key]+=$9;f10[key]+=$10;)} input.txt > output.txt
$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
{
keys[$1]
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (key in keys) {
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}
$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099
The above outputs the lines in random order, if you want them output in the same order as the key values were read in, it's just a couple more lines of code:
$ cat tst.awk
BEGIN {
FS=OFS="|"
}
NR==1 {
print $0, "h"
next
}
!seen[$1]++ {
keys[++numKeys] = $1
}
{
for (i=2; i<=NF; i++) {
sum[$1,i] += $i
}
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "%s", key
for (i=2; i<=NF; i++) {
printf "%s%s", OFS, sum[key,i]
}
print OFS key
}
}
$ awk -f tst.awk file
a|b|c|d|e|f|g|h
IN27201800024099|11.01|190|66|18|3|20.45|IN27201800024099
IN27201800023963|10|11|72|11|62|62|IN27201800023963
IN27201800024098|80|67|6|0|1|765|IN27201800024098

Merge two files using awk in linux

I have a 1.txt file:
betomak#msn.com||o||0174686211||o||7880291304ca0404f4dac3dc205f1adf||o||Mario||o||Mario||o||Kawati
zizipi#libero.it||o||174732943.0174732943||o||e10adc3949ba59abbe56e057f20f883e||o||Tiziano||o||Tiziano||o||D'Intino
frankmel#hotmail.de||o||0174844404||o||8d496ce08a7ecef4721973cb9f777307||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||536c1287d2dc086030497d1b8ea7a175||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||9893ac33a018e8d37e68c66cae23040e||o||Nabile||o||Nabile||o||Nassime
donaldduck#yahoo.com||o||174912161.0174912161||o||0c770713436695c18a7939ad82bc8351||o||Donald||o||Donald||o||Duck
cernakova#centrum.cz||o||0174991962||o||d161dc716be5daf1649472ddf9e343e6||o||Dagmar||o||Dagmar||o||Cernakova
trgsrl#tiscali.it||o||0175099675||o||d26005df3e5b416d6a39cc5bcfdef42b||o||Esmeralda||o||Esmeralda||o||Trogu
catherinesou#yahoo.fr||o||0175128896||o||2e9ce84389c3e2c003fd42bae3c49d12||o||Cat||o||Cat||o||Sou
ermimurati24#hotmail.com||o||0175228687||o||a7766a502e4f598c9ddb3a821bc02159||o||Anna||o||Anna||o||Beratsja
cece_89#live.fr||o||0175306898||o||297642a68e4e0b79fca312ac072a9d41||o||Celine||o||Celine||o||Jacinto
kendinegel39#hotmail.com||o||0175410459||o||a6565ca2bc8887cde5e0a9819d9a8ee9||o||Adem||o||Adem||o||Bulut
A 2.txt file:
9893ac33a018e8d37e68c66cae23040e:134:#a1
536c1287d2dc086030497d1b8ea7a175:~~#!:/92\
8d496ce08a7ecef4721973cb9f777307:demodemo
FS for 1.txt is "||o||" and for 2.txt is ":"
I want to merge two files in a single file result.txt based on the condition that the 3rd column of 1.txt must match with 1st column of 2.txt file and should be replaced by the 2nd column of 2.txt file.
The expected output will contain all the matching lines:
I am showing you one of them:
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime
I tried the script:
awk -F"||o||" 'NR==FNR{s=$0; sub(/:[^:]*$/, "", s); a[s]=$NF;next} {s = $5; for (i=6; i<=NF; ++i) s = s "," $i; if (s in a) { NF = 5; $5=a[s]; print } }' FS=: <(tr -d '\r' < 2.txt) FS="||o||" OFS="||o||" <(tr -d '\r' < 1.txt) > result.txt
But getting an empty file as the result. Any help would be highly appreciated.
If your actual Input_file(s) are same as shown sample then following awk may help you in same.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
next
}
($1 in a){
print a[$1] s1 $2 FS $3 s1 b[$1]
}
' FS="|" 1.txt FS=":" 2.txt
EDIT: Since OP has changed requirement a bit so providing code as per new ask where it will create 2 files too 1 file which will have ids present in 1.txt and NOT in 2.txt and other will be vice versa of it.
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
{
print > "NOT_present_in_2.txt"
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print j,c[j] > "NOT_present_in_1.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt
You can use this awk to get your output:
awk -F ':' 'NR==FNR{a[$1]=$2 FS $3; next} FNR==1{FS=OFS="||o||"; gsub(/[|]/, "\\\\&", FS)}
$3 in a{$3=a[$3]; print}' file2 file1 > result.txt
cat result.txt
frankmel#hotmail.de||o||0174844404||o||demodemo:||o||Melanie||o||Melanie||o||Kiesel
apoka-paris#hotmail.fr||o||0174847613||o||~~#!:/92\||o||Sihem||o||Sihem||o||Sousou
sofianomovic#msn.fr||o||174902297.0174902297||o||134:#a1||o||Nabile||o||Nabile||o||Nassime

Group the consecutive numbers in shell

$ foo="1,2,3,6,7,8,11,13,14,15,16,17"
In shell, how to group the numbers in $foo as 1-3,6-8,11,13-17
Given the following function:
build_range() {
local range_start= range_end=
local -a result
end_range() {
: range_start="$range_start" range_end="$range_end"
[[ $range_start ]] || return
if (( range_end == range_start )); then
# single number; just add it directly
result+=( "$range_start" )
elif (( range_end == (range_start + 1) )); then
# emit 6,7 instead of 6-7
result+=( "$range_start" "$range_end" )
else
# larger span than 2; emit as start-end
result+=( "$range_start-$range_end" )
fi
range_start= range_end=
}
# use the first number to initialize both values
range_start= range_end=
result=( )
for number; do
: number="$number"
if ! [[ $range_start ]]; then
range_start=$number
range_end=$number
continue
elif (( number == (range_end + 1) )); then
(( range_end += 1 ))
continue
else
end_range
range_start=$number
range_end=$number
fi
done
end_range
(IFS=,; printf '%s\n' "${result[*]}")
}
...called as follows:
# convert your string into an array
IFS=, read -r -a numbers <<<"$foo"
build_range "${numbers[#]}"
...we get the output:
1-3,6-8,11,13-17
awk solution for an extended sample:
foo="1,2,3,6,7,8,11,13,14,15,16,17,19,20,33,34,35"
awk -F',' '{
r = nxt = 0;
for (i=1; i<=NF; i++)
if ($i+1 == $(i+1)){ if (!r) r = $i"-"; nxt = $(i+1) }
else { printf "%s%s", (r)? r nxt : $i, (i == NF)? ORS : FS; r = 0 }
}' <<<"$foo"
The output:
1-3,6-8,11,13-17,19-20,33-35
As an alternative, you can use this awk command:
cat series.awk
function prnt(delim) {
printf "%s%s", s, (p > s ? "-" p : "") delim
}
BEGIN {
RS=","
}
NR==1 {
s = $1
}
p < $1-1 {
prnt(RS)
s = $1
}
{
p = $1
}
END {
prnt(ORS)
}
Now run it as:
$> foo="1,2,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1-3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17
$> foo="1,3,6,7,8,11,13,14,15,16,17,20"
$> awk -f series.awk <<< "$foo"
1,3,6-8,11,13-17,20
Here is an one-liner for doing the same:
awk 'function prnt(delim){printf "%s%s", s, (p > s ? "-" p : "") delim}
BEGIN{RS=","} NR==1{s = $1} p < $1-1{prnt(RS); s = $1} {p = $1}END {prnt(ORS)}' <<< "$foo"
In this awk command we keep 2 variables:
p for storing previous line's number
s for storing start of the range that need to be printed
How it works:
When NR==1 we set s to first line's number
When p is less than (current_number -1) or $1-1 that indicates we have a break in sequence and we need to print the range.
We use a function prnt for doing the printing that accepts only one argument that is end delimiter. When prnt is called from p < $1-1 { ...} block then we pass RS or comma as end delimiter and when it gets called from END{...} block then we pass ORS or newline as delimiter.
Inside p < $1-1 { ...} we reset s (start range) to $1
After processing each line we store $1 in variable p.
prnt uses printf for formatted output. It always prints starting number s first. Then it checks if p > s and prints hyphen followed by p if that is the case.

Comparing two CSV files in linux

I have two CSV files with me in the following format:
File1:
No.1, No.2
983264,72342349
763498,81243970
736493,83740940
File2:
No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"
I need to compare the two files and output the matched,unmatched values.
I did it through awk:
#!/bin/bash
awk 'BEGIN {
FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
a[$1];
next
}
FNR>1 {
print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
delete a[$1]
}
END {
for (x in a) {
print x FS "In file1 but not in file2"
}
}'file1 file2
But the output is:
"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2
Can you please tell me where I am going wrong?
Here are some corrections to your script:
BEGIN {
# FS = OFS = ","
FS = "[,\"]+"
OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}
# NR>1 && NR==FNR {
NR==FNR {
a[$1];
next
}
# FNR>1 {
$2 in a {
# print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
print ($2 in a) ? $2 OFS "Match" : $2 "In file2 but not in file1"
delete a[$2]
}
END {
for (x in a) {
print x, "In file1 but not in file2"
}
}
This is an awk script, so you can run it like awk -f script.awk file1 file2. Doing so gives these results:
$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2
The main problem with your script was that it didn't correctly handle the double quotes around the numbers in file2. I changed the input field separator so that the double quotes are treated as part of the separator to deal with this. As a result, the first field $1 in the second file is empty (it is the bit between the start of the line and the first "), so you need to use $2 to refer to the first value you're interested in. Aside from that, I removed some redundant conditions from your other blocks and used OFS rather than FS in your first print statement.

Resources