implementing Excel-vlookup-like function with awk - excel

I have a question about vlookup function implementation with awk. I have a csv file having id-score pairs like this (say 1.csv):
id,score
1,16
3,12
5,13
11,8
13,32
17,37
23,74
29,7
31,70
41,83
There are "unscored" guys. I also have a csv file including all registered guys both scored and unscored like this (say, 2.csv) (I transposed for the want of space)
id,1,3,5,7,11,13,17,19,23,29,31,37,41
I would like to generate id-score pairs according to 2nd csv file so as to include both scored and unscored guys. For unscored guys, NAN would be used instead of the digit.
In other words, final result is desired to be like this:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
When I tried to create a new table with the following awk command, it did not work to me. Thanks in advance for any advice.
awk 'FNR==NR{a[$1]++; next} {print $0, (a[$1]) ? a[$2] : "NAN"}' 1.csv 2.csv

here is your script with fixes: set field separators; save the score value for each id; print the value from lookup, if missing NaN
$ awk 'BEGIN {FS=OFS=","}
FNR==NR {a[$1]=$2; next}
{print $1, (($1 in a)?a[$1]:"NAN")}' file1 file2
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83

With bash and join:
echo "id,score"
join --header -j 1 -t ',' <(sort 1.csv | grep -v '^id') <(tr ',' '\n' < 2.csv | grep -v '^id' | sort) -e "NAN" -a 2 -o 2.1,1.2 | sort -n
Output:
id,score
1,16
3,12
5,13
7,NAN
11,8
13,32
17,37
19,NAN
23,74
29,7
31,70
37,NAN
41,83
See: man join

With awk could you please try following, written with shown samples in GNU awk. Considering(like your shown samples) your both the Input_files have headers in their first line.
awk -v counter=2 '
FNR==1{
next
}
FNR==NR{
a[FNR]=$0
b[FNR]=$1
next
}
{
if($0==b[counter]){
print a[counter]
counter++
}
else{
print $0",NA"
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv)
Explanation: Adding detailed explanation for above.
awk -v counter=2 ' ##Starting awk program from here and setting counter as 2.
FNR==1{ ##Checking condition if line is 1st then do following.
next ##next will skip all further statements from here.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when Input_file 1.csv is being read.
a[FNR]=$0 ##Creating array a with index FNR and value of current line.
b[FNR]=$1 ##Creating array b with index FNR and value of 1st field of current line.
next ##next will skip all further statements from here.
}
{
if($0==b[counter]){ ##Checking condiiton if current line is same as array b with index counter value then do following.
print a[counter] ##Printing array a with index of counter here.
counter++ ##Increasing count of counter by 1 each time cursor comes here.
}
else{ ##Else part of for above if condition starts here.
print $0",NA" ##Printing current line and NA here.
}
}
' FS="," 1.csv <(tr ',' '\n' < 2.csv) ##Setting FS as , for Input_file 1.csv and sending 2.csv output by changing comma to new line to awk.

An awk solution could be:
awk -v FS=, -v OFS=, '
NR == 1 { print; next }
NR == FNR { score[$1] = $2; next }
{ for (i = 2; i <= NF; ++i)
print $i, score[$i] == "" ? "NAN" : score[$i] }
' 1.csv 2.csv

Related

Shell script to find increase in occurance count between two files

File1.log
2000 apple
2333 cat
5343 dog
1500 lion
File2.log
2500 apple
2333 cat
1700 lion
Need a shell script to output as below:
500 apple
200 lion
Have tried lot of solution but nothing worked out as I'm having both text and string. Could someone help on this. Thanks
EDIT(by RavinderSingh13): Added OP's efforts which OP had shown in comments to in post:
#!/bin/bash
input1="./File1.log"
input2="./File2.log"
while IFS= read -r line2
do
while IFS=read -r line1
do
echo "$line1"
done < "$input1"
echo "$line2"
done < "$input2"
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} $2 in a && ($1-a[$2])>0{print $1-a[$2],$2}' file1 file2
Adding a non-one liner form of above solution:
awk '
FNR==NR{
a[$2]=$1
next
}
($2 in a) && ($1-a[$2])>0{
print $1-a[$2],$2
}
' Input_file1 Input_file2
Explanation: Adding a detailed explanation for above solution here.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE once file1 is being read then do following.
a[$2]=$1 ##Creating an array a whose index is $2 and value is $1 of current line.
next ##Using next function of awk, to skip all further lines from here.
} ##Closing condition BLOCK for FNR==NR here.
($2 in a) && ($1-a[$2])>0{ ##Checking condition if $2 is present in array a AND difference of $1 and array a with index $2 is greater than 0 then do following.
print $1-a[$2],$2 ##Printing difference between $1 and array a with index $2 along with current $2 here.
} ##Closing BLOCK for above condition here.
' file1 file2 ##Mentioning Input_file names here.
awk '{if (!($2 in entry)) { entry[$2]=$1 } else { delta=$1-entry[$2]; if (delta!=0) {print delta,$2} } }' FILE_1 FILE2
You can also put this into a file, e.g. delta.awk:
{
if (!($2 in entry)) {
entry[$2]=$1
} else {
delta=$1-entry[$2]
if (delta !=0) { # Only output lines of non-zero increment/decrement
print delta,$2
}
}
}
Invoke via awk -f delta.awk FILE_1.txt FILE_2.txt.

Search array index with double quotes string using awk

File1:
"1"|data|er
"2"|text|rq
""|test2|req
"3"|test4|teq
File2:
1
2
3
Expected Output should be (file3.txt)
"1"|data|er
"2"|text|rq
"3"|test4|teq
awk -F''$Delimeter'' '{print $1}' file1.txt | awk '{gsub(/"/, "", $1); print $1}' | awk 'NF && !seen[$1]++' | sort -n > file2.txt
I am able to extract the ids 1,2,3 from file1 and removed the double quotes and written into the file2 but i need to search these 1,2,3 ids in my file1.txt("1","2","3"), problem is search not recognizing due to dobule qoutes in the file
awk 'BEGIN {FS=OFS="|"} NR==FNR{a[$1]; next} \"$1\" in a' file2.txt file1.txt > file3.txt
Could you please try following.
awk -v s1='"' '
FNR==NR{
val=s1 $0 s1
a[val]
next
}
($1 in a)
' Input_file2 FS='|' Input_file1
Explanation: Adding detailed explanation for above code.
awk -v s1='"' ' ##Starting awk program from here and creating variable s1 whose value is ".
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
val=s1 $0 s1 ##Creating variable val whose value is s1 current line value and s1 here.
a[val] ##Creating an array named a whose index is variable val.
next ##next will skip all further statements from here.
} ##Closing FNR==NR BLOCK of this code here.
($1 in a) ##Checking condition if $1 of current line is present in array a then print that line of Input_file1.
' Input_file2 FS='|' Input_file1 ##Mentioning Input_file2 then setting FS as pipe and mentioning Input_file1 name here.
Let's say that your input is
"1"|data|er
"2"|text|rq
""|test2|req
"3"|test4|teq
And you want from these information 2 types of data :
The ids
The lines containing an id
The easiest way to achieve this is, I think, to first get lines that has id, then from this, retrieve the ids.
To do so :
$ awk -F'|' '$0 ~ /"[0-9]+"/' input1 >input3; cat input3
"1"|data|er
"2"|text|rq
"3"|test4|teq
$ sed 's/^"//; s/".*$//' input3 >input2; cat input2
1
2
3

Comparison of two files using awk and print non matched records

I am comparing two files file1 and file2, i need to print the updated records of file2 which am comparing in file1. i need data changes of file2 and newly added records
File1:
1|footbal|play1
2|cricket|play2
3|tennis|play3
5|golf|play5
File2:
1|footbal|play1
2|cricket|play2
3|tennis|play3
4|soccer|play4
5|golf|play6
output file:
4|soccer|play4
5|golf|play6
i have tried the below solution but its not the expected output
awk -F'|' 'FNR == NR { a[$3] = $3; a[$1]=$1; next; } { if ( !($3 in a) && !($1 in a) ) { print $0; } }' file1.txt file2.txt
i have compared the column1 and column3 from both files
Could you please try following.
awk 'BEGIN{FS="|"}FNR==NR{a[$1,$3];next} !(($1,$3) in a)' Input_file1 Input_file2
OR a non-one liner form of solution.
awk '
BEGIN{
FS="|"
}
FNR==NR{
a[$1,$3]
next
}
!(($1,$3) in a)
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here as per Input_file(s).
} ##Closing BEGIN block for this awk code here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1st Input_file named file1 is being read.
a[$1,$3] ##Creating an array named a with index os $1,$3 of current line.
next ##next will skip all further statements.
}
!(($1,$3) in a) ##Checking condition if $1,$3 are NOT present in array a then print that line from Input_file2.
' Input_file1 Input_file2 ##mentioning Input_file names here.
Output will be as follows.
4|soccer|play4
5|golf|play6

extract data using sed or awk in linux

I am trying to merge data from 2 text files based on some condition.
I have two files:
1.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||0f8caa3ced5dc172901a427410d20540
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||f139d8b49085d012af9048bb1cba3534
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||d7ca4d78fc79a795695ae1c161ce82ea
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
2.txt
f139d8b49085d012af9048bb1cba3534: 12883 #: "#
d7ca4d78fc79a795695ae1c161ce82ea: 123422
0f8caa3ced5dc172901a427410d20540 :: demo
Contains the matching lines from 1.txt and hash is replaced with corresponding value in 2.txt
result.txt
gera077 || o || emi_riv_90#hotmail.com || or || 200.45.113.254 || o ||: demo
glen-666-||glen-666#hotmail.com||||84.196.42.167||||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
Contains the non-matching lines from 1.txt
left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
The script I am trying is :
awk -v s1="||o||" '
FNR==NR{
a[$9]=$1 s1 $5;
b[$9]=$13 s1 $17 s1 $21;
c[$9]=$0;
next
}
($1 in a){
val=$1;
$1="";
sub(/:/,"");
print a[val] s1 $0 s1 b[val];
d[val]=$0;
next
}
END{
for(i in d){
delete c[i]
};
for(j in c){
print c[j] > "left.txt"
}}
' FS="|" 1.txt FS=":" OFS=":" 2.txt > result.txt
But it is giving me empty result.txt
I am facing difficulty in debugging the issue.
Any help would be highly appreciated.
Try following awk(completely based on your shown Input_file(s) and considering that your 2.txt will not have any duplicates on it too) and let me know if this helps you.
awk 'FNR==NR{a[$NF]=$0;next} $1~/:/{sub(/:/,"",$1);flag=1} ($1 in a){val=$1;if($0 ~ /:/ && !flag){sub(/[^:]*/,"");sub(/:/,"")};print a[val] OFS $0 > "result.txt";flag="";delete a[val]} END{for(i in a){print a[i]>"left.txt"}}' FS="|" 1.txt FS=" " OFS="||o||" 2.txt
Output will be 2 files named results.txt and left.txt. Will add non-one liner form and explanation too for above code shortly.
Adding a non-one liner form of solution too now.
awk '
FNR==NR{ ##FNR and NR both are awk out of the box variables and they denote line numbers in Input_file(s), difference between them is FNR value will be RESET when it complete reading 1 Input_file and NR value will be keep increasing till it completes reading all the Input_file(s).
a[$NF]=$0; ##Creating an array named a whose index is $NF(value of last field of current line) and value is current line.
next ##next is awk out of the box keyword which will skip all further statements now.
}
$1~/:/{ ##Checking condition here if current lines 1st field has a colon in it then do following:
sub(/:/,"",$1); ##Using sub function of awk which will substitute colon with NULL of 1st field of current line of current Input_file.
flag=1 ##Setting a variable named flag here(basically to make sure that 1st colon is substituted so need for another colon removal.
}
($1 in a){ ##Checking a condition here if current line $1 is present in array a then do following:
val=$1; ##Setting variable named val value to $1 here.
if($0 ~ /:/ && !flag){ ##Checking condition here if current line is having colon and variable flag is NOT NULL then do following:
sub(/[^:]*/,""); ##Substituting all the values from starting to till colon comes with NULL.
sub(/:/,"")}; ##Then substituting only 1 colon here.
print a[val] OFS $0 > "result.txt"; ##printing the value of array a whose index is variable val OFS(output field separator) current line values to output file named results.txt here.
flag=""; ##Unsetting the value of variable flag here.
delete a[val] ##Deleting the value of array a whose index is variable val here.
}
END{ ##Starting end section of this awk program here. which will be executed once all Input_file(s) have been read.
for(i in a){ ##Traversing through the array a now.
print a[i]>"left.txt"} ##Printing the value of array a(which will basically provide those values which are NOT matched in both files) in left.txt file.
}
' FS="|" 1.txt FS=" " OFS="||o||" 2.txt ##Setting FS="|" for 1.txt Input_file and then setting FS=" " and OFS="||o||" for 2.txt Input_file, 1.txt and 2.txt are Input_files for this program to run.
This awk script may also help.
$ awk 'BEGIN{FS="\|";OFS="|"}NR==FNR{data[$1]=$2;}
NR!=FNR{if($NF in data){
$NF=data[$NF];print >"result.txt"
}else{
print >"left.txt"}
}' <( sed 's/\s*:\s*/|/' 2.txt) 1.txt 2>/dev/null
Output
$ cat result.txt
gera077||o||emi_riv_90#hotmail.com||||200.45.113.254||o||: demo
glen-666||o||glen-666#hotmail.com||||84.196.42.167||o||12883 #: "#
menopause |||totoche#wanadoo.fr||o||83.193.209.52||o||123422
$ cat left.txt
okan1993||||killa-o#hotmail.de||||84.141.125.140||o||69c1cb5ddbc66cceebe0dddba3eddf68
Tosiunia||||tosia_19#amorki.pl||o||83.22.193.86|||||ddcbba2076646980391cb4971b8030
DREP
Page 1
Sheyes1 ||||summer_faerie_dustyrose#yahoo.com|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.
BenPhynix||||BenPhynix#aol.de||||| 62.226.181.57||||11dea24f1caebb012e11285579050f38
jonof.|o||joflem#medi3.no||o||213.161.242.106||o||239f33743e4a070b728d4dcbd1091f1a
We have preprocessed the first file - using sed - to make its field delimiter | and used process substitution to pass the result to awk.

Matching files using awk in linux

I have 2 files:
1.txt:
e10adc3949ba59abbe56e057f20f883e
f8b46e989c5794eec4e268605b63eb59
e3ceb5881a0a1fdaad01296d7554868d
2.txt:
e10adc3949ba59abbe56e057f20f883e:1111
679ab793796da4cbd0dda3d0daf74ec1:1234
f8b46e989c5794eec4e268605b63eb59:1#/233:
I want 2 files as output:
One is result.txt which contains lines from 2.txt whose match is in 1.txt
and another is left.txt which contains lines from 1.txt whose match is not in 2.txt
Expected output of both files is below:
result.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d
I tried 1-2 approaches with awk but not succeeded. Any help would be highly appreciated.
My script:
awk '
FNR==NR{
val=$1;
sub(/[^:]*/,"");
sub(/:/,"");
a[val]=$0;
next
}
!($NF in a){
print > "left.txt";
next
}
{
print $1,$2,a[$NF]> "result.txt"
}
' FS=":" 2.txt FS=":" OFS=":" 1.txt
Following awk may help you in same.
awk 'FNR==NR{a[$1]=$0;next} ($0 in a){print a[$0] > "results.txt";next} {print > "left.txt"}' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt
EDIT: Adding explanation of code too here.
awk '
FNR==NR{ ##FNR==NR condition will be TRUE when first Input_file is being read by awk. Where FNR and NR are the out of the box variables for awk.
a[$1]=$0; ##creating an array named a whose index is $1 and value is $2 from 2.txt Input_file.
next ##next is out of the box keyword from awk and will skip all further statements of awk.
}
($0 in a){ ##Checking here condition if current line of Input_file 1.txt is present in array named a then do following.
print a[$0] > "results.txt"; ##Printing the current line into output file named results.txt, since current line is coming in array named a(which was created by 1st file).
next ##next is awk keyword which will skip further statements for awk code now.
}
{
print > "left.txt" ##Printing all lines which skip above condition(which means they did not come into array a) to output file named left.txt as per OP need.
}
' FS=":" OFS=":" 2.txt FS=" " OFS=":" 1.txt ##Setting FS(field separator) as colon for 2.txt and Setting FS to space for 1.txt here. yes, we could set multiple field separators for different Input_file(s).
How about this one:
awk 'BEGIN{ FS = ":" }NR==FNR{ a[$0]; next }$1 in a{ print $0 > "results.txt"; delete a[$1]; next }END{ for ( i in a ) print i > "left.txt" }' 1.txt 2.txt
Output:
results.txt
e10adc3949ba59abbe56e057f20f883e:1111
f8b46e989c5794eec4e268605b63eb59:1#/233:
left.txt
e3ceb5881a0a1fdaad01296d7554868d

Resources