AWK - 3 for loops in END statement not desired result - linux

New to AWK. I have a file with the following content:
FirstName,LastName,Email,ID,Number,IDToBeMatched
John,Smith,js#.com,js30,4,kt78
George,Haynes,gh#.com,gh67,3,re201
Mary,Dewar,md#.com,md009,4,js30
Kevin,Pan,kp#.com,kp41,2,md009
,,,,,ti10
,,,,,qwe909
,,,,,md009
,,,,,kor28
,,,,,gh67
The idea is to check whether any of the fields below the header ID matches any of the fields below IDToBeMatched and if there is a match to print the whole record but for the last field (i.e. IDToBeMatched). So my final output should look like:
FirstName,LastName,Email,ID,Number
John,Smith,js#.com,js30,4
George,Haynes,gh#.com,gh67,3
Mary,Dewar,md#.com,md009,4
My code so far
awk 'BEGIN{
FS=OFS=",";SUBSEP=",";
}
{
# all[$1,$2,$3,$4,$5]
a[$4]++;
b[$6]++;
}
END{ #for(k in all){
for(i in a){
for(j in b){
if(i==j){
print i #k
}
}
}
#}
}' inputfile
This prints the match only. If however I try to introduce another loop by uncommenting the lines in the above script in order to have the whole line for the matching field, things get messy. I understand why but I cannot find the solution. I thought to introduce a next statement but it's not allowed in the END. My AWK defaults to GAWK and I would prefer an (G)AWK solution only.
Thank you in advance.
The last field has more records because it was copied/pasted from an ID "pool" which does not necessarily has the same number of records as the files it was pasted in.

$ awk -F, 'NR==FNR{a[$6];next} (FNR==1)||($4 in a){sub(/,[^,]+$/,"");print}' file file
FirstName,LastName,Email,ID,Number
John,Smith,js#.com,js30,4
George,Haynes,gh#.com,gh67,3
Mary,Dewar,md#.com,md009,4

Related

AWK how to process multiple files and comparing them IN CONTROL FILE! (not command line one-liner)

I read all of answers for similar problems but they are not working for me because my files are not uniformal, they contain several control headers and in such case is safer to create script than one-liner and all the answers focused on one-liners. In theory one-liners commands should be convertible to script but I am struggling to achieve:
printing the control headers
print only the records started with 16 in <file 1> where value of column 2 NOT EXISTS in column 2 of the <file 2>
I end up with this:
BEGIN {
FS="\x01";
OFS="\x01";
RS="\x02\n";
ORS="\x02\n";
file1=ARGV[1];
file2=ARGV[2];
count=0;
}
/^#/ {
print;
count++;
}
# reset counters after control headers
NR=1;
FNR=1;
# Below gives syntax error
/^16/ AND NR==FNR {
a[$2];next; 'FNR==1 || !$2 in a' file1 file2
}
END {
}
Googling only gives me results for command line processing and documentation is also silent in that regard. Does it mean it cannot be done?
Perhaps try:
script.awk:
BEGIN {
OFS = FS = "\x01"
ORS = RS = "\x02\n"
}
NR==FNR {
if (/^16/) a[$2]
next
}
/^16/ && !($2 in a) || /^#/
Note the parentheses: !$2 in a would be parsed as (!$2) in a
Invoke with:
awk -f script.awk FILE2 FILE1
Note order of FILE1 / FILE2 is reversed; FILE2 must be read first to pre-populate the lookup table.
First of all, short answer to my question should be "NOT POSSIBLE", if anyone read question carefully and knew AWK in full that is obvious answer, I wish I knew it sooner instead of wasting few days trying to write script.
Also, there is no such thing as minimal reproducible example (this was always constant pain on TeX groups) - I need full example working, if it works on 1 row there is no guarantee if it works on 2 rows and my number of rows is ~ 127 mln.
If you read code carefully than you would know what is not working - I put in comment section what is giving syntax error. Anyway, as #Daweo suggested there is no way to use logic operator in pattern section. So because we don't need printing in first file the whole trick is to do conditional in second brackets:
awk -F, 'BEGIN{} NR==FNR{a[$1];next} !($1 in a) { if (/^16/) print $0} ' set1.txt set2.txt
assuming in above example that separator is comma. I don't know where assumption about multiple RS support only in gnu awk came from. On MacOS BSD awk it works exactly the same, but in fact RS="\x02\n" is single separator not two separators.

Convert key:value to CSV file

I found the following bash script for converting a file with key:value information to CSV file:
awk -F ":" -v OFS="," '
BEGIN { print "category","recommenderSubtype", "resource", "matchesPattern", "resource", "value" }
function printline() {
print data["category"], data["recommenderSubtype"], data["resource"], data["matchesPattern"], data["resource"], data["value"]
}
{data[$1] = $2}
NF == 0 {printline(); delete data}
END {printline()}
' file.yaml
But after executed it, it only converts the first group of data (only the first 6 rows of data), like this
category,recommenderSubtype,resource,matchesPattern,resource,value
COST,CHANGE_MACHINE_TYPE,instance-1,f1-micro,instance-1,g1-small
My original file is like this (with 1000 rows and more):
category:COST
recommenderSubtype:CHANGE_MACHINE_TYPE
resource:portal-1
matchesPattern:f1-micro
resource:portal-1
value:g1-small
category:PERFORMANCE
recommenderSubtype:CHANGE_MACHINE_TYPE
resource:old-3
matchesPattern:n1-standard-4
resource:old-3
value:n1-highmem-2
Is there any command am I missing?
The problem with the original script are these lines:
NF == 0 {printline(); delete data}
END {printline()}
The first line means: Call printline() if the current line has no records. The second line means call printline() after all data has been processed.
The difficulty with the input data format is that it does not really give a good indicator when to output the next record. In the following, I have simply changed the script to output the data every six records. In case there can be duplicate keys, the criterion for output might be "all fields populated" or such which would need to be programmed slightly differently.
#!/bin/sh -e
awk -F ":" -v OFS="," '
BEGIN {
records_in = 0
print "category","recommenderSubtype", "resource", "matchesPattern", "resource", "value"
}
{
data[$1] = $2
records_in++
if(records_in == 6) {
records_in = 0;
print data["category"], data["recommenderSubtype"], data["resource"], data["matchesPattern"], data["resource"], data["value"]
}
}
' file.yaml
Other commends
I have just removed the delete statement, because I am unsure what it does. The POSIX specification for awk only defines it for deleting single array elements. In case the whole array should be deleted, it recommends doing a loop over the elements. In case all fields are always present, however, it might as well be possible to eliminate it altogether.
Welcome to SO (I am new here as well). Next time you are asking, I would recommend tagging the question awk rather than bash because AWK is really the scripting language used in this question with bash only being responsible for calling awk with suitable parameters :)

How to extract data with field value greater than particular number

I am trying to extract the values/methods which are taking more than particular milliseconds, I am unable to provide correct field separator
awk -F'=' '$3>960' file
awk -F'=||' '$3>960' file
This is a sample line:
logAlias=Overall|logDurationMillis=34|logTimeStart=2019-09-12_05:22:02.602|logTimeStop=2019-09-12_05:22:02.636|logTraceUID=43adbcaf55de|getMethod1=26|getMethod2=0|getMethod3=0|getMethod4=1|getMethod5=8
I do not see any result or i see it gives me all the transactions
Here is a generic, robust and easily extendible way:
awk -F'|' '{
for(i=1;i<=NF;++i) {
split($i,kv,"=")
f[kv[1]]=kv[2]
}
}
f["logDurationMillis"]>960' file
You may use
awk -F'[=|]' '$4>960' file
Note that [=|] is a regex matching either = or | and the value you want to compare against appears in the fourth field.
See online demo:
s="logAlias=Overall|logDurationMillis=34|logTimeStart=2019-09-12_05:22:02.602|logTimeStop=2019-09-12_05:22:02.636|logTraceUID=43adbcaf55de|getMethod1=26|getMethod2=0|getMethod3=0|getMethod4=1|getMethod5=8
logAlias=Overall|logDurationMillis=980|logTimeStart=2019-09-12_05:22:02.602|logTimeStop=2019-09-12_05:22:02.636|logTraceUID=43adbcaf55de|getMethod1=26|getMethod2=0|getMethod3=0|getMethod4=1|getMethod5=8"
awk -F'[=|]' '$4>960' <<< "$s"
Output:
logAlias=Overall|logDurationMillis=980|logTimeStart=2019-09-12_05:22:02.602|logTimeStop=2019-09-12_05:22:02.636|logTraceUID=43adbcaf55de|getMethod1=26|getMethod2=0|getMethod3=0|getMethod4=1|getMethod5=8
Could you please try following, this may help you in case your string is not coming in a fixed place.
awk '
match($0,/logDurationMillis=[0-9]+/){
if(substr($0,RSTART+18,RLENGTH-18)+0>960){
print
}
}
' Input_file
2nd Solution:
awk '
match($0,/logDurationMillis=[0-9]+/){
val=substr($0,RSTART,RLENGTH)
sub(/.*=/,"",val)
if(val+0>960){
print
}
}
' Input_file
Here is how I do it:
awk -F'logDurationMillis=' '{split($2,a,"[|]")} a[1]>960' file
If its log duration logDurationMillis you are looking for, I do set it as a line separator. This way I know for sure the next data is the value to get. Then split the next data by | to get the number in front of it. Then a[1] have your value and you can test it against what you need. No loop, so should be fast.

Reformat data using awk

I have a dataset that contains rows of UUIDs followed by locations and transaction IDs. The UUIDs are separated by a semi-colon (';') and the transactions are separated by tabs, like the following:
01234;LOC_1=ABC LOC_1=BCD LOC_2=CDE
56789;LOC_2=DEF LOC_3=EFG
I know all of the location codes in advance. What I want to do is transform this data into a format I can load into SQL/Postgres for analysis, like this:
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
I'm pretty sure I can do this easily using awk (or similar) by looking up location IDs from a file (ex. LOC_1) and matching any instance of the location ID and printing that out next to the UUID. I haven't been able to get it right yet, and any help is much appreciated!
My locations file is named location and my dataset is data. Note that I can edit the original file or write the results to a new file, either is fine.
awk without using split: use semicolon or tab as the field separator
awk -F'[;\t]' -v OFS=';' '{for (i=2; i<=NF; i++) print $1,$i}' file
I don't think you need to match against a known list of locations; you should be able to just print each line as you go:
$ awk '{print $1; split($1,a,";"); for (i=2; i<=NF; ++i) print a[1] ";" $i}' file
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
You comment on knowing the locations and the mapping file makes me suspicious what your example seems to have done isn't exactly what is being asked - but it seems like you're wanting to reformat each set of tab delimited LOC= values into a row with their UUID in front.
If so, this will do the trick:
awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}'
Given:
$ cat -A data.txt
01234;LOC_1=ABC^ILOC_1=BCD^ILOC_2=CDE$
56789;LOC_2=DEF^ILOC_3=EFG$
Then:
$ awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}' data.txt
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
The BEGIN {OFS=FS=";"} block sets the input and output delimiter to ;.
For each row, we then split the second field into an array named locs, splitting on tab, via - split($2,locs,"\t")
And then loop through locs printing the UUID and each loc value - for (n in locs) { print $1,locs[n]}
How about without loop or without split one as follows.(considering that Input_file is same as shown samples only)
awk 'BEGIN{FS=OFS=";"}{gsub(/[[:space:]]+/,"\n"$1 OFS)} 1' Input_file
This might work for you (GNU sed):
sed -r 's/((.*;)\S+)\s+(\S+)/\1\n\2\3/;P;D' file
Repeatedly replace the white space between locations with a newline, followed by the UUID and a ;, printing/deleting each line as it appears.

bash script and awk to sort a file

so I have a project for uni, and I can't get through the first exercise. Here is my problem:
I have a file, and I want to select some data inside of it and 'display' it in another file. But the data I'm looking for is a little bit scattered in the file, so I need several awk commands in my script to get them.
Query= fig|1240086.14.peg.1
Length=76
Score E
Sequences producing significant alignments: (Bits) Value
fig|198628.19.peg.2053 140 3e-42
> fig|198628.19.peg.2053
Length=553
Here on the picture, you can see that there are 2 types of 'Length=', and I only want to 'catch' the "Length=" that are just after a "Query=".
I have to use awk so I tried this :
awk '{if(/^$/ && $(NR+1)/^Length=/) {split($(NR+1), b, "="); print b[2]}}'
but it doesn't work... does anyone have an idea?
You need to understand how Awk works. It reads a line, evaluates the script, then starts over, reading one line at a time. So there is no way to say "the next line contains this". What you can do is "if this line contains, then remember this until ..."
awk '/Query=/ { q=1; next } /Length/ && q { print } /./ { q=0 }' file
This sets the flag q to 1 (true) when we see Query= and then skips to the next line. If we see Length and we recently saw Query= then q will be 1, and so we print. In other cases, set q back to "not recently seen" on any non-empty line. (I put in the non-empty condition to allow for empty lines anywhere without affecting the overall logic.)
awk solution:
awk '/^Length=/ && r~/^Query/{ sub(/^[^=]+=/,""); printf "%s ",$0 }
NF{ r=$0 }END{ print "" }' file
NF{ r=$0 } - capture the whole non-empty line
/^Length=/ && r~/^Query/ - on encountering Length line having previous line started with Query(ensured by r~/^Query/)
It sounds like this is what you want for the first part of your question:
$ awk -F'=' '!NF{next} f && ($1=="Length"){print $2} {f=($1=="Query")}' file
76
but idk what the second part is about since there's no "data" lines in your input and only 1 valid output from your sample input best I can tell.

Resources