I found the following bash script for converting a file with key:value information to CSV file:
awk -F ":" -v OFS="," '
BEGIN { print "category","recommenderSubtype", "resource", "matchesPattern", "resource", "value" }
function printline() {
print data["category"], data["recommenderSubtype"], data["resource"], data["matchesPattern"], data["resource"], data["value"]
}
{data[$1] = $2}
NF == 0 {printline(); delete data}
END {printline()}
' file.yaml
But after executed it, it only converts the first group of data (only the first 6 rows of data), like this
category,recommenderSubtype,resource,matchesPattern,resource,value
COST,CHANGE_MACHINE_TYPE,instance-1,f1-micro,instance-1,g1-small
My original file is like this (with 1000 rows and more):
category:COST
recommenderSubtype:CHANGE_MACHINE_TYPE
resource:portal-1
matchesPattern:f1-micro
resource:portal-1
value:g1-small
category:PERFORMANCE
recommenderSubtype:CHANGE_MACHINE_TYPE
resource:old-3
matchesPattern:n1-standard-4
resource:old-3
value:n1-highmem-2
Is there any command am I missing?
The problem with the original script are these lines:
NF == 0 {printline(); delete data}
END {printline()}
The first line means: Call printline() if the current line has no records. The second line means call printline() after all data has been processed.
The difficulty with the input data format is that it does not really give a good indicator when to output the next record. In the following, I have simply changed the script to output the data every six records. In case there can be duplicate keys, the criterion for output might be "all fields populated" or such which would need to be programmed slightly differently.
#!/bin/sh -e
awk -F ":" -v OFS="," '
BEGIN {
records_in = 0
print "category","recommenderSubtype", "resource", "matchesPattern", "resource", "value"
}
{
data[$1] = $2
records_in++
if(records_in == 6) {
records_in = 0;
print data["category"], data["recommenderSubtype"], data["resource"], data["matchesPattern"], data["resource"], data["value"]
}
}
' file.yaml
Other commends
I have just removed the delete statement, because I am unsure what it does. The POSIX specification for awk only defines it for deleting single array elements. In case the whole array should be deleted, it recommends doing a loop over the elements. In case all fields are always present, however, it might as well be possible to eliminate it altogether.
Welcome to SO (I am new here as well). Next time you are asking, I would recommend tagging the question awk rather than bash because AWK is really the scripting language used in this question with bash only being responsible for calling awk with suitable parameters :)
Related
I read all of answers for similar problems but they are not working for me because my files are not uniformal, they contain several control headers and in such case is safer to create script than one-liner and all the answers focused on one-liners. In theory one-liners commands should be convertible to script but I am struggling to achieve:
printing the control headers
print only the records started with 16 in <file 1> where value of column 2 NOT EXISTS in column 2 of the <file 2>
I end up with this:
BEGIN {
FS="\x01";
OFS="\x01";
RS="\x02\n";
ORS="\x02\n";
file1=ARGV[1];
file2=ARGV[2];
count=0;
}
/^#/ {
print;
count++;
}
# reset counters after control headers
NR=1;
FNR=1;
# Below gives syntax error
/^16/ AND NR==FNR {
a[$2];next; 'FNR==1 || !$2 in a' file1 file2
}
END {
}
Googling only gives me results for command line processing and documentation is also silent in that regard. Does it mean it cannot be done?
Perhaps try:
script.awk:
BEGIN {
OFS = FS = "\x01"
ORS = RS = "\x02\n"
}
NR==FNR {
if (/^16/) a[$2]
next
}
/^16/ && !($2 in a) || /^#/
Note the parentheses: !$2 in a would be parsed as (!$2) in a
Invoke with:
awk -f script.awk FILE2 FILE1
Note order of FILE1 / FILE2 is reversed; FILE2 must be read first to pre-populate the lookup table.
First of all, short answer to my question should be "NOT POSSIBLE", if anyone read question carefully and knew AWK in full that is obvious answer, I wish I knew it sooner instead of wasting few days trying to write script.
Also, there is no such thing as minimal reproducible example (this was always constant pain on TeX groups) - I need full example working, if it works on 1 row there is no guarantee if it works on 2 rows and my number of rows is ~ 127 mln.
If you read code carefully than you would know what is not working - I put in comment section what is giving syntax error. Anyway, as #Daweo suggested there is no way to use logic operator in pattern section. So because we don't need printing in first file the whole trick is to do conditional in second brackets:
awk -F, 'BEGIN{} NR==FNR{a[$1];next} !($1 in a) { if (/^16/) print $0} ' set1.txt set2.txt
assuming in above example that separator is comma. I don't know where assumption about multiple RS support only in gnu awk came from. On MacOS BSD awk it works exactly the same, but in fact RS="\x02\n" is single separator not two separators.
so I have a project for uni, and I can't get through the first exercise. Here is my problem:
I have a file, and I want to select some data inside of it and 'display' it in another file. But the data I'm looking for is a little bit scattered in the file, so I need several awk commands in my script to get them.
Query= fig|1240086.14.peg.1
Length=76
Score E
Sequences producing significant alignments: (Bits) Value
fig|198628.19.peg.2053 140 3e-42
> fig|198628.19.peg.2053
Length=553
Here on the picture, you can see that there are 2 types of 'Length=', and I only want to 'catch' the "Length=" that are just after a "Query=".
I have to use awk so I tried this :
awk '{if(/^$/ && $(NR+1)/^Length=/) {split($(NR+1), b, "="); print b[2]}}'
but it doesn't work... does anyone have an idea?
You need to understand how Awk works. It reads a line, evaluates the script, then starts over, reading one line at a time. So there is no way to say "the next line contains this". What you can do is "if this line contains, then remember this until ..."
awk '/Query=/ { q=1; next } /Length/ && q { print } /./ { q=0 }' file
This sets the flag q to 1 (true) when we see Query= and then skips to the next line. If we see Length and we recently saw Query= then q will be 1, and so we print. In other cases, set q back to "not recently seen" on any non-empty line. (I put in the non-empty condition to allow for empty lines anywhere without affecting the overall logic.)
awk solution:
awk '/^Length=/ && r~/^Query/{ sub(/^[^=]+=/,""); printf "%s ",$0 }
NF{ r=$0 }END{ print "" }' file
NF{ r=$0 } - capture the whole non-empty line
/^Length=/ && r~/^Query/ - on encountering Length line having previous line started with Query(ensured by r~/^Query/)
It sounds like this is what you want for the first part of your question:
$ awk -F'=' '!NF{next} f && ($1=="Length"){print $2} {f=($1=="Query")}' file
76
but idk what the second part is about since there's no "data" lines in your input and only 1 valid output from your sample input best I can tell.
New to AWK. I have a file with the following content:
FirstName,LastName,Email,ID,Number,IDToBeMatched
John,Smith,js#.com,js30,4,kt78
George,Haynes,gh#.com,gh67,3,re201
Mary,Dewar,md#.com,md009,4,js30
Kevin,Pan,kp#.com,kp41,2,md009
,,,,,ti10
,,,,,qwe909
,,,,,md009
,,,,,kor28
,,,,,gh67
The idea is to check whether any of the fields below the header ID matches any of the fields below IDToBeMatched and if there is a match to print the whole record but for the last field (i.e. IDToBeMatched). So my final output should look like:
FirstName,LastName,Email,ID,Number
John,Smith,js#.com,js30,4
George,Haynes,gh#.com,gh67,3
Mary,Dewar,md#.com,md009,4
My code so far
awk 'BEGIN{
FS=OFS=",";SUBSEP=",";
}
{
# all[$1,$2,$3,$4,$5]
a[$4]++;
b[$6]++;
}
END{ #for(k in all){
for(i in a){
for(j in b){
if(i==j){
print i #k
}
}
}
#}
}' inputfile
This prints the match only. If however I try to introduce another loop by uncommenting the lines in the above script in order to have the whole line for the matching field, things get messy. I understand why but I cannot find the solution. I thought to introduce a next statement but it's not allowed in the END. My AWK defaults to GAWK and I would prefer an (G)AWK solution only.
Thank you in advance.
The last field has more records because it was copied/pasted from an ID "pool" which does not necessarily has the same number of records as the files it was pasted in.
$ awk -F, 'NR==FNR{a[$6];next} (FNR==1)||($4 in a){sub(/,[^,]+$/,"");print}' file file
FirstName,LastName,Email,ID,Number
John,Smith,js#.com,js30,4
George,Haynes,gh#.com,gh67,3
Mary,Dewar,md#.com,md009,4
On day one I may receive large CSV output such as:
this,is,a,test
bob,is,your,uncle
sound,one,"Zen proverb",clapping
On day two I may receive output such as:
test,this,is,a
clapping,one,sound,"Zen proverb"
uncle,bob,is,your
Neo,the,Matrix,"Has you"
The column and row I'm interested in will always be random, I'll never know which field the output will come to me as - but I'm only interested in the vertical column with a certain string. For example 'Uncle'.
test
clapping
uncle
Neo
I'm a newbie to awk and PERL - but I imagine awk would be able to print the first match column based upon a matched string (Up-and-Down the column). Does anyone know how I should go about parsing this kind of data?
It sounds like you want the following: given a string and a comma separated file, find the first match of the string and output that field for each record in the file. Make 2 passes on the file, with the first pass looking for the match:
s=uncle
awk 'NR==FNR && /'$s'/ { for( i=1; i<=NF; i++ ) if( $i ~ /'$s'/ ) { a=i; nextfile; } }
NR!=FNR{ print $a}' FS=, input input
Note that if the string is not in the file, the second pass will print the whole record. Also nextfile is not standard awk, but does exist in gawk. Rather than nextfile, you can do: NR==FNR && /'$s'/ && !a, or just invoke awk twice, with the first just getting the column to output and the 2nd doing the printing.
I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10