GAWK Script - Print filename in BEGIN section - linux

I am writing a gawk script that begins
#!/bin/gawk -f
BEGIN { print FILENAME }
I am calling the file via ./script file1.html but the script just returns nothing. Any ideas?

you can use ARGV[1] instead of FILENAME if you really want to use it in BEGIN block
awk 'BEGIN{print ARGV[1]}' file

You can print the file name when encounter line 1:
FNR == 1
If you want to be less cryptic, easier to understand:
FNR == 1 {print}
UPDATE
My first two solutions were incorrect. Thank you Dennis for pointing it out. His way is correct:
FNR == 1 {print FILENAME}

Straight from the man page (slightly reformatted):
FILENAME: The name of the current input file. If no files are specified on the command line, the value of FILENAME is “-”. However, FILENAME is undefined inside the BEGIN block (unless set by getline).

Building on Hai Vu's answer I suggest that if you only want the filename printed once per file it needs to be wrapped in a conditional.
if(FNR == 1) { print FILENAME };

Related

I need to make an awk script to parse text in a file. I am not sure if I am doing it correctly

Hi I need to make a an awk script in order to parse a csv file and sort it in bash.
I need to get a list of presidents from Wikipedia and sort their years in office by year.
When it is all sorted out, each ear needs to be in a text file.
Im not sure I am doing it correctly
Here is a portion of my csv file:
28,Woodrow Wilson,http:..en.wikipedia.org.wiki.Woodrow_Wilson,4.03.1913,4.03.1921,Democratic ,WoodrowWilson.gif,thmb_WoodrowWilson.gif,New Jersey
29,Warren G. Harding,http:..en.wikipedia.org.wiki.Warren_G._Harding,4.03.1921,2.8.1923,Republican ,WarrenGHarding.gif,thmb_WarrenGHarding.gif,Ohio
I want to include $2 which is i think the name, and sort by $4 which is think the date the president took office
Here is my actual awk file:
#!/usr/bin/awk -f
-F, '{
if (substr($4,length($4)-3,2) == "17")
{ print $2 > Presidents1700 }
else if (substr($4,length($4)-3,2) == "18")
{ print $2 > Presidents1800 }
else if (substr($4,length($4)-3,2) == "19")
{ print $2 > Presidents1900 }
else if (substr($4,length($4)-3,2) == "20")
{ print $2 > Presidents2000 }
}'
Here is my function running it:
SplitFile() {
printf "Task 4: Spliting file based on century\n"
awk -f $AFILE ${custFolder}/${month}/$DFILE
}
Where $AFILE is my awk file, and the directories listed on the right lead to my actual file.
Here is a portion of my output, it's actually several hundred lines long but in the
end this is what a portion of it looks like:
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47: ^ syntax error
awk: presidentData/10/presidents.csv:47: 46,Joseph Biden,http:..en.wikipedia.org.wiki.Joe_Biden,20.01.2021,Incumbent , Democratic , Joe_Biden.jpg,thmb_Joe_Biden.jpg,Pennsilvania awk: presidentData/10/presidents.csv:47:
I know the output is not very helpful; I would rather just screenshot but I can't. I tried getting help but these online classes can be really hard and getting help at a distance is tough, the syntax errors above seem to be pointing to commas in the csv file.
After the edits, it's clear you are trying to classify the presidents by century outputting the century in which the president served.
As stated in my comments above, you don't include single quotes or command-line arguments in an awk script file. You use the BEGIN {...} rule to set the field-separator FS = ",". Then there are several ways to you split things in the fourth field. split() is just as easy as anything else.
That will leave you with the ending year in which the president served in the fourth element of arr (arr[0] is always the complete expression matching any REGEX used). Then it just a matter of comparing with the largest year first and decreasing from there redirecting the output to the output file for the century.
Continuing with what you started, your awk script will look similar to:
#!/usr/bin/awk -f
BEGIN { FS = "," }
{
split ($4, arr, ".")
if (arr[3] >= 2000)
print $2 > "Presidents2000"
else if (arr[3] >= 1900)
print $2 > "Presidents1900"
else if (arr[3] >= 1800)
print $2 > "Presidents1800"
else if (arr[3] >= 1700)
print $2 > "Presidents1700"
}
Now make it executable (for convenience). Presuming the script is in the file pres.awk:
$ chmod +x pres.awk
Now simply call the awk script passing the .csv filename as the argument, e.g.
$ ./pres.awk my.csv
Now list the files named Presid* and see what is created:
$ ls -al Presid*
-rw-r--r-- 1 david david 33 Oct 8 22:28 Presidents1900
And verify the contents is what you needed:
$ cat Presidents1900
Woodrow Wilson
Warren G. Harding
Presuming that is the output you are looking for based on your attempt.
(note: you need to quote the output file name to ensure, e.g. Presidents1900 isn't taken as a variable that hasn't been set yet)
Let me know if you have further questions.

AWK how to process multiple files and comparing them IN CONTROL FILE! (not command line one-liner)

I read all of answers for similar problems but they are not working for me because my files are not uniformal, they contain several control headers and in such case is safer to create script than one-liner and all the answers focused on one-liners. In theory one-liners commands should be convertible to script but I am struggling to achieve:
printing the control headers
print only the records started with 16 in <file 1> where value of column 2 NOT EXISTS in column 2 of the <file 2>
I end up with this:
BEGIN {
FS="\x01";
OFS="\x01";
RS="\x02\n";
ORS="\x02\n";
file1=ARGV[1];
file2=ARGV[2];
count=0;
}
/^#/ {
print;
count++;
}
# reset counters after control headers
NR=1;
FNR=1;
# Below gives syntax error
/^16/ AND NR==FNR {
a[$2];next; 'FNR==1 || !$2 in a' file1 file2
}
END {
}
Googling only gives me results for command line processing and documentation is also silent in that regard. Does it mean it cannot be done?
Perhaps try:
script.awk:
BEGIN {
OFS = FS = "\x01"
ORS = RS = "\x02\n"
}
NR==FNR {
if (/^16/) a[$2]
next
}
/^16/ && !($2 in a) || /^#/
Note the parentheses: !$2 in a would be parsed as (!$2) in a
Invoke with:
awk -f script.awk FILE2 FILE1
Note order of FILE1 / FILE2 is reversed; FILE2 must be read first to pre-populate the lookup table.
First of all, short answer to my question should be "NOT POSSIBLE", if anyone read question carefully and knew AWK in full that is obvious answer, I wish I knew it sooner instead of wasting few days trying to write script.
Also, there is no such thing as minimal reproducible example (this was always constant pain on TeX groups) - I need full example working, if it works on 1 row there is no guarantee if it works on 2 rows and my number of rows is ~ 127 mln.
If you read code carefully than you would know what is not working - I put in comment section what is giving syntax error. Anyway, as #Daweo suggested there is no way to use logic operator in pattern section. So because we don't need printing in first file the whole trick is to do conditional in second brackets:
awk -F, 'BEGIN{} NR==FNR{a[$1];next} !($1 in a) { if (/^16/) print $0} ' set1.txt set2.txt
assuming in above example that separator is comma. I don't know where assumption about multiple RS support only in gnu awk came from. On MacOS BSD awk it works exactly the same, but in fact RS="\x02\n" is single separator not two separators.

awk sub() not doing replacement

The example text file.
in2.txt file:
*{commented out line}* KeyWordName = KeyWordValueAA
KeyWordName = KeyWordValueAA
The regexp finds the string:
awk '/^KeyWordName[[:blank:]]=[[:blank:]].*$/' in2.txt
Output:
KeyWordName = KeyWordValueAA
Command that results in no errors or changed text:
awk '{sub(/^KeyWordName[[:blank:]]=[[:blank:]].*$/, "KeyWordValueBB")}' in2.txt
Removing the double quotes does not work either.
Reading the manpage did not help either.
You need to explicitly print to see any output:
A common awk idiom is to add a 1 after the action:
$ awk '{sub(/^KeyWordName[[:blank:]]=[[:blank:]].*$/, "KeyWordValueBB")}1' file
{commented out line} KeyWordName = KeyWordValueAA
KeyWordValueBB
The 1 is an always-true pattern and since there's no corresponding action, the default action of print is performed (on all input lines).
Or you can just use print:
$ awk '{sub(/^KeyWordName[[:blank:]]=[[:blank:]].*$/, "KeyWordValueBB")} {print}' file
{commented out line} KeyWordName = KeyWordValueAA
KeyWordValueBB
To change the file when you don't have the -i option, save to a temporary file and rename it to your file (that's what -i is doing for you, anyway). By using && we are sure the mv command will only be executed if the awk terminates with success. Stil, you might want to save a copy of your original file first in case the awk is "successful" from the OS point of view, but doesn't do what you expected!
awk '{ ... }' file > tmp && mv tmp file

bash script and awk to sort a file

so I have a project for uni, and I can't get through the first exercise. Here is my problem:
I have a file, and I want to select some data inside of it and 'display' it in another file. But the data I'm looking for is a little bit scattered in the file, so I need several awk commands in my script to get them.
Query= fig|1240086.14.peg.1
Length=76
Score E
Sequences producing significant alignments: (Bits) Value
fig|198628.19.peg.2053 140 3e-42
> fig|198628.19.peg.2053
Length=553
Here on the picture, you can see that there are 2 types of 'Length=', and I only want to 'catch' the "Length=" that are just after a "Query=".
I have to use awk so I tried this :
awk '{if(/^$/ && $(NR+1)/^Length=/) {split($(NR+1), b, "="); print b[2]}}'
but it doesn't work... does anyone have an idea?
You need to understand how Awk works. It reads a line, evaluates the script, then starts over, reading one line at a time. So there is no way to say "the next line contains this". What you can do is "if this line contains, then remember this until ..."
awk '/Query=/ { q=1; next } /Length/ && q { print } /./ { q=0 }' file
This sets the flag q to 1 (true) when we see Query= and then skips to the next line. If we see Length and we recently saw Query= then q will be 1, and so we print. In other cases, set q back to "not recently seen" on any non-empty line. (I put in the non-empty condition to allow for empty lines anywhere without affecting the overall logic.)
awk solution:
awk '/^Length=/ && r~/^Query/{ sub(/^[^=]+=/,""); printf "%s ",$0 }
NF{ r=$0 }END{ print "" }' file
NF{ r=$0 } - capture the whole non-empty line
/^Length=/ && r~/^Query/ - on encountering Length line having previous line started with Query(ensured by r~/^Query/)
It sounds like this is what you want for the first part of your question:
$ awk -F'=' '!NF{next} f && ($1=="Length"){print $2} {f=($1=="Query")}' file
76
but idk what the second part is about since there's no "data" lines in your input and only 1 valid output from your sample input best I can tell.

awk script to process 400 .txt files

I have a nice .awk script that takes the 2nd $2 value and prints it. Because the data in the .txt files only go down 8192 lines, any lines after that are irrelevant (the script takes care of that.) I have 400+ .tst files that need to have the same thing done and have the ouput's placed into a single file. So how would I go through every .tst file in the current directory? I tried piping the cat output to a single line version of the script but it only processed the first file. Any suggestions?
BEGIN{
}
{
print $2 "\n";
if (NR==8192)
exit;
}
END {
print NR "\n";
}
This should work -
awk 'FNR<=8192{ print $2 }' *.tst > finalfile
Just glob all the .tst files in the current directory and redirect the output to outfile:
$ awk 'FNR<=8192{print $2"\n";next}{print FNR"\n";nextfile}' *.tst > outfile

Resources