AWK - string containing required fields - string

I thought it would be easy to define a string such as "1 2 3" and use it within AWK (GAWK) to extract the required fields, how wrong I have been.
I have tried creating AWK arrays, BASH arrays, splitting, string substitution etc, but could not find any method to use the resulting 'chunks' (ie the column/field numbers) in a print statement.
I believe Akshay Hegde has provided an excellent solution with the get_cols function, here
but it was over 8 years ago, and I am really struggling to work out 'how it works', namely, what this is doing;
s = length(s) ? s OFS $(C[i]) : $(C[i])
I am unable to post a comment asking for clarification due to my lack of reputation (and it is an old post).
Is someone able to explain how the solution works?
NB I don't think I need the sub as I using the following to cleanup (replace all non-numeric characters with a comma, ie seperator, and sort numerically)
Columns=$(echo $Input_string | sed 's/[^0-9]\+/,/g') Columns=$(echo $Columns | xargs -n1 | sort -n | xargs)
(using this string, the awk would be Executed using awk -v cols=$Columns -f test.awk infile in the given solution)
Given the informative answer from #Ed Morton, with a nice worked example, I have attempted to remove the need for a function (and also an additional awk program file). The intention is to have this within a shell script, and I would rather it be self contained, but also, further investigation into 'how it works'.
Fields="1 2 3"
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print "s="s " arr1="Column[1]" arr2="Column[2]" arr3="Column[3]}'
The results have surprised me (taking note of my Comment to Ed)
s=1 2 3 arr1=1 arr2=2 arr3=3
The above clearly shows the split has worked into the array, but I thought s would include $ for each ternary operator concatenation, ie "$1 $2 $3"
Moreso, I was hoping to append the actual file to the above command, which I have found allows me to use echo $string | awk '{program}' file.name
NB it is a little insulting that my question has been marked as -1 indicating little research effort, as I have spent days trying to work this out.
Taking all the information above, I think s results in "1 2 3", but the print doesn't accept this in the same way as it does as it is called from a function, simply trying to 'print 1 2 3' in relation to the file, which seems to be how all my efforts have ended up.
This really confuses me, as Ed's 'diagonal' example works from command line, indicating that concept of 'print s' is absolutely fine when used with a file name input.
Can anyone suggest how this (example below) can work?
I don't know if using echo pipe and appending the file name is strictly allowed, but it appears to work (?!?!?!)
(failed result)
echo $Fields | awk -F "," '{n=split($0,Column," "); for(i=1;i<=n;i++) s = length(s) ? s OFS $(Column[i]) : $(Column[i])}END{print s}' myfile.txt
This appears to go through myfile.txt and output all lines containing many comma separated values, ie the whole file (I haven't included the values, just for illustration only)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

what this is doing; s = length(s) ? s OFS $(C[i]) : $(C[i])
You have encountered a ternary operator, it has following syntax
condition ? valueiftrue : valueiffalse
length function, when provided with single argument does return number of characters, in GNU AWK integer 0 is considered false, others integers are considered true, so in this case it is is not empty check. When s is not empty (it might be also not initalized yet, as GNU AWK will assume empty string in such case), it is concatenated with output field separator (OFS, default is space) and C[i]-th field value and assigned to variable s, when s is empty value of C[i]-th field value. Used multiple time this allows building of string of values sheared by OFS, consider following simple example, let say you want to get diagonal of 2D matrix, stored in file.txt with following content
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
then you might do
awk '{s = length(s) ? s OFS $(NR) : $(NR)}END{print s}' file.txt
which will get output
1 7 13 19 25
Explanation: NR is number row, so 1st row $(NR) is 1st field, for 2nd row it is 2nd field, for 3rd it is 3rd field and so on
(tested in GNU Awk 5.0.1)

Related

Isolate product names from strings by matching string after (including) first letter in a variable

I have a bunch of strings of following pattern in a text file:
201194_2012110634 Appliance 130 AB i Some optional (Notes )
300723_2017050006(2016111550) Device 16 AB i Note
The first part is serial, the second is date. Device/Appliance name and model (about 10 possible different names) is the string after date number and before (including AB i).
I was able to isolate dates and serials using
SERIAL=${line:0:6}
YEAR=${line:7:4}
I'm trying to isolate Device name and note after that:
#!/bin/bash
while IFS= read line || [[ -n $line ]]; do
NAME=${line#*[a-zA-Z]}
STRINGAP='Appliance '"${line/#*Appliance/}"
The first approach is to take everything after the first letter appearing in line, which gives me
NAME = ppliance 130 AB i Some optional (Notes )
The second approach is to write tests for each of the ~10 possible appliance/device names and then append appliance name after the subtracted test. Then test variable which actually matched Appliance / Device (or other name) and use that to input into the database.
Is it possible to write a line that would select everything, including first letter in a line, in text file? Then I would subtract everything after AB i to get notes and everything before AB i would become appliance name.
Remove the ${line#*[az-A-Z]} line (which will, as you see, remove the first character of the name), and instead use
STRINGAP=$(echo "$line" | sed 's/^[0-9_]* \(.*\) AB i.*/\1/')
This drops the leading digits and underscore, and everything from " AB i" to the end.
Edit: The details are unclear - do you want to keep the "AB i", and will it always be "AB i"? If you want it, change the line to
STRINGAP=$(echo "$line" | sed 's/^[0-9_]* \(.* AB i\).*/\1/')
I also forgot the double quotes round the text line.
You can use sed and read to give you more control of parsing.
tmp> line2="300723_2017050006(2016111550) Device 16 AB i Note"
tmp> read serial date type val <<<$(echo $line2 | \
sed 's/\([0-9]*\)_\([0-9]*\)[^A-Z]*\(Device\|Appliance\) \
\([0-9]*\).*/\1 \2 \3 \4/')
tmp> echo "$serial|$date|$type|$val"
300723|2017050006|Device|16
Basically, read allows you to assign multiple variables in one line. The sed statment parses the line, and gives you space delimitted output of its results. You can also read each variable seperately if you don't mind running sed a few extra times:
device="$(echo $line2 | sed -e 's/^.*Device \([0-9]*\).*/\1/;t;d')"
appliance="$(echo $line2 | sed -e 's/^.*Appliance \([0-9]*\).*/\1/;t;d')"
This way $device is populated with device if present, and is blank otherwise (note the -e and ;t;d at the end of the regex to prevent it from dumping the line if it doesn't match.)
Your question isn't clear but it seems like you might be trying to parse strings into substrings. Try this with GNU awk for the 3rd arg to match() and let us know if there's something else you were looking for:
$ awk 'match($0,/^([0-9]+)_([0-9]+)(\([0-9]+\))?\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.*)/,a) {
for (i=1; i<=8; i++) {
print i, a[i]
}
print "---"
}' file
1 201194
2 2012110634
3
4 Appliance
5 130
6 AB
7 i
8 Some optional (Notes )
---
1 300723
2 2017050006
3 (2016111550)
4 Device
5 16
6 AB
7 i
8 Note
---
If you wanted a CSV output, for example, then it'd just be:
$ awk -v OFS=',' 'match($0,/^([0-9]+)_([0-9]+)(\([0-9]+\))?\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.*)/,a) {
for (i=1; i<=8; i++) {
printf "%s%s", a[i], (i<8?OFS:ORS)
}
}' file
201194,2012110634,,Appliance,130,AB,i,Some optional (Notes )
300723,2017050006,(2016111550),Device,16,AB,i,Note
Massage to suit...

AWK make a simple subtract and find the minimum value of that

I have this matrix:
{{1,4},{6,8}}
and I want to substract the second value from the first value like: 4-1 and 8-6
and then, comparer both and show what was the minimun value from both, in this case: 8-6=2
All of this using AWK in terminal
You seem a little confused about whether you want to subtract the first from the second or the second from the first. Also, about whether your data is in a file or a variable. However, this should get you started...
If we replace any mad braces or commas with spaces:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); print}'
1 4 6 8
Now we can access the fields as $1 through $4 and do what you want:
echo "{{1,4},{6,8}}" | awk '{gsub(/[{},]/," "); x=$2-$1; y=$4-$3; if(x<y)print x; else print y}'
2
As a, maybe more elegant, alternative suggested by #3161993 in the comments, you could set the field separator to be one or more open or close braces or commas, like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; if(x<y) print x; else print y}' <<< "{{1,4},{6,8}}"
2
And, as #EdMorton kindly pointed out, it can be made a bit more succinct with a ternary operator like this:
awk -F '[,{}]+' '{x=$3-$2; y=$5-$4; print (x<y ? x : y)}' <<< "{{1,4},{6,8}}"

How to extract specific value using grep and awk?

I am facing a problem to extract a specific value in a .txt file using grep and awk.
I show below an excerpt from the .txt file:
"-
bravais-lattice index = 2
lattice parameter (alat) = 10.0000 a.u.
unit-cell volume = 250.0000 (a.u.)^3
number of atoms/cell = 2
number of atomic types = 1
number of electrons = 28.00
number of Kohn-Sham states= 18
kinetic-energy cutoff = 60.0000 Ry
charge density cutoff = 300.0000 Ry
convergence threshold = 1.0E-09
mixing beta = 0.7000"
I also defined some variable: ELEMENT and lat.
I want to extract the "unit-cell volume" value which is equal to 250.00.
I tried the following to extract the value using grep and awk:
volume=`grep "unit-cell volume" ./latt.10/$ELEMENT.scf.latt_$lat.out | awk '{printf "%15.12f\n",$5}'`
However, when i run the bash file I always get 00.000000 as a result instead of the correct value of 250.00.
Can anyone help, please?
Thanks in advance.
awk '{printf "%15.12f\n",$5}'
You're asking awk to print out the fifth field of the line ($5).
unit-cell volume = 250.0000 (a.u.)^3
1 2 3 4 5
The fifth field is (a.u.)^3, which you are then asking awk to interpret as a number via the %f format code. It's not a number, though (or actually, doesn't start with a number), and when awk is asked to treat a non-numeric string as a number, it uses 0 instead. Thus it prints 0.
Solution: use $4 instead.
By the way, you can skip invoking grep by using awk itself to select the line, e.g.
awk /^ unit-cell/ {...}
The /^ unit-cell/ is a regular expression that matches "unit-cell" (with a leading space) at the beginning of the line. Adjust as necessary if you have other lines that start with unit-cell which you don't want to select.
You never need grep when you're using awk since awk can do anything useful that grep can do. It sounds like this is all you need:
$ awk -F'=' '/unit-cell volume/{printf "%.2f\n",$2}' file
250.00
The above works because when FS is = that means $2 is <spaces>250.000 (a.u.)^3 and when awk is asked to convert a string to a number it strips off leading spaces and anything after the numeric part so that leaves 250.000 to be converted to a number by %.2f.
In the script you posted $5 was failing because the 5th space-separated field in:
$1 $2 $3 $4 $5
<unit-cell> <volume> <=> <250.0000> <(a.u.)^3>
is (a.u.)^3 - you could have just added print $5 to see that.
Since you are processing key-value pairs where the key can have variable amount on space in it, you need to tune that field number ($4, $5 etc.) separately for each record you want to process unless you set the field separator (FS) appropriately to FS=" *= *". Then the key will always be in $1 and value in $2.
Then use split to split the value and unit parts from each other.
Also, you can loose that grep by defining in awk a pattern (or condition, /unit-cell volume/) for that printaction:
$ awk 'BEGIN{FS=" *= *"} /unit-cell volume/{split($2,a," +");print a[1]}' file
250.0000
Explained:
$ awk '
BEGIN { FS=" *= *" } # set appropriate field separator
/unit-cell volume/ { # pattern or condition
split($2,a," +") # split value part to value and possible unit parts
print a[1] # output value part
}' file

How can I append any string at the end of line and keep doing it after specific number of lines?

I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.
You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file
You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file
There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo
Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.

Linux/bash parse text output, select fields, ignore nulls in one field only

I've done my requisite 20 searches but I can't quite find an example that includes the 'ignore null' part of what I'm trying to do. Working on a Linux-ish system that uses bash and has grep/awk/sed/perl and the other usual suspects. Output from a job is in the format:
Some Field I Dont Care About = Nothing Interesting
Another Field That Doesnt Matter = 216
Name = The_Job_name
More Useless Stuff = Blah blah
Estimated Completion = Aug 13, 2015 13:30 EDT
Even Yet Still More Nonsense = Exciting value
...
Jobs not currently active will have a null value for estimated completion time. The field names are long, and multi-word names contain spaces as shown. The delimiter is always "=" and it always appears in the same column, padded with spaces on either side. There may be dozens of jobs listed, and there are about 36 fields for each job. At any given time there are only one or two active, and those are the ones I care about.
I am trying to get the value for the 'Name' field and the value of the 'Estimated Completion' field on a single line for each record that is currently active, hence ignoring nulls, like this:
Job_04 Aug 13, 2015 13:30 EDT
Job_21 Aug 09, 2015 10:10 EDT
...
I started with <command> | grep '^Name\|^Estimated' which got me the lines I care about.
I have moved on to awk -F"=" '/^Name|^Estimated/ {print $2}' which gets the values by themselves. This is where is starts to go awry - I tried to join every other line using awk -F"=" '/^Name|^Estimated/ {print $2}'| sed 'N;s/\n/ /' but the output from that is seriously wonky. Add to this I am not sure whether I should be looking for blank lines and eliminating them (and the preceding line) to get rid of the nulls at this point, or if it is better to read the values into variables and printf them.
I'm not a Perl guy, but if that would be a better approach I'd be happy to shift gears and go in that direction. Any thoughts or suggestions appreciated, Thanks!
Some Field I Dont Care About = Nothing Interesting
Another Field That Doesnt Matter = 216
Name = Job_4119
More Useless Stuff = Blah blah
Estimated Completion =
Even Yet Still More Nonsense = Exciting value
...
I can't comment, not enough reputation...
But I think something like this will work, in your print command
{printf "%s,",$2;next}{print;}
Or use paste command?
paste -s -d",\n" file
You can do something like:
awk -F"=" '/^Name/ {name=$2} /^Estimated/ { print name, $2}' file
if they always come in the same order: name first, estimate next.
You can then add a NULL check to the last field and don't print the line if it matches like:
awk -F"=" '/^Name/ {name=$2} /^Estimated/ { if($2 != "") {print name, $2}}' file
$ awk -F'\\s*=\\s*' '{a[$1]=$2} /^Estimated/ && $2{print a["Name"], $2}' file
The_Job_name Aug 13, 2015 13:30 EDT
Replace \\s with [[:space:]] if you aren't using gawk, i.e.:
$ awk -F'[[:space:]]*=[[:space:]]*' '{a[$1]=$2} /^Estimated/ && $2{print a["Name"], $2}' file
and if your awk doesn't even support character classes then GET A NEW AWK but in the meantime:
$ awk -F'[ \t]*=[ \t]*' '{a[$1]=$2} /^Estimated/ && $2{print a["Name"], $2}' file

Resources