increment variable in awk based on two columns - linux

I am writing an awk script that parses a CSV file, compares one column containing date, and another column containing activity type, and then prints the count of a particular activity.
The code I have written is:
NOW=$(date --date="5 days ago" +"%Y%m%d")
awk -F "," -v mydate=$NOW '{
var_1=1;
var_2=1;} {
if ( substr($8,2,8) == mydate ) {
if ( $6 == 1001 ) {
var_1++;
}
else if ( $6 == 1003 ) {
var_2++;
}
}
print var_1 var_2
}' *.csv
The output I get is
11
11
11
11
11
11
I believe the issue is something to do with the way I have defined var_1 and var_2; they are reinitialized or something.
Also I want to only print the final value of var_1 and var_2; at the moment, it gets printed with every iteration of awk.
Any advice?

You have two blocks that are executed on each line of data:
{ var_1=1; var_2=1; } which sets the variables to 1 on each pass.
{
if ( substr($8,2,8) == mydate ) {
if ( $6 == 1001 ) {
var_1++;
}
else if ( $6 == 1003 ) {
var_2++;
}
}
print var_1 var_2
} which prints the values of var_1 and var_2 as concatenated strings (hence no space between the 1 and 1).
It appears that either the substr() condition or the $6 condition is not being matched, ever.
You probably wanted BEGIN before the first block, but why you'd start at 1 rather than 0 is not obvious. If you started the counts at 0, you wouldn't need a BEGIN block. You should probably use print var_1, var_2 to separate the two values.
As for why the matches aren't matching, there's no way to say without any sample data to work on, but you could debug by printing out $8 and $6 for each line (and mydate, too; and maybe substr($8,2,8)), so you can see what is happening.
If you only want the values to print at the end, then (once you've debugged what's happening during the main action), you can place the print in an END block:
END { print var_1, var_2 }

Related

How to print output in table format in shell script

I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.

Modify multiple columns value based on specific column values in Linux

I have a file that has the following data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
I want to find every line that contains CACR and then divide col2, col4, col5, and col6 values by the respective cells of col6 (ignore the divide calculation if col6 has 0 or blank cells) using Linux terminal. So, my output looks like following:
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",
"CACR","50","CO","40","5","0"
I am trying to use grep and awk
grep CACR file.csv | awk -F "," '$6 != 0; $6 == "" {$2 = $2/$6; $4= $4/$6; $5 = $5/$6; $6 = $6/$6}1'
But couldn't get any desired output.
As outlined in a comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP — a Simple Matter of Programming.
Here's my version. It could be written more succinctly (fewer newlines, fewer spaces), but I prefer clarity over brevity.
script.awk
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
printf("%s,%s,%s,%s,%s,%s\n",
$1,
add_quotes(strip_quotes($2) / div),
$3,
add_quotes(strip_quotes($4) / div),
add_quotes(strip_quotes($5) / div),
add_quotes(strip_quotes($6) / div))
}
data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
Output
$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$
Variant script3.awk
This code sets the output field separator OFS to comma too, and resets the values of $2, $4, $5, and $6 before using print to print the modified $0.
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = ","; OFS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
$2 = add_quotes(strip_quotes($2) / div)
$4 = add_quotes(strip_quotes($4) / div)
$5 = add_quotes(strip_quotes($5) / div)
$6 = add_quotes(strip_quotes($6) / div)
print
}
Data validation
Both versions of the script could be more stringent, validating that there are 5 or 6 columns (rejecting lines with more columns or fewer columns or complaining about them). The check for the headings could insist on 6 columns. It might be sensible to check that div is a non-zero number. It might be sensible to check that each of $2, $4, $5 and $6 is a number. The divisors (column 6) in the sample data are convenient; you might need to do some work if the number is not as simple, such as 7, where the result could have many decimal places. You'd need to decide how such numbers should be formatted (the default might be OK, or it might not). It might also be worth checking that the data in each field matches the regex /^"[^"]*"$/ (so each value is surrounded by double quotes).
Trailing white space
The rule $6 == "" || $6 == "\"0\"" { print; next } does not handle trailing white space very well. It can be revised to:
$6 ~ /^[[:space:]]*$/ || $6 == "\"0\"" { print; next }
That recognizes trailing white space and treats it as zero. It would be possible, and probably sensible, to add:
if (div == 0) { print; next }
after the assignment to div. If the value found is zero, there is a problem. It would be possible to complain too — to produce an error message diagnosing 'malformed data'.
How much of the validation and error prevention is worthwhile depends on how unruly your input data is. If you're dealing with human-generated data, you have to deal with human's propensity for varying the rules and giving erratic or erroneous data to programs, and you probably need to handle (diagnose) unexpected inputs. If you're dealing with machine-generated data, it is typically more uniform, and you can get away with less validation work.
Most solutions that depend on regexes have to strike a balance between working sufficiently well and breaking on erratic inputs. The more erratic the inputs, the harder it is to devise bomb-proof (fool-proof) regexes. As the saying goes, "if you make something idiot-proof, someone will just make a better idiot".

Split file with multiple delimited entries in some columns into separate lines

I have a very large file with the following basic format, with a number of additional fields:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
3,ENST25;ENST76;ENST35,ENSP91;ENSP77;ENSP78,515;544;544,ENSG765,Gene2
4,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
5,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
6,ENST54;ENST93,ENSP83;ENSP36,1864;722,ENSG48,Gene3
Line one (posA=1) has a single entry for each column, and does not need to be modified. For lines with a variable number of multiple entries for some columns, for the third line (posA=2), the first entry for "id1" (ENST25) is paired with the first entry for "id2" (ENSP91) and the first entry for "posB" (515), and so on, but the columns with a single entry (eg, "posA", "id3", "name") apply to all of the paired entries in columns 2-4. Some fields in addition to columns 2-4 also rarely contain multiple entries.
I want to split the columns with multiple entries into separate lines, while retaining the data from the other columns, like so:
posA,id1,id2,posB,id3,name,(n additional fields)
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST76,ENSP77,544,ENSG765,Gene2
2,ENST35,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST76,ENSP77,544,ENSG765,Gene2
3,ENST35,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST93,ENSP36,722,ENSG48,Gene3
...
What is the best approach for this problem?
Thanks!
Taking your example as an example that at most there will be two-compound attributes, then using simple parameter expansion with substring removal, you can accomplish what you intend fairly easily, e.g.
#!/bin/bash
while IFS=, read -r p a1 a2 a3; do
[[ $a1 =~ ';' ]] && {
printf "%s,%s,%s,%s\n" "$p" "${a1%;*}" "${a2%;*}" "$a3"
printf "%s,%s,%s,%s\n" "$p" "${a1#*;}" "${a2#*;}" "$a3"
} || printf "%s,%s,%s,%s\n" "$p" "$a1" "$a2" "$a3"
done < "$1"
Where [[ $a1 =~ ';' ]] checks for a ';' in $a1 and if found then picks off the first attribute in $a1 and $a2 with ${a1%;*} and ${a2%;*}. Then for the second attribute in each, ${a1#*;} and ${a2#*;} are used.
If no ';' is contained in $a1, the attributes are printed unchanged. IFS=, insures the parameters are word-split on ','.
(note: you should add validation that the filename is valid, etc. to your final script. You can also use echo if you like)
Example Use/Output
$ splitattrib.sh file
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
the best is break it to three part.
You have 3 line patterns. One has 6 columns. Another has 12, and the last is 9.
6 columns => 1 line
12 columns => 3 lines
9 columns => 2 line
Your 6 columns should not be modified. So reminds 12, and 9. That you can separate them in the if, else if and else. Like:
if( column == 6 ){...}
else if( column == 12 ){...}
else {...}
And here is a Perl one-liner solution:
perl -a -F",|;" -lne '$s=scalar #F;if($s==6){print join ",",#F}elsif($s==12){print join",",#F[0,1,4,7,-2,-1];print join",",#F[0,1,5,8,-2,-1];print join",",#F[0,1,6,9,-2,-1];}else{print join",",#F[0,1,3,5,-2,-1];print join",",#F[0,1,4,6,-2,-1]} ' file
and for you input, the output is:
1,ENST7,ENSP93,1,ENSG92,Gene1
2,ENST25,ENSP91,515,ENSG765,Gene2
2,ENST25,ENSP77,544,ENSG765,Gene2
2,ENST25,ENSP78,544,ENSG765,Gene2
3,ENST25,ENSP91,515,ENSG765,Gene2
3,ENST25,ENSP77,544,ENSG765,Gene2
3,ENST25,ENSP78,544,ENSG765,Gene2
4,ENST54,ENSP83,1864,ENSG48,Gene3
4,ENST54,ENSP36,722,ENSG48,Gene3
5,ENST54,ENSP83,1864,ENSG48,Gene3
5,ENST54,ENSP36,722,ENSG48,Gene3
6,ENST54,ENSP83,1864,ENSG48,Gene3
6,ENST54,ENSP36,722,ENSG48,Gene3
Assume your multiple entries are separated with semicolon ;, here is the awk version to do.
BEGIN {
FS="[,]"
}
{
if ($0 ~ /^[0-9].*/) {
end_split_field = 0
for (f=2;f<=NF;f++) {
if ($f ~ /.*;.*/) {
end_split_field=f
}
}
if (end_split_field == 0) {
print $0
} else {
for (f=2;f<=end_split_field;f++) {
n = split($f, a, ";") #split and return the number
for (i=1;i<=n;i++) {
b[f, i] = a[i]
}
}
for (i=1;i<=n;i++) {
printf $1","
for (j=2;j<=end_split_field;j++) {
printf b[j, i]","
}
for (k=end_split_field;k<NF;k++) {
printf $k","
}
printf $NF"\n"
}
}
} else {
print $0
}
}
Save the content above as input.awk, example input and output
$ cat input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c;d,e;f,+
3,g;h;i,j;k;l,-
We can get the split output
$ awk -f input.awk input
Pos,Attribute1,Attribute2,Attribute3
1,a,b,-
2,c,e,+
2,d,f,+
3,g,j,-
3,h,k,-
3,i,l,-

Using awk on large txt to extract specific characters of fields

I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!
Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped
If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}

One-Liners Script to (1) Delete a line and the following one, (2) Change a value in the file

In Linux I have a file with the following structure:
[a/b/c]
value = 1
[a/b/d]
value = 0
[a/b/e]
value = 0
The keys '[x/y/z]' are unique.
Before and after the equals sign I have \t tabulation (please post the solution even with no \t).
(e.g. value\t\t=\t0)
How can I do the following operations?
(1) Delete a line '[x/y/z]' and the following one 'value = 0', given the key
(2) Change a value of a key in the file:
FROM
[a/b/c]
value = 0
TO
[a/b/c]
value = 1
Here's delete operation in a nice, easy to read awk one-liner:
awk -v delkey="[a/b/d]" '{ if ($1 == delkey) { i=3 }; if ( i>0 ) { i-- } else { print }}' file.txt
Using that example, see if you can figure out the value change operation. Awk is very rewarding to learn - well worth the effort.

Resources