Modify multiple columns value based on specific column values in Linux - linux

I have a file that has the following data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
I want to find every line that contains CACR and then divide col2, col4, col5, and col6 values by the respective cells of col6 (ignore the divide calculation if col6 has 0 or blank cells) using Linux terminal. So, my output looks like following:
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",
"CACR","50","CO","40","5","0"
I am trying to use grep and awk
grep CACR file.csv | awk -F "," '$6 != 0; $6 == "" {$2 = $2/$6; $4= $4/$6; $5 = $5/$6; $6 = $6/$6}1'
But couldn't get any desired output.

As outlined in a comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP — a Simple Matter of Programming.
Here's my version. It could be written more succinctly (fewer newlines, fewer spaces), but I prefer clarity over brevity.
script.awk
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
printf("%s,%s,%s,%s,%s,%s\n",
$1,
add_quotes(strip_quotes($2) / div),
$3,
add_quotes(strip_quotes($4) / div),
add_quotes(strip_quotes($5) / div),
add_quotes(strip_quotes($6) / div))
}
data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
Output
$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$
Variant script3.awk
This code sets the output field separator OFS to comma too, and resets the values of $2, $4, $5, and $6 before using print to print the modified $0.
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = ","; OFS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
$2 = add_quotes(strip_quotes($2) / div)
$4 = add_quotes(strip_quotes($4) / div)
$5 = add_quotes(strip_quotes($5) / div)
$6 = add_quotes(strip_quotes($6) / div)
print
}
Data validation
Both versions of the script could be more stringent, validating that there are 5 or 6 columns (rejecting lines with more columns or fewer columns or complaining about them). The check for the headings could insist on 6 columns. It might be sensible to check that div is a non-zero number. It might be sensible to check that each of $2, $4, $5 and $6 is a number. The divisors (column 6) in the sample data are convenient; you might need to do some work if the number is not as simple, such as 7, where the result could have many decimal places. You'd need to decide how such numbers should be formatted (the default might be OK, or it might not). It might also be worth checking that the data in each field matches the regex /^"[^"]*"$/ (so each value is surrounded by double quotes).
Trailing white space
The rule $6 == "" || $6 == "\"0\"" { print; next } does not handle trailing white space very well. It can be revised to:
$6 ~ /^[[:space:]]*$/ || $6 == "\"0\"" { print; next }
That recognizes trailing white space and treats it as zero. It would be possible, and probably sensible, to add:
if (div == 0) { print; next }
after the assignment to div. If the value found is zero, there is a problem. It would be possible to complain too — to produce an error message diagnosing 'malformed data'.
How much of the validation and error prevention is worthwhile depends on how unruly your input data is. If you're dealing with human-generated data, you have to deal with human's propensity for varying the rules and giving erratic or erroneous data to programs, and you probably need to handle (diagnose) unexpected inputs. If you're dealing with machine-generated data, it is typically more uniform, and you can get away with less validation work.
Most solutions that depend on regexes have to strike a balance between working sufficiently well and breaking on erratic inputs. The more erratic the inputs, the harder it is to devise bomb-proof (fool-proof) regexes. As the saying goes, "if you make something idiot-proof, someone will just make a better idiot".

Related

How to print output in table format in shell script

I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.

file manipulation with command line tools on linux

I want to transform a file from this format
1;a;34;34;a
1;a;34;23;d
1;a;34;23;v
1;a;4;2;r
1;a;3;2;d
2;f;54;3;f
2;f;34;23;e
2;f;23;5;d
2;f;23;23;g
3;t;26;67;t
3;t;34;45;v
3;t;25;34;h
3;t;34;23;u
3;t;34;34;z
to this format
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
These are cvs files, so it should work with awk or sed ... but I have failed till now. If the first value is the same, I want to add the last three values to the first line. And this will run till the last entry in the file.
Here some code in awk, but it does not work:
#!/usr/bin/awk -f
BEGIN{ FS = " *; *"}
{ ORS = "\;" }
{
x = $1
print $0
}
{ if (x == $1)
print $3, $4, $5
else
print "\n"
}
END{
print "\n"
}
$ cat tst.awk
BEGIN { FS=OFS=";" }
{ curr = $1 FS $2 }
curr == prev {
sub(/^[^;]*;[^;]*/,"")
printf "%s", $0
next
}
{
printf "%s%s", (NR>1?ORS:""), $0
prev = curr
}
END { print "" }
$ awk -f tst.awk file
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z
If I understand you correctly that you want to build a line from fields 3-5 of all lines with the same first two fields (preceded by those two fields), then
awk -F \; 'key != $1 FS $2 { if(NR != 1) print line; key = $1 FS $2; line = key } { line = line FS $3 FS $4 FS $5 } END { print line }' filename
That is
key != $1 FS $2 { # if the key (first two fields) changed
if(NR != 1) print line; # print the line (except at the very
# beginning, to not get an empty line there)
key = $1 FS $2 # remember the new key
line = key # and start building the next line
}
{
line = line FS $3 FS $4 FS $5 # take the value fields from each line
}
END { # and at the very end,
print line # print the last line (that the block above
} # cannot handle)
You got good answers in awk. Here is one in perl:
perl -F';' -lane'
$key = join ";", #F[0..1]; # Establish your key
$seen{$key}++ or push #rec, $key; # Remember the order
push #{ $h{$key} }, #F[2..$#F] # Build your data structure
}{
$, = ";"; # Set the output list separator
print $_, #{ $h{$_} } for #rec' file # Print as per order
This is going to seem a lot more complicated than the other answers, but it's adding a few things:
It computes the maximum number of fields from all built up lines
Appends any missing fields as blanks to the end of the built up lines
The posix awk on a mac doesn't maintain the order of array elements even when the keys are numbered when using the for(key in array) syntax. To maintain the output order then, you can keep track of it as I've done or pipe to sort afterwards.
Having matching numbers of fields in the output appears to be a requirement per the specified output. Without knowing what it should be, this awk script is built to load all the lines first, compute the maximum number of fields in an output line then output the lines with any adjustments in order.
#!/usr/bin/awk -f
BEGIN {FS=OFS=";"}
{
key = $1
# create an order array for the mac's version of awk
if( key != last_key ) {
order[++key_cnt] = key
last_key = key
}
val = a[key]
# build up an output line in array a for the given key
start = (val=="" ? $1 OFS $2 : val)
a[key] = start OFS $3 OFS $4 OFS $5
# count number of fields for each built up output line
nf_a[key] += 3
}
END {
# compute the max number of fields per any built up output line
for(k in nf_a) {
nf_max = (nf_a[k]>nf_max ? nf_a[k] : nf_max)
}
for(i=1; i<=key_cnt; i++) {
key = order[i]
# compute the number of blank flds necessary
nf_pad = nf_max - nf_a[key]
blank_flds = nf_pad!=0 ? sprintf( "%*s", nf_pad, OFS ) : ""
gsub( / /, OFS, blank_flds )
# output lines along with appended blank fields in order
print a[key] blank_flds
}
}
If the desired number of fields in the output lines is known ahead of time, simply appending the blank fields on key switch without all these arrays would work and make a simpler script.
I get the following output:
1;a;34;34;a;34;23;d;34;23;v;4;2;r;3;2;d
2;f;54;3;f;34;23;e;23;5;d;23;23;g;;;
3;t;26;67;t;34;45;v;25;34;h;34;23;u;34;34;z

Joining text to the previous line

Currently working on Solaris
I need to search for a specific string in a text file and, if found, join it to the previous line. For example:
if logical condition
then
i = i + 1
Would become
if logical condition then
i = i + 1.
I'm sure I can do this with awk using a hold space of some sort but my awk skills are a little rusty.
Addendum: Apologies, I should have been more specific. The match is triggered by the appearance of the string "then". I have no knowledge of the contents of the previous line - it could be anything. Whatever it is I need to concatenate the "then" to it.
$ awk '{printf "%s%s", (NR>1?(/then/?FS:RS):""), $0} END{print ""}' file
if logical condition then
i = i + 1
Try this:
awk '
function dump(trailing_line) {
for (i = 0; i < length(previous) - 1; i++)
print previous[i]
if (trailing_line) {
print previous[i] trailing_line
} else {
print previous[i]
}
i = 0
delete previous
}
/then/ {
dump(" then")
next
}
{
previous[i++] = $0
}
END {
dump("")
}
' your_input_file
If you truly have no knowledge of what comes before the then, you will have to store each line into an array and then dump them out before you emit the then. Since you will also want to do this within the END clause, it's easiest to make this "dumping" action into a function.

Using awk on large txt to extract specific characters of fields

I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!
Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped
If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}

increment variable in awk based on two columns

I am writing an awk script that parses a CSV file, compares one column containing date, and another column containing activity type, and then prints the count of a particular activity.
The code I have written is:
NOW=$(date --date="5 days ago" +"%Y%m%d")
awk -F "," -v mydate=$NOW '{
var_1=1;
var_2=1;} {
if ( substr($8,2,8) == mydate ) {
if ( $6 == 1001 ) {
var_1++;
}
else if ( $6 == 1003 ) {
var_2++;
}
}
print var_1 var_2
}' *.csv
The output I get is
11
11
11
11
11
11
I believe the issue is something to do with the way I have defined var_1 and var_2; they are reinitialized or something.
Also I want to only print the final value of var_1 and var_2; at the moment, it gets printed with every iteration of awk.
Any advice?
You have two blocks that are executed on each line of data:
{ var_1=1; var_2=1; } which sets the variables to 1 on each pass.
{
if ( substr($8,2,8) == mydate ) {
if ( $6 == 1001 ) {
var_1++;
}
else if ( $6 == 1003 ) {
var_2++;
}
}
print var_1 var_2
} which prints the values of var_1 and var_2 as concatenated strings (hence no space between the 1 and 1).
It appears that either the substr() condition or the $6 condition is not being matched, ever.
You probably wanted BEGIN before the first block, but why you'd start at 1 rather than 0 is not obvious. If you started the counts at 0, you wouldn't need a BEGIN block. You should probably use print var_1, var_2 to separate the two values.
As for why the matches aren't matching, there's no way to say without any sample data to work on, but you could debug by printing out $8 and $6 for each line (and mydate, too; and maybe substr($8,2,8)), so you can see what is happening.
If you only want the values to print at the end, then (once you've debugged what's happening during the main action), you can place the print in an END block:
END { print var_1, var_2 }

Resources