Joining text to the previous line - text

Currently working on Solaris
I need to search for a specific string in a text file and, if found, join it to the previous line. For example:
if logical condition
then
i = i + 1
Would become
if logical condition then
i = i + 1.
I'm sure I can do this with awk using a hold space of some sort but my awk skills are a little rusty.
Addendum: Apologies, I should have been more specific. The match is triggered by the appearance of the string "then". I have no knowledge of the contents of the previous line - it could be anything. Whatever it is I need to concatenate the "then" to it.

$ awk '{printf "%s%s", (NR>1?(/then/?FS:RS):""), $0} END{print ""}' file
if logical condition then
i = i + 1

Try this:
awk '
function dump(trailing_line) {
for (i = 0; i < length(previous) - 1; i++)
print previous[i]
if (trailing_line) {
print previous[i] trailing_line
} else {
print previous[i]
}
i = 0
delete previous
}
/then/ {
dump(" then")
next
}
{
previous[i++] = $0
}
END {
dump("")
}
' your_input_file
If you truly have no knowledge of what comes before the then, you will have to store each line into an array and then dump them out before you emit the then. Since you will also want to do this within the END clause, it's easiest to make this "dumping" action into a function.

Related

How to print output in table format in shell script

I am new to shell scripting.. I want to disribute all the data of a file in a table format and redirect the output into another file.
I have below input file File.txt
Fruit_label:1 Fruit_name:Apple
Color:Red
Type: S
No.of seeds:10
Color of seeds :brown
Fruit_label:2 fruit_name:Banana
Color:Yellow
Type:NS
I want it to looks like this
Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds
1 | apple | red | S | 10 | brown
2 | banana| yellow | NS
I want to read all the data line by line from text file and make the headerlike fruit_label,fruit_name,color,type, no.of seeds, color of seeds and then print all the assigned value in rows.All the above data is different for different fruits for ex. banana dont have seeds so want to keep its row value as blank ..
Can anyone help me here.
Another approach, is a "Decorate & Process" approach. What is "Decorate & Process"? To Decorate is to take the text you have and decorate it with another separator to make field-splitting easier -- like in your case your fields can contain included whitespace along with the ':' separator between the field-names and values. With your inconsistent whitespace around ':' -- that makes it a nightmare to process ... simply.
So instead of worrying about what the separator is, think about "What should the fields be?" and then add a new separator (Decorate) between the fields and then Process with awk.
Here sed is used to Decorate your input with '|' as separators (a second call eliminates the '|' after the last field) and then a simpler awk process is used to split() the fields on ':' to obtain the field-name and field-value where the field-value is simply printed and the field-names are stored in an array. When a duplicate field-name is found -- it is uses as seen variable to designate the change between records, e.g.
sed -E 's/([^:]+:[[:blank:]]*[^[:blank:]]+)[[:blank:]]*/\1|/g' file |
sed 's/|$//' |
awk '
BEGIN { FS = "|" }
{
for (i=1; i<=NF; i++) {
if (split ($i, parts, /[[:blank:]]*:[[:blank:]]*/)) {
if (! n || parts[1] in fldnames) {
printf "%s %s", n ? "\n" : "", parts[2]
delete fldnames
n = 1
}
else
printf " | %s", parts[2]
fldnames[parts[1]]++
}
}
}
END { print "" }
'
Example Output
With your input in file you would have:
1 | Apple | Red | S | 10 | brown
2 | Banana | Yellow | NS
You will also see a "Decorate-Sort-Undecorate" used to sort data on a new non-existent columns of values by "Decorating" your data with a new last field, sorting on that field, and then "Undecorating" to remove the additional field when sorting is done. This allow sorting by data that may be the sum (or combination) of any two columns, etc...
Here is my solution. It is a new year gift, usually you have to demonstrate what you have tried so far and we help you, not do it for you.
Disclaimer some guru will probably come up with a simpler awk version, but this works.
File script.awk
# Remove space prefix
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
# Remove space suffix
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
# Remove both suffix and prefix spaces
function trim(s) { return rtrim(ltrim(s)); }
# Initialise or reset a fruit array
function array_init() {
for (i = 0; i <= 6; ++i) {
fruit[i] = ""
}
}
# Print the content of the fruit
function array_print() {
# To keep track if something was printed. Yes, print a carriage return.
# To avoid printing a carriage return on an empty array.
printedsomething = 0
for (i = 0; i <= 6; =+i) {
# Do no print if the content is empty
if (fruit[i] != "") {
printedsomething = 1
if (i == 1) {
# The first field must be further split, to remove "Fruit_name"
# Split on the space
split(fruit[i], temparr, / /)
printf "%s", trim(temparr[1])
}
else {
printf " | %s", trim(fruit[i])
}
}
}
if ( printedsomething == 1 ) {
print ""
}
}
BEGIN {
FS = ":"
print "Fruit_label| Fruit_name |color| Type |no.of seeds |Color of seeds"
array_init()
}
/Fruit_label/ {
array_print()
array_init()
fruit[1] = $2
fruit[2] = $3
}
/Color:/ {
fruit[3] = $2
}
/Type/ {
fruit[4] = $2
}
/No.of seeds/ {
fruit[5] = $2
}
/Color of seeds/ {
fruit[6] = $2
}
END { array_print() }
To execute, call awk -f script.awk File.txt
awk processes a file line per line. So the idea is to store fruit information into an array.
Every time the line "Fruit_label:....." is found, print the current fruit and start a new one.
Since each line is read in sequence, you tell awk what to do with each line, based on a pattern.
The patterns are what are enclosed between // characters at the beginning of each section of code.
Difficulty: since the first line contains 2 information on every fruit, and I cut the lines on : char, the Fruit_label will include "Fruit_name".
I.e.: the first line is cut like this: $1 = Fruit_label, $2 = 1 Fruit_name, $3 = Apple
This is why the array_print() function is so complicated.
Trim functions are there to remove spaces.
Like for the Apple, Type: S when split on the : will result in S
If it meets your requirements, please see https://stackoverflow.com/help/someone-answers to accept it.

Modify multiple columns value based on specific column values in Linux

I have a file that has the following data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
I want to find every line that contains CACR and then divide col2, col4, col5, and col6 values by the respective cells of col6 (ignore the divide calculation if col6 has 0 or blank cells) using Linux terminal. So, my output looks like following:
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOX","30","10",
"CACR","50","CO","40","5","0"
I am trying to use grep and awk
grep CACR file.csv | awk -F "," '$6 != 0; $6 == "" {$2 = $2/$6; $4= $4/$6; $5 = $5/$6; $6 = $6/$6}1'
But couldn't get any desired output.
As outlined in a comment, the primary problem is that the double quotes around the fields mean that when a field is interpreted as a number (e.g. with a division), the value is zero. I think you need to write Awk functions to remove and reinstate the double quotes. With those in place, it's mostly a SMOP — a Simple Matter of Programming.
Here's my version. It could be written more succinctly (fewer newlines, fewer spaces), but I prefer clarity over brevity.
script.awk
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
printf("%s,%s,%s,%s,%s,%s\n",
$1,
add_quotes(strip_quotes($2) / div),
$3,
add_quotes(strip_quotes($4) / div),
add_quotes(strip_quotes($5) / div),
add_quotes(strip_quotes($6) / div))
}
data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","50","6","2.0"
"FF","15","CO2","20","4","3"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
Output
$ awk -f script.awk data
"col1","col2","col3","col4","col5","col6"
"CACR","0","SO2","25","3","1"
"CACR","25","NOx","30","10",
"CACR","50","CO","40","5","0"
$
Variant script3.awk
This code sets the output field separator OFS to comma too, and resets the values of $2, $4, $5, and $6 before using print to print the modified $0.
function strip_quotes(s)
{
gsub(/"/, "", s)
return s
}
function add_quotes(s)
{
return sprintf("\"%s\"", s)
}
BEGIN { FS = ","; OFS = "," }
NR == 1 { print; next }
$0 !~ /CACR/ { next }
$6 == "" || $6 == "\"0\"" { print; next }
{
div = strip_quotes($6)
$2 = add_quotes(strip_quotes($2) / div)
$4 = add_quotes(strip_quotes($4) / div)
$5 = add_quotes(strip_quotes($5) / div)
$6 = add_quotes(strip_quotes($6) / div)
print
}
Data validation
Both versions of the script could be more stringent, validating that there are 5 or 6 columns (rejecting lines with more columns or fewer columns or complaining about them). The check for the headings could insist on 6 columns. It might be sensible to check that div is a non-zero number. It might be sensible to check that each of $2, $4, $5 and $6 is a number. The divisors (column 6) in the sample data are convenient; you might need to do some work if the number is not as simple, such as 7, where the result could have many decimal places. You'd need to decide how such numbers should be formatted (the default might be OK, or it might not). It might also be worth checking that the data in each field matches the regex /^"[^"]*"$/ (so each value is surrounded by double quotes).
Trailing white space
The rule $6 == "" || $6 == "\"0\"" { print; next } does not handle trailing white space very well. It can be revised to:
$6 ~ /^[[:space:]]*$/ || $6 == "\"0\"" { print; next }
That recognizes trailing white space and treats it as zero. It would be possible, and probably sensible, to add:
if (div == 0) { print; next }
after the assignment to div. If the value found is zero, there is a problem. It would be possible to complain too — to produce an error message diagnosing 'malformed data'.
How much of the validation and error prevention is worthwhile depends on how unruly your input data is. If you're dealing with human-generated data, you have to deal with human's propensity for varying the rules and giving erratic or erroneous data to programs, and you probably need to handle (diagnose) unexpected inputs. If you're dealing with machine-generated data, it is typically more uniform, and you can get away with less validation work.
Most solutions that depend on regexes have to strike a balance between working sufficiently well and breaking on erratic inputs. The more erratic the inputs, the harder it is to devise bomb-proof (fool-proof) regexes. As the saying goes, "if you make something idiot-proof, someone will just make a better idiot".

Using awk on large txt to extract specific characters of fields

I have a large txt file ("," as delimiter) with some data and string:
2014:04:29:00:00:58:GMT: subject=BMRA.BM.T_GRIFW-1.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=3,TS=2014:04:29:01:00:00:GMT,VP=4.0,TS=2014:04:29:01:29:00:GMT,VP=4.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
2014:04:29:00:00:59:GMT: subject=BMRA.BM.T_GRIFW-2.FPN, message={SD=2014:04:29:00:00:00:GMT,SP=5,NP=2,TS=2014:04:29:01:00:00:GMT,VP=3.0,TS=2014:04:29:01:30:00:GMT,VP=3.0}
I would like to find lines that contain 'T_GRIFW' and then print the $1 field from 'subject' onwards and only the times and floats from $2 onwards. Furthermore, I want to incorporate an if statement so that if field $4 == 'NP=3', only fields $5,$6,$9,$10 are printed after the previous fields and if $4 == 'NP=2', all following fields are printed (times and floats only)
For instance, the result of the two sample lines will be:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I know this is complex and I have tried my best to be thorough in my description. The basic code I have thus far is:
awk 'BEGIN {FS=","}{OFS=","} /T_GRIFW-1.FPN/ {print $1}' tib_messages.2014-04-29
THANKS A MILLION!
Here's an awk executable file that'll create your desired output:
#!/usr/bin/awk -f
# use a more complicated FS => field numbers counted differently
BEGIN { FS="=|,"; OFS="," }
$2 ~ /T_GRIFW/ && $8=="NP" {
str="subject=" $2 OFS
# strip ":GMT" from dates and "}" from everywhere
gsub( /:GMT|[\}]/, "")
# append common fields to str with OFS
for(i=5;i<=13;i+=2) str=str $i OFS
# print the remaining fields and line separator
if($9==3) { print str $19, $21 }
else if($9==2) { print str $15, $17 }
}
Placing that in a file called awko and chmod'ing it then running awko data yields:
subject=BMRA.BM.T_GRIFW-1.FPN,2014:04:29:00:00:00,5,3,2014:04:29:01:00:00,4.0,2014:04:29:01:30:00,3.0
subject=BMRA.BM.T_GRIFW-2.FPN,2014:04:29:00:00:00,5,2,2014:04:29:01:00:00,3.0,2014:04:29:01:30:00,3.0
I've placed comments in the script, but here are some things that could be spelled out better:
Using a more complicated FS means you don't have reparse for = to work with the field data
I "cheated" and just hard-coded subject (which now falls at the end of $1) for str
:GMT and } appeared to be the only data that needed to be forcibly removed
With this FS Dates and numbers are two apart from each other but still loop-able
In either final print call, the str already ends in an OFS, so the comma between it and next field can be skipped
If I understand your requirements, the following will work:
BEGIN {
FS=","
OFS=","
}
/T_GRIFW/ {
split($1, subject, " ")
result = subject[2] OFS
delete arr
counter = 1
for (i = 2; i <= NF; i++) {
add = 0
if ($4 == "NP=3") {
if (i == 5 || i == 6 || i == 9 || i == 10) {
add = 1
}
}
else if ($4 == "NP=2") {
add = 1
}
if (add) {
counter = counter + 1
split($i, field, "=")
if (match(field[2], "[0-9]*\.[0-9]+|GMT")) {
arr[counter] = field[2]
}
}
}
for (i in arr) {
gsub(/{|}/,"", arr[i]) # remove curly braces
result = result arr[i] OFS
}
print substr(result, 0, length(result)-1)
}

Parse columns with awk

I am new at AWK programming and I was wondering how to filter the following text:
Goedel - Declarative language for AI, based on many-sorted logic. Strongly
typed, polymorphic, declarative, with a module system. Supports bignums
and sets. "The Goedel Programming Language", P. M. Hill et al, MIT Press
1994, ISBN 0-262-08229-2. Goedel 1.4 - partial implementation in SICStus
Prolog 2.1.
ftp://ftp.cs.bris.ac.uk/goedel
info: goedel#compsci.bristol.ac.uk
Just to print this:
Goedel
I have used the following sentence but it just does not work as I wished:
awk -F " - " "/ - /{ print $1 }"
It shows the following:
Goedel
1994, ISBN 0-262-08229-2. Goedel 1.4
Could somebody tell me what I have to modify so I can get what I want?
Thanks in advance
awk 'BEGIN { RS = "" } { print $1 }' your_file.txt
which means: splits string into paragraphs by empty line, and then splits words by the default separator (space), and finally print the first word ($1) of every paragraph
this one-liner could work for your requirement:
awk -F ' - ' 'NF>1{print $1;exit}'
awk -F ' - ' ' { if (FNR % 4 == 1) next; print $1; }'
If the format is exactly the same as below, then the code above should work:
1 Author - ...
2 Year ...
3 URL
4 Extra info ...
5 Author - ...
6..N etc.
If there is a blank line between entries, you can set RS to a null string and $1 will be the author as long as the value for -F (the FS variable in an awk script) is the same. This has the advantage that if you don't have "info: ..." or a URL, you can still distinguish between entries, assuming it is not "Author - ...{newline}Year ...{newline}{newline}info: ...{newline}{newline}Author - ..." (you can't have an empty line between parts of an entry if an empty line is what separates entries.) For example:
# A blank line is what separates each entry.
BEGIN { RS = ""; }
{ print $1; }
If you have an awk that supports it, you can make RS a multiple character string if necessary (e.g. RS = "\n--\n" for entries separated by "--" on a line by itself). If you need a regex or simply don't have an awk that supports multiple character record separators, you're forced to use something like the following:
BEGIN { found_sep = 1; }
{ if (found_sep) { print $1; found_sep = 0; } }
# Entry separator is "--\n"
/^--$/ { found_sep = 1; }
More sample input will be required for something more complicated.

Implement tail with awk

I am struggling with this awk code which should emulate the tail command
num=$1;
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[$i]
}
So what I'm trying to achieve here is an tail command emulated by awk/
For example consider cat somefile | awk -f tail.awk 10
should print the last 10 lines of a text file, any suggestions?
All of these answers store the entire source file. That's a horrible idea and will break on larger files.
Here's a quick way to store only the number of lines to be outputted (note that the more efficient tail will always be faster because it doesn't read the entire source file!):
awk -vt=10 '{o[NR%t]=$0}END{i=(NR<t?0:NR);do print o[++i%t];while(i%t!=NR%t)}'
more legibly (and with less code golf):
awk -v tail=10 '
{
output[NR % tail] = $0
}
END {
if(NR < tail) {
i = 0
} else {
i = NR
}
do {
i = (i + 1) % tail;
print output[i]
} while (i != NR % tail)
}'
Explanation of legible code:
This uses the modulo operator to store only the desired number of items (the tail variable). As each line is parsed, it is stored on top of older array values (so line 11 gets stored in output[1]).
The END stanza sets an increment variable i to either zero (if we've got fewer than the desired number of lines) or else the number of lines, which tells us where to start recalling the saved lines. Then we print the saved lines in order. The loop ends when we've returned to that first value (after we've printed it).
You can replace the if/else stanza (or the ternary clause in my golfed example) with just i = NR if you don't care about getting blank lines to fill the requested number (echo "foo" |awk -vt=10 … would have nine blank lines before the line with "foo").
for(i=NR-num;i<=NR;i++)
print vect[$i]
$ indicates a positional parameter. Use just plain i:
for(i=NR-num;i<=NR;i++)
print vect[i]
The full code that worked for me is:
#!/usr/bin/awk -f
BEGIN{
num=ARGV[1];
# Make that arg empty so awk doesn't interpret it as a file name.
ARGV[1] = "";
}
{
vect[NR]=$0;
}
END{
for(i=NR-num;i<=NR;i++)
print vect[i]
}
You should probably add some code to the END to handle the case when NR < num.
You need to add -v num=10 to the awk commandline to set the value of num. And start at NR-num+1 in your final loop, otherwise you'll end up with num+1 lines of output.
This might work for you:
awk '{a=a b $0;b=RS;if(NR<=v)next;a=substr(a,index(a,RS)+1)}END{print a}' v=10

Resources