Maths on multiple fields on seperate lines using awk - linux

I've been doing some maths on a 3 field x 2 line file like this:
3216.01 2724.81 1708.25
1762.48 617.436 1650.79
My question is how do i refer to the first field on the first line and in the same calculation, refer to the first field in the second row?
And just for completion: I'm planning on taking $1 (line1) and minusing $1 (line2), then squaring and doing the same for the other columns and finally summing this value.

this line works with your requirement:
awk 'NR==1{for(i=1;i<=NF;i++)a[i]=$i}
NR==2{for(i=1;i<=NF;i++)s+=(a[i]-$i)^2; printf "sum: %.3f",s}' file
result:
sum: 6557076.288
Note
this should work with dynamic number of columns, but exactly two lines
the output format is %.3f, you could change it if you like
The codes could be shorten, since there are two similar for-loop structures
EDIT
as suggested by EdMorton, the above codes could be written as:
awk 'NR==1{split($0,a)}
NR==2{for(i=1;i<=NF;i++)s+=(a[i]-$i)^2; printf "sum: %.3f\n",s}' file
very good suggestion, I didn't think of the split... thank Ed!

I would normally story it in some temporary variable
awk 'NR>1 {print $1-a} {a=$1}' inputfile

Related

insert column with same row content to csv in cli

I am having a csv to which I need to add a new column at the end and add a certain string to all rows of the csv in the newly added column.
Example csv:
os,num1,alpha1
Unix,10,A
Linux,30,B
Solaris,40,C
Fedora,20,D
Ubuntu,50,E
I tried using awk command and did not get expected result. I am not sure whether indexing or counting column number is right.
awk -F'[[:null:]]' '$2 && !$1{ $4="NA" }1'
Expected result is:
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA
You can use sed:
sed 's/$/,NA/' db1.csv > db2.csv
then edit the first line containing the column titles.
I'm not quite sure how you came up w/ that awk statement of yours, why you'd think that your file has NUL-terminated lines or that [[:null:]] has become a valid character class ...
The following, however, will do your bidding:
awk 'NR==1{print $0",code"}; NR>1{print $0",NA"}' example.csv
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA

Awk script to put value in column basis on another column value

I am trying to use below script to replace column values.
But below data is huge and have around 33000 rows.
so when i run the script i get error "Argument list too long"
Please let me know other way to do it..
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}
if($33="100000006"){$36="EA"}
if($33="100000007"){$36="EA"}
if($33="100000008"){$36="EA"}
if($33="100000009"){$36="EA"}
if($33="100000010"){$36="EA"}
if($33="100000011"){$36="EA"}
if($33="100000012"){$36="EA"}
if($33="100000013"){$36="EA"}
if($33="100000014"){$36="EA"}
if($33="100000015"){$36="EA"}
if($33="100000016"){$36="EA"}
if($33="100000017"){$36="EA"}
if($33="100000018"){$36="EA"}
if($33="100000019"){$36="EA"}
if($33="100000020"){$36="EA"}
sample input file
SourceIifier|SourleName|GntCode|Dision|Suvision|ProfitCe1|Profie2|Plade|Retuiod|SuppliN|DocType|Suppe|Docummber|Docte|Originer|OrigDate|CRDST|LineNumber|CustoN|UINorComposition|OriginaN|Custoame|Custoe|BillTe|Shite|POS|PortCode|ShippingBillNumber|ShippingBillDate|FOB|ExportDuty|HSNorSAC|ProductCode|ProductDescription|Categorduct|UnitOement|Quantity|Taxabue|Integratede|Integratount|Centraate|CentralTt|StaURate|StateUTTaxAmount|CessRateAdvalorem|CessAmountAdvalorem|CessRateSpecific|CessAmountSpecific|Invoalue|ReverseChargeFlag|TCSFlag|eComGSTIN|ITCFlag|ReasonForCreditDebitNote|AccountingVoucmber|Accountinate|Userdefinedfield1|Userdefinedfield2|Userdefinedfield3|Additionalfield1|Additionalfield2|Additlfield3|Additionalfield4|Additionalfield5
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAACT2T|IN|EXPWT|262881626|02.02.2018||||10||||TVVAHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOCOaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||1301312397||ZEVD|1210||||||0
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAACZT|IV|EXPWT|2627|02.02.2018||||10||||TVVHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOAMaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||130139||ZEVD|1210||||||0
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAAZT|NV|AN|2628|02.02.2018||||20||||TVHVAISHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOCOCDMAMaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||13014||ZEVD|1210||||||0
My code :
awk -F"|" -v OFS="|" '{
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}
if($33="100000006"){$36="EA"}
if($33="100000007"){$36="EA"}
if($33="100000008"){$36="EA"}
if($33="100000009"){$36="EA"}
if($33="100000010"){$36="EA"}1' inputfile > outputfile
Here the above code is just sample but in actual has around 33000 rows.
below is Sample Awk code..
BEGIN {
FS="|";
OFS="|";
}
{
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}1 inputfile > outputfile
and called it like below
awk -f script.awk
Below is the error by calling awk script.
awk: fpostp.awk:33445: if($36=="M") {$36="MTR"}} TFinaloutputp7_6_3_d_OYMNC_w.csv > TFinaloutputt_w36.csv
awk: fpostp.awk:33445: ^ syntax error
awk: fpostp.awk:33445: if($36=="M") {$36="MTR"}} TFinaloutputp7_6_3_d_OYMNC_w.csv > TFinaloutputt_w36.csv
awk: fpostp.awk:33445: ^ syntax error
Can't i redirect output in someother file when executing by awk -f script.awk
Programming is not like that, if writing it could get boring there is bound to be another way.
awk -F"|" -v OFS="|" '$33>=100 && $33<200{$36="EA";print} $33>=200 && $33<300{$36="FB";print}' inputfile > outputfile
First awk is a pattern matching language, on rows the pattern outside the curly braces matches, it does what is inside the curly braces.
no need for the if syntax as it is inherent.
The patterns can be compound and awk knows what numbers are without being told (and does math arbitrarily well).
I shortened the values in $33 and made up what and where $36 becomes
bit in general make a statement per change in $36 for ranges of $36
If that is not your goal the question will need some refining.
Edit:
maybe you are masking $36 to a constant based on a arbitrary condition involving
$33 which only you know and there are lots of them ... in a file somewhere.
(I am pretending you have the list isolated in a file named filter.list )
so maybe something like
awk 'FNR==NR{filter[$1]++}FNR!=NR && $33 in filter {$36="EA"}' filter.list inputfile > outputfile
FNR is the File's number row and NR is the overall scripts number row
they are ony equal for the first file,
so using it here to treat the first file differently from the second.

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:
foo0:bar0
foo1:bar1
foo2:bar2
first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:
sort -t':' -k1,1 filename.txt > output_sorted.txt
I have also been able to drop all duplicates with:
sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt
These operations take 4-8 min.
I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.
I think I can achieve this with awk with this code:
BEGIN { FS = ":" }
{
count[$1]++;
if (count[$1] == 1){
first[$1] = $0;
}
if (count[$1] == 2){
print first[$1];
}
if (count[$1] > 1){
print $0;
}
}
running it with:
awk -f awk.dups input_sorted.txt > output_dup.txt
Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:
uniq -D sorted_file.txt > output_dup.txt
The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware
i7-4700HQ
16GB ram
256GB ssd samsung 850 pro
Anything that can help is welcome,
Thanks.
SOLUTION FROM BELOW
Using:
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'
with the command time I get the following performance.
real 0m46.058s
user 0m40.352s
sys 0m2.984s
If your file is already sorted you don't need to store more than one line, try this
$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input
If you try this please post the timings...
I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries
I have tested and this produces the same output as the above but might be easier to understand.
{if(p!=$1){p=$1; c=0; p0=$0} else c++}
If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.
c>=1{if(c==1){print p0;} print $0}
In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

How can I append any string at the end of line and keep doing it after specific number of lines?

I want to add a symbol " >>" at the end of 1st line and then 5th line and then so on. 1,5,9,13,17,.... I was searching the web and went through below article but I'm unable to achieve it. Please help.
How can I append text below the specific number of lines in sed?
retentive
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
Output should be like-
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
You can do it with awk:
awk '{if ((NR-1) % 5) {print $0} else {print $0 " >>"}}'
We check if line number minus 1 is a multiple of 5 and if it is we output the line followed by a >>, otherwise, we just output the line.
Note: The above code outputs the suffix every 5 lines, because that's what is needed for your example to work.
You can do it multiple ways. sed is kind of odd when it comes to selecting lines but it's doable. E.g.:
sed:
sed -i -e 's/$/ >>/;n;n;n;n' file
You can do it also as perl one-liner:
perl -pi.bak -e 's/(.*)/$1 >>/ if not (( $. - 1 ) % 5)' file
You're thinking about this wrong. You should append to the end of the first line of every paragraph, don't worry about how many lines there happen to be in any given paragraph. That's just:
$ awk -v RS= -v ORS='\n\n' '{sub(/\n/," >>&")}1' file
retentive >>
good at remembering
The child was very sharp, and her memory was extremely retentive.
— Rowlands, Effie Adelaide
unconscionable >>
greatly exceeding bounds of reason or moderation
For generations in the New York City public schools, this has become the norm with devastating consequences rooted in unconscionable levels of student failure.
— New York Times (Nov 4, 2011)
This might work for you (GNU sed):
sed -i '1~4s/$/ >>/' file
There's a couple more:
$ awk 'NR%5==1 && sub(/$/,">>>") || 1 ' foo
$ awk '$0=$0(NR%5==1?">>>":"")' foo
Here is a non-numeric way in Awk. This works if we have an Awk that supports the RS variable being more than one character long. We break the data into records based on the blank line separation: "\n\n". Inside these records, we break fields on newlines. Thus $1 is the word, $2 is the definition, $3 is the quote and $4 is the source:
awk 'BEGIN {OFS=FS="\n";ORS=RS="\n\n"} $1=$1" >>"'
We use the same output separators as input separators. Our only pattern/action step is then to edit $1 so that it has >> on it. The default action is { print }, which is what we want: print each record. So we can omit it.
Shorter: Initialize RS from catenation of FS.
awk 'BEGIN {OFS=FS="\n";ORS=RS=FS FS} $1=$1" >>"'
This is nicely expressive: it says that the format uses two consecutive field separators to separate records.
What if we use a flag, initially reset, which is reset on every blank line? This solution still doesn't depend on a hard-coded number, just the blank line separation. The rule fires on the first line, because C evaluates to zero, and then after every blank line, because we reset C to zero:
awk 'C++?1:$0=$0" >>";!NF{C=0}'
Shorter version of accepted Awk solution:
awk '(NR-1)%5?1:$0=$0" >>"'
We can use a ternary conditional expression cond ? then : else as a pattern, leaving the action empty so that it defaults to {print} which of course means {print $0}. If the zero-based record number is is not congruent to 0, modulo 5, then we produce 1 to trigger the print action. Otherwise we evaluate `$0=$0" >>" to add the required suffix to the record. The result of this expression is also a Boolean true, which triggers the print action.
Shave off one more character: we don't have to subtract 1 from NR and then test for congruence to zero. Basically whenever the 1-based record number is congruent to 1, modulo 5, then we want to add the >> suffix:
awk 'NR%5==1?$0=$0" >>":1'
Though we have to add ==1 (+3 chars), we win because we can drop two parentheses and -1 (-4 chars).
We can do better (with some assumptions): Instead of editing $0, what we can do is create a second field which contains >> by assigning to the parameter $2. The implicit print action will print this, offset by a space:
awk 'NR%5==1?$2=">>":1'
But this only works when the definition line contains one word. If any of the words in this dictionary are compound nouns (separated by space, not hyphenated), this fails. If we try to repair this flaw, we are sadly brought back to the same length:
awk 'NR%5==1?$++NF=">>":1'
Slight variation on the approach: Instead of trying to tack >> onto the record or last field, why don't we conditionally install >>\n as ORS, the output record separator?
awk 'ORS=(NR%5==1?" >>\n":"\n")'
Not the tersest, but worth mentioning. It shows how we can dynamically play with some of these variables from record to record.
Different way for testing NR == 1 (mod 5): namely, regexp!
awk 'NR~/[16]$/?$0=$0" >>":1'
Again, not tersest, but seems worth mentioning. We can treat NR as a string representing the integer as decimal digits. If it ends with 1 or 6 then it is congruent to 1, mod 5. Obviously, not easy to modify to other moduli, not to mention computationally disgusting.

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance
try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.
Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"
Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Resources