add a number to a column in a text file -- linux awk - linux

Im attempting to add the number 128 to each line in column 6 of my file below_zn.pdb that contains 128 lines, and 12 columns separated by spaces, not tab delimited. When I use
awk '{ $6+=128; print }' below_zn.pdb
I am able to add 128 to column 6, but the formatting of my file changes. My output looks as follows:
ATOM 1 ZN ZN2 H 129 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 130 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 131 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 132 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 133 -13.175 29.836 14.467 1.00 0.00 HETA
ATOM 6 ZN ZN2 H 134 -13.175 38.963 14.467 1.00 0.00 HETA
ATOM 7 ZN ZN2 H 135 -13.175 48.090 14.467 1.00 0.00 HETA
ATOM 8 ZN ZN2 H 136 -13.175 57.217 14.467 1.00 0.00 HETA
ATOM 9 ZN ZN2 H 137 -10.679 34.400 -15.527 1.00 0.00 HETA
ATOM 10 ZN ZN2 H 138 -10.679 25.273 -15.527 1.00 0.00 HETA
ATOM 11 ZN ZN2 H 139 -10.679 43.527 -15.527 1.00 0.00 HETA
ATOM 12 ZN ZN2 H 140 -10.679 52.654 -15.527 1.00 0.00 HETA
ATOM 13 ZN ZN2 H 141 -10.590 29.836 -11.760 1.00 0.00 HETA
ATOM 14 ZN ZN2 H 142 -10.590 38.963 -11.760 1.00 0.00 HETA
ATOM 15 ZN ZN2 H 143 -10.590 48.090 -11.760 1.00 0.00 HETA
ATOM 16 ZN ZN2 H 144 -10.590 57.217 -11.760 1.00 0.00 HETA
ATOM 17 ZN ZN2 H 145 -9.288 34.400 1.958 1.00 0.00 HETA
ATOM 18 ZN ZN2 H 146 -9.288 25.273 1.958 1.00 0.00 HETA
ATOM 19 ZN ZN2 H 147 -9.288 43.527 1.958 1.00 0.00 HETA
ATOM 20 ZN ZN2 H 148 -9.288 52.654 1.958 1.00 0.00 HETA
I need to keep the formatting for my file to be useful. I have tried
awk -F'()' '{ $6+=128; print }' below_zn.pdb
but instead of adding the number 128 to all lines of column 6, I am seeing a new column at the farthest right made of the number 128 repeatedly. As seen below:
ATOM 1 ZN ZN2 H 1 -13.264 34.400 10.700 1.00 0.00 HETA 128
ATOM 2 ZN ZN2 H 2 -13.264 25.273 10.700 1.00 0.00 HETA 128
ATOM 3 ZN ZN2 H 3 -13.264 43.527 10.700 1.00 0.00 HETA 128
ATOM 4 ZN ZN2 H 4 -13.264 52.654 10.700 1.00 0.00 HETA 128
ATOM 5 ZN ZN2 H 5 -13.175 29.836 14.467 1.00 0.00 HETA 128
ATOM 6 ZN ZN2 H 6 -13.175 38.963 14.467 1.00 0.00 HETA 128
Is there a way I can use awk/sed/grep or any other command in linux to add 128 to my numbers in column 6 while keeping the formatting as follows:
ATOM 1 ZN ZN2 H 1 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 2 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 3 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 4 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 5 -13.175 29.836 14.467 1.00 0.00 HETA
ATOM 6 ZN ZN2 H 6 -13.175 38.963 14.467 1.00 0.00 HETA
ATOM 7 ZN ZN2 H 7 -13.175 48.090 14.467 1.00 0.00 HETA
ATOM 8 ZN ZN2 H 8 -13.175 57.217 14.467 1.00 0.00 HETA
ATOM 9 ZN ZN2 H 9 -10.679 34.400 -15.527 1.00 0.00 HETA
ATOM 10 ZN ZN2 H 10 -10.679 25.273 -15.527 1.00 0.00 HETA
ATOM 11 ZN ZN2 H 11 -10.679 43.527 -15.527 1.00 0.00 HETA
ATOM 12 ZN ZN2 H 12 -10.679 52.654 -15.527 1.00 0.00 HETA
ATOM 13 ZN ZN2 H 13 -10.590 29.836 -11.760 1.00 0.00 HETA
ATOM 14 ZN ZN2 H 14 -10.590 38.963 -11.760 1.00 0.00 HETA
ATOM 15 ZN ZN2 H 15 -10.590 48.090 -11.760 1.00 0.00 HETA
ATOM 16 ZN ZN2 H 16 -10.590 57.217 -11.760 1.00 0.00 HETA
ATOM 17 ZN ZN2 H 17 -9.288 34.400 1.958 1.00 0.00 HETA
ATOM 18 ZN ZN2 H 18 -9.288 25.273 1.958 1.00 0.00 HETA
ATOM 19 ZN ZN2 H 19 -9.288 43.527 1.958 1.00 0.00 HETA
ATOM 20 ZN ZN2 H 20 -9.288 52.654 1.958 1.00 0.00 HETA
.
.
.
An important note is that column 7 through 9 can have up to 7 characters (whole number with a period followed by the decimal), and there is one space separating the columns.
My file has the following format
column 1 - 4 characters
1 space
column 2 - 1 character
1 space
column 3 - 2 characters
1 space
column 4 - 1 character
1 space
column 5 - 3 characters
1 space
column 6 - 1,2,or 3 characters
5 spaces
column 7 - up to 7 characters
1 space
column 8 - up to 7 characters
1 space
column 9 - up to 7 characters
2 spaces
column 10 - 4 characters
2 spaces
column 11 - 4 characters
6 spaces
column 12 - 4 characters
end of file
Thank you!

Assumptions:
input is using fixed-width spacing
white space only shows up as a column delimiter (ie, no column values contain white space)
the values in column 6 are left-justified
Adding a new row to demonstrate a wider value for column 6:
$ cat below_zn.pdb
ATOM 1 ZN ZN2 H 1 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 2 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 3 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 4 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 5 -13.175 29.836 14.467 1.00 0.00 HETA
BUBBLE 206 ZN ZN2 H 7000 -13.175 29.836 14.467 1.00 0.00 HETA-HETA
One awk idea:
awk '
BEGIN { regex1="^([^[:space:]]+[[:space:]]+){5}" # match 1st 5 columns plus trailing white space
regex2="[^[:space:]]+" # match non-white space characters (aka 6th column)
}
{ oldline=$0
match(oldline,regex1) # find 1st 5 columns
newline=substr(oldline,1,RSTART+RLENGTH-1) # save 1st 5 columns for new line
oldline=substr(oldline,RSTART+RLENGTH) # strip off 1st 5 columns
match(oldline,regex2) # match 1st column of shortened line (aka 6th column of original line)
newval=substr(oldline,1,RLENGTH) + 128 # extract column and add 128
newlen=length(newval) # get length of new value
newline=newline newval substr(oldline,RSTART+newlen) # append new value and rest of line to newline
print newline # print newline to stdout
}
' below_zn.pdb
This generates:
ATOM 1 ZN ZN2 H 129 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 130 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 131 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 132 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 133 -13.175 29.836 14.467 1.00 0.00 HETA
BUBBLE 206 ZN ZN2 H 7128 -13.175 29.836 14.467 1.00 0.00 HETA-HETA

I would harness GNU AWK for this task following way, let file.txt content be
ATOM 1 ZN ZN2 H 1 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 2 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 3 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 4 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 5 -13.175 29.836 14.467 1.00 0.00 HETA
then
awk 'BEGIN{FPAT="[^[:space:]]+[[:space:]]*";OFS=""}{$6=($6+128) " ";print}' file.txt
gives output
ATOM 1 ZN ZN2 H 129 -13.264 34.400 10.700 1.00 0.00 HETA
ATOM 2 ZN ZN2 H 130 -13.264 25.273 10.700 1.00 0.00 HETA
ATOM 3 ZN ZN2 H 131 -13.264 43.527 10.700 1.00 0.00 HETA
ATOM 4 ZN ZN2 H 132 -13.264 52.654 10.700 1.00 0.00 HETA
ATOM 5 ZN ZN2 H 133 -13.175 29.836 14.467 1.00 0.00 HETA
Explanation: I inform GNU AWK that field is one-or-more (+) non (^) whitespace ([:space:]) characters, followed by zero-or-more (*) whitespace chcaracters, therefore trailing whitespace will become part of field and that output field separator (OFS) is empty string. Then for each line regarding 6th column I increase value by 128 and concatenate with two spaces, after that I print line. Feel free to adjust required number of spaces.
(tested in gawk 4.2.1)

Related

copy the segment of the file

I have a multi-line pdb file in the following format
ATOM 2381 CG2 THR A 304 3.359 -8.466 -13.379 1.00 34.89 C
ATOM 2380 OG1 THR A 304 5.073 -10.157 -13.609 1.00 36.00 O
...
ATOM 2380 OG1 THR A 304 5.073 -10.157 -13.609 1.00 36.00 O
TER
HETATM 2382 O HOH A 572 2.739 5.289 20.202 1.00 33.02 O
HETATM 2389 H01 HOH A 572 2.967 5.272 19.270 1.00 33.02 H
HETATM 2390 H02 HOH A 572 2.017 5.906 20.344 1.00 33.02 H
HETATM 2383 O HOH A 619 9.589 -1.213 21.275 1.00 28.34 O
HETATM 2391 H01 HOH A 619 9.100 -1.521 22.041 1.00 28.34 H
HETATM 2392 H03 HOH A 619 9.669 -0.257 21.309 1.00 28.34 H
HETATM 2384 O HOH A 634 8.859 1.214 21.216 1.00 27.10 O
HETATM 2393 H01 HOH A 634 9.495 1.911 21.394 1.00 27.10 H
HETATM 2394 H02 HOH A 634 8.631 0.771 22.037 1.00 27.10 H
HETATM 2385 O HOH A 660 10.309 -1.469 23.867 1.00 43.45 O
HETATM 2395 H01 HOH A 660 9.648 -1.616 24.547 1.00 43.45 H
HETATM 2396 H02 HOH A 660 10.465 -0.527 23.770 1.00 43.45 H
END
Using some utility I need to copy all lines after TER record (they may be defined as the lines started from HETATM) and save it in the separate file, containing:
HETATM 2382 O HOH A 572 2.739 5.289 20.202 1.00 33.02 O
HETATM 2389 H01 HOH A 572 2.967 5.272 19.270 1.00 33.02 H
HETATM 2390 H02 HOH A 572 2.017 5.906 20.344 1.00 33.02 H
HETATM 2383 O HOH A 619 9.589 -1.213 21.275 1.00 28.34 O
HETATM 2391 H01 HOH A 619 9.100 -1.521 22.041 1.00 28.34 H
HETATM 2392 H03 HOH A 619 9.669 -0.257 21.309 1.00 28.34 H
HETATM 2384 O HOH A 634 8.859 1.214 21.216 1.00 27.10 O
HETATM 2393 H01 HOH A 634 9.495 1.911 21.394 1.00 27.10 H
HETATM 2394 H02 HOH A 634 8.631 0.771 22.037 1.00 27.10 H
HETATM 2385 O HOH A 660 10.309 -1.469 23.867 1.00 43.45 O
HETATM 2395 H01 HOH A 660 9.648 -1.616 24.547 1.00 43.45 H
HETATM 2396 H02 HOH A 660 10.465 -0.527 23.770 1.00 43.45 H
what unix utility may be useful for this?

SED change last columnt text

I would like to ask how to change in last column the letter A to C using sed.
Input for example:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 A
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 A
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 A
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 A
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 A
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 A
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
Output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
I tried sed like this:
sed 's/[A*]$/C/'
But the output looks like this:
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OC
Simple sed approach:
sed 's/\<A[[:space:]]*$/C/' file
\< - word boundary (assuming A char occurs only as standalone char)
[[:space:]]* - match possible whitespace(s) at the end of the string $
The output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA

Optimal class weight parameter for the following SVC?

Hello I am working with sklearn to perform a classifier, I have the following distribution of labels:
label : 0 frecuency : 119
label : 1 frecuency : 1615
label : 2 frecuency : 197
label : 3 frecuency : 70
label : 4 frecuency : 203
label : 5 frecuency : 137
label : 6 frecuency : 18
label : 7 frecuency : 142
label : 8 frecuency : 15
label : 9 frecuency : 182
label : 10 frecuency : 986
label : 12 frecuency : 73
label : 13 frecuency : 27
label : 14 frecuency : 81
label : 15 frecuency : 168
label : 18 frecuency : 107
label : 21 frecuency : 125
label : 22 frecuency : 172
label : 23 frecuency : 3870
label : 25 frecuency : 2321
label : 26 frecuency : 25
label : 27 frecuency : 314
label : 28 frecuency : 76
label : 29 frecuency : 116
One thing that clearly stands out is that I am working with a unbalanced data set I have many labels for the class 25,23,1,10, I am getting bad results after the training as follows:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.61 0.23 0.34 528
2 0.00 0.00 0.00 70
3 0.67 0.06 0.11 32
4 0.00 0.00 0.00 62
5 0.78 0.82 0.80 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.14 0.01 0.02 313
12 0.00 0.00 0.00 30
13 0.31 0.57 0.40 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.41 0.74 0.53 1278
25 0.28 0.39 0.33 758
26 0.50 0.25 0.33 8
27 0.29 0.02 0.03 115
28 1.00 0.61 0.76 23
29 0.00 0.00 0.00 42
avg / total 0.33 0.39 0.32 3683
I am getting many zeros and the SVC is not able to learn from several class, the hyperparameters that I am using are the followings:
from sklearn import svm
clf2= svm.SVC(kernel='linear')
I order to overcome this issue I builded one dictionary with weights for each class as follows:
weight={}
for i,v in enumerate(uniqLabels):
weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)
for i,v in weight.items():
print(i,v)
print(weight)
these are the numbers and output, I am just taking the numbers of element of determinated label divided by the total of elements in the labels set, the sum of these numbers is 1:
0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346
trying again with this dictionary of weights as follows:
from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)
I got:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.90 0.19 0.31 528
2 0.00 0.00 0.00 70
3 0.00 0.00 0.00 32
4 0.00 0.00 0.00 62
5 0.00 0.00 0.00 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.00 0.00 0.00 313
12 0.00 0.00 0.00 30
13 0.00 0.00 0.00 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.36 0.99 0.52 1278
25 0.46 0.01 0.02 758
26 0.00 0.00 0.00 8
27 0.00 0.00 0.00 115
28 0.00 0.00 0.00 23
29 0.00 0.00 0.00 42
avg / total 0.35 0.37 0.23 3683
Since I am not getting good results I really appreciate suggestions to automatically adjust the weight of each class and express that in the SVC, I don have many expierience dealing with unbalanced problems so all the suggestions are well Received.
It seems that you are doing the opposite of what you should be doing. In particular, what you want is to put higher weights on the smaller classes, so that the classifier is penalized more during training on these classes. A good point to start would be setting class_weight="balanced".

VIM replacing text in 2 columns

So below is a part of one column-sensitive file from lines 23 to 34. Please look at columns 25 and 26. Lines 23 to 28 are correct as it's supposed to be sequential.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 1 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 10 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 1 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 11 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 0 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 0 -1.819 -5.599 1.769 1.00 0.00 WAT H
However, columns 25 and 26 in lines 29 to 34 (and also lines beyond 34 that are not included here) need to be edited. They represent the ID number of water molecules in the file. So, columns 25 and 26 in lines 29-31 is supposed to be ' 9' instead of ' 1' or '10', and columns 25 and 26 in lines 32-34 are supposed to be '10' instead of '11' or ' 0'. And all lines after 34 suffers from the similar problem and I also want to change the contents in columns 25 and 26 to '12','13',etc. for each group of 3 lines. So the final result is expected to be like this.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 9 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 9 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 9 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 10 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 10 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 10 -1.819 -5.599 1.769 1.00 0.00 WAT H
So far I couldn't really come up with a nice pattern to replace those funky numbers to 9,10,etc. It would be great if I could replace all these groups of 3 lines in a single vim command instead of having to do it group by group, as there are 50-60 groups of these with this problem. What I did earlier was just simply :26,28s/HOH 1/HOH 8 and this is clearly not the most efficient way.
Sorry for not being clear at the first attempt of the question, but your help would be appreciated. Thank you
Your question is not clear, but from what I understand, trying to select a rectangular block in visual mode might help you. Use ctrl-v in OS X or Linux or ctrl-q in Windows (in normal mode).
Actually I'd like to thank everyone for your time and sorry for causing the confusions. I found a way to do it, with python's string formatting as the pattern is really fuzzy and I'm not so used to the regex patterns so I couldn't figure a simple way to do it on VIM.

Linux text: add line to previous line of a pattern

I would like to add a specific line "TER" to several variable text files:
Input:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
Output:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
So the pattern is: if after a " C " for the first time a " D " is found add a "TER" before the " D " line (after the " C " line). All other numbers and characters can be variable.
I found some examples with the sed command however I do not know how to do add to the previous line.
With awk:
$ awk 'last_c5=="C" && $5=="D" {print "TER"}; last_c5=$5' file
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
It keeps tracking last 5th column value storing it in last_c5 variable. In case the previous was C and the current is D, it prints TER. On last_c5=$5 all lines are being printed.

Resources