copy the segment of the file - linux

I have a multi-line pdb file in the following format
ATOM 2381 CG2 THR A 304 3.359 -8.466 -13.379 1.00 34.89 C
ATOM 2380 OG1 THR A 304 5.073 -10.157 -13.609 1.00 36.00 O
...
ATOM 2380 OG1 THR A 304 5.073 -10.157 -13.609 1.00 36.00 O
TER
HETATM 2382 O HOH A 572 2.739 5.289 20.202 1.00 33.02 O
HETATM 2389 H01 HOH A 572 2.967 5.272 19.270 1.00 33.02 H
HETATM 2390 H02 HOH A 572 2.017 5.906 20.344 1.00 33.02 H
HETATM 2383 O HOH A 619 9.589 -1.213 21.275 1.00 28.34 O
HETATM 2391 H01 HOH A 619 9.100 -1.521 22.041 1.00 28.34 H
HETATM 2392 H03 HOH A 619 9.669 -0.257 21.309 1.00 28.34 H
HETATM 2384 O HOH A 634 8.859 1.214 21.216 1.00 27.10 O
HETATM 2393 H01 HOH A 634 9.495 1.911 21.394 1.00 27.10 H
HETATM 2394 H02 HOH A 634 8.631 0.771 22.037 1.00 27.10 H
HETATM 2385 O HOH A 660 10.309 -1.469 23.867 1.00 43.45 O
HETATM 2395 H01 HOH A 660 9.648 -1.616 24.547 1.00 43.45 H
HETATM 2396 H02 HOH A 660 10.465 -0.527 23.770 1.00 43.45 H
END
Using some utility I need to copy all lines after TER record (they may be defined as the lines started from HETATM) and save it in the separate file, containing:
HETATM 2382 O HOH A 572 2.739 5.289 20.202 1.00 33.02 O
HETATM 2389 H01 HOH A 572 2.967 5.272 19.270 1.00 33.02 H
HETATM 2390 H02 HOH A 572 2.017 5.906 20.344 1.00 33.02 H
HETATM 2383 O HOH A 619 9.589 -1.213 21.275 1.00 28.34 O
HETATM 2391 H01 HOH A 619 9.100 -1.521 22.041 1.00 28.34 H
HETATM 2392 H03 HOH A 619 9.669 -0.257 21.309 1.00 28.34 H
HETATM 2384 O HOH A 634 8.859 1.214 21.216 1.00 27.10 O
HETATM 2393 H01 HOH A 634 9.495 1.911 21.394 1.00 27.10 H
HETATM 2394 H02 HOH A 634 8.631 0.771 22.037 1.00 27.10 H
HETATM 2385 O HOH A 660 10.309 -1.469 23.867 1.00 43.45 O
HETATM 2395 H01 HOH A 660 9.648 -1.616 24.547 1.00 43.45 H
HETATM 2396 H02 HOH A 660 10.465 -0.527 23.770 1.00 43.45 H
what unix utility may be useful for this?

Related

Separating the coordinates for each cluster in DBSCAN using python

Below scripts gives me the coordinates of each cluster in separate txt files. But i want to edit the content of the file as below
usually the coordinates will get printed as follows
0.64 0.30 0.29
0.27 0.24 0.92
0.34 0.62 0.92
0.05 0.48 0.60
0.26 0.77 0.62
0.15 0.23 0.14
0.35 0.26 0.64
But i need it to get printed as Below with all these integers, letters and words for each line.
HETATM 1 O HOH 1 W 0.64 0.30 0.29 1.00 43.38
HETATM 2 O HOH 2 W 0.27 0.24 0.92 1.00 43.38
HETATM 3 O HOH 3 W 0.34 0.62 0.92 1.00 43.38
HETATM 4 O HOH 4 W 0.05 0.48 0.60 1.00 43.38
HETATM 5 O HOH 5 W 0.15 0.23 0.14 1.00 43.38
HETATM 6 O HOH 6 W 0.15 0.23 0.14 1.00 43.38
HETATM 7 O HOH 7 W 0.15 0.23 0.14 1.00 43.38
HETATM 8 O HOH 8 W 0.15 0.23 0.14 1.00 43.38
HETATM 9 O HOH 9 W 0.15 0.23 0.14 1.00 43.38
HETATM 10 O HOH 10 W 0.15 0.23 0.14 1.00 43.38
This is like the format of pdb files (.pdb) for proteins
Does anybody knows how to do this?
Below is my script
from sklearn.cluster import DBSCAN
import numpy as np
data = np.random.rand(500,3)
db = DBSCAN(eps=0.12, min_samples=1).fit(data)
labels = db.labels_
from collections import Counter
Counter(labels)
from collections import defaultdict
clusters = defaultdict(list)
for i,c in enumerate(db.labels_):
clusters[c].append(data[i])
for k,v in clusters.items():
np.savetxt('cluster{}.txt'.format(k), v, delimiter=",", fmt="%1.2f %1.2f %1.2f")
You can modify the two for loops this way:
for i,c in enumerate(db.labels_):
l = np.concatenate([['HETATM {}'.format(i), 'O HOH {} W'.format(i)],data[i],[1.00, 43.38]], axis=0)
clusters[c].append(l)
for k,v in clusters.items():
np.savetxt('cluster{}.txt'.format(k), v, delimiter=",", fmt='%s')
and you get the number of the sample in your dataset, for example:
HETATM 2,O HOH 2 W,0.27035681984544035,0.25141288216432167,0.44097961252275675,1.0,43.38
HETATM 21,O HOH 21 W,0.2905981520836243,0.2680383230921106,0.47545544921372906,1.0,43.38

SED change last columnt text

I would like to ask how to change in last column the letter A to C using sed.
Input for example:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 A
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 A
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 A
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 A
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 A
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 A
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
Output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
I tried sed like this:
sed 's/[A*]$/C/'
But the output looks like this:
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OC
Simple sed approach:
sed 's/\<A[[:space:]]*$/C/' file
\< - word boundary (assuming A char occurs only as standalone char)
[[:space:]]* - match possible whitespace(s) at the end of the string $
The output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA

Pattern-based substitution in the txt via AWK

I have a long text file where somewhere near the end there is a 1 line, with the 3rd column == OXT.
ATOM 2439 O LEU 300 -4.699 34.599 65.335 1.00 83.23 O
ATOM 2440 N LEU 301 -6.822 33.898 65.057 1.00 19.70 N
ATOM 2441 CA LEU 301 -7.080 34.965 64.138 1.00 19.70 C
ATOM 2442 CB LEU 301 -8.165 34.630 63.101 1.00 19.70 C
ATOM 2443 CG LEU 301 -7.762 33.478 62.162 1.00 19.70 C
ATOM 2444 CD1 LEU 301 -8.849 33.207 61.110 1.00 19.70 C
ATOM 2445 CD2 LEU 301 -6.376 33.719 61.543 1.00 19.70 C
ATOM 2446 C LEU 301 -7.556 36.168 64.946 1.00 19.70 C
ATOM 2447 O LEU 301 -8.657 36.695 64.633 1.00 19.70 O
ATOM 2448 OXT LEU 301 -6.821 36.580 65.884 1.00 19.70 O
TER 2449 LEU 301
HETATM 2450 NA NA 302 -13.016 13.036 54.214 1.00 44.33 NA
HETATM 2451 O WAT 303 -18.411 13.587 59.094 1.00 27.41 O
HETATM 2452 O WAT 304 -11.894 17.279 58.575 1.00 18.35 O
HETATM 2453 O WAT 305 -15.811 12.728 54.157 1.00 39.81 O
I need to modify this line with the pattern OXT (see example below) in a following fashion: in a third column - substitute "OXT" with "N "; in a forth column - substitute ACE with NHE; in a last column substitute O with N. Importantly after the substitutions I need to keep the equal space numbers between each of the columns as in the rest of the file:
ATOM 2439 O LEU 300 -4.699 34.599 65.335 1.00 83.23 O
ATOM 2440 N LEU 301 -6.822 33.898 65.057 1.00 19.70 N
ATOM 2441 CA LEU 301 -7.080 34.965 64.138 1.00 19.70 C
ATOM 2442 CB LEU 301 -8.165 34.630 63.101 1.00 19.70 C
ATOM 2443 CG LEU 301 -7.762 33.478 62.162 1.00 19.70 C
ATOM 2444 CD1 LEU 301 -8.849 33.207 61.110 1.00 19.70 C
ATOM 2445 CD2 LEU 301 -6.376 33.719 61.543 1.00 19.70 C
ATOM 2446 C LEU 301 -7.556 36.168 64.946 1.00 19.70 C
ATOM 2447 O LEU 301 -8.657 36.695 64.633 1.00 19.70 O
ATOM 2448 N NHE 301 -6.821 36.580 65.884 1.00 19.70 N
TER
HETATM 2450 NA NA 302 -13.016 13.036 54.214 1.00 44.33 NA
HETATM 2451 O WAT 303 -18.411 13.587 59.094 1.00 27.41 O
HETATM 2452 O WAT 304 -11.894 17.279 58.575 1.00 18.35 O
HETATM 2453 O WAT 305 -15.811 12.728 54.157 1.00 39.81 O
I have tried to use
awk '$3=="OXT"{ f=1; rn=NR; $3=$NF="N"; $4="NHE" }/TER/ && f && NR-rn == 1{ $0=$1 }1' file
It has produced a right job but within a new string now I have 1 space between each columns which is wrong format.
ATOM 2410 N NHE 299 -17.563 -15.711 -15.915 1.00 76.42 N
However I need to keep the original format of the spacings between the columns as in the rest of the file:
ATOM 2448 N NHE 301 -6.821 36.580 65.884 1.00 19.70 N
quick and very dirty:
#/bin/bash
skip=0
cat /tmp/list | while read line
do
third=$(echo $line | awk '{print $3}')
if [ $skip -eq 1 ]
then
echo "TER"
skip=0
continue
fi
if [ "${third}" == "OXT" ]
then
echo "${line}" | sed 's/OXT/N /'
skip=1
continue
fi
echo "${line}"
done
of course the /tmp/list is the file with all values.
You can pipe the result of your command to the column command:
$>awk '$3=="OXT"{ f=1; rn=NR; $3=$NF="N"; $4="NHE" }/TER/ && f && NR-rn == 1{ $0=$1 }1' f|column -t
ATOM 2439 O LEU 300 -4.699 34.599 65.335 1.00 83.23 O
ATOM 2440 N LEU 301 -6.822 33.898 65.057 1.00 19.70 N
ATOM 2441 CA LEU 301 -7.080 34.965 64.138 1.00 19.70 C
ATOM 2442 CB LEU 301 -8.165 34.630 63.101 1.00 19.70 C
ATOM 2443 CG LEU 301 -7.762 33.478 62.162 1.00 19.70 C
ATOM 2444 CD1 LEU 301 -8.849 33.207 61.110 1.00 19.70 C
ATOM 2445 CD2 LEU 301 -6.376 33.719 61.543 1.00 19.70 C
ATOM 2446 C LEU 301 -7.556 36.168 64.946 1.00 19.70 C
ATOM 2447 O LEU 301 -8.657 36.695 64.633 1.00 19.70 O
ATOM 2448 N NHE 301 -6.821 36.580 65.884 1.00 19.70 N
TER
HETATM 2450 NA NA 302 -13.016 13.036 54.214 1.00 44.33 NA
HETATM 2451 O WAT 303 -18.411 13.587 59.094 1.00 27.41 O
HETATM 2452 O WAT 304 -11.894 17.279 58.575 1.00 18.35 O
HETATM 2453 O WAT 305 -15.811 12.728 54.157 1.00 39.81 O

VIM replacing text in 2 columns

So below is a part of one column-sensitive file from lines 23 to 34. Please look at columns 25 and 26. Lines 23 to 28 are correct as it's supposed to be sequential.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 1 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 10 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 1 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 11 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 0 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 0 -1.819 -5.599 1.769 1.00 0.00 WAT H
However, columns 25 and 26 in lines 29 to 34 (and also lines beyond 34 that are not included here) need to be edited. They represent the ID number of water molecules in the file. So, columns 25 and 26 in lines 29-31 is supposed to be ' 9' instead of ' 1' or '10', and columns 25 and 26 in lines 32-34 are supposed to be '10' instead of '11' or ' 0'. And all lines after 34 suffers from the similar problem and I also want to change the contents in columns 25 and 26 to '12','13',etc. for each group of 3 lines. So the final result is expected to be like this.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 9 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 9 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 9 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 10 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 10 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 10 -1.819 -5.599 1.769 1.00 0.00 WAT H
So far I couldn't really come up with a nice pattern to replace those funky numbers to 9,10,etc. It would be great if I could replace all these groups of 3 lines in a single vim command instead of having to do it group by group, as there are 50-60 groups of these with this problem. What I did earlier was just simply :26,28s/HOH 1/HOH 8 and this is clearly not the most efficient way.
Sorry for not being clear at the first attempt of the question, but your help would be appreciated. Thank you
Your question is not clear, but from what I understand, trying to select a rectangular block in visual mode might help you. Use ctrl-v in OS X or Linux or ctrl-q in Windows (in normal mode).
Actually I'd like to thank everyone for your time and sorry for causing the confusions. I found a way to do it, with python's string formatting as the pattern is really fuzzy and I'm not so used to the regex patterns so I couldn't figure a simple way to do it on VIM.

Linux text: add line to previous line of a pattern

I would like to add a specific line "TER" to several variable text files:
Input:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
Output:
[...]
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
[...]
So the pattern is: if after a " C " for the first time a " D " is found add a "TER" before the " D " line (after the " C " line). All other numbers and characters can be variable.
I found some examples with the sed command however I do not know how to do add to the previous line.
With awk:
$ awk 'last_c5=="C" && $5=="D" {print "TER"}; last_c5=$5' file
ATOM 4149 C LEU C 9 136.820 120.050 53.540 1.00 0.00
ATOM 4150 O LEU C 9 136.600 118.860 53.240 1.00 0.00
ATOM 4151 O LEU C 9 137.310 120.340 54.650 1.00 0.00
TER
ATOM 4154 N LYS D 2 115.050 134.940 61.060 1.00 0.00
ATOM 4155 H1 LYS D 2 115.660 134.160 61.180 1.00 0.00
ATOM 4156 H2 LYS D 2 114.760 135.000 60.100 1.00 0.00
It keeps tracking last 5th column value storing it in last_c5 variable. In case the previous was C and the current is D, it prints TER. On last_c5=$5 all lines are being printed.

Resources