I would like to ask how to change in last column the letter A to C using sed.
Input for example:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 A
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 A
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 A
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 A
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 A
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 A
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
Output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
I tried sed like this:
sed 's/[A*]$/C/'
But the output looks like this:
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OC
Simple sed approach:
sed 's/\<A[[:space:]]*$/C/' file
\< - word boundary (assuming A char occurs only as standalone char)
[[:space:]]* - match possible whitespace(s) at the end of the string $
The output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
Related
Below scripts gives me the coordinates of each cluster in separate txt files. But i want to edit the content of the file as below
usually the coordinates will get printed as follows
0.64 0.30 0.29
0.27 0.24 0.92
0.34 0.62 0.92
0.05 0.48 0.60
0.26 0.77 0.62
0.15 0.23 0.14
0.35 0.26 0.64
But i need it to get printed as Below with all these integers, letters and words for each line.
HETATM 1 O HOH 1 W 0.64 0.30 0.29 1.00 43.38
HETATM 2 O HOH 2 W 0.27 0.24 0.92 1.00 43.38
HETATM 3 O HOH 3 W 0.34 0.62 0.92 1.00 43.38
HETATM 4 O HOH 4 W 0.05 0.48 0.60 1.00 43.38
HETATM 5 O HOH 5 W 0.15 0.23 0.14 1.00 43.38
HETATM 6 O HOH 6 W 0.15 0.23 0.14 1.00 43.38
HETATM 7 O HOH 7 W 0.15 0.23 0.14 1.00 43.38
HETATM 8 O HOH 8 W 0.15 0.23 0.14 1.00 43.38
HETATM 9 O HOH 9 W 0.15 0.23 0.14 1.00 43.38
HETATM 10 O HOH 10 W 0.15 0.23 0.14 1.00 43.38
This is like the format of pdb files (.pdb) for proteins
Does anybody knows how to do this?
Below is my script
from sklearn.cluster import DBSCAN
import numpy as np
data = np.random.rand(500,3)
db = DBSCAN(eps=0.12, min_samples=1).fit(data)
labels = db.labels_
from collections import Counter
Counter(labels)
from collections import defaultdict
clusters = defaultdict(list)
for i,c in enumerate(db.labels_):
clusters[c].append(data[i])
for k,v in clusters.items():
np.savetxt('cluster{}.txt'.format(k), v, delimiter=",", fmt="%1.2f %1.2f %1.2f")
You can modify the two for loops this way:
for i,c in enumerate(db.labels_):
l = np.concatenate([['HETATM {}'.format(i), 'O HOH {} W'.format(i)],data[i],[1.00, 43.38]], axis=0)
clusters[c].append(l)
for k,v in clusters.items():
np.savetxt('cluster{}.txt'.format(k), v, delimiter=",", fmt='%s')
and you get the number of the sample in your dataset, for example:
HETATM 2,O HOH 2 W,0.27035681984544035,0.25141288216432167,0.44097961252275675,1.0,43.38
HETATM 21,O HOH 21 W,0.2905981520836243,0.2680383230921106,0.47545544921372906,1.0,43.38
As the name suggests, I am using a simple RPC system between a PC (windows x64) and an embedded linux PC running ubuntu. The embedded linux pc is the RPC server and the PC is the RPC client. The RPC framework is: erpc.
I have noticed that the transaction rate I am getting is particularly low - on the order of 20 transactions/sec.
The issue is definitely not hardware related as I have an alternate RPC system (which I'm trying to replace with the contentious one) which can easily get over 1000 transactions/sec using the exact same hardware configuration.
To further prove this,I also wrote a simple python script which acts as a simple socket client or server depending on a switch. I run it on the embedded machine as a server and as a client on the pc. The script simply has the client send some random data to the server which in turn sends the data back. The client does this a few hundred times and determines the transaction rate based on this. The amount of data transmitted is of the same order as what erpc uses. Using this setup I can get 3000+ transactions/sec.
The RPC system in question is half duplex. Only a single thread is used. Server recvs, processes the request and sends the response in a loop.
Only a single socket is used for the duration of the test. I.e. no close and accepts occur during the loop. No other IO occurs. Or at least, I have refactored it for the purposes of these tests to not do any other IO.
On the Windows client side, I have a python unit test which I have run with profiling on. The results don't seem to indicate that the problem is on the client.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 23.998 23.998 runner.py:105(pytest_runtest_call)
1 0.000 0.000 23.998 23.998 python.py:1313(runtest)
1 0.000 0.000 23.998 23.998 __init__.py:603(__call__)
1 0.000 0.000 23.998 23.998 __init__.py:219(_hookexec)
1 0.000 0.000 23.998 23.998 __init__.py:213(<lambda>)
1 0.000 0.000 23.998 23.998 callers.py:151(_multicall)
1 0.000 0.000 23.998 23.998 python.py:183(pytest_pyfunc_call)
1 0.003 0.003 23.998 23.998 test_static_if.py:4(test_read_version)
400 0.014 0.000 23.993 0.060 client.py:16(get_version)
400 0.017 0.000 23.942 0.060 client.py:79(perform_request)
400 0.006 0.000 23.828 0.060 transport.py:75(receive)
800 0.016 0.000 23.820 0.030 transport.py:139(_base_receive)
800 23.803 0.030 23.803 0.030 {method 'recv' of '_socket.socket' objects}
400 0.007 0.000 0.061 0.000 transport.py:65(send)
400 0.002 0.000 0.053 0.000 transport.py:135(_base_send)
400 0.050 0.000 0.050 0.000 {method 'sendall' of '_socket.socket' objects}
400 0.012 0.000 0.032 0.000 basic_codec.py:113(start_read_message)
400 0.006 0.000 0.015 0.000 basic_codec.py:39(start_write_message)
1600 0.007 0.000 0.015 0.000 basic_codec.py:130(_read)
800 0.002 0.000 0.012 0.000 basic_codec.py:156(read_uint32)
The server is a C++ application. I have tried profiling it with gprof but the results of that show practically no time consumed by the application at all. After reading up a bit more about how gprof works and how gprof doesn't accumulate time spent in system calls, this indicates that the program is (obviously) IO bound and that the vast majority of time is spent in blocking system calls.
I won't add the entire output here for brevity but below is an exerpt:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 2407 0.00 0.00 erpc::MessageBuffer::get()
0.00 0.00 0.00 2400 0.00 0.00 erpc::MessageBuffer::setUsed(unsigned short)
0.00 0.00 0.00 2000 0.00 0.00 erpc::MessageBuffer::getUsed() const
0.00 0.00 0.00 1600 0.00 0.00 erpc::MessageBuffer::Cursor::write(void const*, unsigned int)
0.00 0.00 0.00 1201 0.00 0.00 erpc::Codec::getBuffer()
0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::Cursor::set(erpc::MessageBuffer*)
0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::getLength() const
0.00 0.00 0.00 802 0.00 0.00 erpc::Codec::reset()
0.00 0.00 0.00 801 0.00 0.00 erpc::TCPTransport::underlyingReceive(unsigned char*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::TCPTransport::underlyingSend(unsigned char const*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::read(unsigned int*)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(int)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::MessageBuffer::Cursor::read(void*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::Service::getServiceId() const
0.00 0.00 0.00 403 0.00 0.00 erpc::Service::getNext()
0.00 0.00 0.00 401 0.00 0.00 erpc::SimpleServer::runInternal(erpc::Codec*)
0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::accept()
0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::receive(erpc::MessageBuffer*)
0.00 0.00 0.00 401 0.00 0.00 erpc::FramedTransport::receive(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 write_p_version_t_struct(erpc::Codec*, p_version_t const*)
0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::handleInvocation(unsigned int, unsigned int, erpc::Codec*, erpc::MessageBufferFactory*)
0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::get_version_shim(erpc::Codec*, erpc::MessageBufferFactory*, unsigned int)
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endReadMessage()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteStruct()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteMessage()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startReadMessage(erpc::_message_type*, unsigned int*, unsigned int*, unsigned int*)
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteStruct()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteMessage(erpc::_message_type, unsigned int, unsigned int, unsigned int)
0.00 0.00 0.00 400 0.00 0.00 erpc::FramedTransport::send(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 erpc::MessageBufferFactory::prepareServerBufferForSend(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 erpc::Server::processMessage(erpc::Codec*, erpc::_message_type&)
0.00 0.00 0.00 400 0.00 0.00 erpc::Server::findServiceWithId(unsigned int)
0.00 0.00 0.00 400 0.00 0.00 get_version
0.00 0.00 0.00 5 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::get()
0.00 0.00 0.00 4 0.00 0.00 operator new(unsigned int, void*)
0.00 0.00 0.00 3 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::operator->()
0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::TCPTransport>::get()
0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::BasicCodecFactory>::get()
0.00 0.00 0.00 2 0.00 0.00 erpc::Server::addService(erpc::Service*)
0.00 0.00 0.00 2 0.00 0.00 erpc::Service::Service(unsigned int)
0.00 0.00 0.00 2 0.00 0.00 erpc::Service::~Service()
0.00 0.00 0.00 2 0.00 0.00 erpc_add_service_to_server
0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z5usagev
Using strace, the problem becomes apparent in the first recv of every request. For context, an initial header is transmitted first which indicates the amount of data the request proper contains.
Here's a couple of excerpts from the output (the full output is 2000 lines).
I used the -r, -T and -C switches which show a relative timestamp for each call, prints the time spent in each call and also shows the summary respectively.
In the transaction loop:
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059478>
0.059589 recv(4, "q\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047>
0.000167 send(4, "\20\0", 2, 0) = 2 <0.000073>
0.000183 send(4, "q\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059513>
0.059625 recv(4, "r\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000167 send(4, "\20\0", 2, 0) = 2 <0.000071>
0.000182 send(4, "r\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059059>
0.059172 recv(4, "s\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047>
0.000183 send(4, "\20\0", 2, 0) = 2 <0.000073>
0.000183 send(4, "s\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059330>
0.059441 recv(4, "t\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000166 send(4, "\20\0", 2, 0) = 2 <0.000072>
0.000182 send(4, "t\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
0.000163 recv(4, "\10\0", 2, 0) = 2 <0.059506>
0.059618 recv(4, "u\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000166 send(4, "\20\0", 2, 0) = 2 <0.000070>
0.000181 send(4, "u\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059359>
0.059488 recv(4, "v\1\1\2\0\1\0\0", 8, 0) = 8 <0.000048>
0.000175 send(4, "\20\0", 2, 0) = 2 <0.000077>
0.000189 send(4, "v\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000051>
0.000165 recv(4, "\10\0", 2, 0) = 2 <0.059496>
0.059612 recv(4, "w\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000170 send(4, "\20\0", 2, 0) = 2 <0.000074>
0.000182 send(4, "w\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
The summary:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.59 0.010000 12 801 recv
1.41 0.000143 0 800 send
0.00 0.000000 0 12 read
0.00 0.000000 0 3 write
0.00 0.000000 0 25 19 open
0.00 0.000000 0 7 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 8 lseek
0.00 0.000000 0 6 6 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 readlink
0.00 0.000000 0 1 munmap
0.00 0.000000 0 2 setitimer
0.00 0.000000 0 1 uname
0.00 0.000000 0 9 mprotect
0.00 0.000000 0 5 writev
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 16 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 6 fstat64
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1 accept
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 set_tls
------ ----------- ----------- --------- --------- ----------------
100.00 0.010143 1731 40 total
In passing, I am not sure I completely understand the summary. The summary suggests that recv happens very quick compared to the time indicated in each call to recv.
It looks like the time spent in the first recv is what is killing the RPC system at nearly 60ms per call. Am I misreading this? I am not sure of the units but so I am guessing seconds.
So, after profiling both the client and the server, it appears the vast amount of time is spent in recv.
If we assumed that extra time spent in the intitial recv on the server side was because the client was still processing something and hadn't send it yet, that should have shown up when profiling the client.
Any suggestions you may have as to how to further debug this would be greatly appreciated.
Thanks!
Hello I am working with sklearn to perform a classifier, I have the following distribution of labels:
label : 0 frecuency : 119
label : 1 frecuency : 1615
label : 2 frecuency : 197
label : 3 frecuency : 70
label : 4 frecuency : 203
label : 5 frecuency : 137
label : 6 frecuency : 18
label : 7 frecuency : 142
label : 8 frecuency : 15
label : 9 frecuency : 182
label : 10 frecuency : 986
label : 12 frecuency : 73
label : 13 frecuency : 27
label : 14 frecuency : 81
label : 15 frecuency : 168
label : 18 frecuency : 107
label : 21 frecuency : 125
label : 22 frecuency : 172
label : 23 frecuency : 3870
label : 25 frecuency : 2321
label : 26 frecuency : 25
label : 27 frecuency : 314
label : 28 frecuency : 76
label : 29 frecuency : 116
One thing that clearly stands out is that I am working with a unbalanced data set I have many labels for the class 25,23,1,10, I am getting bad results after the training as follows:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.61 0.23 0.34 528
2 0.00 0.00 0.00 70
3 0.67 0.06 0.11 32
4 0.00 0.00 0.00 62
5 0.78 0.82 0.80 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.14 0.01 0.02 313
12 0.00 0.00 0.00 30
13 0.31 0.57 0.40 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.41 0.74 0.53 1278
25 0.28 0.39 0.33 758
26 0.50 0.25 0.33 8
27 0.29 0.02 0.03 115
28 1.00 0.61 0.76 23
29 0.00 0.00 0.00 42
avg / total 0.33 0.39 0.32 3683
I am getting many zeros and the SVC is not able to learn from several class, the hyperparameters that I am using are the followings:
from sklearn import svm
clf2= svm.SVC(kernel='linear')
I order to overcome this issue I builded one dictionary with weights for each class as follows:
weight={}
for i,v in enumerate(uniqLabels):
weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)
for i,v in weight.items():
print(i,v)
print(weight)
these are the numbers and output, I am just taking the numbers of element of determinated label divided by the total of elements in the labels set, the sum of these numbers is 1:
0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346
trying again with this dictionary of weights as follows:
from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)
I got:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.90 0.19 0.31 528
2 0.00 0.00 0.00 70
3 0.00 0.00 0.00 32
4 0.00 0.00 0.00 62
5 0.00 0.00 0.00 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.00 0.00 0.00 313
12 0.00 0.00 0.00 30
13 0.00 0.00 0.00 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.36 0.99 0.52 1278
25 0.46 0.01 0.02 758
26 0.00 0.00 0.00 8
27 0.00 0.00 0.00 115
28 0.00 0.00 0.00 23
29 0.00 0.00 0.00 42
avg / total 0.35 0.37 0.23 3683
Since I am not getting good results I really appreciate suggestions to automatically adjust the weight of each class and express that in the SVC, I don have many expierience dealing with unbalanced problems so all the suggestions are well Received.
It seems that you are doing the opposite of what you should be doing. In particular, what you want is to put higher weights on the smaller classes, so that the classifier is penalized more during training on these classes. A good point to start would be setting class_weight="balanced".
This question already has answers here:
Pandas concat yields ValueError: Plan shapes are not aligned
(7 answers)
Closed 6 years ago.
I am parsing data from excel files and the columns of the resulting DataFrame may or may not align to a base DataFramewhere I want to stack several parsed DataFrame.
Lets call the DataFrame I parse from data A, and the base DataFrame df_A.
I read an excel shee resulting in A=
Index AGUB AGUG MUEB MUEB SIL SIL SILB SILB
2012-01-01 00:00:00 0.00 0 0.00 50.78 0.00 0.00 0.00 0.00
2012-01-01 01:00:00 0.00 0 0.00 53.15 0.00 53.15 0.00 0.00
2012-01-01 02:00:00 0.00 0 0.00 0.00 53.15 53.15 53.15 53.15
2012-01-01 03:00:00 0.00 0 0.00 0.00 0.00 55.16 0.00 0.00
2012-01-01 04:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 05:00:00 48.96 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 06:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 07:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 08:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 09:00:00 52.28 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 10:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 11:00:00 36.93 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 12:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 13:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 50.00
2012-01-01 14:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 34.01
2012-01-01 15:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 16:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 17:00:00 53.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 18:00:00 0.00 75 0.00 75.00 0.00 75.00 0.00 0.00
2012-01-01 19:00:00 0.00 70 0.00 70.00 0.00 0.00 0.00 0.00
2012-01-01 20:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 21:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 22:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 23:00:00 0.00 0 53.45 53.45 0.00 0.00 0.00 0.00
I create the base dataframe:
units = ['MUE', 'MUEB', 'SIL', 'SILB', 'AGUG', 'AGUB', 'MUEBP', 'MUELP']
df_A = pd.DataFrame(columns=units)
df_A = pd.concat([df_A, A], axis=0)
Usually with concat if A had less columns than df_A it'll be fine, but in this case the only difference in the columns is the order. the concatenation leads to the following error:
ValueError: Plan shapes are not aligned
I'd like to know how to concatenate the two dataframes with the column order given by df_A.
I've tried this and it doesn't matter whether there are more columns in the source, or target defined DataFrame - either way, the result is a dataframe that consists of a union of all supplied columns (with empty columns specified in the target, but not populated by the source populated with NaN).
Where I have been able to reproduce your error is where the column names in either the source or target dataframe include a duplicate name (or empty column names).
In your example, various columns appear more than once in your source file. I don't think concat copes very well with these kinds of duplicate columns.
import pandas as pd
s1 = [0,1,2,3,4,5]
s2 = [0,0,0,0,1,1]
A = pd.DataFrame([s2,s1],columns=['A','B','C','D','E','F'])
Resulting in:
A B C D E F
-----------
0 0 0 0 1 1
0 1 2 3 4 5
Take a subset of columns and use them to create a new dataframe called B
B = A[['A','C','E']]
A C E
-----
0 0 1
0 2 4
Create a new empty target dataframe
col_names = ['D','A','C','B']
Z = pd.DataFrame(columns=col_names)
D A C B
-------
And concatenate the two:
Z = pd.concat([B,Z],axis=0)
A C D E
0 0 NaN 1
0 2 NaN 4
Works fine!
But if I recreate the empty dataframe using columns as so:
col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)
D A C D
And try to concatenate:
col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)
Then I get the error you describe.
It's because of the duplicate columns in the data (SIL). See: Pandas concat gives error ValueError: Plan shapes are not aligned
So below is a part of one column-sensitive file from lines 23 to 34. Please look at columns 25 and 26. Lines 23 to 28 are correct as it's supposed to be sequential.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 1 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 10 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 1 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 11 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 0 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 0 -1.819 -5.599 1.769 1.00 0.00 WAT H
However, columns 25 and 26 in lines 29 to 34 (and also lines beyond 34 that are not included here) need to be edited. They represent the ID number of water molecules in the file. So, columns 25 and 26 in lines 29-31 is supposed to be ' 9' instead of ' 1' or '10', and columns 25 and 26 in lines 32-34 are supposed to be '10' instead of '11' or ' 0'. And all lines after 34 suffers from the similar problem and I also want to change the contents in columns 25 and 26 to '12','13',etc. for each group of 3 lines. So the final result is expected to be like this.
HETATM 21 O HOH 7 -1.609 5.551 -4.296 1.00 0.00 WAT O
HETATM 22 H HOH 7 -1.594 5.971 -3.395 1.00 0.00 WAT H
HETATM 23 H HOH 7 -1.048 4.730 -4.281 1.00 0.00 WAT H
HETATM 24 O HOH 8 -4.693 5.472 -0.557 1.00 0.00 WAT O
HETATM 25 H HOH 8 -3.881 4.900 -0.521 1.00 0.00 WAT H
HETATM 26 H HOH 8 -4.819 5.805 -1.485 1.00 0.00 WAT H
HETATM 27 O HOH 9 0.289 -5.035 5.663 1.00 0.00 WAT O
HETATM 28 H HOH 9 0.241 -4.604 -5.564 1.00 0.00 WAT H
HETATM 29 H HOH 9 -0.399 -5.750 5.605 1.00 0.00 WAT H
HETATM 30 O HOH 10 -1.741 -5.167 0.877 1.00 0.00 WAT O
HETATM 31 H HOH 10 -2.612 -4.754 0.636 1.00 0.00 WAT H
HETATM 32 H HOH 10 -1.819 -5.599 1.769 1.00 0.00 WAT H
So far I couldn't really come up with a nice pattern to replace those funky numbers to 9,10,etc. It would be great if I could replace all these groups of 3 lines in a single vim command instead of having to do it group by group, as there are 50-60 groups of these with this problem. What I did earlier was just simply :26,28s/HOH 1/HOH 8 and this is clearly not the most efficient way.
Sorry for not being clear at the first attempt of the question, but your help would be appreciated. Thank you
Your question is not clear, but from what I understand, trying to select a rectangular block in visual mode might help you. Use ctrl-v in OS X or Linux or ctrl-q in Windows (in normal mode).
Actually I'd like to thank everyone for your time and sorry for causing the confusions. I found a way to do it, with python's string formatting as the pattern is really fuzzy and I'm not so used to the regex patterns so I couldn't figure a simple way to do it on VIM.