Read mixed text file in MATLAB - string

I have a text file which contains numbers and characters, and, more importantly, it also has * which means repetition. For example:
data
-- comment
34*0.00 0.454 0.223
0.544 5*4.866
/
the above example starts with 34 , zeros, 0.00 , and then 0.454 and then 0.223 , then 0.544 and 5 of 4.866 repeated. which means it has 34 + 1 + 1 +1 + 5 = 42 numeric values. What is the best way to write a general code that can read such text files? Nothing else matters in the text file; only the numbers are relevant.

The first step is to read the data in. I'm assuming that the contents of your file look like this:
-- comment
34*0.00 0.454 0.223
0.544 5*4.866
For that format, you can use textscan like so:
fid = fopen('data.txt');
data = textscan(fid, '%s', 'CommentStyle', '--');
fclose(fid);
data = data{1};
And data will look like this when displayed:
data =
5×1 cell array
'34*0.00'
'0.454'
'0.223'
'0.544'
'5*4.866'
Now, there are a few different ways you could try to convert this into numeric data of the format you need. One (potentially horrifying) way is to use regexprep like so:
>> data = regexprep(data, '([\d\.]+)\*([\d\.]+)', ...
'${repmat([$2 blanks(1)], 1, str2num($1))}')
data =
5×1 cell array
'0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0…'
'0.454'
'0.223'
'0.544'
'4.866 4.866 4.866 4.866 4.866 '
As you can see, it replicates each string in place as needed. Now, we can convert each cell of the cell array to a numeric value and concatenate them all together like so, using cellfun and str2num:
>> num = cellfun(#str2num, data, 'UniformOutput', false);
>> num = [num{:}]
num =
Columns 1 through 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Columns 15 through 28
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Columns 29 through 42
0 0 0 0 0 0 0.4540 0.2230 0.5440 4.8660 4.8660 4.8660 4.8660 4.8660

Related

Pandas - group by function and sum columns to extract rows where sum of other columns is 0

I have a data frame with over three million rows. I am trying to Group values in Bar_Code column and extract only those rows where sum of all rows in SOH, Cost and Sold_Date is zero.
My dataframe is as under:
Location Bar_Code SOH Cost Sold_Date
1 00000003589823 0 0.00 NULL
2 00000003589823 0 0.00 NULL
3 00000003589823 0 0.00 NULL
1 0000000151818 -102 0.00 NULL
2 0000000151818 0 8.00 NULL
3 0000000151818 0 0.00 2020-10-06T16:35:25.000
1 0000131604108 0 0.00 NULL
2 0000131604108 0 0.00 NULL
3 0000131604108 0 0.00 NULL
1 0000141073505 -53 3.00 2020-10-06T16:35:25.000
2 0000141073505 0 0.00 NULL
3 0000141073505 -20 20.00 2020-09-25T10:11:30.000
I have tried the below code:
df.groupby(['Bar_Code','SOH','Cost','Sold_Date']).sum()
but I am getting the below output:
Bar_Code SOH Cost Sold_Date
0000000151818 -102.0 0.0000 2021-12-13T10:01:59.000
0.0 8.0000 2020-10-06T16:35:25.000
0000131604108 0.0 0.0000 NULL
0000141073505 -53.0 0.0000 2021-11-28T16:57:59.000
3.0000 2021-12-05T11:23:02.000
0.0 0.0000 2020-04-14T08:02:45.000
0000161604109 -8.0 4.1000 2020-09-25T10:11:30.000
00000003589823 0 0.00 NULL
I need to check if it is possible to get the below desired output to get only the specific rows where sum of SOH, Cost & Sold_Date is 0 or NULL, its safe that the code ignores first Column (Locations):
Bar_Code SOH Cost Sold_Date
00000003589823 0 0.00 NULL
0000131604108 0.0 0.0000 NULL
Idea is filter all groups if SOH, Cost and Sold_Date is 0 or NaN by filter rows if not match first, get Bar_Code and last invert mask for filter all groups in isin:
g = df.loc[df[['SOH','Cost','Sold_Date']].fillna(0).ne(0).any(axis=1), 'Bar_Code']
df1 = df[~df['Bar_Code'].isin(g)].drop_duplicates('Bar_Code').drop('Location', axis=1)
print (df1)
Bar_Code SOH Cost Sold_Date
0 00000003589823 0 0.0 NaN
6 0000131604108 0 0.0 NaN

Finding the source of latency in an RPC system

As the name suggests, I am using a simple RPC system between a PC (windows x64) and an embedded linux PC running ubuntu. The embedded linux pc is the RPC server and the PC is the RPC client. The RPC framework is: erpc.
I have noticed that the transaction rate I am getting is particularly low - on the order of 20 transactions/sec.
The issue is definitely not hardware related as I have an alternate RPC system (which I'm trying to replace with the contentious one) which can easily get over 1000 transactions/sec using the exact same hardware configuration.
To further prove this,I also wrote a simple python script which acts as a simple socket client or server depending on a switch. I run it on the embedded machine as a server and as a client on the pc. The script simply has the client send some random data to the server which in turn sends the data back. The client does this a few hundred times and determines the transaction rate based on this. The amount of data transmitted is of the same order as what erpc uses. Using this setup I can get 3000+ transactions/sec.
The RPC system in question is half duplex. Only a single thread is used. Server recvs, processes the request and sends the response in a loop.
Only a single socket is used for the duration of the test. I.e. no close and accepts occur during the loop. No other IO occurs. Or at least, I have refactored it for the purposes of these tests to not do any other IO.
On the Windows client side, I have a python unit test which I have run with profiling on. The results don't seem to indicate that the problem is on the client.
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 23.998 23.998 runner.py:105(pytest_runtest_call)
1 0.000 0.000 23.998 23.998 python.py:1313(runtest)
1 0.000 0.000 23.998 23.998 __init__.py:603(__call__)
1 0.000 0.000 23.998 23.998 __init__.py:219(_hookexec)
1 0.000 0.000 23.998 23.998 __init__.py:213(<lambda>)
1 0.000 0.000 23.998 23.998 callers.py:151(_multicall)
1 0.000 0.000 23.998 23.998 python.py:183(pytest_pyfunc_call)
1 0.003 0.003 23.998 23.998 test_static_if.py:4(test_read_version)
400 0.014 0.000 23.993 0.060 client.py:16(get_version)
400 0.017 0.000 23.942 0.060 client.py:79(perform_request)
400 0.006 0.000 23.828 0.060 transport.py:75(receive)
800 0.016 0.000 23.820 0.030 transport.py:139(_base_receive)
800 23.803 0.030 23.803 0.030 {method 'recv' of '_socket.socket' objects}
400 0.007 0.000 0.061 0.000 transport.py:65(send)
400 0.002 0.000 0.053 0.000 transport.py:135(_base_send)
400 0.050 0.000 0.050 0.000 {method 'sendall' of '_socket.socket' objects}
400 0.012 0.000 0.032 0.000 basic_codec.py:113(start_read_message)
400 0.006 0.000 0.015 0.000 basic_codec.py:39(start_write_message)
1600 0.007 0.000 0.015 0.000 basic_codec.py:130(_read)
800 0.002 0.000 0.012 0.000 basic_codec.py:156(read_uint32)
The server is a C++ application. I have tried profiling it with gprof but the results of that show practically no time consumed by the application at all. After reading up a bit more about how gprof works and how gprof doesn't accumulate time spent in system calls, this indicates that the program is (obviously) IO bound and that the vast majority of time is spent in blocking system calls.
I won't add the entire output here for brevity but below is an exerpt:
Flat profile:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
0.00 0.00 0.00 2407 0.00 0.00 erpc::MessageBuffer::get()
0.00 0.00 0.00 2400 0.00 0.00 erpc::MessageBuffer::setUsed(unsigned short)
0.00 0.00 0.00 2000 0.00 0.00 erpc::MessageBuffer::getUsed() const
0.00 0.00 0.00 1600 0.00 0.00 erpc::MessageBuffer::Cursor::write(void const*, unsigned int)
0.00 0.00 0.00 1201 0.00 0.00 erpc::Codec::getBuffer()
0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::Cursor::set(erpc::MessageBuffer*)
0.00 0.00 0.00 803 0.00 0.00 erpc::MessageBuffer::getLength() const
0.00 0.00 0.00 802 0.00 0.00 erpc::Codec::reset()
0.00 0.00 0.00 801 0.00 0.00 erpc::TCPTransport::underlyingReceive(unsigned char*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::TCPTransport::underlyingSend(unsigned char const*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::read(unsigned int*)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(int)
0.00 0.00 0.00 800 0.00 0.00 erpc::BasicCodec::write(unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::MessageBuffer::Cursor::read(void*, unsigned int)
0.00 0.00 0.00 800 0.00 0.00 erpc::Service::getServiceId() const
0.00 0.00 0.00 403 0.00 0.00 erpc::Service::getNext()
0.00 0.00 0.00 401 0.00 0.00 erpc::SimpleServer::runInternal(erpc::Codec*)
0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::accept()
0.00 0.00 0.00 401 0.00 0.00 erpc::TCPTransport::receive(erpc::MessageBuffer*)
0.00 0.00 0.00 401 0.00 0.00 erpc::FramedTransport::receive(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 write_p_version_t_struct(erpc::Codec*, p_version_t const*)
0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::handleInvocation(unsigned int, unsigned int, erpc::Codec*, erpc::MessageBufferFactory*)
0.00 0.00 0.00 400 0.00 0.00 StaticIF_service::get_version_shim(erpc::Codec*, erpc::MessageBufferFactory*, unsigned int)
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endReadMessage()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteStruct()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::endWriteMessage()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startReadMessage(erpc::_message_type*, unsigned int*, unsigned int*, unsigned int*)
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteStruct()
0.00 0.00 0.00 400 0.00 0.00 erpc::BasicCodec::startWriteMessage(erpc::_message_type, unsigned int, unsigned int, unsigned int)
0.00 0.00 0.00 400 0.00 0.00 erpc::FramedTransport::send(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 erpc::MessageBufferFactory::prepareServerBufferForSend(erpc::MessageBuffer*)
0.00 0.00 0.00 400 0.00 0.00 erpc::Server::processMessage(erpc::Codec*, erpc::_message_type&)
0.00 0.00 0.00 400 0.00 0.00 erpc::Server::findServiceWithId(unsigned int)
0.00 0.00 0.00 400 0.00 0.00 get_version
0.00 0.00 0.00 5 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::get()
0.00 0.00 0.00 4 0.00 0.00 operator new(unsigned int, void*)
0.00 0.00 0.00 3 0.00 0.00 erpc::ManuallyConstructed<erpc::SimpleServer>::operator->()
0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::TCPTransport>::get()
0.00 0.00 0.00 2 0.00 0.00 erpc::ManuallyConstructed<erpc::BasicCodecFactory>::get()
0.00 0.00 0.00 2 0.00 0.00 erpc::Server::addService(erpc::Service*)
0.00 0.00 0.00 2 0.00 0.00 erpc::Service::Service(unsigned int)
0.00 0.00 0.00 2 0.00 0.00 erpc::Service::~Service()
0.00 0.00 0.00 2 0.00 0.00 erpc_add_service_to_server
0.00 0.00 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z5usagev
Using strace, the problem becomes apparent in the first recv of every request. For context, an initial header is transmitted first which indicates the amount of data the request proper contains.
Here's a couple of excerpts from the output (the full output is 2000 lines).
I used the -r, -T and -C switches which show a relative timestamp for each call, prints the time spent in each call and also shows the summary respectively.
In the transaction loop:
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059478>
0.059589 recv(4, "q\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047>
0.000167 send(4, "\20\0", 2, 0) = 2 <0.000073>
0.000183 send(4, "q\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059513>
0.059625 recv(4, "r\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000167 send(4, "\20\0", 2, 0) = 2 <0.000071>
0.000182 send(4, "r\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059059>
0.059172 recv(4, "s\1\1\2\0\1\0\0", 8, 0) = 8 <0.000047>
0.000183 send(4, "\20\0", 2, 0) = 2 <0.000073>
0.000183 send(4, "s\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000161 recv(4, "\10\0", 2, 0) = 2 <0.059330>
0.059441 recv(4, "t\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000166 send(4, "\20\0", 2, 0) = 2 <0.000072>
0.000182 send(4, "t\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
0.000163 recv(4, "\10\0", 2, 0) = 2 <0.059506>
0.059618 recv(4, "u\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000166 send(4, "\20\0", 2, 0) = 2 <0.000070>
0.000181 send(4, "u\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000049>
0.000160 recv(4, "\10\0", 2, 0) = 2 <0.059359>
0.059488 recv(4, "v\1\1\2\0\1\0\0", 8, 0) = 8 <0.000048>
0.000175 send(4, "\20\0", 2, 0) = 2 <0.000077>
0.000189 send(4, "v\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000051>
0.000165 recv(4, "\10\0", 2, 0) = 2 <0.059496>
0.059612 recv(4, "w\1\1\2\0\1\0\0", 8, 0) = 8 <0.000046>
0.000170 send(4, "\20\0", 2, 0) = 2 <0.000074>
0.000182 send(4, "w\1\1\2\2\1\0\0\235\256\322\2664\22\0\0", 16, 0) = 16 <0.000050>
The summary:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.59 0.010000 12 801 recv
1.41 0.000143 0 800 send
0.00 0.000000 0 12 read
0.00 0.000000 0 3 write
0.00 0.000000 0 25 19 open
0.00 0.000000 0 7 close
0.00 0.000000 0 1 execve
0.00 0.000000 0 8 lseek
0.00 0.000000 0 6 6 access
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 readlink
0.00 0.000000 0 1 munmap
0.00 0.000000 0 2 setitimer
0.00 0.000000 0 1 uname
0.00 0.000000 0 9 mprotect
0.00 0.000000 0 5 writev
0.00 0.000000 0 2 rt_sigaction
0.00 0.000000 0 16 mmap2
0.00 0.000000 0 16 15 stat64
0.00 0.000000 0 6 fstat64
0.00 0.000000 0 1 socket
0.00 0.000000 0 1 bind
0.00 0.000000 0 1 listen
0.00 0.000000 0 1 accept
0.00 0.000000 0 1 setsockopt
0.00 0.000000 0 1 set_tls
------ ----------- ----------- --------- --------- ----------------
100.00 0.010143 1731 40 total
In passing, I am not sure I completely understand the summary. The summary suggests that recv happens very quick compared to the time indicated in each call to recv.
It looks like the time spent in the first recv is what is killing the RPC system at nearly 60ms per call. Am I misreading this? I am not sure of the units but so I am guessing seconds.
So, after profiling both the client and the server, it appears the vast amount of time is spent in recv.
If we assumed that extra time spent in the intitial recv on the server side was because the client was still processing something and hadn't send it yet, that should have shown up when profiling the client.
Any suggestions you may have as to how to further debug this would be greatly appreciated.
Thanks!

SED change last columnt text

I would like to ask how to change in last column the letter A to C using sed.
Input for example:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 A
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 A
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 A
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 A
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 A
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 A
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
Output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA
I tried sed like this:
sed 's/[A*]$/C/'
But the output looks like this:
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OC
Simple sed approach:
sed 's/\<A[[:space:]]*$/C/' file
\< - word boundary (assuming A char occurs only as standalone char)
[[:space:]]* - match possible whitespace(s) at the end of the string $
The output:
HETATM 18 H UNK 0 12.447 20.851 23.373 0.00 0.00 0.167 HD
HETATM 19 C UNK 0 11.406 19.947 21.942 0.00 0.00 0.033 C
HETATM 20 C UNK 0 10.684 20.899 21.181 0.00 0.00 0.030 C
HETATM 21 C UNK 0 9.503 20.541 20.507 0.00 0.00 0.019 C
HETATM 22 C UNK 0 9.032 19.211 20.545 0.00 0.00 0.032 C
HETATM 23 C UNK 0 9.772 18.248 21.264 0.00 0.00 0.019 C
HETATM 24 C UNK 0 10.946 18.613 21.948 0.00 0.00 0.030 C
HETATM 25 C UNK 0 7.833 18.846 19.889 0.00 0.00 0.253 C
HETATM 26 O UNK 0 7.856 18.994 18.642 0.00 0.00 -0.267 OA

concatenate dataframes with different column ordering [duplicate]

This question already has answers here:
Pandas concat yields ValueError: Plan shapes are not aligned
(7 answers)
Closed 6 years ago.
I am parsing data from excel files and the columns of the resulting DataFrame may or may not align to a base DataFramewhere I want to stack several parsed DataFrame.
Lets call the DataFrame I parse from data A, and the base DataFrame df_A.
I read an excel shee resulting in A=
Index AGUB AGUG MUEB MUEB SIL SIL SILB SILB
2012-01-01 00:00:00 0.00 0 0.00 50.78 0.00 0.00 0.00 0.00
2012-01-01 01:00:00 0.00 0 0.00 53.15 0.00 53.15 0.00 0.00
2012-01-01 02:00:00 0.00 0 0.00 0.00 53.15 53.15 53.15 53.15
2012-01-01 03:00:00 0.00 0 0.00 0.00 0.00 55.16 0.00 0.00
2012-01-01 04:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 05:00:00 48.96 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 06:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 07:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 08:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 09:00:00 52.28 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 10:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 11:00:00 36.93 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 12:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 13:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 50.00
2012-01-01 14:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 34.01
2012-01-01 15:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 16:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 17:00:00 53.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 18:00:00 0.00 75 0.00 75.00 0.00 75.00 0.00 0.00
2012-01-01 19:00:00 0.00 70 0.00 70.00 0.00 0.00 0.00 0.00
2012-01-01 20:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 21:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 22:00:00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00
2012-01-01 23:00:00 0.00 0 53.45 53.45 0.00 0.00 0.00 0.00
I create the base dataframe:
units = ['MUE', 'MUEB', 'SIL', 'SILB', 'AGUG', 'AGUB', 'MUEBP', 'MUELP']
df_A = pd.DataFrame(columns=units)
df_A = pd.concat([df_A, A], axis=0)
Usually with concat if A had less columns than df_A it'll be fine, but in this case the only difference in the columns is the order. the concatenation leads to the following error:
ValueError: Plan shapes are not aligned
I'd like to know how to concatenate the two dataframes with the column order given by df_A.
I've tried this and it doesn't matter whether there are more columns in the source, or target defined DataFrame - either way, the result is a dataframe that consists of a union of all supplied columns (with empty columns specified in the target, but not populated by the source populated with NaN).
Where I have been able to reproduce your error is where the column names in either the source or target dataframe include a duplicate name (or empty column names).
In your example, various columns appear more than once in your source file. I don't think concat copes very well with these kinds of duplicate columns.
import pandas as pd
s1 = [0,1,2,3,4,5]
s2 = [0,0,0,0,1,1]
A = pd.DataFrame([s2,s1],columns=['A','B','C','D','E','F'])
Resulting in:
A B C D E F
-----------
0 0 0 0 1 1
0 1 2 3 4 5
Take a subset of columns and use them to create a new dataframe called B
B = A[['A','C','E']]
A C E
-----
0 0 1
0 2 4
Create a new empty target dataframe
col_names = ['D','A','C','B']
Z = pd.DataFrame(columns=col_names)
D A C B
-------
And concatenate the two:
Z = pd.concat([B,Z],axis=0)
A C D E
0 0 NaN 1
0 2 NaN 4
Works fine!
But if I recreate the empty dataframe using columns as so:
col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)
D A C D
And try to concatenate:
col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)
Then I get the error you describe.
It's because of the duplicate columns in the data (SIL). See: Pandas concat gives error ValueError: Plan shapes are not aligned

Haskell: Leaking memory from ST / GC not collecting?

I have a computation inside ST which allocates memory through a Data.Vector.Unboxed.Mutable. The vector is never read or written, nor is any reference retained to it outside of runST (to the best of my knowledge). The problem I have is that when I run my ST computation multiple times, I sometimes seem to keep the memory for the vector around.
Allocation statistics:
5,435,386,768 bytes allocated in the heap
5,313,968 bytes copied during GC
134,364,780 bytes maximum residency (14 sample(s))
3,160,340 bytes maximum slop
518 MB total memory in use (0 MB lost due to fragmentation)
Here I call runST 20x with different values for my computation and a 128MB vector (again - unused, not returned or referenced outside of ST). The maximum residency looks good, basically just my vector plus a few MB of other stuff. But the total memory use indicates that I have four copies of the vector active at the same time. This scales perfectly with the size of the vector, for 256MB we get 1030MB as expected.
Using a 1GB vector runs out of memory (4x1GB + overhead > 32bit). I don't understand why the RTS keeps seemingly unused, unreferenced memory around instead of just GC'ing it, at least at the point where an allocation would otherwise fail.
Running with +RTS -S reveals the following:
Alloc Copied Live GC GC TOT TOT Page Flts
bytes bytes bytes user elap user elap
134940616 13056 134353540 0.00 0.00 0.09 0.19 0 0 (Gen: 1)
583416 6756 134347504 0.00 0.00 0.09 0.19 0 0 (Gen: 0)
518020 17396 134349640 0.00 0.00 0.09 0.19 0 0 (Gen: 1)
521104 13032 134359988 0.00 0.00 0.09 0.19 0 0 (Gen: 0)
520972 1344 134360752 0.00 0.00 0.09 0.19 0 0 (Gen: 0)
521100 828 134360684 0.00 0.00 0.10 0.19 0 0 (Gen: 0)
520812 592 134360528 0.00 0.00 0.10 0.19 0 0 (Gen: 0)
520936 1344 134361324 0.00 0.00 0.10 0.19 0 0 (Gen: 0)
520788 1480 134361476 0.00 0.00 0.10 0.20 0 0 (Gen: 0)
134438548 5964 268673908 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
586300 3084 268667168 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
517840 952 268666340 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
520920 544 268666164 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
520780 428 268666048 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
520820 2908 268668524 0.00 0.00 0.19 0.38 0 0 (Gen: 0)
520732 1788 268668636 0.00 0.00 0.19 0.39 0 0 (Gen: 0)
521076 564 268668492 0.00 0.00 0.19 0.39 0 0 (Gen: 0)
520532 712 268668640 0.00 0.00 0.19 0.39 0 0 (Gen: 0)
520764 956 268668884 0.00 0.00 0.19 0.39 0 0 (Gen: 0)
520816 420 268668348 0.00 0.00 0.20 0.39 0 0 (Gen: 0)
520948 1332 268669260 0.00 0.00 0.20 0.39 0 0 (Gen: 0)
520784 616 268668544 0.00 0.00 0.20 0.39 0 0 (Gen: 0)
521416 836 268668764 0.00 0.00 0.20 0.39 0 0 (Gen: 0)
520488 1240 268669168 0.00 0.00 0.20 0.40 0 0 (Gen: 0)
520824 1608 268669536 0.00 0.00 0.20 0.40 0 0 (Gen: 0)
520688 1276 268669204 0.00 0.00 0.20 0.40 0 0 (Gen: 0)
520252 1332 268669260 0.00 0.00 0.20 0.40 0 0 (Gen: 0)
520672 1000 268668928 0.00 0.00 0.20 0.40 0 0 (Gen: 0)
134553500 5640 402973292 0.00 0.00 0.29 0.58 0 0 (Gen: 0)
586776 2644 402966160 0.00 0.00 0.29 0.58 0 0 (Gen: 0)
518064 26784 134342772 0.00 0.00 0.29 0.58 0 0 (Gen: 1)
520828 3120 134343528 0.00 0.00 0.29 0.59 0 0 (Gen: 0)
521108 756 134342668 0.00 0.00 0.30 0.59 0 0 (Gen: 0)
Here it seems we have 'live bytes' exceeding ~128MB.
The +RTS -hy profile basically just says we allocate 128MB:
http://imageshack.us/a/img69/7765/45q8.png
I tried reproducing this behavior in a simpler program, but even with replicating the exact setup with ST, a Reader containing the Vector, same monad/program structure etc. the simple test program doesn't show this. Simplifying my big program the behavior also stops eventually when removing apparently completely unrelated code.
Qs:
Am I really keeping this vector around 4 times out of 20?
If yes, how do I actually tell since +RTS -Hy and maximum residency claim I'm not, and what can I do to stop this behavior?
If no, why is Haskell not GC'ing it and running out of address space / memory, and what can I do to stop this behavior?
Thanks!
I suspect this is a bug in GHC and/or the RTS.
First, I'm confident there is no actual space leak or anything like that.
Reasons:
The vector is never used anywhere. Not read, not written, not referenced. It should be collected once runST is done. Even when the ST computation returns a single Int which is immediately printed out to evaluate it, the memory issue still exists. There is no reference to that data.
Every profiling mode the RTS offers is in violent agreement that I never actually have more than a single vector's worth of memory allocated/referenced. Every statistic and pretty chart says that.
Now, here's the interesting bit. If I manually force the GC by calling System.Mem.performGC after every run of my function, the problem goes away, completely.
So we have a case where the runtime has GBs worth of memory which (demonstrably!) can be reclaimed by the GC and even according to its own statistic is not held by anybody anymore. When running out of its memory pool the runtime does not collect, but instead asks the OS for more memory. And even when that finally fails, the runtime still does not collect (which would reclaim GBs of memory, demonstrably) but instead chooses to terminate the program with an out-of-memory error.
I'm no expert on Haskell, GHC or GC. But this does look awfully broken to me. I'll report this as a bug.

Resources