pandarallel package on windows infinite loop bug - python-3.x

so this is not really a question but rather a bug report for the pandarallel package:
this is the end of my code:
...
print('Calculate costs NEG...')
for i, group in tqdm(df_mol_neg.groupby('DELIVERY_DATE')):
srl_slice = df_srl.loc[df_srl['DATE'] == i]
srl_slice['srl_soll'] = srl_slice['srl_soll'].copy() * -1
df_aep_neg.loc[df_aep_neg['DATE'] == i, 'SRL_cost'] = srl_slice['srl_soll'].parallel_apply(lambda x: get_cost_of_nearest_mol(group, x)).sum()
what happens here is that instead of doing the parallel_apply function, it loops back to the start of my code and repeats it all again. the exact same code works fine on my remote linux mashine so I have 2 possible error sources:
since pandarallel itself already has some difficulties with the windows os it might just be a windows problem
the other thing is that I currently use the early access version of pycharm (223.7401.13) and use the debugger which might also be a problem source
other than this bug I can highly recommend the pandarallel package (at least for linux users). it's super easy to use and if you got some cores it can really shave off some time, in my case it shaved off a cool 90% of time.
(also if there is a better way to report bugs, please let me know)

Related

Python too many subprocesses?

I'm trying to start a lot of python procees on a single machine.
Here is a code snippet:
fout = open(path, 'w')
p = subprocess.Popen((python_path,module_name),stdout=fout,bufsize=-1)
After about 100 processes I'm getting the error below:
Running on win 10 64 bit, Python 3.5. Any Idea what that might be? Already tried to split the start (so start from two scripts) as well as sleep command. After a certain number of processes, the error shows up. Any Idea what that might be? Thanks a lot for any hint!
PS:
Some background. Each process opens database connections as well as does some requests using the requests package. Then some calculations are done using numpy, scipy etc.
PPS: Just discover this error message:
dll load failed the paging file is too small for this operation to complete python (when calling scipy)
Issues solved through reinstalling numpy and scipy + installing mkl.
Strange about this error was that it only appeared after a certain number of processes. Would love to hear if anybody knows why this happened!

Python3 scapy/kamene extremely slow

I was trying to use Pcap.net for some PCAP file analysis, which took around five seconds to loop through all available packets in a 1GB pcap file.
I'm now trying to use Scapy on Python3, which for whatever reason is called Kamene, but it's taking literally forever to parse the file, and CPU activity hits 100%, so I'm clearly doing something wrong. Here's the code:
from kamene.all import *
packetCount = 0
with PcapReader("C:\\Testing\\pcap\\maccdc2012_00000.pcap") as reader:
for packet in reader:
packetCount += 1
print(packetCount)
When running that, I get:
WARNING: No route found for IPv6 destination :: (no default route?).
This affects only IPv6
<UNIVERSAL><class 'kamene.asn1.asn1.ASN1_Class_metaclass'>
That UNIVERSAL message just gets repeated over and over, and after running for five minutes, I gave up. Does anyone have any idea on what is going on? Am I being dumb?
I've tried this on both Ubuntu and within Visual Studio on Windows (both virtualised)
First,l of all, you’re not using Scapy :/
from https://scapy.net
An independent fork of Scapy was created from v2.2.0 in 2015, aimed at
supporting only Python3 (scapy3k). The fork diverged, did not follow
evolutions and fixes, and has had its own life without contributions
back to Scapy. Unfortunately, it has been packaged as python3-scapy in
some distributions, and as scapy-python3 on PyPI leading to confusion
amongst users. It should not be the case anymore soon. Scapy supports
Python3 in addition to Python2 since 2.4.0. Scapy v2.4.0 should be
favored as the official Scapy code base. The fork has been renamed as
kamene.
Uninstall kamene and pip install scapy or pip3 install scapy (or get it from github) might help.
Once you've done that, you will find tips on how to speed up Scapy starting from 2.4.4 in the Performance section of the doc
That being said, Scapy isn’t designed to support very large amount of data (but rather aimed at being easy to implement). It will probably take some time to handle 1GB anyways :/ (Also, Python is slower than other languages (C) on such matters as packet dissection. You probably will never match Wireshark speed in Python)

MPI4PY strange OS error

I have a complex MPI4PY script, that gives a seemingly impossible error.
The important part of the script:
for rnd in range(50):
if rnd > 0:
WEIGHT_FILE = '{}/weights_{}.wts'.format(WORK_DIR, rnd - 1)
WORK_DIR = '{}'.format(rnd)
if PROCESS_NUM == 0:
if not os.path.isdir(WORK_DIR):
os.mkdir(WORK_DIR)
....
So after the second iteration i get OS Error, cannot create directory, directory exists. How is this possible? If the directory exists, if should not create it. PROCESS_NUM is the MPI rank, so only one process should try to create it. Is there some kind of race condition, or locking error? Any idea?
You need to create the full path name before checking:
if not os.path.isdir(os.path.join(full_path, WORK_DIR)):
Let's use:
os.makedirs(WORK_DIR, exist_ok=True)
I seem to found the answer, and i was deep in the architecture, not related to python.
I was using SLURM distribution manager with mpich, and on one of the nodes there was an installation of open-mpi as well alongside mpich causing some trouble. The numbering of the cores on that node was 0/1 for all allocations, causing a race condition in the script cause multiple cores got the same PROCESS_NUM.

"Out of Memory Error (Java)" when using R and XLConnect package

I tried to load a ~30MB excel spreadsheet into R using the XLConnect package.
This is what I wrote:
wb <- loadWorkbook("largespreadsheet.xlsx")
And after about 15 seconds, I got the following error:
Error: OutOfMemoryError (Java): GC overhead limit exceeded.
Is this a limitation of the XLConnect package or is there a way to tweak my memory settings to allow for larger files?
I appreciate any solutions/tips/advice.
Follow the advice from their website:
options(java.parameters = "-Xmx1024m")
library(XLConnect)
If you still have problems with importing XLSX files you can use this opiton. Anwser with "Xmx1024m" didn't work and i changed to "-Xmx4g".
options(java.parameters = "-Xmx4g" )
library(XLConnect)
This link was useful.
Use read.xlsx() in the openxlsx package. It has no dependency on rJava thus only has the memory limitations of R itself. I have not explored in much depth for writing and formatting XLSX but it has some promising looking vignettes. For reading large spreadsheets, it works well.
Hat tip to #Brad-Horn. I've just turned his comment as an answer because I also found this to be the best solution!
In case someone encounters this error when reading not one huge but many files, I managed to solve this error by freeing Java Virtual Machine memory with xlcFreeMemory(), thus:
files <- list.files(path, pattern = "*.xlsx")
for (i in seq_along(files)) {
wb <- loadWorkbook(...)
...
rm(wb)
xlcFreeMemory() # <= free Java Virtual Machine memory !
}
This appears to be the case, when u keep using the same R-session over and over again without restarting R-Studio. Restarting R-Studio can help to allocate a fresh memory-heap to the program. It worked for me right away.
Whenever you are using a library that relies on rJava (such as RWeka in my case), you are bound to hit the default heap space (512 MB) some day. Now, when you are using Java, we all know the JVM argument to use (-Xmx2048m if you want 2 gigabytes of RAM). Here it's just a matter of how to specify it in the R environnement.
options(java.parameters = "-Xmx2048m")
library(rJava)
As suggested in this here, make sure to run the option function in the first line in your code. In my case, it worked only when I restarted the R session and run it in the first line.
options(java.parameters = "-Xmx4g" )
library(XLConnect)

wxCriticalSection under Linux/Unix

i discovered that a wxCriticalSection is not recursive ( does deadlock when a thread grabs a section more than once ) under linux. Looking at the sources, i discovered that a wxCriticalSection is implemented using a wxMutex under Linux, but without using wxMUTEX_RECURSIVE. I have a codebase that runs well under Win and Mac, and i want to port it to Linux, but i have deadlocks at some places where i did not avoid recursion.
Now i have two possibilities:
Changing and rebuilding wxWidgets for my purpose ( brrr - by any chance i want to avpid that since i do not know too much about the design decisions behind that )
debugging each and all of my possible code paths ( brrr - will take days and is horribly bug - prone )
Is there a third way, replacing/extending wxCriticalSection with a construct that behaves equally under Mac/Win/Unix?
ps. could someone explain the design decision to me? Mr. Vadim Z says ...
I had temporarily forgot the reason I was against this (making wxCriticalSections recursive) but I did recall it 30 seconds later (after sending my message, of course ). Please see my follow-up
But there was never a follow-up ...
In version 2.9.1, it appears that the default should be recursive. In file \wxWidgets-2.9.1\include\wx\thread.h:
inline wxCriticalSection::wxCriticalSection( wxCriticalSectionType critSecType )
: m_mutex( critSecType == wxCRITSEC_DEFAULT ? wxMUTEX_RECURSIVE : wxMUTEX_DEFAULT ) { }
And in class wxCriticalSection the constructor declaration is
wxCRITSECT_INLINE wxCriticalSection( wxCriticalSectionType critSecType = wxCRITSEC_DEFAULT );
I don't use Linux, so I can't verify that wxCriticalSection is actually recursive when compiled.

Resources