Python too many subprocesses? - python-3.x

I'm trying to start a lot of python procees on a single machine.
Here is a code snippet:
fout = open(path, 'w')
p = subprocess.Popen((python_path,module_name),stdout=fout,bufsize=-1)
After about 100 processes I'm getting the error below:
Running on win 10 64 bit, Python 3.5. Any Idea what that might be? Already tried to split the start (so start from two scripts) as well as sleep command. After a certain number of processes, the error shows up. Any Idea what that might be? Thanks a lot for any hint!
PS:
Some background. Each process opens database connections as well as does some requests using the requests package. Then some calculations are done using numpy, scipy etc.
PPS: Just discover this error message:
dll load failed the paging file is too small for this operation to complete python (when calling scipy)

Issues solved through reinstalling numpy and scipy + installing mkl.
Strange about this error was that it only appeared after a certain number of processes. Would love to hear if anybody knows why this happened!

Related

pandarallel package on windows infinite loop bug

so this is not really a question but rather a bug report for the pandarallel package:
this is the end of my code:
...
print('Calculate costs NEG...')
for i, group in tqdm(df_mol_neg.groupby('DELIVERY_DATE')):
srl_slice = df_srl.loc[df_srl['DATE'] == i]
srl_slice['srl_soll'] = srl_slice['srl_soll'].copy() * -1
df_aep_neg.loc[df_aep_neg['DATE'] == i, 'SRL_cost'] = srl_slice['srl_soll'].parallel_apply(lambda x: get_cost_of_nearest_mol(group, x)).sum()
what happens here is that instead of doing the parallel_apply function, it loops back to the start of my code and repeats it all again. the exact same code works fine on my remote linux mashine so I have 2 possible error sources:
since pandarallel itself already has some difficulties with the windows os it might just be a windows problem
the other thing is that I currently use the early access version of pycharm (223.7401.13) and use the debugger which might also be a problem source
other than this bug I can highly recommend the pandarallel package (at least for linux users). it's super easy to use and if you got some cores it can really shave off some time, in my case it shaved off a cool 90% of time.
(also if there is a better way to report bugs, please let me know)

Jupyter Lab shutting down and system being logged of on running python code

I am using Jupyter Lab with python for data analysis of cosmological data sets.
I use Dell Vostro 5515 laptop with 16gb ram and Ryzen7 processor. My OS is Fedora 36 with KDE and Xfce environments.
The problem is that on running my .ipynb notebook for some time, it closes down abruptly if I am in KDE. If I m in xfce, it also closes all the applications and logs out my session.
The crash happens mostly while running a function called compute_full_master in pymaster library in python. But has also happened rarely while running some other functions.
I have tried to get error messages by running jupyter lab in --debug mode, but when the crash happens, the terminal is also closed. I do not know how to get the crash details in other ways.
I have tried to run the code in firefox, chrome, and VSCode.
I am sorry if I have not provided any details necessary and I am happy to provide any if anyone can help!
EDIT:
A simple example:
arr_len = 8394753
x = np.arange(arr_len)
plt.figure(figsize=(25,15))
plt.plot(x, y_1 - y_2)
plt.plot(x, y_1 - y_3)
plt,plot(x, y_1 - y_4)
plt.ylim((-1e-6,1e-6))
The arrays y_1, y_2, y_3 and y_4 have the length arr_len and are complex. The imaginary part does not matter. The notebook is already having run some code in previous cells. But running this plotting cell a few times caused the shut down many times.

Python3 scapy/kamene extremely slow

I was trying to use Pcap.net for some PCAP file analysis, which took around five seconds to loop through all available packets in a 1GB pcap file.
I'm now trying to use Scapy on Python3, which for whatever reason is called Kamene, but it's taking literally forever to parse the file, and CPU activity hits 100%, so I'm clearly doing something wrong. Here's the code:
from kamene.all import *
packetCount = 0
with PcapReader("C:\\Testing\\pcap\\maccdc2012_00000.pcap") as reader:
for packet in reader:
packetCount += 1
print(packetCount)
When running that, I get:
WARNING: No route found for IPv6 destination :: (no default route?).
This affects only IPv6
<UNIVERSAL><class 'kamene.asn1.asn1.ASN1_Class_metaclass'>
That UNIVERSAL message just gets repeated over and over, and after running for five minutes, I gave up. Does anyone have any idea on what is going on? Am I being dumb?
I've tried this on both Ubuntu and within Visual Studio on Windows (both virtualised)
First,l of all, you’re not using Scapy :/
from https://scapy.net
An independent fork of Scapy was created from v2.2.0 in 2015, aimed at
supporting only Python3 (scapy3k). The fork diverged, did not follow
evolutions and fixes, and has had its own life without contributions
back to Scapy. Unfortunately, it has been packaged as python3-scapy in
some distributions, and as scapy-python3 on PyPI leading to confusion
amongst users. It should not be the case anymore soon. Scapy supports
Python3 in addition to Python2 since 2.4.0. Scapy v2.4.0 should be
favored as the official Scapy code base. The fork has been renamed as
kamene.
Uninstall kamene and pip install scapy or pip3 install scapy (or get it from github) might help.
Once you've done that, you will find tips on how to speed up Scapy starting from 2.4.4 in the Performance section of the doc
That being said, Scapy isn’t designed to support very large amount of data (but rather aimed at being easy to implement). It will probably take some time to handle 1GB anyways :/ (Also, Python is slower than other languages (C) on such matters as packet dissection. You probably will never match Wireshark speed in Python)

matplotlib.pyplot.hist() hangs if size of bins is too large?

I am plotting histograms and I found this on stack exchange which works great:
histogram for discrete values
Here is the code posted there:
import matplotlib.pyplot as plt
import numpy as np
data = range(11)
data = np.array(data)
d = np.diff(np.unique(data)).min()
left_of_first_bin = data.min() - float(d)/2
right_of_last_bin = data.max() + float(d)/2
plt.hist(data, np.arange(left_of_first_bin, right_of_last_bin + d, d))
plt.show()
I am using it with a case where d = 2.84e-5, the output of np.arrange() above is then 68704 in length. If I run this from python interpreter (python 3.5) on ubuntu 14.04 from an anaconda environment, the system hangs and I can not recover without ctrl-c which kills the interpreter. I am wondering if there is a limit on the size of bins in plt.hist() or if there is something inherently wrong with this approach. If a limitation, I would expect an error rather than a hang. The code works fine if d is not too small. The length of my data might be impacting this as well, it was 22289. I guess it could just be churning and I am not waiting long enough?
I searched for matplotlib.pyplot.hist limitations and other variations and could not find anything. The documentation from what I can tell does not mention a limit. Thank you.
It looks like there is not a real hang. It just takes forever because the data is so large and the bin widths so small. I noted that with d=.001, it took about 30 seconds on my machine to render the plot. Sorry for the trouble, I thought I found a potential bug and as a newbie got excited.

"Out of Memory Error (Java)" when using R and XLConnect package

I tried to load a ~30MB excel spreadsheet into R using the XLConnect package.
This is what I wrote:
wb <- loadWorkbook("largespreadsheet.xlsx")
And after about 15 seconds, I got the following error:
Error: OutOfMemoryError (Java): GC overhead limit exceeded.
Is this a limitation of the XLConnect package or is there a way to tweak my memory settings to allow for larger files?
I appreciate any solutions/tips/advice.
Follow the advice from their website:
options(java.parameters = "-Xmx1024m")
library(XLConnect)
If you still have problems with importing XLSX files you can use this opiton. Anwser with "Xmx1024m" didn't work and i changed to "-Xmx4g".
options(java.parameters = "-Xmx4g" )
library(XLConnect)
This link was useful.
Use read.xlsx() in the openxlsx package. It has no dependency on rJava thus only has the memory limitations of R itself. I have not explored in much depth for writing and formatting XLSX but it has some promising looking vignettes. For reading large spreadsheets, it works well.
Hat tip to #Brad-Horn. I've just turned his comment as an answer because I also found this to be the best solution!
In case someone encounters this error when reading not one huge but many files, I managed to solve this error by freeing Java Virtual Machine memory with xlcFreeMemory(), thus:
files <- list.files(path, pattern = "*.xlsx")
for (i in seq_along(files)) {
wb <- loadWorkbook(...)
...
rm(wb)
xlcFreeMemory() # <= free Java Virtual Machine memory !
}
This appears to be the case, when u keep using the same R-session over and over again without restarting R-Studio. Restarting R-Studio can help to allocate a fresh memory-heap to the program. It worked for me right away.
Whenever you are using a library that relies on rJava (such as RWeka in my case), you are bound to hit the default heap space (512 MB) some day. Now, when you are using Java, we all know the JVM argument to use (-Xmx2048m if you want 2 gigabytes of RAM). Here it's just a matter of how to specify it in the R environnement.
options(java.parameters = "-Xmx2048m")
library(rJava)
As suggested in this here, make sure to run the option function in the first line in your code. In my case, it worked only when I restarted the R session and run it in the first line.
options(java.parameters = "-Xmx4g" )
library(XLConnect)

Resources