Edit 2/5/17: After some work with our system deparment, it seems that this happens when Linux is low on inodes. My question is, therefore, why are the two related, and how could I have known it from the error message?
Problem Details:
I run Matlab R2016b on Linux (CentOS 6.3), and for the past couple of days I keep getting a non-familiar error when I try to do open parallel threads. Specifically, writing
parpool(3);
yields as always
>>starting parallel pool (parpool) using the local profile ....
But then, after a short while, I get
>> Caught unexpected fl::except::IInternalException
and it crashes. (The double 'I' in Internal is intentional).
Thanks.
Related
First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.
I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.
My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).
When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either
an illegal memory access was encountered
or
unspecified launch failure
in my error checking after the GPU call.
If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.
I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.
My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?
The error ended up being indeed illegal memory access.
These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.
Too specific? yeah.
I've been trying to install MIT Scheme under a 64-bit Windows 10 installation, however whenever I try to start the program I get the following error message:
>>The system has trapped within critical section "band load".
>>The trap is an ACCESS_VIOLATION trap.
>>Successful recovery is unlikely.
Then I'm presented with the option to try to recover, but the program then crashes with another ACCESS_VIOLATION.
I have already tried installing it in different directories and drives, setting the heap size, running with and without --edit and running it in multiple compatibility modes.
GDB reports this:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x779f2a4c in win32u!NtUserMessageCall () from C:\WINDOWS\SysWOW64\win32u.dll
Is there a way to fix this problem or should I just not bother and try another implementation?
Thank you for your help!
I recently did this to my system
ulimit -c unlimited
and it created as designed, a core file for the program I use, ever since, I have had random crashes to my program but I haven't had the possibility to check the core dump to see what errors it gave, as it does daily restart of the program, I assume the previous errors are gone, if they are not, please tell me so I can look them up.
But my question is: is there in any possible way that this new ulimit command I used, be the issue with the server crash? because for years ive runned the same program with no crashes and since this commmand, I have had random crashes from time to time that somewhat feels like it loops for around 5 minutes then restarts the program.
Any help is appreciated, as I cannot reproduce the issue
I have a very strange problem with one of my systems. There are two components:
uClinux running on NIOS board.
Power PC running old CentOS.
There is an open socket between the two boards with constant text commands passing back and forth. I have several systems with this setup.
However, one of them have this strange error: the socket disconnects around midnight throwing broken pipe error. Does anyone knows what particular setting configures this behaviour? I doubt it is my software because it works just fine on several other systems.
So to summarize the results: I couldn't find what was causing the broken pipe error right at midnight. But I was able to mitigate its effects by ... ignoring the SIGPIPE signal as suggested by this post.
I'm running tomcat 7 with apache 2.2 & mod_jk 1.2.26 on a debian-lenny x64 server with 2GB of RAM.
I've a strange problem with my server: every several hour & sometimes (under load) every several minutes, my tomcat ajp-connector pauses with a memory leak error, but seems this error also effects some other parts of system (e.g some other running applications also stop working) & I have to reboot the server to solve the problem for a while.
I've checked catalina.out for several days, but it seem's there is not a unique error pattern just before pausing ajp with this message:
INFO: Pausing ProtocolHandler ["ajp-bio-8009"]
Sometimes there is this message before pausing:
Exception in thread "ajp-bio-8009-Acceptor-0" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)...
& sometimes this one:
INFO: Reloading Context with name [] has started
Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5482)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:230)
at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3847)
at org.apache.catalina.loader.WebappLoader.backgroundProcess(WebappLoader.java:424)
at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1214)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1400)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1410)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1410)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1389)
at java.lang.Thread.run(Thread.java:619)
java.sql.SQLException: null, message from server: "Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug"...
& some other times the output messages related to some other parts of program.
I've checked my application source code & I don't guess it causes the problem, I've also checked memory usage using jConsole. The wanderfull point is that when server fails, is shows a lot of free memory on both heap & non-heap jvm memory space. As I told before, after crashing server, many other applications also fail & when I want to restart them it gives a resource temporary unavailable message (I've also checked my limits.conf file).
So I really really confused with this serious problem many days & i have really no more idea about it. So, can anybody please give me any kind of suggestion to solve this complicated & unknown problem ???
What could be the most possible reason for this error ?
What are your limits for number of processes?
Check them with uname -a and check maximum number of processes. If it's 1024, increase it.
Also, check the same thing for user which you are using to start it (for example, if you are using nobody user for your stuff, run su -c "ulimit -a" -s /bin/sh nobody to see what actually this user sees as limits). That should show you a problem (had it couple of days ago, totally missed to check this).
In the moment when that starts happening, you can also count all your running threads and processes for that user (or even better to monitor it using rrdtool or something else) with "ps -eLf | wc -l" which will give you back simple count of all processes and threads running on your system. This information, together with limits for all particular users, should solve your issue.
Use jvisualvm to check the heap usage of your jvm. If you see it slowly climbing over a period of time, that is a memory leak. Sometimes a memory leak is short term and eventually gets cleared up, only to start again.
If you see a sawtooth pattern, take a heap dump near the peak of the sawtooth, otherwise take a heapdump after the jvm has been running long enough to be at a high risk of and OOM error. Then copy that .hprof file to another machine and use the Eclipse MAT (Memory Analysis Tool) to open it up and identify likely culprits. You will still need to spend some time following refs in the data structure and also reading some Javadocs to figure out just what is using that Hashmap or List that is growing out of control. The sorting options are also useful to focus on the most likely problem areas.
There are no easy answers.
Note that there is also a command line tool included with the SUN jvm which can trigger a heapdump. And if you have a good profiler that can also be of use because memory leaks are usually in a piece of code that is executed frequently and therefore will show up as a hot spot in the profiler.
I finally found the problem: it was not actually a memory leak, but the limitation in number of allowed threads for the VPS was caused the problem. My server was a Xen vps with default limitation of 256 threads, so when it reached the maximum allowed threads, the supervisor was killed some of running threads (that was cause of stopping some of my running processes). By increasing number of allowed threads to 512, the problem totally solved (of course if I increase maxThreads in tomcat settings, its obvious that the problem will rise again).