skb allocation failures in 2.6.32 - linux

We are running CentOS 6.3 (based on 2.6.32) and under high load we receive order 0 allocation failures when allocating skb. This problem is not inspected on CentOS 5.4 (based on 2.6.18).
This problem is very similar to https://bugzilla.kernel.org/show_bug.cgi?id=14141.
The bug 14141 is closed by some patch (related to tty), that I don't sure is related to the problem.
Here is a very long discussion in lkml regarding the bug (see https://lkml.org/lkml/2009/10/2/86), which ends by some possible leads regarding congestion_wait() and more, but without patch.
I saw some proposals to inrcease vm.min_free_kbytes. IMHO it's not a solution, but rather workaround.
I suppose the problem was solved, but can't find when and how.
Any ideas?
Thanks

Related

FAISS search fails with vague error: "Illegal instruction" or kernel crash

Currently trying to run a basic similarity search via FAISS with reproducible code from that link. However, every time I run the code in the following venues, I have these problems:
Jupyter notebook - kernel crashes
VS Code - receive "Illegal Instruction" message in the terminal with no further documentation
I've got similar code working in Kaggle, so I suppose the problem is with my particular setup.
Based on the print statements, it appears that the error occurs during the call of the .search method. Because of how vague this error is, I've not been able to find much information on the problem. It seems that some people mentioned older processors may have a problem (AVX/AVX2 flags being the culprit?), though admittedly I didn't quite understand the connections.
Problem: Can I get some help understanding this error, and if possible, a potential solution?
Current setup:
WSL2
VSCODE (v. 1.49.0)
Jupyter-client (v. 6.1.7)
Jupyter-core (v. 4.6.3)
FAISS-cpu (v. 1.6.3)
Numpy (v. 1.19.2)
Older machine (AMD FX-8350 with 16GB RAM)
For anyone that runs across this error, the problem (in my case) was that my CPU was old enough that it doesn't support AVX2. To determine this, I used this SO post.
Once I ran the code in Colab or on a newer machine, all was well.

Understanding kernel patch submitting check list? related to testing patch?

What does this mean by following checklist rule
Tested after it has been merged into the -mm patchset to make sure
that it still works with all of the other queued patches and various
changes in the VM, VFS, and other subsystems.
mm patchset means memory management related patches ?
Please correct my understanding.
Once upon a time, the -mm patchset indeed contained memory management related patches for future verions of the kernel.
That checklist entry is outdated; you should test against the linux-next tree.

Netlogo 5.1 (and 5.05) Behavior Space Memory Leak

I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.

Reloading Flash 17 times causes error #2046 and requires a browser restart

I am encountering some very strange behaviour with a Flex 4.1 app I am writing which gets in the way of testing. It seems that I can reload the app 16 times and then on the 17th, the loading process fails with
Error #2046: The loaded file did not have a valid signature
It seems to be consistently happening on the 17th reload on both Firefox 5.0 and Chrome 12. I am not sure if it's relevant, but I am running Flash Player v10.2.159.1 (also happens with 10.3.181.34) on Ubuntu 10.04. Happens with both regular and debugger versions of the player. When I run the app on Windows FF5, it doesn't seem to happen. Closing the current browser window does not seem to fix it. The only way around it is to completely close all browser windows and restart the browser. And then again after 16 successful loads, the 17th fails.
At this point I'm thinking of chalking it as a Linux Flash bug but I'd like to make sure and check if anyone knows if there's something I should be doing to prevent this.
The user from this post seems to have had the same problem but I guess he didn't notice the pattern I have.
Any help will be greatly appreciated.
Ruy
== UPDATE ==
I just realized that after my app starts throwing the 2046 error, trying to load any other Flash that uses signed RSLs also shows the 2046 error (e.g. this app), which means the problem is not specific to my app and most likely related to the Flash cache or something of the sort.
Disclosure: I am a Flash Player Developer at Adobe.
This is unlikely to get much attention as it is Linux only and kind of an edge case: Probably annoying during dev work but very few users will reload the same page more than 16 times. It might also be a browser issue. But it is probably us :) I will look at the jira tomorrow and see if I can bump it up a bit, but I'll be honest in that it is really an edge case and unlikely to get much love. If you want to increase your chances make sure to add the most simple .swf test case you can make to the bug. Also please double check if it still happens with the latest beta.
I also just took a look at the earlier bug reports and forum posts, you probably should post this as a Flash Player bug, not as Flex.
Long shot guess, but it sounds similar to a problem we had.... in the project properties - Flex Build Path - Framework Linkage - change to "merged into code". This fixes a problem very similar to what you are describing, though I wish I knew exactly what the cause is. Good luck!
tl;dr: No idea on the cause, posting random possibility in hope it might give someone else an idea or two for testing.
Considering that it seems to be an unresolved bug in Adobe issue tracker, its unlikely that you will get any definitive answer here. Considering it occurs on both Firefox & Chrome, let's rule out browser bugs and assume it is in either some common library (Flash) or OS API (Linux kernel implementation). A comment in one of the jira issues specifically mentions killing Flash process fixes it, so its a Flash issue and not OS bug.
The most interesting thing I can see here is your observation that it succeeds for exactly 16 times before failing to load. Time for some speculation here, from someone who's never worked on kernel or crypto dev:
With a 2048 bit RSA key and 32k cache for storing them, 16 keys would fit before adding another one fails - so one conjecture is that each time this file is loaded, Flash is caching the signed value (possibly a hashed version) for some reason - maybe to keep track of allowed & used security permissions etc.? If this entry is not removed, then once it is full all file loads will fail if caching the signature is part of checking it.
Things you can experiment with:
Reduce size of app to see if page can be reloaded more often (as suggested by stackfish)
Count number of signed RSLs used and if its a power/multiple of 2 (maybe others get the error after 32 page loads if they use half the no. of signed libs?)
Check if Linux Flash plugin has some option to increase credentials cache or something (or decrease it, just to see if it impacts the no. of loads - if so, could be related to the problem)
I expect that to actually find a solution, you'd have to dive into the library loading code and look at all constants related to loading signed libs that are 4, 16, or a multiple of 16 to see if they might be responsible - in short, unlikely to be soluble by others outside of Flash dev team imho :/
This behavior could be related to a memory leak caused either by the Flex implementation, or the browser plugin. Firefox is notorious for not cleaning up memory anyway and the footprint will continue to grow the longer you have the same browser window open.
If you reduce the size of your flex app to produce something very tiny, does the number of times you can reload the page go up?
Error #2046 on win vista, 64 bit machine wit 1000 mb ati radeon videocard
problem occurs only in msn video sofar
I meet a same problem when I use ppt on icourse163.org . when I open the course site, I can't see the ppt but I use chrome can do it .there are the same flash version(32.0.0.344),and Then ,I copy the tar.gz file that downloaded from adobe.
usr/* to /usr.I solved it.wish can help you .

Best strategy to check harddisk issues on Red Had Linux

What is the best strategy(and utility) to check hard disk related issues on red hat Linux ?
Depends on the issue. If it's some sort of failure-related thing, check dmesg for error messages and smartctl for info on what might have gone wrong. If it's performance problems, check smartctl for remapped blocks and sar -d 1 0 to see how hard the disk is actually being thrashed.
Beyond that, what issues are you actually having? More detailed questions elicit more detailed answers.

Resources