Sysmgr volatile memory bug - cisco

As you may know that there is a bug with Nexus switches called SYSMGR-2-VOLATILE_DB_FULL with versions below System version: 5.0(2)N2(1) that causes a switch to crash reboot once dir /dev/shm gets to 100% unless updated to a later version.
in order to fill the dir you can run long commands such as "show run" (needs to be over 190 lines) and then check how it increases by running
show system internal flash
show system internal dir /dev/shm | i csm_acfg | count
I was wondering if there is a similar bug with 4500 switches ?
Catalyst 4500 L3 Switch Software (cat4500es8-UNIVERSALK9-M), Version 03.11.00.E RELEASE SOFTWARE (fc3)
So what exactly happened...
I have a script that runs from time to time that gets over 190 lines from all of our switches and performs some action remotely, so recently when the script ran a few minutes later we had a massive outage since our core switch had a power outage( at least what I was able to see from the logs) The thing is there are 2 4500 chassis configured with sso redundancy so the failover should have been instantaneous, however everything was down for about 8 mins before the standby switch became active.
Can anyone please advise if there is a similar bug with 4500 switches ?
Thank you.

After analysing crash info I was able to find some things that caused the crash, however wont be able to tell with 100 % certainty what exactly happened to crash it
So there are a few errors that are called VFETQINTERRUPT and VFETQTOOMANYPARITYERRORS basically VFETQINTERRUPT counts fast accruing errors and VFETQTOOMANYPARITYERRORS will crash reboot switch if exceeds 100 errors in a short period of time, could indicate that there is a hardware error
and this is pretty much what happened in out environment, something has caused 100+ errors and it crashed rebooted.
There is a command to stop it from crash rebooting, however not sure if it should be used as if there is a hardware issue it better to failover onto the other supervisor.
platform fw-asic dbl hash memory parity-error reload never

Related

Is it unhealthy for SSD if I write 'vital signal' to check a python code is running?

A python program that I'm building was used to die for no apparent reason. I couldn't figure out the reason, so my workaround was to add few lines that write the time to a 'vitality' file every time a certain line within the program is executed, which happens about every 0.1 seconds.
A separate script reads the 'vitality' file every 1 second, and when the vital sign doesn't update for, say 10 seconds, the script kills the program and restarts it.
So far this workaround has been working great on the original problem, but now I'm rather concerned if the SSD will degrade by this or not.
Does writing 10 digits of unixtimestamp every 0.1s to a file have negligible effect on SSD health, or would it degrade the SSD fast?
Doing that will degrade the SSD and destroy it over time.
In my last job, the SSD health tool (smartctl) indicated that the 15 SSDs in our cluster product were wearing rapidly and had only months of life left. The team found that a third party software package (etcd) was syncing a small amounts of data to a filesystem on SSD once per second. And each sync wrote at least an entire 16K block. Luckily, the problem was found early enough that we could patch it in a software update before suffering too many customer returns.
Write the 'vitality' file somewhere else. It could be on a tmpfs like /var/run/user/. Or use a different vitality mechanism; something like supervisord can manage your task, run health checks and restart it on failure.

Running perf record with Intel-PT event on compiled binaries from SPECCpu2006 crashes the server machine

I am having a recurring problem when using perf with Intel-PT event. I am currently performing profiling on a Intel(R) Xeon(R) CPU E5-2620 v4 # 2.10GHz machine, with x86_64 architecture and 32 hardware threads with virtualization enabled. I specifically use programs/source codes from SpecCPU2006 for profiling.
I am specifically observing that the first time I perform profiling on one of the compiled binaries from SpecCPU2006, everything works fine and the perf.data file gets generated, which is as expected with Intel-PT. As SpecCPU2006 programs are computationally-intensive(use 100% of CPU at any time), clearly perf.data files would be large for most of the programs. I obtain roughly 7-10 GB perf.data files for most of the profiled programs.
However, when I try to perform profiling the second time on the same compiled binary, after the first one is successfully done -- my server machine freezes up. Sometimes, this happens when I try profiling the third time/the fourth time (after the second or third profiling completed successfully). This behavior is highly unpredictable. Now I cannot profile any more binaries unless I have restarted the machine again.
I have also posted the server error logs which I get once I see that the computer has stopped responding.
Server error logs
Clearly there is an error message saying Fixing recursive fault but reboot is needed!.
This happens for particularly large enough SpecCPU2006 binaries which take more than 1 minute to run without perf.
Is there any particular reason why this might happen ? This should not occur due to high CPU usage, as running the programs without perf or with perf but any other hardware event(that can be seen by perf list) completed successfully. This only seems to happen with Intel-PT.
Please guide me in using the steps to solve this problem. Thanks.
Seems I resolved this issue now. So will post an answer.
The server crashed because of a null pointer dereference/access happening with a specific member of the structure perf_event. Basically the member perf_event->handle was the culprit. This information, as suggested by #osgx, was obtained from var/log/syslog file. A portion of the error message was :-
Apr 19 04:49:15 ###### kernel: [582411.404677] BUG: unable to handle kernel NULL pointer dereference at 00000000000000ea
Apr 19 04:49:15 ###### kernel: [582411.404747] IP: [] perf_event_aux_event+0x2e/0xf0
One possible scenario where this structure member turns out to be NULL is if I start capturing packets even before an earlier run of perf record finished releasing all of its resources. This has been properly handled in kernel version 4.10. I was using kernel version 4.4.
I upgraded my kernel to the newer version and it works fine now!

javaw.exe consumes memory on starting STS

At first I thought my program had memory leaks. But I terminated all java processes and restarted Spring Tools Suite. I kept an eye on the task manager. In just a few minutes, javaw.exe had grown to 2,000,000 K Memory. The memory keeps going up, without issuing commands in STS. STS has literally ONLY been opened. I have no tabs open in it. The error log doesn't show any memory related errors. Upon closing STS javaw.exe DOES disappear from task manager and opening STS restarts the process over again around 150,000K, quickly jumping to 600,000K, then slowly growing and growing until it has consumed all my memory.
Any thoughts what might be causing this? I'm running a full system scan now just in case I've been compromised.
--edit--
This problem started around 10 AM Eastern and mysteriously went away at noon, when the security scan completed. No items were detected by the scan to lend an explanation to either the problem or its mysterious resolution. As of now javaw.exe is hovering at or around 700,000K. Very strange!
Sounds like a 2 hour bug! Be thankful it is gone but be sure to document it thoroughly if it occurs again. Sounds like a rough 2 hours you went through.
That is not completely unusual unfortunately. Because Eclipse is made up of a bunch of plug-ins some times a plug-in can go wild and start consuming memory and/or CPU. Using VisualVM (http://visualvm.java.net/) you can determine what is causing Eclipse to freak out. Depending on what it is, you might be able to disable that functionality. Because it could be so many different plug-ins it doesn’t surprise me you could not find any answers googling or looking here at StackOverflow.

Memory leak with apache, tomcat & mod_jk & mysql

I'm running tomcat 7 with apache 2.2 & mod_jk 1.2.26 on a debian-lenny x64 server with 2GB of RAM.
I've a strange problem with my server: every several hour & sometimes (under load) every several minutes, my tomcat ajp-connector pauses with a memory leak error, but seems this error also effects some other parts of system (e.g some other running applications also stop working) & I have to reboot the server to solve the problem for a while.
I've checked catalina.out for several days, but it seem's there is not a unique error pattern just before pausing ajp with this message:
INFO: Pausing ProtocolHandler ["ajp-bio-8009"]
Sometimes there is this message before pausing:
Exception in thread "ajp-bio-8009-Acceptor-0" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)...
& sometimes this one:
INFO: Reloading Context with name [] has started
Exception in thread "ContainerBackgroundProcessor[StandardEngine[Catalina]]" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5482)
at org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:230)
at org.apache.catalina.core.StandardContext.reload(StandardContext.java:3847)
at org.apache.catalina.loader.WebappLoader.backgroundProcess(WebappLoader.java:424)
at org.apache.catalina.core.ContainerBase.backgroundProcess(ContainerBase.java:1214)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1400)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1410)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1410)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1389)
at java.lang.Thread.run(Thread.java:619)
java.sql.SQLException: null, message from server: "Can't create a new thread (errno 11); if you are not out of available memory, you can consult the manual for a possible OS-dependent bug"...
& some other times the output messages related to some other parts of program.
I've checked my application source code & I don't guess it causes the problem, I've also checked memory usage using jConsole. The wanderfull point is that when server fails, is shows a lot of free memory on both heap & non-heap jvm memory space. As I told before, after crashing server, many other applications also fail & when I want to restart them it gives a resource temporary unavailable message (I've also checked my limits.conf file).
So I really really confused with this serious problem many days & i have really no more idea about it. So, can anybody please give me any kind of suggestion to solve this complicated & unknown problem ???
What could be the most possible reason for this error ?
What are your limits for number of processes?
Check them with uname -a and check maximum number of processes. If it's 1024, increase it.
Also, check the same thing for user which you are using to start it (for example, if you are using nobody user for your stuff, run su -c "ulimit -a" -s /bin/sh nobody to see what actually this user sees as limits). That should show you a problem (had it couple of days ago, totally missed to check this).
In the moment when that starts happening, you can also count all your running threads and processes for that user (or even better to monitor it using rrdtool or something else) with "ps -eLf | wc -l" which will give you back simple count of all processes and threads running on your system. This information, together with limits for all particular users, should solve your issue.
Use jvisualvm to check the heap usage of your jvm. If you see it slowly climbing over a period of time, that is a memory leak. Sometimes a memory leak is short term and eventually gets cleared up, only to start again.
If you see a sawtooth pattern, take a heap dump near the peak of the sawtooth, otherwise take a heapdump after the jvm has been running long enough to be at a high risk of and OOM error. Then copy that .hprof file to another machine and use the Eclipse MAT (Memory Analysis Tool) to open it up and identify likely culprits. You will still need to spend some time following refs in the data structure and also reading some Javadocs to figure out just what is using that Hashmap or List that is growing out of control. The sorting options are also useful to focus on the most likely problem areas.
There are no easy answers.
Note that there is also a command line tool included with the SUN jvm which can trigger a heapdump. And if you have a good profiler that can also be of use because memory leaks are usually in a piece of code that is executed frequently and therefore will show up as a hot spot in the profiler.
I finally found the problem: it was not actually a memory leak, but the limitation in number of allowed threads for the VPS was caused the problem. My server was a Xen vps with default limitation of 256 threads, so when it reached the maximum allowed threads, the supervisor was killed some of running threads (that was cause of stopping some of my running processes). By increasing number of allowed threads to 512, the problem totally solved (of course if I increase maxThreads in tomcat settings, its obvious that the problem will rise again).

High %wa CPU load when running PHP as CLI

Sorry for the vague question, but I've just written some php code that executes itself as CLI, and I'm pretty sure it's misbehaving. When I run "top" on the command line it's showing very little resources given to any individual process, but between 40-98% to iowait time (%wa). I usually have about .7% distributed between %us and %sy, with the remaining resources going to idle processes (somewhere between 20-50% usually).
This server is executing MySQL queries in, easily, 300x the time it takes other servers to run the same query, and it even takes what seems like forever to log on via SSH... so despite there being some idle cpu time left over, it seems clear that something very bad is happening. Whatever scripts are running, are updating my MySQL database, but it seems to be exponentially slower then when they started.
I need some ideas to serve as launch points for me to diagnose what's going on.
Some things that I would like to know are:
How I can confirm how many scripts are actually running
Is there anyway to confirm that these scripts are actually shutting down when they are through, and not just "hanging around" taking up CPU time and memory?
What kind of bottlenecks should I be checking to make sure I don't create too many instances of this script so this doesn't happen again.
I realize this is probably a huge question, but I'm more then willing to follow any links provided and read up on this... I just need to know where to start looking.
High iowait means that your disk bandwidth is saturated. This might be just because you're flooding your MySQL server with too many queries, and it's maxing out the disk trying to load the data to execute them.
Alternatively, you might be running low on physical memory, causing large amounts of disk IO for swapping.
To start diagnosing, run vmstat 60 for 5 minutes and check the output - the si and so columns show swap-in and swap-out, and the bi and bo lines show other IO. (Edit your question and paste the output in for more assistance).
High iowait may mean you have a slow/defective disk. Try checking it out with a S.M.A.R.T. disk monitor.
http://www.linuxjournal.com/magazine/monitoring-hard-disks-smart
ps auxww | grep SCRIPTNAME
same.
Why are you running more than one instance of your script to begin with?

Resources