Why I'm getting the Gatling error: "[gatling-http-1-14] WARN io.netty.channel.nio.NioEventLoop" and Gatling is killing a VU after any error - performance-testing

I'm experiencing a couple of problems while trying to run a Gatling simulation I've done. basically, I have two problems:
When any API under test in the Gatling simulation returns 400 or any other error code (not part of the check.status), Gatling kills the virtual user, instead of restarting the scenario and re-use that virtual user, so per each error, I'm losing a virtual user, as my test was planned only with 16 users for 30 mins, I'm having finished the test in the first few minutes without completing anything It happens every time I try to run the test and is annoying as I've been following all the documentation and I don't know what do I have wrong on the code. The simulation config for the users and ramp looks like the next code:
val scnCreation = scenario("My Scenario").during(conf.getDuration("test.duration")) {
exec(Pattern.methodToExecute)
}
setUp(scnCreation.inject(
rampUsers(conf.getInt("test.vu")).during(conf.getDuration("test.ramp"))
)
I'm getting this error while the test is running, but my resources are not overwhelmed (CPU is about ~30% and memory is about ~50%):
[gatling-http-1-13] WARN io.netty.channel.nio.NioEventLoop - Selector.select() returned prematurely 512 times in a row;
Any ideas on how to increase the "limits" of the Gatling engine and how to fix that bad-behavior in which Gatling kills one virtual user after an error?

Your statement that, on error, virtual users escape your loop and terminate is most likely wrong, unless you've implemented your test this way with some exit condition (edit: you did).
If you don't want your users to exit your loop, don't use exitHereIfFailed. Wrap your loop content with exitBlockOnFail instead. Have a look at the documentation.
Then, in any case, Gatling doesn't "recycle" virtual users. You're using rampUsers which defines an open workload model where you define the arrival rate of your users.
This is a NIO/OS bug. Upgrade your Java to the latest release of Java 8 at least, and if you're running on Linux, check how old your kernel is and most likely consider upgrading.
Also, I suspect you're using an old version of Gatling as modern ones don't use Java NIO on Linux. You should upgrade (latest is 3.5.1 as of now)

Related

Stopping a Running Spark Application (Databricks Interactive Cluster)

I'm using databricks with an interactive cluster. If I review their management user-interface, there is only one "application" listed. And when I try to kill it, I always get this message
HTTP ERROR 405
Problem accessing /app/kill/. Reason:
Method Not Allowed
The end result is that I'm forced to restart the entire cluster. I use their "cluster pool" feature which makes the wait time a bit less. but it still involves waiting for about a minute before I'm able to get back to work.
The reason I need to restart the application is to swap fresh jar's into the spark environment. Otherwise when I repeatedly use addJar(), I run into some annoying jar-hell issues (class not found errors and such).
Why does Databricks only list one application at a time in their "interactive" cluster?
Why doesn't databricks have a way to stop one application and start another in its place (without restarting the whole cluster)?
This affects development productivity when we are forced to sit around waiting an extra minute for no good reason. It is already pretty hard to be productive with spark.

WSO2 APIM - Tunning

I have performed some performance tests on WSO2 APIM on both WebServices (WSDL) and Gateway interfaces. Everything went good on the gateway one, however I am facing an odd behavior when using the WebServices one.
Basically I created a test that add, change password and delete a user and run a test plan using 64 threads. At the very beggining my throughput increases a lot up until reach all 64 threads (throughput peak was 1600 req/seg). However, after that the throughput start to decrease with no reason.
All 64 threads are still active and running, and the machine hosting the wso2am reduce CPU usage. It seems that APIM is given up of handling the request even though it has threads and processors for that.
The picture below shows the vmstat result for processor (user, system and idle) and the context switch and interruptions. It is possible to cpu/context switch follows the throughput.
And the next picture illustrate the jmeter test result after at the end (after decrease throughput).
Basically what I need is a clue on what may be the reason for such behavior. I have already tried to increase the pool of threads on both wso2am and tomcat, however it has no effect. It is like the requests were not arriving at all. Even though jmeter is full of power and had already send a bigger throughput before.
I would bet that a simple configuration on tomcat or wso2 is the answer for that. Any help is appreciate.
Thanks and Regards
It may be due to JMeter not being able to send the requests fast enough, try the following steps:
Upgrade JMeter to the latest version (3.1 as of now), you can get the most recent JMeter distribution from JMeter download page
Run your test in command-line non-GUI mode. JMeter GUI can be used for tests development and/or debugging only, it is not designed for running load tests.
Remove (or disable) all the listeners during test execution. Later on you can open JMeter GUI, add the listener of your choice, load .jtl results file and perform analysis or create an HTML Reporting Dashboard out of results file
See 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure article for above points explained in details and few more tips on configuring JMeter for maximum performance and throughput

Application pool disabling

I have an application in the Production environment which is Windows Server 2012/IIS 8 and is load balanced.
Recently out of nowhere, the website app pool suddenly started gettig disabled. The System Windows Logs logged the following error message by the Resource-Exhaustion-Detector ...
Application Pool 'x' is being automatically disabled due to a series of failures in the process(es) serving that application pool.
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: w3wp.exe (6604) consumed 5080641536 bytes, w3wp.exe (1572) consumed 477335552 bytes, and w3wp.exe (352) consumed 431423488 bytes.
Anyone got any idea how I figure out what is happening? We've never come across this issue before and the application has been running for a good couple of years.
Also, this isn't something that happens regularly but instead seems to happen one every day or so, and even that is at a random time. The Virtual Memory was initially 4GB but because of the issue above, we increased it to 8GB. Recently it spiked at using about 6.8GB out of 8GB, which it has no reason to do so.
Any help would be really appreciated!
The answer is easy here, obviously and certainly you have two issues here
1- You have a serious bug in your process/code that happens intermittently "you need to debug it to find how/when that happens" or at least run a ProcDump
such that you keep it listening on the server on the process W3WP till an exception happens and then analyze this dump to find where the code get stuck and consume that memory/otherwise just debug the code and see what changes were made in last few months "not days"
2- the application get stopped because you have configured/it is configured by default to get disabled break after a certain number of failure repeats, and that's a normal behavior but the main issue as I said is not the application pool itself, its inside the process
please let me know if you need a further explanation or help on this matter

Jenkins running at very high CPU usage

I recently upgraded from Jenkins 1.6 to 2.5. After I did this, I noticed very high CPU usage, sometimes over 300% (there are only 4 cores, so I don't think it could go over 400%). I'm not sure where to begin debugging this, but here's a thread dump and some screenshots from top/htop
htop
top:
As it turned out, my issue was that several jobs had thousands of old builds. This was fine in Jenkins 1.6 but it's a problem in 2.5 (I guess maybe Jenkins tries to load all the builds into memory when you view the job overview page). To fix it, I just deleted most of the old builds from the problem jobs using this strategy and then reloaded jenkins. Worked like a charm!
I also set the "discard old builds" plugin to keep only the 50 most recent builds, to prevent this from happening again.
Whenever a request comes in, Jenkins will spawn some threads to serve the request. After upgrading Jenkins, it might have invoked at high throttle at that time. Plz check the CPU and memory usage of Jenkins server while the following scenarios :
Jenkins is idle and no other apps are running on the server.
Scheduled a build and no other apps are running on the server.
And compare the behaviors which could help you out to determine whether Jenkins or running jenkins in parallel with other apps are really making trouble.
As #vlp said, try to monitor the jenkins application via JVisualVM with Jstad configuration to hook in. Refer this link to Configure JvisualVM with Jstad.
I have noticed a couple of reasons for abnormal CPU usage with my Jenkins install on Windows 7 Ultimate.
I had recently upgraded from v2.138 to v2.140 plus added a few additional plugins. I started noticing a problem with the Jenkins java executable taking up to 60% of my CPU time every time a job would trigger. None of the jobs were CPU bound, just grabbing data from external servers, so it didn't make any sense. It was fixed with a simple restart of the Jenkins service. I assume the upgrade just didn't finish cleanly.
Java Garbage Collection was throwing errors and hogging the CPU when running with the default memory settings. It was probably overkill, but I went wild and upped the Java Heap Space for Jenkins from the default 256mb to 4gb; which solved this problem for me.See this solution for instructions:
https://stackoverflow.com/a/8122566/4479786
2.5 seems to be a development release, while 1.6 is their Long Term Support version. Thus it seems logical that you should expect some regressions when using the bleeding edge version. The bounty on this question is proof that other users are experiencing this as well. The solution is to report a bug on the Jenkins bug tracker. You can temporarily downgrade to the known good version for now.
Try passwing following argument to jenkins:
-Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true
as mentioned here: https://issues.jenkins-ci.org/browse/JENKINS-52150

IIS7: Faulting application w3wp.exe, what is the root cause of these crashes?

Our Website is in .NET but with some old ASP and 32bits libraries too in it. It had been working fine for a while (2 years). But for the past month, we have seen the following error on our IIS7 server, which we have been unable to track down and fix:
"Faulting application w3wp.exe, version 7.0.6001.18000, time stamp 0x47919413, faulting module kernel32.dll, version 6.0.6001.18215, time stamp 0x4995344f, exception code 0xe053534f, fault offset 0x0002f328, process id 0x%9, application start time 0x%10."
We are able to reproduce the error:
One of our .ASPX pages starts loading, executing code and queries (we have response.flush() all over the page to track where the code breaks), then it suddenly stops and we get the above error in IIS.
The page stops loading and, without the response.flush(), it's not redirecting to our error.aspx page (as configured in web.config)
The error does NOT happen all the time. Sometimes, it happens 3 times in a row, then it's working fine for 15 minutes non-stop with a proper redirection to error.aspx.
The error we get then is a classic: "Either BOF or EOF is True, or the current record has been deleted."
When the error occurs, the page hangs and all other session on the same computer from any browsers have hanging web pages as well (BTW, we only allow 1 worker process while we are testing). From other computers, the site loads fine.
I can recycle the Application Pool, kill w3wp.exe, restart IIS. Nothing will do. The only way to successfully load the page again is to Restart MS SQL which handles our Session States. I don't know why this is, but we guessed that the Session Cookies on the users browsers points to a thread which was not terminated properly (due to the above crash) and IIS is waiting for it to terminate to process more code (?). If someone can explain this better, that would be really helpful. Is there a timeout which we can set to "terminate" threads? Is it a MS SQL related issue?
I have also looked at the Private and Virtual Memory usages, because I think our code is not the most effective and I am certain we have remaining memory leaks. However, I saw the page crash even though both Private and Virtual Memories were still quite low (under 100MB each).
I have used Debug Diag and WinDbg as indicated here: http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx, but we are not able to make windbg work, this is what we are trying to do at the moment.
If someone could help us or point us toward the right direction that would be really great, thank you.
"Either BOF or EOF is True, or the current record has been deleted" means the table is empty and you are attempting to do a MoveNext. So check for eof before you do any moves.
IIS is notorious for throwing kernel errors in w3wp.exe like this one. All your errors in session state are just symptoms of the crashed process. Multiple APP pools won't help much - they just spread the error around.
I''d wager it is SQL deadlocks due to your user environment changing. This will cause a 10-second lag as SQL tries to determine which query to kill off. One wins, one loses. The loser gets back a pointer to an unexpectedly empty table and you try a move and subsequent crash. You maybe could point your DB to an ODBC connection and turn on tracing, or figure out a way to get SQL to log it.
I had all the same symptoms as above in Perl. I was able to make a wrapper fn() to do all SQL queries and log all sql, + params and any errors to disk to track down the problem. It was deadlocks, then we were able to code in auto-retry, and eventually we recoded the query order and scanned columns to eliminate the deadlocks.
It's entirely possible one of your referenced/linked assemblies somewhere has randomly gone corrupt (it can happen) on disk. Can you try a replicate the problem on a new, clean machine with the same stats, fresh installs of the latest xyz drivers you're using?
I solved a mysterious problem that took me months to isolate this way. It seemed clean, new machines with the same specs and prerequired drivers would work just fine - only some older machines with the same specs were failing consistently. I ended up uninstalling everything (IIS, ASP.NET, .NET, database and client) and starting from scratch. The end cause when I isolated it was that the db client driver was corrupt on the older machines (and all the older machines were clones of each other, so I assume they were cloned after the corruption occured), and it seemed to be messing with the .NET memory space even when I wasn't calling it directly. I have yet to even reply to my "help me debug this monster" post with this answer because I doubted it would ever help anyone.
We started receiving this error after installing windows updates on a Windows Server 2008R2 machine. Windows Process Activation Service (WAS) installs some additional site bindings that caused issues for our setup.
We removed net.tcp, net.pipe, net.msmq, and msmq.formatname bindings from our website and no longer got the faulting application exception.
This is probably an edge case, but just in case someone is coming here and they are using MVCMailer , I was getting this same error due to the .SendAsync() method on the mailers.
I switched them all to .Send() and the crashing stopped.
See this SO answer for ways to use the mailer async and avoid the crash (allegedly, I did not personally implement it)

Resources