hazelcast 3.2 slow when deployed using TomEE 1.6.0 - hazelcast

I want to upgrade from TomEE 1.5.1 to TomEE 1.6.0.
I have some hazelcast maps that are populated during server startup.
When deployed on TomEE 1.5.1 works fast (less than a second to populate and index 2k items, including some processing in between).
When deploying the exact same WARs to TomEE 1.6.0 the same tasks takes ~4 seconds.
To complete the picture, when running unit-test with openejb.home pointing to openejb 4.6.0 - it runs perfectly well.
Any ideas?
===== edit =====
I realized that this is a bit in the air.
Here's a link to a simple war that puts 50000 items to the map.
https://drive.google.com/file/d/0B3Xw6Xt1YU4bVy16NE9Xc295LTA/edit?usp=sharing
I deployed it in apache-tomee-plus-1.5.1 and in apache-tomee-jaxrs-1.6.0. The time was ~2.5 sec and ~10 sec, respectfully.
There are emphasized output in the tomee log to indicate the time.
Sources are included.
I hope it helps in understanding and solving the issue.

basically you are stucked in hazelcast:
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:721)
- locked <0x00000007c58b50c0> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:695)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:674)
at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:239)
at com.hazelcast.map.proxy.MapProxySupport.putInternal(MapProxySupport.java:200)
at com.hazelcast.map.proxy.MapProxyImpl.put(MapProxyImpl.java:71)
at com.hazelcast.map.proxy.MapProxyImpl.put(MapProxyImpl.java:57)
You can take some thread stack in both instances to compare but TomEE didn't change enough to justify alone such a difference.
Do you use the exact same network config?

I didn't get any exception neither. Just got a thread dump when started to see if tomee was the bottlenexk or not. Since the time is spent in hazelcast TomEE shouldn't be the cause of it so you need to compare both instances.

The problem was solved by editing the objects that hazelcast serialize.
For TomEE 1.5.1, those objects were implementing the java.io.Serializable interface and the performance difference occured.
I changed it to com.hazelcast.nio.serialization.DataSerializable and things run faster and are consistent in both servers.
So, although my problem was solved, I still don't understand the behavior differences.

it can be classloading difference, tomee 1.6 tolerate a bit more classes from the webapp by default. Playing with openejb.classloader.forced-skip=package1,package2,... to exclude common classes between webapp and tomee/lib can make it faster

Related

Do all Apache Cassandra nodes need to use the same Garbage Collector?

I have recently upgraded our Cassandra cluster from 3.11 to 4.0 with the long term goal to also upgrade the Java version. I did not want to do both of these things at once for obvious reasons, however we have been upgraded on C4 for just over two weeks now and I'm looking to upgrade the Java version from jdk8 to jdk11, and also move from CMS Garbage Collector to G1GC.
We wanted to get an idea of what the impact of moving to G1GC would be before going big bang across all nodes.
Is it safe to use a different Garbage collector on different nodes? or should this be something setup in a test environment to monitor?
Thanks in advance.
Yes! That is actually the recommended practice when changing/testing new GC types, assuming that you cannot fully simulate production workloads in a lower environment.
I'd advise making the switch on one or two nodes, and then monitor their performance relative to the CMS nodes.
Logically you can do it since they are different java processes running on different machines. Actual intention behind you doing this activity is to test you must analyze the impact on test environment first and then apply changes on production if you find test results suitable.

Hazelcast exception

We recently added Hazelcast to one of our applications and noticed this NPE coming in our logs without obvious reasons.
We are using Hazelcast 3.11 and there are twenty members in the cluster running on four physical servers.
We use Hazelcast to share some locks and a map across different JVMs.
[24/08/19 17:50:10:586 EST] 000000ba ExecutionServ E com.hazelcast.spi.ExecutionService [SERVERNAME]:5701 [xyz] [3.11.3] Failed to execute java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#b20b531
java.lang.NullPointerException
at com.hazelcast.crdt.CRDTReplicationTask.replicate(CRDTReplicationTask.java:101)
at com.hazelcast.crdt.CRDTReplicationTask.run(CRDTReplicationTask.java:67)
at com.hazelcast.spi.impl.executionservice.impl.DelegateAndSkipOnConcurrentExecutionDecorator$DelegateDecorator.run(DelegateAndSkipOnConcurrentExecutionDecorator.java:77)
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:906)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:929)
at java.lang.Thread.run(Thread.java:773)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Given our application is very critical I would like to understand what would potentially cause it and what would be the consequences. Our application seems to be working normally around the places where we use Hazelcast.
Thank you in advance for your inputs.
This issue seems to have been logged with Hazelcast and fixed in September 2018:
https://github.com/hazelcast/hazelcast/pull/13706
But it looks like the issue never made it into one of the hazelcast releases. See the release notes, no mention of bug 13706:
https://docs.hazelcast.org/docs/rn/index.html#3-12-2
I asked if/when this issue will be released (if not already) on the hazecast pull request (1st link above).
One thing you could try, just in case they pulled the fix into one release, would be to test with hazelcast 3.12.2 (latest release), maybe they pulled in the fix but didn't mention it in the release notes?

Is Apache Zeppelin stable enough to be used in Production

I am using AWS EMR cluster. I have been experimenting with Spark Drivers and Apache Zeppelin Rest APIs to run jobs. I have run several hundred adhoc jobs with Zeppelin and didn't have any concern. With that fact I am considering to use Zeppelin Rest APIs in production. Will be submitting jobs using Rest APIs.
Has anyone experienced stability issues with Zeppelin in Production?
I have a zeppelin running in production in a multiuser environment (+/- 15 users) and it hasn't been very stable. To make it more stable I run zeppelin on its own node, not any longer on the master node.
Anyway, I found the following problems:
In the releases before 0.7.2 Zeppelin created a lot of zombie processes, which causes memory problems after heavy usage.
User libraries can break Zeppelin, this has been the case in the versions prior 0.7.0. E.g. Jackson libraries make Zeppelin unable to communicate with the spark interpreter. In 0.7.0 and up this problem has been mitigated.
There are random freezes when there are a lot of users. The only way to fix this, is a restart of the service. (All versions)
Sometimes when a user starts his interpreter and the local repo is empty, zeppelin doesn't download all the libraries specified in the interpreter config. Then it won't download them again, the only way to mitigate this is to delete the contents of the local repo of the interpreter. (All versions)
Sometimes changes on notebooks don't get saved, which causes users to loose code.
In version 0.6.0 spark interpreters shared a context, which caused users to overwrite each other variables.
Problems are difficult to debug, the logging is not that great yet. Some bugs seem to break the logging and sometimes running an interpreter in debug mode fixes the problem.
So, I wouldn't put it in a production setting yet, where people depend on it. But for testing and data discovery it would be fine. Zeppelin is clearly still in a beta stage.
Also don't run it on the master node, but setup your own instance and let it connect remotely to the cluster. This makes it much more stable. Put it on a beefy node and restart it overnight.
Most of the bugs I encountered are already on the Jira and the developers are working hard to make things better. The stability becomes better and better every release and I see the maintenance load going down every version, so it certainly has potential.
I have used zeppelin now for more than a year. It gets you going quickly when you are just starting but it is not a good candidate for production use cases and especially with more than 10 users and it depends on your cluster resources. These were my concerns overall with Zeppelin.
By default you can't have more than one job running at a time, you
will need to change the configuration to make that happen.
If you are loading additional libraries from s3 or external environments, you can do that only in the beginning or you will have
to restart zeppelin.
spark context is pre-created and there are only few settings you can make changes to.
The editor itself doesn't resize well when your output is large.
I am moving on to jupyter for my use cases which is much strong in my initial assessment.
As of the time of this answer, end of February 2019, my answer would be : NO. Plain and Simple. Zeppelin keeps crashing, hanging and getting unresponsive, notebooks tend to get unloadable due to size errors, very slow execution compared to Jupyter, plus so many limitations regarding third party displaying engines integration (although many effort have been made towards this).
I experienced these issues on a decently sized and capacited cluster, with a single user. I would never, ever, advice it to be a production tool. Not as it is today to the least. Unless you have an admin at hand able to restart the whole thing regularly and track down/fix errors and be in charge of integration.
We moved back to Jupyter, and everything worked smoothly out-of-the box from day one, after struggling to stabilize Zeppelin for weeks.

Jenkins running at very high CPU usage

I recently upgraded from Jenkins 1.6 to 2.5. After I did this, I noticed very high CPU usage, sometimes over 300% (there are only 4 cores, so I don't think it could go over 400%). I'm not sure where to begin debugging this, but here's a thread dump and some screenshots from top/htop
htop
top:
As it turned out, my issue was that several jobs had thousands of old builds. This was fine in Jenkins 1.6 but it's a problem in 2.5 (I guess maybe Jenkins tries to load all the builds into memory when you view the job overview page). To fix it, I just deleted most of the old builds from the problem jobs using this strategy and then reloaded jenkins. Worked like a charm!
I also set the "discard old builds" plugin to keep only the 50 most recent builds, to prevent this from happening again.
Whenever a request comes in, Jenkins will spawn some threads to serve the request. After upgrading Jenkins, it might have invoked at high throttle at that time. Plz check the CPU and memory usage of Jenkins server while the following scenarios :
Jenkins is idle and no other apps are running on the server.
Scheduled a build and no other apps are running on the server.
And compare the behaviors which could help you out to determine whether Jenkins or running jenkins in parallel with other apps are really making trouble.
As #vlp said, try to monitor the jenkins application via JVisualVM with Jstad configuration to hook in. Refer this link to Configure JvisualVM with Jstad.
I have noticed a couple of reasons for abnormal CPU usage with my Jenkins install on Windows 7 Ultimate.
I had recently upgraded from v2.138 to v2.140 plus added a few additional plugins. I started noticing a problem with the Jenkins java executable taking up to 60% of my CPU time every time a job would trigger. None of the jobs were CPU bound, just grabbing data from external servers, so it didn't make any sense. It was fixed with a simple restart of the Jenkins service. I assume the upgrade just didn't finish cleanly.
Java Garbage Collection was throwing errors and hogging the CPU when running with the default memory settings. It was probably overkill, but I went wild and upped the Java Heap Space for Jenkins from the default 256mb to 4gb; which solved this problem for me.See this solution for instructions:
https://stackoverflow.com/a/8122566/4479786
2.5 seems to be a development release, while 1.6 is their Long Term Support version. Thus it seems logical that you should expect some regressions when using the bleeding edge version. The bounty on this question is proof that other users are experiencing this as well. The solution is to report a bug on the Jenkins bug tracker. You can temporarily downgrade to the known good version for now.
Try passwing following argument to jenkins:
-Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true
as mentioned here: https://issues.jenkins-ci.org/browse/JENKINS-52150

Stopping the classloader leak when using log4j in JBoss

We have an old version of JBoss running multiple apps and we get perm gen errors after multiple deploys. I believe it is due to a classloader leak. It turns out that this is due to a bug that they have decided to not fix:
https://issues.apache.org/bugzilla/show_bug.cgi?id=46221
The short and skinny of that link is that you get a classloader leak simply from using log4j and they aren't fixing it.
So is there a there a way for me to fix the classloader leak so I don't need to restart the server every two weeks?
I'm hoping to get around upgrading the server, but if I can change configurations, apply some sort of patch, or perhaps reset the log file somehow, that would be great.
The Bug has an attached patch. Did you try that? Going from jboss4 to 5 is not that painful, it would probably be easier to upgrade then to play around with a patch.

Resources