Hazelcast exception - hazelcast

We recently added Hazelcast to one of our applications and noticed this NPE coming in our logs without obvious reasons.
We are using Hazelcast 3.11 and there are twenty members in the cluster running on four physical servers.
We use Hazelcast to share some locks and a map across different JVMs.
[24/08/19 17:50:10:586 EST] 000000ba ExecutionServ E com.hazelcast.spi.ExecutionService [SERVERNAME]:5701 [xyz] [3.11.3] Failed to execute java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#b20b531
java.lang.NullPointerException
at com.hazelcast.crdt.CRDTReplicationTask.replicate(CRDTReplicationTask.java:101)
at com.hazelcast.crdt.CRDTReplicationTask.run(CRDTReplicationTask.java:67)
at com.hazelcast.spi.impl.executionservice.impl.DelegateAndSkipOnConcurrentExecutionDecorator$DelegateDecorator.run(DelegateAndSkipOnConcurrentExecutionDecorator.java:77)
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:227)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:906)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:929)
at java.lang.Thread.run(Thread.java:773)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
Given our application is very critical I would like to understand what would potentially cause it and what would be the consequences. Our application seems to be working normally around the places where we use Hazelcast.
Thank you in advance for your inputs.

This issue seems to have been logged with Hazelcast and fixed in September 2018:
https://github.com/hazelcast/hazelcast/pull/13706
But it looks like the issue never made it into one of the hazelcast releases. See the release notes, no mention of bug 13706:
https://docs.hazelcast.org/docs/rn/index.html#3-12-2
I asked if/when this issue will be released (if not already) on the hazecast pull request (1st link above).
One thing you could try, just in case they pulled the fix into one release, would be to test with hazelcast 3.12.2 (latest release), maybe they pulled in the fix but didn't mention it in the release notes?

Related

Do all Apache Cassandra nodes need to use the same Garbage Collector?

I have recently upgraded our Cassandra cluster from 3.11 to 4.0 with the long term goal to also upgrade the Java version. I did not want to do both of these things at once for obvious reasons, however we have been upgraded on C4 for just over two weeks now and I'm looking to upgrade the Java version from jdk8 to jdk11, and also move from CMS Garbage Collector to G1GC.
We wanted to get an idea of what the impact of moving to G1GC would be before going big bang across all nodes.
Is it safe to use a different Garbage collector on different nodes? or should this be something setup in a test environment to monitor?
Thanks in advance.
Yes! That is actually the recommended practice when changing/testing new GC types, assuming that you cannot fully simulate production workloads in a lower environment.
I'd advise making the switch on one or two nodes, and then monitor their performance relative to the CMS nodes.
Logically you can do it since they are different java processes running on different machines. Actual intention behind you doing this activity is to test you must analyze the impact on test environment first and then apply changes on production if you find test results suitable.

Azure WebApps leaking handles "out of nothing"

I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.

Is Apache Zeppelin stable enough to be used in Production

I am using AWS EMR cluster. I have been experimenting with Spark Drivers and Apache Zeppelin Rest APIs to run jobs. I have run several hundred adhoc jobs with Zeppelin and didn't have any concern. With that fact I am considering to use Zeppelin Rest APIs in production. Will be submitting jobs using Rest APIs.
Has anyone experienced stability issues with Zeppelin in Production?
I have a zeppelin running in production in a multiuser environment (+/- 15 users) and it hasn't been very stable. To make it more stable I run zeppelin on its own node, not any longer on the master node.
Anyway, I found the following problems:
In the releases before 0.7.2 Zeppelin created a lot of zombie processes, which causes memory problems after heavy usage.
User libraries can break Zeppelin, this has been the case in the versions prior 0.7.0. E.g. Jackson libraries make Zeppelin unable to communicate with the spark interpreter. In 0.7.0 and up this problem has been mitigated.
There are random freezes when there are a lot of users. The only way to fix this, is a restart of the service. (All versions)
Sometimes when a user starts his interpreter and the local repo is empty, zeppelin doesn't download all the libraries specified in the interpreter config. Then it won't download them again, the only way to mitigate this is to delete the contents of the local repo of the interpreter. (All versions)
Sometimes changes on notebooks don't get saved, which causes users to loose code.
In version 0.6.0 spark interpreters shared a context, which caused users to overwrite each other variables.
Problems are difficult to debug, the logging is not that great yet. Some bugs seem to break the logging and sometimes running an interpreter in debug mode fixes the problem.
So, I wouldn't put it in a production setting yet, where people depend on it. But for testing and data discovery it would be fine. Zeppelin is clearly still in a beta stage.
Also don't run it on the master node, but setup your own instance and let it connect remotely to the cluster. This makes it much more stable. Put it on a beefy node and restart it overnight.
Most of the bugs I encountered are already on the Jira and the developers are working hard to make things better. The stability becomes better and better every release and I see the maintenance load going down every version, so it certainly has potential.
I have used zeppelin now for more than a year. It gets you going quickly when you are just starting but it is not a good candidate for production use cases and especially with more than 10 users and it depends on your cluster resources. These were my concerns overall with Zeppelin.
By default you can't have more than one job running at a time, you
will need to change the configuration to make that happen.
If you are loading additional libraries from s3 or external environments, you can do that only in the beginning or you will have
to restart zeppelin.
spark context is pre-created and there are only few settings you can make changes to.
The editor itself doesn't resize well when your output is large.
I am moving on to jupyter for my use cases which is much strong in my initial assessment.
As of the time of this answer, end of February 2019, my answer would be : NO. Plain and Simple. Zeppelin keeps crashing, hanging and getting unresponsive, notebooks tend to get unloadable due to size errors, very slow execution compared to Jupyter, plus so many limitations regarding third party displaying engines integration (although many effort have been made towards this).
I experienced these issues on a decently sized and capacited cluster, with a single user. I would never, ever, advice it to be a production tool. Not as it is today to the least. Unless you have an admin at hand able to restart the whole thing regularly and track down/fix errors and be in charge of integration.
We moved back to Jupyter, and everything worked smoothly out-of-the box from day one, after struggling to stabilize Zeppelin for weeks.

hazelcast 3.2 slow when deployed using TomEE 1.6.0

I want to upgrade from TomEE 1.5.1 to TomEE 1.6.0.
I have some hazelcast maps that are populated during server startup.
When deployed on TomEE 1.5.1 works fast (less than a second to populate and index 2k items, including some processing in between).
When deploying the exact same WARs to TomEE 1.6.0 the same tasks takes ~4 seconds.
To complete the picture, when running unit-test with openejb.home pointing to openejb 4.6.0 - it runs perfectly well.
Any ideas?
===== edit =====
I realized that this is a bit in the air.
Here's a link to a simple war that puts 50000 items to the map.
https://drive.google.com/file/d/0B3Xw6Xt1YU4bVy16NE9Xc295LTA/edit?usp=sharing
I deployed it in apache-tomee-plus-1.5.1 and in apache-tomee-jaxrs-1.6.0. The time was ~2.5 sec and ~10 sec, respectfully.
There are emphasized output in the tomee log to indicate the time.
Sources are included.
I hope it helps in understanding and solving the issue.
basically you are stucked in hazelcast:
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:721)
- locked <0x00000007c58b50c0> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:695)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:674)
at com.hazelcast.map.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:239)
at com.hazelcast.map.proxy.MapProxySupport.putInternal(MapProxySupport.java:200)
at com.hazelcast.map.proxy.MapProxyImpl.put(MapProxyImpl.java:71)
at com.hazelcast.map.proxy.MapProxyImpl.put(MapProxyImpl.java:57)
You can take some thread stack in both instances to compare but TomEE didn't change enough to justify alone such a difference.
Do you use the exact same network config?
I didn't get any exception neither. Just got a thread dump when started to see if tomee was the bottlenexk or not. Since the time is spent in hazelcast TomEE shouldn't be the cause of it so you need to compare both instances.
The problem was solved by editing the objects that hazelcast serialize.
For TomEE 1.5.1, those objects were implementing the java.io.Serializable interface and the performance difference occured.
I changed it to com.hazelcast.nio.serialization.DataSerializable and things run faster and are consistent in both servers.
So, although my problem was solved, I still don't understand the behavior differences.
it can be classloading difference, tomee 1.6 tolerate a bit more classes from the webapp by default. Playing with openejb.classloader.forced-skip=package1,package2,... to exclude common classes between webapp and tomee/lib can make it faster

Stopping the classloader leak when using log4j in JBoss

We have an old version of JBoss running multiple apps and we get perm gen errors after multiple deploys. I believe it is due to a classloader leak. It turns out that this is due to a bug that they have decided to not fix:
https://issues.apache.org/bugzilla/show_bug.cgi?id=46221
The short and skinny of that link is that you get a classloader leak simply from using log4j and they aren't fixing it.
So is there a there a way for me to fix the classloader leak so I don't need to restart the server every two weeks?
I'm hoping to get around upgrading the server, but if I can change configurations, apply some sort of patch, or perhaps reset the log file somehow, that would be great.
The Bug has an attached patch. Did you try that? Going from jboss4 to 5 is not that painful, it would probably be easier to upgrade then to play around with a patch.

Resources