RetryableHazelcastException when launching cluster using Hazelcast 3.7

RetryableHazelcastException when launching cluster using Hazelcast 3.7 - hazelcast

I have a cluster with two members that have map loaders to a database.
Version 3.6.1 shows no issues during startup - however when I upgraded to 3.7, I was presented with lots of exceptions like below - and the cluster failed to start!
Any ideas what it means?
Thanks
14:32:50.613), waitTimeout=-1, callTimeout=60000, name=TRADE_SETTLEMENT}, tryCount=250, tryPauseMillis=500, invokeCount=240, callTimeoutMillis=60000, firstInvocationTimeMs=1473427838152, firstInvocationTime='2016-09-09 14:30:38.152', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 01:00:00.000', target=[xxx.co.uk]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.RetryableHazelcastException: Map TRADE_SETTLEMENT is still loading data from external store
Sep 09, 2016 2:32:50 PM com.hazelcast.spi.impl.operationservice.impl.Invocation

Related

One of our Hazelcast clusters crashed and took down all other clusters as well

Today one of our web application servers crashed (not Hazelcast related),
and that crash took down all other Hazelcast clusters for several minutes as well.
We are using Hazelcast for session replication, caching and distributed statistics.
How to minimize the impact, when one cluster crashes?
Our stack:
Hazelcast 3.7.4
Spring Boot 1.5.1.RELEASE
Spring Framework 4.3.6.RELEASE
Spring Websocket 4.3.6.RELEASE
Apache Tomcat 8.5.11
Java 1.8 112
Windows Server 2012 R2
Logs of a hazelcast cluster:
2017-07-07 14:41:20.177 ERROR 1600 --- [https-jsse-nio-8443-exec-20] c.p.p.r.GlobalControllerExceptionHandler : Unhandled Exception, null
com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-07-07 14:41:20.177. Total elapsed time: 121392 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-07-07 14:39:09.296. Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=790688984, partitionId=11, replicaIndex=0, callId=0, invocationTime=1499438359178 (2017-07-07 14:39:19.178), waitTimeout=-1, callTimeout=60000, name=subscription-type-by-subscription-level}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1499438358785, firstInvocationTime='2017-07-07 14:39:18.785', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.0.0.4]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=6, /10.0.0.7:5702->/10.0.0.4:49193, endpoint=[10.0.0.4]:5702, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:376)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:307)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
...
2017-07-07 14:41:40.181 WARN 1600 --- [https-jsse-nio-8443-exec-32] c.h.map.impl.query.MapQueryEngineImpl : [10.0.0.7]:5702 [app-v19] [3.7.4] Could not get results
java.util.concurrent.ExecutionException: QueryOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-07-07 14:41:40.181. Total elapsed time: 120624 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-07-07 14:39:09.296. Invocation{op=com.hazelcast.map.impl.query.QueryOperation{serviceName='hz:impl:mapService', identityHash=2056544024, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1499438379950 (2017-07-07 14:39:39.950), waitTimeout=-1, callTimeout=60000, name=online-messaging-session-containers}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1499438379557, firstInvocationTime='2017-07-07 14:39:39.557', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.0.0.4]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=6, /10.0.0.7:5702->/10.0.0.4:49193, endpoint=[10.0.0.4]:5702, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
...
2017-07-07 14:42:58.385 WARN 1600 --- [https-jsse-nio-8443-exec-191] c.h.map.impl.query.MapQueryEngineImpl : [10.0.0.7]:5702 [app-v19] [3.7.4] Could not get results
com.hazelcast.core.MemberLeftException: Member [10.0.0.4]:5702 - 14a2452c-45bc-40c9-bf77-ce4d73bf6f7e has left cluster!
at com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor$OnMemberLeftTask.run0(InvocationMonitor.java:379)
at com.hazelcast.spi.impl.operationservice.impl.InvocationMonitor$MonitorTask.run(InvocationMonitor.java:221)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)

(bdutil) Unable to get hadoop/spark cluster working with a fresh install

I'm setting up a tiny cluster in GCE to play around with it but although instances are created some failures prevent to get it working. I'm following the steps in https://cloud.google.com/hadoop/downloads
So far I'm using (as of now) lastest versions of gcloud (143.0.0) and bdutil (1.3.5), freshly installed.
./bdutil deploy -e extensions/spark/spark_env.sh
using debian-8 as image (as bdutil still uses debian-7-backports).
At some point I got
Fri Feb 10 16:19:34 CET 2017: Command failed: wait ${SUBPROC} on line 326.
Fri Feb 10 16:19:34 CET 2017: Exit code of failed command: 1
full debug output is in https://gist.github.com/jlorper/4299a816fc0b140575ed70fe0da1f272
(project id and bucket names changed)
Instances are created, but spark not even installed. Digging a bit I've managed to run spark installation and start hadoop commands in the master after after ssh. But it fails badly when starting the spark-shell:
17/02/10 15:53:20 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.5-hadoop1
17/02/10 15:53:20 INFO gcsio.FileSystemBackedDirectoryListCache: Creating '/hadoop_gcs_connector_metadata_cache' with createDirectories()...
java.lang.RuntimeException: java.lang.RuntimeException: java.nio.file.AccessDeniedException: /hadoop_gcs_connector_metadata_cache
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
and not able to import sparkSQL. For what I've read everything should be started automatically.
Up to this point I'm a bit lost and don't know what else to do.
Am I missing any step? Is any of the commands faulty? Thanks in advance.
Update: solved
As pointed out in accepted solution I cloned the repo and cluster was created without issues. When trying to start the spark-shell though it gave
java.lang.RuntimeException: java.io.IOException: GoogleHadoopFileSystem has been closed or not initialized.`
That sounded to me like connectors were not initialized properly, so after running
./bdutil --env_var_files extensions/spark/spark_env.sh,bigquery_env.sh run_command_group install_connectors
it worked as expected.

The last version of bdutil on https://cloud.google.com/hadoop/downloads is a bit stale and I'd instead recommend using the version of bdutil at head on github: https://github.com/GoogleCloudPlatform/bdutil.

Cassandra upgrade from 2.2.1 to 3.0.0 fails with NullPointerException

I tried upgrading Cassandra from 2.2.1 to 3.0.0 but Cassandra doesn't start:
ERROR [main] 2015-11-30 15:44:50,164 CassandraDaemon.java:702 - Exception encountered during startup
java.lang.NullPointerException: null
at org.apache.cassandra.io.util.FileUtils.delete(FileUtils.java:374) ~[apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.db.SystemKeyspace.migrateDataDirs(SystemKeyspace.java:1341) ~[apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:180) [apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:561) [apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:689) [apache-cassandra-3.0.0.jar:3.0.0]
Anyone else faces this problem?
I also raised an issue here:
https://issues.apache.org/jira/browse/CASSANDRA-10788
Here is TRACE level logging:
TRACE [MemtablePostFlush:1] 2015-12-01 16:47:52,675 ColumnFamilyStore.java:868 - forceFlush requested but everything is clean in schema_columns
TRACE [main] 2015-12-01 16:47:52,675 ColumnFamilyStore.java:1563 - Snapshot for Keyspace(name='system') keyspace data file /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/la-46-big-Data.db created in /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/snapshots/1448984872341-upgrade-2.2.1-3.0.0
TRACE [main] 2015-12-01 16:47:52,676 ColumnFamilyStore.java:1563 - Snapshot for Keyspace(name='system') keyspace data file /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/la-45-big-Data.db created in /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/snapshots/1448984872341-upgrade-2.2.1-3.0.0
TRACE [main] 2015-12-01 16:47:52,676 ColumnFamilyStore.java:1563 - Snapshot for Keyspace(name='system') keyspace data file /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/la-47-big-Data.db created in /data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/snapshots/1448984872341-upgrade-2.2.1-3.0.0
TRACE [main] 2015-12-01 16:47:52,676 SystemKeyspace.java:1327 - Checking directory /data for old files
ERROR [main] 2015-12-01 16:47:52,751 CassandraDaemon.java:702 - Exception encountered during startup
java.lang.NullPointerException: null
at org.apache.cassandra.io.util.FileUtils.delete(FileUtils.java:374) ~[apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.db.SystemKeyspace.migrateDataDirs(SystemKeyspace.java:1341) ~[apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:180) [apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:561) [apache-cassandra-3.0.0.jar:3.0.0]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:689) [apache-cassandra-3.0.0.jar:3.0.0]
It looks like the code is expected legacy files under /data directory but it can't find any and call FileUtils.delete(null).

One thing that may impact your migration is the 2.1.x to 3.0 conversion is not recommended by Datastax.
Cassandra 3.0.x restrictions
Upgrade from Cassandra 2.1 versions greater or equal to 2.1.9 or from
Cassandra 2.2 versions greater or equal to 2.2.2.
*Emphasis added
The null pointer could be happening due to non-existent files or a duplicated delete call.
I would try upgrading to at least 2.2.2 before the 3.0 migration.

I was able to fix the problem by using this patch:
https://github.com/stef1927/cassandra/commit/1c464adf097d323320ce11db6daf05e1a31c62b6
More details:
https://issues.apache.org/jira/browse/CASSANDRA-10788
And thanks, I'll see if upgrading from 2.2.1 to 2.2.2 first helps.

Unasked Mojarra initializing

I'm using Tapestry 5.3 framework in my Java 1.7 web-app, and run in under Tomcat 7.x.
Suddenly today I've found in startup logs of my app these lines:
jul 31, 2014 12:12:46 PM com.sun.faces.config.ConfigureListener contextInitialized
INFO: Initializing Mojarra 2.2.7 ( 20140610-1547 https://svn.java.net/svn/mojarra~svn/tags/2.2.7#13362) for context '/XXX'
jul 31, 2014 12:12:46 PM com.sun.faces.spi.InjectionProviderFactory createInstance
INFO: JSF1048: PostConstruct/PreDestroy annotations present. ManagedBeans methods marked with these annotations will have said annotations processed.
I have never seen these log-lines before in my app. Furthermore, "Initializing Mojarra" process takes some appreciable time.
So I should think, that somebody among our project developers have added special libraries - but there aren't any JSF or Faces libraries in the build.
I have no ideas what have changed and how I can cut out Mojarra from my project? Any help would be appreciated.

The problem was in jsf-*.jar libraries placed mistakenly in /lib folder of Tomcat. They were initialized due to tomcat startup. Removing them solved the problem.

Unable to start titan server with embedded cassandra and rexter

I am trying to run Titan with embedded cassandra and rexster. Downloaded Titan distribution titan-all-0.3.2 and unpacked on a linux box. After unpacking this is what i ran the command
$ ./bin/titan.sh config/titan-server-rexster.xml config/titan-server-cassandra.properties
This is what i see in the logs
After starting RexPro services its unable to deploy and start grizzly. Has anyone had this issue?
Exception stack trace:
13/10/18 14:51:31 INFO server.RexProRexsterServer: RexPro serving on port: [8184]
Oct 18, 2013 2:51:31 PM org.glassfish.grizzly.servlet.WebappContext deploy
INFO: Starting application [jersey] ...
Oct 18, 2013 2:51:31 PM org.glassfish.grizzly.servlet.WebappContext deploy
SEVERE: [jersey] Exception deploying application. See stack trace for details.
java.lang.RuntimeException: com.sun.jersey.api.container.ContainerException: No WebApplication provider is present
at org.glassfish.grizzly.servlet.WebappContext.initServlets(WebappContext.java:1479)
at org.glassfish.grizzly.servlet.WebappContext.deploy(WebappContext.java:265)

There were some packaging problems in some of the 0.3.2 zip files. You basically need to replace a jar file or two around Jersey to get it to work (or I think use the titan-cassandra distribution instead of titan-all).
You can read more about the issue here and its solution (also reported here), but the answer is:
You should be able to patch 0.3.2 by replacing this jar file in the
Titan lib directory:
jersey-core-1.8.jar
with:
jersey-core-1.17
(http://repo1.maven.org/maven2/com/sun/jersey/jersey-core/1.17/jersey-core-1.17.jar)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string