Trouble adding Hadoop node in Datastax Enterprise - cassandra

I'm having a difficult time adding a hadoop node in datastax enterprise 4.5.1. I have an existing Cassandra virtual dc with two nodes, using vnodes. I am using opscenter, and I startup a hadoop node, setting the initial_token value to 0. Opscenter installs everything just fine (i.e. I get passed the 5 green dots), but about a minute after, the node dies. The system.log file has this exception:
INFO [main] 2014-12-28 05:40:37,931 StorageService.java (line 1007) JOINING: Starting to bootstrap...
ERROR [main] 2014-12-28 05:40:37,998 CassandraDaemon.java (line 513) Exception encountered during startup
java.lang.IllegalStateException: No sources found for (-1,0]
at org.apache.cassandra.dht.RangeStreamer.getAllRangesWithSourcesFor(RangeStreamer.java:159)
at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:117)
at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:72)
at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1035)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:797)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:614)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:504)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:394)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
at com.datastax.bdp.server.DseDaemon.main(DseDaemon.java:574)
INFO [StorageServiceShutdownHook] 2014-12-28 05:40:38,015 Gossiper.java (line 1279) Announcing shutdown
INFO [Thread-1] 2014-12-28 05:40:38,015 DseDaemon.java (line 477) DSE shutting down...
INFO [Thread-1] 2014-12-28 05:40:38,022 PluginManager.java (line 317) All plugins are stopped.
INFO [Thread-1] 2014-12-28 05:40:38,023 CassandraDaemon.java (line 463) Cassandra shutting down...
ERROR [Thread-1] 2014-12-28 05:40:38,023 CassandraDaemon.java (line 199) Exception in thread Thread[Thread-1,5,main]
java.lang.NullPointerException
at org.apache.cassandra.service.CassandraDaemon.stop(CassandraDaemon.java:464)
at com.datastax.bdp.server.DseDaemon.stop(DseDaemon.java:480)
at com.datastax.bdp.server.DseDaemon$1.run(DseDaemon.java:384)
INFO [StorageServiceShutdownHook] 2014-12-28 05:40:40,015 MessagingService.java (line 683) Waiting for messaging service to quiesce
INFO [ACCEPT-/172.31.19.81] 2014-12-28 05:40:40,017 MessagingService.java (line 923) MessagingService has terminated the accept() thread
I have a keyspace that looks like this:
CREATE KEYSPACE mykeyspace WITH replication = {
'class': 'NetworkTopologyStrategy',
'Analytics': '1',
'Cassandra': '1'
};
I'm wondering if it is because I am using vnodes in my Cassandra dc and not in the Analytics dc? The datastax documentation mentions this type of mixed architecture is ok. My snitch is set to DSEDelegateSnitch, which in turn uses DSESimpleSnitch, the default. I've ran a node repair, but to no avail. One other detail is that in opscenter, I get a warning that I am using two different versions of a Datastax enterprise, 4.5.1 in my cassandra dc, and 2.0.8.39 in the Analytics dc. In addition, opscenter lists the hadoop dc as "unknown." Any help at all would be appreciated.

Related

Driver stops executors without a reason

I have an application based on spark structured streaming 3 with kafka, which is processing some user logs and after some time the driver is starting to kill the executors and I don't understand why.
The executors doesn't contain any errors. I'm leaving bellow the logs from executor and driver
On the executor 1:
0/08/31 10:01:31 INFO executor.Executor: Finished task 5.0 in stage 791.0 (TID 46411). 1759 bytes result sent to driver
20/08/31 10:01:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
On the executor 2:
20/08/31 10:14:33 INFO executor.YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
20/08/31 10:14:34 INFO memory.MemoryStore: MemoryStore cleared
20/08/31 10:14:34 INFO storage.BlockManager: BlockManager stopped
20/08/31 10:14:34 INFO util.ShutdownHookManager: Shutdown hook called
On the driver:
20/08/31 10:01:33 ERROR cluster.YarnScheduler: Lost executor 3 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 130392 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Lost executor 2 on xxx.xxx.xxx.xxx: Executor heartbeat timed out after 125773 ms
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129308 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129314 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129311 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
20/08/31 10:53:33 ERROR cluster.YarnScheduler: Ignoring update with state FINISHED for TID 129305 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
Is there anyone which had the same problem and solved it?
Looking at the available information at hand:
no errors
Driver commanded a shutdown
Yarn logs showing "state FINISHED"
this seems to be expected behavior.
This typically happens if you forget to await the termination of the spark streaming query. If you do not conclude your code with
query.awaitTermination()
your streaming application will just shutdown after all data was processed.

Cassandra decommission loss of data

We are running a Cassandra cluster with node servers. Initially, the cluster only had one node, and we decided that since that node was running out of space, we could add another node to the cluster.
Info on the cluster:
Keyspace with replication factor 1 using the SimpleStrategy class on a single datacenter
Node 1 - 256 tokens, almost no space available (1TB occupied by Cassandra data)
Node 2 - connected with 256 tokens, had 13TB available
First we added node 2 to the cluster and then realized that to stream the data to node 2, we'd have to decommission node 1.
So we decided to decommission, empty and reconfigure node 1 (we wanted node 1 to hold only 32 tokens) and re-add node 1 to the cluster datacenter.
When launching the decommission process, it created a stream of 29 files making a total of almost 600GB. That stream copied successfully (we checked the logs and used nodetool netstats) and we were expecting that a second stream would follow as we had 1TB on node 1. But nothing else happened, the node reported as decommissioned and node 2 reported data stream complete.
The log from node 2 related to the copy stream:
INFO [STREAM-INIT-/10.131.155.200:48267] 2018-10-08 16:05:55,636 StreamResultFuture.java:116 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a ID#0] Creating new streaming plan for Unbootstrap
INFO [STREAM-INIT-/10.131.155.200:48267] 2018-10-08 16:05:55,648 StreamResultFuture.java:123 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a, ID#0] Received streaming plan for Unbootstrap
INFO [STREAM-INIT-/10.131.155.200:57298] 2018-10-08 16:05:55,648 StreamResultFuture.java:123 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a, ID#0] Received streaming plan for Unbootstrap
INFO [STREAM-IN-/10.131.155.200:57298] 2018-10-08 16:05:55,663 StreamResultFuture.java:173 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a ID#0] Prepare completed. Receiving 29 files(584.444GiB), sending 0 files(0.000KiB)
INFO [StreamReceiveTask:2] 2018-10-09 16:55:33,646 StreamResultFuture.java:187 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a] Session with /10.131.155.200 is complete
INFO [StreamReceiveTask:2] 2018-10-09 16:55:33,709 StreamResultFuture.java:219 - [Stream #a248d100-cb0b-11e8-a427-37a119a8af0a] All sessions completed
After clearing the cassandra data folder (we should've backed it up, we know), we started cassandra again in node 1 and it successfully joined the cluster.
The cluster is functional with:
Node 1 - 32 tokens
Node 2 - 256 tokens
But, we seem to have a lost a lot of data. We were doing this as instructed in the Cassandra documentation.
We tried doing nodetool repair on both nodes, but to no avail (both reported no data to be recovered).
What did we miss here? Is there a way to recover this lost data?
Thank you all!

Rolling upgrade from 1.2.19 to 2.0.15

We have 3 datacenters with 12 cassandra nodes in each. Current Cassandra version is 1.2.19. We want to migrate to Cassandra 2.0.15. We cannot have a full downtime and we need to do a rolling upgrade. As a preliminary check we've done 2 experiments:
Experiment 1
Created a new 2.0.15 node and tried to bootstrap it into the cluster with 10% of token interval of already existing node.
The node was unable to join the cluster by producing: "java.lang.RuntimeException: Unable to gossip with any seeds"
CassandraDaemon.java (line 584) Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1296)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:457)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:671)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:623)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:515)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:437)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:567)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:656)
Experiment 2
Added one 1.2.19 node into the cluster with 10% of token interval of already existing node.
When the node was up we stopped it and upgraded to 2.0.15, then started again (minor downtime). This time the node joined the cluster and started serving requests correctly.
To check how it behaves under heavier load we tried to move token to cover 15% of a normal node. Unfortunately
the move operation has failed with the following exception:
INFO [RMI TCP Connection(1424)-192.168.1.100] 2015-07-10 11:37:05,235 StorageService.java (line 982) MOVING: fetching new ranges and streaming old ranges
INFO [RMI TCP Connection(1424)-192.168.1.100] 2015-07-10 11:37:05,262 StreamResultFuture.java (line 87) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Executing streaming plan for Moving
INFO [RMI TCP Connection(1424)-192.168.1.100] 2015-07-10 11:37:05,262 StreamResultFuture.java (line 91) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Beginning stream session with /192.168.1.101
INFO [StreamConnectionEstablisher:1] 2015-07-10 11:37:05,263 StreamSession.java (line 218) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Starting streaming to /192.168.1.101
INFO [StreamConnectionEstablisher:1] 2015-07-10 11:37:05,274 StreamResultFuture.java (line 173) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Prepare completed. Receiving 0 files(0 bytes), sending 112 fi
les(6538607891 bytes)
ERROR [STREAM-IN-/192.168.1.101] 2015-07-10 11:37:05,303 StreamSession.java (line 467) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Streaming error occurred
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:51)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:239)
at java.lang.Thread.run(Thread.java:745)
ERROR [STREAM-OUT-/192.168.1.101] 2015-07-10 11:37:05,312 StreamSession.java (line 467) [Stream #fc3f4290-26f7-11e5-9988-afe392008597] Streaming error occurred
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDi
Questions
Q1. Is it normal for Cassandra 2.0.15 to not bootstrap into 1.2.19 cluster as it was in eperiment 1? (Here I mean that it might not supposed to work by design)
Q2. Is move token operation supposed to work for Cassandra 2.0.15 node which operates in 1.2.19 cluster?
Q3. Are there any workarounds/recommendations of doing a proper rolling upgrade in our case?

Spark on cluster: I would like to know the meaning of the following error and possible causes:

I've the follow errors/warns:
1) WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;#58149ee3,BlockManagerId(2, 192.168.0.171, 49714))] in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
2) ERROR CoarseGrainedExecutorBackend: Driver 192.168.0.131:41837 disassociated! Shutting down.
I'm running a Spark (v. 1.4.0) app in a cluster of 4 machines in which the driver has less memory (4 GB) of the workers (8 Gb each one). Is it possible that the driver produces the error due to its workload?
The driver was not able to respond to the executors since it was under stress during the computation.
The problem was solved simply by adding mroe RAM to the driver.

Cassandra Streaming error - Unknown keyspace system_traces

In our dev cluster, which has been running smooth before, when we replace a node (which we have been doing constantly) the following failure occurs and prevents the replacement node from joining.
cassandra version is 2.0.7
What can be done about it?
ERROR [STREAM-IN-/10.128.---.---] 2014-11-19 12:35:58,007 StreamSession.java (line 420) [Stream #9cad81f0-6fe8-11e4-b575-4b49634010a9] Streaming error occurred
java.lang.AssertionError: Unknown keyspace system_traces
at org.apache.cassandra.db.Keyspace.<init>(Keyspace.java:260)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:110)
at org.apache.cassandra.db.Keyspace.open(Keyspace.java:88)
at org.apache.cassandra.streaming.StreamSession.addTransferRanges(StreamSession.java:239)
at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:368)
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289)
at java.lang.Thread.run(Thread.java:745)
I got the same error while I was trying to setup my cluster, and as I was experimenting with different switches in cassandra.yaml, I restarted the service multiple times and removed the system dir under data directory (/var/lib/cassandra/data as mentioned here).
I guess for some reason cassandra tries to load system_traces keyspace and fails (the other dir under /var/lib/cassandra/data), and nodetool throws this error. You can just remove both system and system_traces before starting cassandra service, or even better delete all content of bommitlog, data and savedcache there.
This works obviously if you dont have any data just yet in the system.

Resources