Unable to start cassandra - cassandra

I am trying to start cassandra so I did
sudo ./cassandra
I came across this
Error: Exception thrown by the agent : java.net.MalformedURLException: Local host name unknown: java.net.UnknownHostException: node24.nise.local: node24.nise.local
so I did what was mentioned as on problem on starting cassandra link and i changed the /etc/hosts file.
Then the starting process got stuck after this:
INFO 22:27:14,227 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 33.904761904761905 (just-counted was 33.904761904761905). calculation took 110ms for 3 cells
INFO 22:27:14,260 Enqueuing flush of Memtable-local#726006040(84/840 serialized/live bytes, 4 ops)
INFO 22:27:14,262 Writing Memtable-local#726006040(84/2848 serialized/live bytes, 4 ops)
INFO 22:27:14,280 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-50-Data.db (116 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=500327)
WARN 22:27:14,327 setting live ratio to maximum of 64.0 instead of Infinity
INFO 22:27:14,327 Enqueuing flush of Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,328 CFS(Keyspace='system', ColumnFamily='local') liveRatio is 64.0 (just-counted was 64.0). calculation took 0ms for 0 cells
INFO 22:27:14,350 Writing Memtable-local#1689909512(10100/101000 serialized/live bytes, 259 ops)
INFO 22:27:14,386 Completed flushing /var/lib/cassandra/data/system/local/system-local-jb-51-Data.db (5278 bytes) for commitlog position ReplayPosition(segmentId=1401859631027, position=512328)
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
No other line was executed after this . Can anyone help in letting me know why did this happen exactly.

I was getting same error ..
you just need to do give command in command prompt
hostname localhost (or the hostname of where cassandra is running)
This believe it will solve your problem

I think after this statement
INFO 22:27:14,493 Node localhost/127.0.0.1 state jump to normal
your server running normally, to verify do jps and check that CassandraDaemon is running or not.

Related

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

WARN [main] 2017-12-01 18:17:40,014 Gossiper.java:1415 - Unable to gossip with any seeds but continuing since node is in its own seed list

insyalled cassandra i see all the 6 nodes up and running. but during the startup i see this alert "WARN [main] 2017-12-01 18:17:40,014 Gossiper.java:1415 - Unable to gossip with any seeds but continuing since node is in its own seed list" can any one explain me why are we getting this warning during the startup

giraph.numInputThreads execution time for "input superstep" it's the same using 1 or 8 threads, how this can be possible?

I'm doing BFS search through the Wikipedia (spanish edition) site. I converted the dump into a file that could be read with Giraph.
Using 1 worker, a file of 1 GB took 452 seconds. I executed Giraph with this command:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true
Container logs:
16/08/24 21:17:02 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 1 input threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:17:02 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
16/08/24 21:17:02 INFO yarn.GiraphYarnTask: [STATUS: task-0] MASTER_ZOOKEEPER_ONLY checkWorkers: Done - Found 1 responses of 1 needed to start superstep -1
16/08/24 21:17:02 INFO netty.NettyClient: Using Netty without authentication.
16/08/24 21:17:02 INFO netty.NettyClient: connectAllAddresses: Successfully added 1 connections, (1 total connected) 0 failed, 0 failures total.
16/08/24 21:17:02 INFO partition.PartitionUtils: computePartitionCount: Creating 1, default would have been 1 partitions.
...
16/08/24 21:25:40 INFO netty.NettyClient: stop: Halting netty client
16/08/24 21:25:40 INFO netty.NettyClient: stop: reached wait threshold, 1 connections closed, releasing resources now.
16/08/24 21:25:43 INFO netty.NettyClient: stop: Netty client halted
16/08/24 21:25:43 INFO netty.NettyServer: stop: Halting netty server
16/08/24 21:25:43 INFO netty.NettyServer: stop: Start releasing resources
16/08/24 21:25:44 INFO bsp.BspService: process: cleanedUpChildrenChanged signaled
16/08/24 21:25:47 INFO netty.NettyServer: stop: Netty server halted
16/08/24 21:25:47 INFO bsp.BspService: process: masterElectionChildrenChanged signaled
16/08/24 21:25:47 INFO master.MasterThread: setup: Took 0.898 seconds.
16/08/24 21:25:47 INFO master.MasterThread: input superstep: Took 452.531 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 0: Took 64.376 seconds.
16/08/24 21:25:47 INFO master.MasterThread: superstep 1: Took 1.591 seconds.
16/08/24 21:25:47 INFO master.MasterThread: shutdown: Took 6.609 seconds.
16/08/24 21:25:47 INFO master.MasterThread: total: Took 526.006 seconds.
As you guys can see, the first line tell us that input superstep is executing with only one thread. And took 492 second in finish Input Superstep.
I did another test, using giraph.numInputThreads=8, tryng to do the input superstep with 8 threads:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 1 -yh 120000 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.numInputThreads=8
The result was the following one:
16/08/24 21:54:00 INFO master.BspServiceMaster: generateVertexInputSplits: Got 8 input splits for 8 input threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Starting to write input split data to zookeeper with 1 threads
16/08/24 21:54:00 INFO master.BspServiceMaster: createVertexInputSplits: Done writing input split data to zookeeper
...
16/08/24 22:10:07 INFO master.MasterThread: setup: Took 0.093 seconds.
16/08/24 22:10:07 INFO master.MasterThread: input superstep: Took 891.339 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 0: Took 66.635 seconds.
16/08/24 22:10:07 INFO master.MasterThread: superstep 1: Took 1.837 seconds.
16/08/24 22:10:07 INFO master.MasterThread: shutdown: Took 6.605 seconds.
16/08/24 22:10:07 INFO master.MasterThread: total: Took 966.512 seconds.
So, my question is, how can be possible that Giraph is using 452 seconds without input threads and 891 seconds with them? Should be exacly the opposite, right?
The cluster used for this was 1 master and one slave, both of a r3.8xlarge EC2 instance on AWS.
The problem is related to the HDFS access. There are 8 threads accessing the same resource, when that resource only can be accessed in a sequential way. For getting best performance, giraph.numInputThreads should be 2 or 1.

Spark Shell Yarn Client Mode - Akka AssociationError

When I launch Spark Shell using:
spark-shell --master yarn --deploy-mode client
I'm getting the following error:
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [akka.tc
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
]
akka.event.Logging$Error$NoCause$
16/03/21 20:52:29 ERROR ErrorMonitor: AssociationError [akka.tcp://sparkDriver#ipaddress10:47915] -> [akka.tcp://sparkExecutor#hostname02:48703]: Error [Association failed with [akka.tc
p://sparkExecutor#hostname02:48703]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#hostname02:48703]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host
]
akka.event.Logging$Error$NoCause$
16/03/21 20:52:32 ERROR YarnScheduler: Lost executor 3 on hostname01: remote Rpc client disassociated
16/03/21 20:52:32 INFO DAGScheduler: Executor lost: 3 (epoch 0)
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
16/03/21 20:52:32 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, hostname01, 37497)
16/03/21 20:52:32 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
16/03/21 20:52:32 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 0)
Firewall & Iptables are turned off. Machines in the cluster are mutually ping-able on all the ports.
But i'm puzzled why I'm still getting "akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host"
Any help please.
Probably you have a name resolution issue. You should try using IP addresses in your settings (for instance in slaves file) rather than names to confirm this hypothesis.
I have experienced the same problem before. I found that I have mistyped some environement variables regarding SPARK_LOCAL_IP and SPARK_LOCAL_DNS
To resolve your problem, you have to:
In all your nodemanager nodes, check the .bashrc and .bash_profile files that you have set the env variables to right values : SPARK_LOCAL_IP and SPARK_PUBLIC_DNS, then restart your nodemanger(s)
In your client machine (where you issue the command spark-shell) set the values of the previous env variables to your client machine IP and hostname

Cannot start Cassandra with "bin/cassandra -f"

I have a problem of using Cassandra, I can start it with "bin/cassandra", but cannot start it with "bin/cassandra -f", anyone know the reason?
Here are the detailed info:
root#server1:~/cassandra# bin/cassandra -f
INFO 10:51:31,500 JNA not found. Native methods will be disabled.
INFO 10:51:31,740 DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap
INFO 10:51:32,043 Deleted /var/lib/cassandra/data/system/LocationInfo-61-Data.db
INFO 10:51:32,044 Deleted /var/lib/cassandra/data/system/LocationInfo-62-Data.db
INFO 10:51:32,052 Deleted /var/lib/cassandra/data/system/LocationInfo-63-Data.db
INFO 10:51:32,053 Deleted /var/lib/cassandra/data/system/LocationInfo-64-Data.db
INFO 10:51:32,063 Sampling index for /var/lib/cassandra/data/system/LocationInfo-65-Data.db
INFO 10:51:32,117 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db
INFO 10:51:32,118 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db
INFO 10:51:32,120 Sampling index for /var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db
INFO 10:51:32,131 Replaying /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,143 Finished reading /var/lib/cassandra/commitlog/CommitLog-1285869561954.log
INFO 10:51:32,145 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1286301092145.log
INFO 10:51:32,153 Standard2 has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=121)
INFO 10:51:32,155 Enqueuing flush of Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,156 Writing Memtable-Standard2#1811560891(29 bytes, 1 operations)
INFO 10:51:32,200 Completed flushing /var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db
INFO 10:51:32,203 Compacting [org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-5-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-6-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-7-Data.db'),org.apache.cassandra.io.SSTableReader(path='/var/lib/cassandra/data/Keyspace1/Standard2-8-Data.db')]
INFO 10:51:32,214 Recovery complete
INFO 10:51:32,214 Log replay complete
INFO 10:51:32,230 Saved Token found: 47408016217042861442279446207060121025
INFO 10:51:32,230 Saved ClusterName found: Test Cluster
INFO 10:51:32,231 Saved partitioner not found. Using org.apache.cassandra.dht.RandomPartitioner
INFO 10:51:32,250 LocationInfo has reached its threshold; switching in a fresh Memtable at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1286301092145.log', position=345)
INFO 10:51:32,250 Enqueuing flush of Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,251 Writing Memtable-LocationInfo#1120194637(95 bytes, 2 operations)
INFO 10:51:32,307 Completed flushing /var/lib/cassandra/data/system/LocationInfo-66-Data.db
INFO 10:51:32,316 Starting up server gossip
INFO 10:51:32,329 Compacted to /var/lib/cassandra/data/Keyspace1/Standard2-9-Data.db. 1670/1440 bytes for 6 keys. Time: 125ms.
INFO 10:51:32,366 Binding thrift service to /172.24.0.80:9160
INFO 10:51:32,369 Cassandra starting up...
I cant see any problems? (-f is short for 'foreground')

Resources