How i restart service of JMX server in cassandra? - cassandra

I killed the process that used the port 7199, then i wanted to run cassandra using
cassandra -f -R
But I had this message :
INFO 05:45:43 Initializing system.schema_functions
INFO 05:45:43 Initializing system.schema_aggregates
INFO 05:45:43 Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO 05:45:43 Configured JMX server at: ****service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7199/jmxrmi****
Exception (java.lang.RuntimeException) encountered during startup: java.util.concurrent.ExecutionException: FSWriteError in
java.lang.RuntimeException: java.util.concurrent.ExecutionException: FSWriteError in at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:403)
I want to run the process that use the port 7199,
I killed that because I had a message the port 7199 already used.

Try to kill the process completely. If it is standalone use this,
$ ps auwx | grep cassandra
$ sudo kill pid
or $ sudo service cassandra stop if you have a local setup

Follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and start cassandra again.

Related

Cassandra restart issues while restoring to a new cluster

I am restoring to a fresh new Cassandra 2.2.5 cluster consisting of 3 nodes.
Initial cluster health of the NEW cluster:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? d2b29b08-9eac-4733-9798-019275d66cfc uswest1adevc
UN 10.40.1.2 230.12 KB 256 ? 5484ab11-32b1-4d01-a5fe-c996a63108f1 uswest1adevc
UN 10.40.1.3 248.47 KB 256 ? bad95fe2-70c5-4a2f-b517-d7fd7a32bc45 uswest1cdevc
As part of the restore instructions in Datastax docs, i do the following on the new cluster:
1) cassandra stop on all of the three nodes one by one.
2) Edit cassandra.yaml for all of the three nodes with the backup'ed token ring information. [Step 2 from docs]
3) Remove the contents from /var/lib/cassandra/data/system/* [Step 4 from docs]
4) cassandra start on nodes 10.40.1.1, 10.40.1.2, 10.40.1.3 respectively.
Result:
10.40.1.1 restarts back successfully:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? 2d23add3-9eac-4733-9798-019275d125d3 uswest1adevc
But the second and the third nodes fail to restart stating:
java.lang.RuntimeException: A node with address 10.40.1.2 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:546) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:766) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:693) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:300) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.5.jar:2.2.5]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.5.jar:2.2.5]
INFO [StorageServiceShutdownHook] 2016-08-09 18:13:21,980 Gossiper.java:1449 - Announcing shutdown
java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
...
Eventual cluster health:
-- Address Load Tokens Owns Host ID Rack
UN 10.40.1.1 259.31 KB 256 ? 2d23add3-9eac-4733-9798-019275d125d3 uswest1adevc
DN 10.40.1.2 230.12 KB 256 ? 6w2321ad-32b1-4d01-a5fe-c996a63108f1 uswest1adevc
DN 10.40.1.3 248.47 KB 256 ? 9et4944d-70c5-4a2f-b517-d7fd7a32bc45 uswest1cdevc
I understand that the HostID of a node might change after system dirs are removed.
My question is:
Do i need to explicitly state during the start to replace itself? Are the docs incomplete or am i missing something in my steps?
Turns out there were stale directories commit_log and saved_caches which i missed to delete earlier. The instructions work correctly with those directories deleted.
Usually on a situation like this, after i do a
$ systemctl stop cassandra
It i will run the
$ ps awxs | grep cassandra
will notice cassandra still has some features up.
I usually do a
$ kill -9 cassandra.pid
and
$ rm -rf /var/lib/cassandra/data/* && /var/lib/cassandra/commitlog/*
java.lang.RuntimeException: A node with address 10.40.1.3 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
If you are still facing this above error, that means your cassandra process is running on that node. Login to 10.40.1.3 node firstly. Then follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and commitlog and start cassandra again.
Do this for all nodes you are suffering with above error you mentioned.

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

What causes BitBake worker processes to exit unexpectedly?

I have a BitBake build process that runs on a Docker container (CentOS 7). The BitBake fails during recipe gcc-cross-i586-5.2.0-r0: task do_compile on each run that I try it in.
An example of bitbake's output:
NOTE: recipe gcc-cross-i586-5.2.0-r0: task do_compile: Started
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
NOTE: Tasks Summary: Attempted 1538 tasks of which 17 didn't need to be rerun and all succeeded.
Is this a problem with recipe gcc-cross-i586-5.2.0-r0: task do_compile? Perhaps an out-of-memory error? I don't know what the -9 refers to or how to find out more information about it.
Try:
$ bitbake -c cleansstate gcc-cross ; bitbake -k gcc-cross
How much you have memory of ram?
Report log error here.
This worked for me,
Edit conf/local.conf and decrease the number of working threads by adding the following to you conf/local.conf file (under the build directory):
BB_NUMBER_THREADS = "6"
Just a long shot, -9 in kernel land means EBADF (bad file number.) Is it possible you have done some operations as root and some files are not accessible during the build? Is the issue reproducible? ie. can you rm -rf tmp and does it happen again? Make sure you don't have any permissions issues in your project directory and associated file system(s).

How to connect master and slaves in Apache-Spark? (Standalone Mode)

I'm using Spark Standalone Mode tutorial page to install Spark in Standalone mode.
1- I have started a master by:
./sbin/start-master.sh
2- I have started a worker by:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
Note: spark://ubuntu:7077 is my master name, which I can see it in Master-WebUI.
Problem: By second command, a worker started successfully. But it couldn't associate with master. It tries repeatedly and then give this message:
15/02/08 11:30:04 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#ubuntu:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ubuntu/127.0.1.1:7077
15/02/08 11:30:04 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1296628173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [20] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
15/02/08 11:31:15 ERROR Worker: All masters are unresponsive! Giving up.
What is the problem?
Thanks
I usually start from spark-env.sh template. And I set, properties that I need. For simple cluster you need:
SPARK_MASTER_IP
Then, create a file called "slaves" in the same directory as spark-env.sh and slaves ip's (one per line). Assure you reach all slaves through ssh.
Finally, copy this configuration in every machine of your cluster. Then start the entire cluster executing start-all.sh script and try spark-shell to check your configuration.
> sbin/start-all.sh
> bin/spark-shell
You can set export SPARK_LOCAL_IP="You-IP" #to set the IP address Spark binds to on this node in $SPARK_HOME/conf/spark-env.sh
In my case, using spark 2.4.7 in standalone mode, I've created a passwordless ssh key using ssh-keygen, but still got asked for worker password when starting the cluster.
What I did was follow the instructions here
https://www.cyberciti.biz/faq/how-to-set-up-ssh-keys-on-linux-unix/
This line solved the problem:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#server-ip

Running Mesos via Monit

I am trying to run Mesos (without zookeeper) using monit to keep slaves running.
I use the following scripts to start and stop mesos slaves:
start-slave.sh:
#!/bin/bash
nohup /home/someuser/mesos/build/bin/mesos-slave.sh
--master=192.168.0.241:5050
--strict=false
--log_dir=/home/someuser/mesos/logs < /dev/null &
sleep 1
pidof lt-mesos-slave > /home/someuser/run/mesos-slave.pid
stop-slave.sh:
#!/bin/bash
cat /home/someuser/run/mesos-slave.pid | xargs kill -9
When I start the scripts via ssh they work great. However when I use them via monit as follows the slaves register (I can see them on the online interface) but when I try to run a computation using spark, it fails in the sense that most of the tasks are lost.
Monit setup:
check process mesos-slave with pidfile /home/someuser/run/mesos-slave.pid
start program = "/home/someuser/run/start-mesos.sh"
as uid someuser
stop program = "/home/someuser/run/stop-mesos.sh"
as uid someuser
if failed port 5051 then restart
Log exerp:
I0925 14:06:21.461169 10633 slave.cpp:2413] Executor '20140924-160043-4043352256-5050-7966-0' of framework 20140925-140255-4043352256-5050-11608-0000 has terminated with signal Killed
E0925 14:06:21.461323 10639 slave.cpp:2686] Failed to unmonitor container for executor 20140924-160043-4043352256-5050-7966-0 of framework 20140925-140255-4043352256-5050-11608-0000: Not monitored
I0925 14:06:21.462224 10633 slave.cpp:2018] Handling status update TASK_LOST (UUID: 8258a34e-7831-4e5d-ba55-6df2b905b5ba) for task 0 of framework 20140925-140255-4043352256-5050-11608-0000 from #0.0.0.0:0
Am I using Monit correctly?

Resources