Running Mesos via Monit - apache-spark

I am trying to run Mesos (without zookeeper) using monit to keep slaves running.
I use the following scripts to start and stop mesos slaves:
start-slave.sh:
#!/bin/bash
nohup /home/someuser/mesos/build/bin/mesos-slave.sh
--master=192.168.0.241:5050
--strict=false
--log_dir=/home/someuser/mesos/logs < /dev/null &
sleep 1
pidof lt-mesos-slave > /home/someuser/run/mesos-slave.pid
stop-slave.sh:
#!/bin/bash
cat /home/someuser/run/mesos-slave.pid | xargs kill -9
When I start the scripts via ssh they work great. However when I use them via monit as follows the slaves register (I can see them on the online interface) but when I try to run a computation using spark, it fails in the sense that most of the tasks are lost.
Monit setup:
check process mesos-slave with pidfile /home/someuser/run/mesos-slave.pid
start program = "/home/someuser/run/start-mesos.sh"
as uid someuser
stop program = "/home/someuser/run/stop-mesos.sh"
as uid someuser
if failed port 5051 then restart
Log exerp:
I0925 14:06:21.461169 10633 slave.cpp:2413] Executor '20140924-160043-4043352256-5050-7966-0' of framework 20140925-140255-4043352256-5050-11608-0000 has terminated with signal Killed
E0925 14:06:21.461323 10639 slave.cpp:2686] Failed to unmonitor container for executor 20140924-160043-4043352256-5050-7966-0 of framework 20140925-140255-4043352256-5050-11608-0000: Not monitored
I0925 14:06:21.462224 10633 slave.cpp:2018] Handling status update TASK_LOST (UUID: 8258a34e-7831-4e5d-ba55-6df2b905b5ba) for task 0 of framework 20140925-140255-4043352256-5050-11608-0000 from #0.0.0.0:0
Am I using Monit correctly?

Related

mpirun launching a runtime script for thread binding

Running a program in a MPI process, which exec file is myapp_mpi. The command line below launches the app correctly
mpirun --bind-to none -np 128 myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I now wish to constrain the thread binding with a script pin.sh. The command line below would then produce an error.
mpirun --bind-to none -np 128 pin.sh myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I get the following error
--------------------------------------------------------------------------
Open MPI tried to fork a new process via the "execve" system call but
failed. Open MPI checks many things before attempting to launch a
child process, but nothing is perfect. This error may be indicative
of another problem on the target host, or even something as silly as
having specified a directory for your application. Your job will now
abort.
Local host: machine001
Working dir: /home/user/myapp/bin
Application name: /home/user/myapp/bin/pin.sh
Error: Exec format error
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 1
Error name: (null)
Node: machine001
when attempting to start process rank 0.
--------------------------------------------------------------------------
125 total processes failed to start
File locations are good, a priori. Any clue ?

Gracefully killing container and pods in kubernetes with spark exception

What is the recommended way to gracefully kill a container and driver pods in kubernetes when an application fails or reaches an exception. Currently, when my application runs into an exception my pods and executors continue to run and I noticed that my container doesn't get killed unless an explicit exit 1 is used.For some reason my spark application doesn't cause an exit 1 status or sigterm signal to be sent to the container or pod.
Tried to add following to yaml specs based on recommendations but still not getting pod driver and executors to terminate:
spec:
terminationGracePeriodSeconds: 0
driver:
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- touch /var/run/killspark && sleep 65
The preStop LiveCycle hook you added won't have any effect, since it is only triggered
before a container is terminated due to an API request or management
event such as liveness probe failure, preemption, resource contention
and others
https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/
I suspect, what you really have to figure out is why your container main process keeps running despite the exception.

How i restart service of JMX server in cassandra?

I killed the process that used the port 7199, then i wanted to run cassandra using
cassandra -f -R
But I had this message :
INFO 05:45:43 Initializing system.schema_functions
INFO 05:45:43 Initializing system.schema_aggregates
INFO 05:45:43 Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO 05:45:43 Configured JMX server at: ****service:jmx:rmi://127.0.0.1/jndi/rmi://127.0.0.1:7199/jmxrmi****
Exception (java.lang.RuntimeException) encountered during startup: java.util.concurrent.ExecutionException: FSWriteError in
java.lang.RuntimeException: java.util.concurrent.ExecutionException: FSWriteError in at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:403)
I want to run the process that use the port 7199,
I killed that because I had a message the port 7199 already used.
Try to kill the process completely. If it is standalone use this,
$ ps auwx | grep cassandra
$ sudo kill pid
or $ sudo service cassandra stop if you have a local setup
Follow the following steps-
$ jps
You see some processes running. For example:
9107 Jps
1112 CassandraDaemon
Then kill the CassandraDaemon process by the process id you see after executing jps. In my example, here process id 1112 for CassandraDaemon.
$ kill -9 1112
Then check processes again after a while-
$ jps
You will see CassandraDaemon will no longer be available.
9170 Jps
Then remove your saved_caches and start cassandra again.

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

What causes BitBake worker processes to exit unexpectedly?

I have a BitBake build process that runs on a Docker container (CentOS 7). The BitBake fails during recipe gcc-cross-i586-5.2.0-r0: task do_compile on each run that I try it in.
An example of bitbake's output:
NOTE: recipe gcc-cross-i586-5.2.0-r0: task do_compile: Started
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
NOTE: Tasks Summary: Attempted 1538 tasks of which 17 didn't need to be rerun and all succeeded.
Is this a problem with recipe gcc-cross-i586-5.2.0-r0: task do_compile? Perhaps an out-of-memory error? I don't know what the -9 refers to or how to find out more information about it.
Try:
$ bitbake -c cleansstate gcc-cross ; bitbake -k gcc-cross
How much you have memory of ram?
Report log error here.
This worked for me,
Edit conf/local.conf and decrease the number of working threads by adding the following to you conf/local.conf file (under the build directory):
BB_NUMBER_THREADS = "6"
Just a long shot, -9 in kernel land means EBADF (bad file number.) Is it possible you have done some operations as root and some files are not accessible during the build? Is the issue reproducible? ie. can you rm -rf tmp and does it happen again? Make sure you don't have any permissions issues in your project directory and associated file system(s).

Resources