Dkron job failed to execute by the scheduled time - cron

I'm using dkron 0.10.4. I created the scheduled jobs for dkron and it was working fine previously. But suddenly the jobs are not executed on the scheduled time. It shows the output as
rpc error: code = Unknown desc = exit status 1
and status "False"

You could open dkron.yml in /etc/dkron and add debug line like
log-level: debug
Run dkron agent :
# dkron agent --config /etc/dkron/dkron.yml
You should see debug lines with more informations about your rpc error code.

Related

How to get a basic direvent watcher working?

I have read through the direvent documentation and am trying to get a simple watch working. Since I am having so much trouble with it, I am wondering if the issue has to do with the fact that the system I am using is nixos.
Here is the simple watcher file, watcher, I've created:
watcher {
path ./dir;
command "echo $file";
}
I run it in the foreground, so I can see the output, with direvent --foreground watcher. Once it's running, I create a file in dir, thus creating an event for it to respond to. However, it fails with the following output:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8552 failed with status 127
direvent: [ERROR] process 8555 failed with status 127
direvent: [ERROR] process 8557 failed with status 127
Since 127 usually means 'command not found', I tried specifying the path to echo, i.e. running this watcher instead:
watcher {
path ./dir;
command "/run/current-system/sw/bin/echo $file";
}
Then the output still gives an error, albeit a different one:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8645 failed with status 1
direvent: [ERROR] process 8651 failed with status 1
direvent: [ERROR] process 8652 failed with status 1
So the failure is now with status 1. I am not sure what to try next. I'm wondering if this issue is due to the fact that I am running nixos. Anyone know what I might try next to get direvent working?
direvent has two other flag that may be useful for you.
--debug(-d) to give extra information.
There's also --lint(t) that check the configuration file for errors, but I suspect this isn't your issue if direvent is running.
Source: https://www.gnu.org.ua/software/direvent/manual/direvent.html

I get the following error on the "Confirm Hosts" step in Ambari (creating cluster):

The Registration log contains the following error:
==========================
Running setup agent script...
==========================
Command start time 2017-11-26 12:05:31
(' File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 355, in main
(retries, connected, stopped) = netutil.try_to_connect(server_url, MAX_RETRIES, logger)
UnboundLocalError: local variable \'server_url\' referenced before assignment
INFO 2017-11-26 12:00:19,304 ExitHelper.py:53 - Performing cleanup before exiting...
I tried it a few times from scratch and I get the same result. ANy ideas on how to fix this?

Stopping a job in spark

I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application

Why is Spark application's final status FAILED while it finishes successfully?

My application Spark 2.0.0 runs on yarn 2.7.2. It finishes successfully but Yarn marks it as failed with error:
Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
I see no errors on executors nor driver and application writes the data it is supposed to.
This seems to be caused by calling System.exit( 0 ) specifically in my code. After removing it the problem is gone

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

Resources