I'm using dkron 0.10.4. I created the scheduled jobs for dkron and it was working fine previously. But suddenly the jobs are not executed on the scheduled time. It shows the output as
rpc error: code = Unknown desc = exit status 1
and status "False"
You could open dkron.yml in /etc/dkron and add debug line like
log-level: debug
Run dkron agent :
# dkron agent --config /etc/dkron/dkron.yml
You should see debug lines with more informations about your rpc error code.
Related
I have read through the direvent documentation and am trying to get a simple watch working. Since I am having so much trouble with it, I am wondering if the issue has to do with the fact that the system I am using is nixos.
Here is the simple watcher file, watcher, I've created:
watcher {
path ./dir;
command "echo $file";
}
I run it in the foreground, so I can see the output, with direvent --foreground watcher. Once it's running, I create a file in dir, thus creating an event for it to respond to. However, it fails with the following output:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8552 failed with status 127
direvent: [ERROR] process 8555 failed with status 127
direvent: [ERROR] process 8557 failed with status 127
Since 127 usually means 'command not found', I tried specifying the path to echo, i.e. running this watcher instead:
watcher {
path ./dir;
command "/run/current-system/sw/bin/echo $file";
}
Then the output still gives an error, albeit a different one:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8645 failed with status 1
direvent: [ERROR] process 8651 failed with status 1
direvent: [ERROR] process 8652 failed with status 1
So the failure is now with status 1. I am not sure what to try next. I'm wondering if this issue is due to the fact that I am running nixos. Anyone know what I might try next to get direvent working?
direvent has two other flag that may be useful for you.
--debug(-d) to give extra information.
There's also --lint(t) that check the configuration file for errors, but I suspect this isn't your issue if direvent is running.
Source: https://www.gnu.org.ua/software/direvent/manual/direvent.html
The Registration log contains the following error:
==========================
Running setup agent script...
==========================
Command start time 2017-11-26 12:05:31
(' File "/usr/lib/python2.6/site-packages/ambari_agent/main.py", line 355, in main
(retries, connected, stopped) = netutil.try_to_connect(server_url, MAX_RETRIES, logger)
UnboundLocalError: local variable \'server_url\' referenced before assignment
INFO 2017-11-26 12:00:19,304 ExitHelper.py:53 - Performing cleanup before exiting...
I tried it a few times from scratch and I get the same result. ANy ideas on how to fix this?
I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application
My application Spark 2.0.0 runs on yarn 2.7.2. It finishes successfully but Yarn marks it as failed with error:
Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
I see no errors on executors nor driver and application writes the data it is supposed to.
This seems to be caused by calling System.exit( 0 ) specifically in my code. After removing it the problem is gone
I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.