is there a way to set a timeout for a step in Amazon Aws EMR?
I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours.
I cannot find a way to set a timeout not in Spark, nor in Yarn, nor in EMR configuration.
Thanks for your help!
I would like to offer an alternative approach, without any timeout/shutdown logic making application itself more complex than needed - although I am obviously quite late to the party. Maybe it proves useful for someone in the future.
You can:
write a Python script and use it as a wrapper around regular Yarn commands
execute those Yarn commands via subprocess lib
parse their output according to your will
decide which Yarn applications should be killed
More details about what I am talking about follow...
Python wrapper script and running the Yarn commands via subprocess lib
import subprocess
running_apps = subprocess.check_output(['yarn', 'application', '--list', '--appStates', 'RUNNING'], universal_newlines=True)
This snippet would give you an output similar to something like this:
Total number of applications (application-types: [] and states: [RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1554703852869_0066 HIVE-645b9a64-cb51-471b-9a98-85649ee4b86f TEZ hadoop default RUNNING UNDEFINED 0% http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:45941/ui/
You can than parse this output (beware there might be more than one app running) and extract application-id values.
Then, for each of those application ids, you can invoke another yarn command to get more details about the specific application:
app_status_string = subprocess.check_output(['yarn', 'application', '--status', app_id], universal_newlines=True)
Output of this command should be something like this:
Application Report :
Application-Id : application_1554703852869_0070
Application-Name : com.organization.YourApp
Application-Type : HIVE
User : hadoop
Queue : default
Application Priority : 0
Start-Time : 1554718311926
Finish-Time : 0
Progress : 10%
State : RUNNING
Final-State : UNDEFINED
Tracking-URL : http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:40817
RPC Port : 36203
AM Host : ip-xx-xxx-xxx-xx.eu-west-1.compute.internal
Aggregate Resource Allocation : 51134436 MB-seconds, 9284 vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
Log Aggregation Status : NOT_START
Diagnostics :
Unmanaged Application : false
Application Node Label Expression : <Not set>
AM container Node Label Expression : CORE
Having this you can also extract application's start time, compare it with current time and see for how long it is running.
If it is running for more than some threshold number of minutes, for example you kill it.
How do you kill it?
Easy.
kill_output = subprocess.check_output(['yarn', 'application', '--kill', app_id], universal_newlines=True)
This should be it, from the killing of the step/application perspective.
Automating the approach
AWS EMR has a wonderful feature called "bootstrap actions".
It runs a set of actions on EMR cluster creation and can be utilized for automating this approach.
Add a bash script to bootstrap actions which is going to:
download the python script you just wrote to the cluster (master node)
add the python script to a crontab
That should be it.
P.S.
I assumed Python3 is at our disposal for this purpose.
Well, as many have already answered, an EMR step cannot be killed/stopped/terminated via an API call at this moment.
But to achieve your goals, you can introduce a timeout as part of your application code itself. When you submit EMR steps, a child process is created to run your application - be it MapReduce Application, Spark Application, etc. and the step completion is determined by the exit code this child process (which is your application) returns.
For example, if you are submitting a MapReduce Application, you can use something like below :
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
final Runnable stuffToDo = new Thread() {
#Override
public void run() {
job.submit();
}
};
final ExecutorService executor = Executors.newSingleThreadExecutor();
final Future future = executor.submit(stuffToDo);
executor.shutdown(); // This does not cancel the already-scheduled task.
try {
future.get(180, TimeUnit.MINUTES);
}
catch (InterruptedException ie) {
/* Handle the interruption. Or ignore it. */
}
catch (ExecutionException ee) {
/* Handle the error. Or ignore it. */
}
catch (TimeoutException te) {
/* Handle the timeout. Or ignore it. */
}
System.exit(job.waitForCompletion(true) ? 0 : 1);
Reference - Java: set timeout on a certain block of code?.
Hope this helps.
Related
I am very new to NodeJS and trying to develop an application which acts as a scheduler that tries to fetch data from ELK and sends the processed data to another ELK. I am able to achieve the expected behaviour but after completing all the processes, scheduler job does not exists and wait for another scheduler job to come up.
Note: This scheduler runs every 3 minutes.
job.js
const self = module.exports = {
async schedule() {
if (process.env.SCHEDULER == "MinuteFrequency") {
var timenow = moment().seconds(0).milliseconds(0).valueOf();
var endtime = timenow - 60000;
var starttime = endtime - 60000 * 3;
//sendData is an async method
reports.sendData(starttime, endtime, "SCHEDULER");
}
}
}
I tried various solutions such Promise.allSettled(....., Promise.resolve(true), etc, but not able to fix this.
As per my requirement, I want the scheduler to complete and process and exit so that I can save some resources as I am planning to deploy the application using Kubernetes cronjobs.
When all your work is done, you can call process.exit() to cause your application to exit.
In this particular code, you may need to know when reports.sendData() is actually done before exiting. We would have to know what that code is and/or see the code to know how to know when it is done. Just because it's an async function doesn't mean it's written properly to return a promise that resolves when it's done. If you want further help, show us the code for sendData() and any code that it calls too.
I am trying to Deploy to a list of servers in parallel to save some time. The names of servers are listed in a collection: serverNames
The original code was:
serverNames.each({
def server = new Server([steps: steps, hostname: it, domain: "test"])
server.stopTomcat()
server.ssh("rm -rf ${WEB_APPS_DIR}/pc*")
PLMScriptUtils.secureCopy(steps, warFileLocation, it, WEB_APPS_DIR)
})
Basically i want to stop the tomcat, rename a file and then copy a war file to a location using the following lines:
server.stopTomcat()
server.ssh("rm -rf ${WEB_APPS_DIR}/pc*")
PLMScriptUtils.secureCopy(steps, warFileLocation, it, WEB_APPS_DIR)
The original code was working properly and it took 1 server from the collection serverNames and performed the 3 line to do the deploy.
But now i have requirement to run the deployment to the servers listed in serverNames parallely
Below is my new modified code:
def threads = []
def th
serverNames.each({
def server = new Server([steps: steps, hostname: it, domain: "test"])
th = new Thread({
steps.echo "doing deployment"
server.stopTomcat()
server.ssh("rm -rf ${WEB_APPS_DIR}/pc*")
PLMScriptUtils.secureCopy(steps, warFileLocation, it, WEB_APPS_DIR)
})
threads << th
})
threads.each {
steps.echo "joining thread"
it.join()
}
threads.each {
steps.echo "starting thread"
it.start()
}
The echo statements were added to visualize the flow.
With this the output is coming as:
joining thread
joining thread
joining thread
joining thread
starting thread
starting thread
starting thread
starting thread
The number of servers in the collection was 4 hence 4 times the thread is being added and started. but it is not executing the 3 lines i want to run in parallel, which means "doing deployment" is not being printed at all and later the build is failing with an exception.
Note that i am running this Groovy code as a pipeline through Jenkins this whole piece of code is actually a function called deploy of the class deployment and my pipeline in jenkins is creating an object of the class deployment and then calling the deploy function
Can anyone help me with this ? I am stuck like hell with this one. :-(
Have a look at the parallel step. In scripted pipelines (which you seem to be using), you can pass it a map of thread name to action (as a Groovy closure) which is then run in parallel.
deployActions = [
Server1: {
// stop tomcat etc.
},
Server2: {
...
}
]
parallel deployActions
It is much simpler and the recommended way of doing it.
The issue is I need to assign the value of Linux command to CHef Attribute.But unable to do it.
Im using the below code and not finding the result. Kindly Help what im
missing
ruby_block "something" do
block do
Chef::Resource::RubyBlock.send(:include, Chef::Mixin::ShellOut)
node.default['foo'] = shell_out("echo Hello world").stdout
end
action :create
end
log "demo" do
message lazy { node['foo'] }
end
Below is the Run logs:
Starting Chef Client, version 13.9.1
resolving cookbooks for run list: ["sample_repo"]
Synchronizing Cookbooks:
- sample_repo (0.1.4)
Installing Cookbook Gems:
Compiling Cookbooks...
Converging 2 resources
Recipe: sample_repo::default
* ruby_block[something] action create
- execute the ruby block something
* log[demo] action write
Running handlers:
Running handlers complete
Chef Client finished, 2/2 resources updated in 02 seconds
Thanks in advance
Your code is fine, the log message is not showing because the default level on the log resource is :info and by default chef-client doesn't show info-level log messages when run interactively. That said, this kind of code where you store stuff in node attributes is very brittle and probably shouldn't be used unless specifically needed. Better is to do this:
message lazy { shell_out("echo Hello world").stdout }
Also you don't need any funky mutating include stuff like you there AFAIK, the shell_out helpers are available in most contexts by default. Also you should usually use shell_out!() rather than shell_out(), the ! version automatically raises an exception if the command fails. Unless you specifically want to allow a failed command, use the ! version.
I'm attempting to use the Jenkins Job DSL plugin for the first time to create some basic job "templates" before getting into more complex stuff.
Jenkins is running on a Windows 2012 server. The Jenkins version is 1.650 and we are using the Job DSL plugin version 1.51.
Ideally what I would like is for the seed job to be parameterised so that when it is being run the user can enter four things: the Job DSL script location, the name of the generated job, a Slack channel for failure notifications, and an email address for failure notifications.
The first two are fine: I can call the parameters in the groovy script, for example the script understands job("${JOB_NAME}") and takes the name I enter for the job when I run the seed job.
However when I try to do the same thing with a Slack channel the groovy script doesn't seem to want to play. Note that if I specify a Slack channel rather than trying to call a parameter it works fine.
My Job DSL script is here:
job("${JOB_NAME}") {
triggers {
cron("#daily")
}
steps {
shell("echo 'Hello World'")
}
publishers {
slackNotifier {
room("${SLACK_CHANNEL}")
notifyAborted(true)
notifyFailure(true)
notifyNotBuilt(false)
notifyUnstable(true)
notifyBackToNormal(true)
notifySuccess(false)
notifyRepeatedFailure(false)
startNotification(false)
includeTestSummary(false)
includeCustomMessage(false)
customMessage(null)
buildServerUrl(null)
sendAs(null)
commitInfoChoice('NONE')
teamDomain(null)
authToken(null)
}
}
logRotator {
numToKeep(3)
artifactNumToKeep(3)
publishers {
extendedEmail {
recipientList('me#mydomain.com')
defaultSubject('Seed job failed')
defaultContent('Something broken')
contentType('text/html')
triggers {
failure ()
fixed ()
unstable ()
stillUnstable {
subject('Subject')
content('Body')
sendTo {
developers()
requester()
culprits()
}
}
}
}
}
}
}
But starting the seed job fails and gives me this output:
Started by user
Building on master in workspace D:\data\jenkins\workspace\tutorial-job-dsl-2
Disk space threshold is set to :5Gb
Checking disk space Now
Total Disk Space Available is: 28Gb
Node Name: master
Running Prebuild steps
Processing DSL script jobBuilder.groovy
ERROR: (jobBuilder.groovy, line 10) No signature of method: javaposse.jobdsl.plugin.structs.DescribableContext.room() is applicable for argument types: (org.codehaus.groovy.runtime.GStringImpl) values: [#dev]
Possible solutions: wait(), find(), dump(), grep(), any(), wait(long)
[BFA] Scanning build for known causes...
[BFA] No failure causes found
[BFA] Done. 0s
Started calculate disk usage of build
Finished Calculation of disk usage of build in 0 seconds
Started calculate disk usage of workspace
Finished Calculation of disk usage of workspace in 0 seconds
Finished: FAILURE
This is the first time I have tried to do anything with Groovy and I'm sure it's a basic error but would appreciate any help.
Hm, that's a bug in Job DSL, see JENKINS-39153.
You actually do not need to use the template string syntax "${FOO}" if you just want to use the value of FOO. All parameters are string variables which can be used directly:
job(JOB_NAME) {
// ...
publishers {
slackNotifier {
room(SLACK_CHANNEL)
notifyAborted(true)
notifyFailure(true)
notifyNotBuilt(false)
notifyUnstable(true)
notifyBackToNormal(true)
notifySuccess(false)
notifyRepeatedFailure(false)
startNotification(false)
includeTestSummary(false)
includeCustomMessage(false)
customMessage(null)
buildServerUrl(null)
sendAs(null)
commitInfoChoice('NONE')
teamDomain(null)
authToken(null)
}
}
// ...
}
This syntax is more concise and does not trigger the bug.
I have a MultiJob Project (made with the Jenkins Multijob plugin), with a series of MultiJob Phases. Let's say one of these jobs is called SubJob01. The jobs that are built are each configured with the "Restrict where this project can be run" option to be tied to one node. SubJob01 is tied to Slave01.
I would like it if these jobs would fail fast when the node is offline, instead of saying "(pending—slave01 is offline)". Specifically, I want there to be a record of the build attempt in SubJob01, with the build being marked as failed. This way, I can configure my MultiJob project to handle the situation as I'd like, instead of using the Jenkins build timeout plugin to abort the whole thing.
Does anyone know of a way to fail-fast a build if all nodes are offline? I could intersperse the MultiJob project with system Groovy scripts to check whether the desired nodes are offline, but that seems like it'd be reinventing, in the wrong place, what should already be a feature.
I ended up creating this solution which has worked well. The first build step of SubJob01 is an Execute system Groovy script, and this is the script:
import java.util.regex.Matcher
import java.util.regex.Pattern
int exitcode = 0
println("Looking for Offline Slaves:");
for (slave in hudson.model.Hudson.instance.slaves) {
if (slave.getComputer().isOffline().toString() == "true"){
println(' * Slave ' + slave.name + " is offline!");
if (slave.name == "Slave01") {
println(' !!!! This is Slave01 !!!!');
exitcode++;
} // if slave.name
} // if slave offline
} // for slave in slaves
println("\n\n");
println "Slave01 is offline: " + hudson.model.Hudson.instance.getNode("Slave01").getComputer().isOffline().toString();
println("\n\n");
if (exitcode > 0){
println("The Slave01 slave is offline - we can not possibly continue....");
println("Please contact IT to resolve the slave down issue before retrying the build.");
return 1;
} // if
println("\n\n");
The jenkins pipeline statement 'beforeAgent true' can be used in evaluating the when condition previous to entering the agent.
stage('Windows') {
when {
beforeAgent true
expression { return ("${TARGET_NODES}".contains("windows")) }
}
agent { label 'win10' }
steps {
cleanWs()
...
}
Ref:
https://www.jenkins.io/doc/book/pipeline/syntax/
https://www.jenkins.io/blog/2018/04/09/whats-in-declarative/