Spark StreamingContext awaitTerminationOrTimeout - apache-spark

I'm calling streamingContext.awaitTerminationOrTimeout(timeout), but I want to make timeout environment dependent.
This means that I want to stop my job if my environment is UAT, but I don't want it to timeout at all if my environment is production.
I have checked the documentation here, but I can't find any references to whether passing 0 or -1 as timeout will just make the job run forever (or until it fails).
Are there any possible timeout values that could do the trick? Or any other alternatives that don't imply calling awaitTermination for my production environment and awaitTerminationOrTimeout for my UAT environment?

After taking a look at the library, I can see that any negative value will do.
awaitTerminationOrTimeout(timeout) internally calls streamingContext.awaitTerminationOrTimeout(timeout), which then calls contextWriter.waitForStopOrError(timeout).
Inside waitForStopOrError there is a condition that says that if the timeout is lower than zero, then we wait.
TL:DR: -1 will never let the job timeout.

Related

How can I pass arguments to a program at runtime?

For example, I have a cyclical script that makes a request to the Api every 10 minutes. I want to change the parameters of the next request without stopping it.
Cmd is blocked at run time.
Reading arguments from a environment file or a database doesn't seem safe to me. I think it's possible that while variables are being changed, the script may access them when they are not yet ready and cause errors.
Any thoughts?
Thanks

Threshold for allowed amount of failed Hyperdrive runs

Because "reasons", we know that when we use azureml-sdk's HyperDriveStep we expect a number of HyperDrive runs to fail -- normally around 20%. How can we handle this without failing the entire HyperDriveStep (and then all downstream steps)? Below is an example of the pipeline.
I thought there would be an HyperDriveRunConfig param to allow for this, but it doesn't seem to exist. Perhaps this is controlled on the Pipeline itself with the continue_on_step_failure param?
The workaround we're considering is to catch the failed run within our train.py script and manually log the primary_metric as zero.
thanks for your question.
I'm assuming that HyperDriveStep is one of the steps in your Pipeline and that you want the remaining Pipeline steps to continue, when HyperDriveStep fails, is that correct?
Enabling continue_on_step_failure, should allow the rest of the pipeline steps to continue, when any single steps fails.
Additionally, the HyperDrive run consists of multiple child runs, controlled by the HyperDriveConfig. If the first 3 child runs explored by HyperDrive fail (e.g. with user script errors), the system automatically cancels the entire HyperDrive run, in order to avoid further wasting resources.
Are you looking to continue other Pipeline steps when the HyperDriveStep fails? or are you looking to continue other child runs within the HyperDrive run, when the first 3 child runs fail?
Thanks!

Why in kubernetes cron job two jobs might be created, or no job might be created?

In k8s Cron Job Limitations mentioned that there is no guarantee that a job will executed exactly once:
A cron job creates a job object about once per execution time of its
schedule. We say “about” because there are certain circumstances where
two jobs might be created, or no job might be created. We attempt to
make these rare, but do not completely prevent them. Therefore, jobs
should be idempotent
Could anyone explain:
why this could happen?
what are the probabilities/statistic this could happen?
will it be fixed in some reasonable future in k8s?
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
The controller:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go
starts with a comment that lays the groundwork for an explanation:
I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)
If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.)
Just periodically list jobs and SJs, and then reconcile them.
Periodically means every 10 seconds:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105
The documentation following the quoted limitations also has some useful color on some of the circumstances under which 2 jobs or no jobs may be launched on a particular schedule:
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.
Higher level, solving for only-once in a distributed system is hard:
https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
Clocks and time synchronization in a distributed system is also hard:
https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html
To the questions:
why this could happen?
For instance- the node hosting the CronJobController fails at the time a job is supposed to run.
what are the probabilities/statistic this could happen?
Very unlikely for any given run. For a large enough number of runs, very unlikely to escape having to face this issue.
will it be fixed in some reasonable future in k8s?
There are no idemopotency-related issues under the area/batch label in the k8s repo, so one would guess not.
https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
Think more about the specific definition of idempotent, and the particular points in the job where there are commits. For instance, jobs can be made to support more-than-once execution if they save state to staging areas, and then there is an election process to determine whose work wins.
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
Yes, it's a core distributed systems problem.
For most users, the k8s documentation gives perhaps a more precise and nuanced answer than is necessary. If your scheduled job is controlling some critical medical procedure, it's really important to plan for failure cases. If it's just doing some system cleanup, missing a scheduled run doesn't much matter. By definition, nearly all users of k8s CronJobs fall into the latter category.

Timeout a pyspark job

TL;DR
Is there a way to timeout a pyspark job? I want a spark job running in cluster mode to be killed automatically if it runs longer than a pre-specified time.
Longer Version:
The cryptic timeouts listed in the documentation are at most 120s, except one which is infinity, but this one is only used if spark.dynamicAllocation.enabled is set to true, but by default (I havent touch any config parameters on this cluster) it is false.
I want to know because I have a code that for a particular pathological input will run extremely slow. For expected input the job will terminate in under an hour.Detecting the pathological input is as hard as trying to solve the problem, so I don't have the option of doing clever preprocessing. The details of the code are boring and irrelevant, so I'm going to spare you having to read them =)
Im using pyspark so I was going to decorate the function causing the hang up like this but it seems that this solution doesnt work in cluster mode. I call my spark code via spark-submit from a bash script, but so far as I know bash "goes to sleep" while the spark job is running and only gets control back once the spark job terminates, so I don't think this is an option.
Actually, the bash thing might be a solution if I did something clever but I'd have to get the driver id for the job like this, and by now I'm thinking "this is too much thought and typing for something so simple as a timeout which ought to be built in."
You can set a classic python alarm. Then in handler function you can raise exception or use sys.exit() function to finish driver code. As driver finishes, YARN kills whole application.
You can find example usage in documentation: https://docs.python.org/3/library/signal.html#example

How to find the time when a Puppet manifest is executed

I'm wondering if anyone knows a good way to get the date and time when a portion of code in a Puppet manifest is actually executed. Sometimes my manifests take a long time to run, and I need to schedule a task to occur soon after the end of the run, no matter when that occurs.
I have tried the time() function, setting a variable using generate() (using the date function on the Puppet master), and even creating a custom fact, but everything I've tried gets evaluated when the manifests are parsed on the server, rather than when they actually execute on the client.
Any ideas? The clients are all Windows, FWIW.
Thanks in advance!
I am not sure I understand what you mean, but you can't get this information during catalog compilation (obviously), so you can't use it to change the way the catalog will be applied.
If you need to trigger another process on the same host, then you should use any IPC mechanism you have available. You can exec anything, and have it happen just after any other resources is applied, so it is just a matter of finding the proper command.

Resources