I'm trying to spread out my Puppet agents' checkins, to avoid thundering herds and such. The timing settings don't seem to work, though, or at least they don't work as I expect.
In /etc/puppet/puppet.conf, I have (among other lines) these:
[agent]
server = myforemanserver.myorg.org
report = true
runinterval = 25m
splaylimit = 10m
splay = true
The intent of the above lines is to stagger the reports, so that the agent checks in every 25-35 minutes (some random value therein). The splay and splaylimit settings don't seem to be honored, though; the servers on which I've installed this new config are just checking in every 25 minutes exactly. (Since it's checking in every 25 minutes, not 30, I know it's read this new configuration; previously, there was no runinterval, or splay, specified.)
This is Puppet open-source, version 3.8.4, running as a RHEL service.
Are there known issues with the splay settings when running in daemon mode, or is there something I'm overlooking in these settings?
Yes, the splay settings work when running in daemon mode. They are intended for daemon mode. They just don't work as you thought they would.
Splaying produces a random delay before the first run, thus offsetting the whole schedule of future runs. Each agent will still check in on a fixed schedule.
This is useful for load averaging in the event that many machines may start at about the same time, such as collocated VMs at the startup of their host, or machines that automatically power on at a scheduled time.
Related
First Example
Suppose I have a CRON job
30 2 * * * ....
Then this would run every time when it is 2:30 at night (local time).
Now suppose I have the time zone Europe/Germany and it's 2017-10-29 (the day when DST is switched). Then this CRON job would run twice, right?
Second Example
Suppose I have the time zone Europe/Germany and the CRON job
30 11 * * * ....
As Germany never had a DST change at 11:30, this will not interfere. But the user could change the local time. To be super clear: This question is NOT about DST.
For the following test cases, I would like to know if/how often the CRON job gets scheduled:
At 11:29:58.0, the user sets the time to 11:31:00
At 11:29:59.1, the user sets the time to 11:31:00
At 11:29:59.6, the user sets the time to 11:31:00
At 11:30:01.0, the user sets the time to 11:29:59.7 - is CRON executed directly afterwards?
They boil down to How quickly is CRON triggered?, where the 4th one also has the question if CRON stores that it was already executed for that minute.
Another variant of the same question:
At 11:29:59, the NTP service corrects the time to 11:31:00 - will the job be executed that day at all?
The easiest way to answer this with confidence is to take a look at the source for the cron daemon. There are a few versions online like this, or you can use apt-get source cron.
The tick cycle in cron is to repeatedly sleep for a minute, or less if there is a job coming up. Immediately after emerging from the sleep, it checks the time and treats the result as one of these wakeupKind values:
Expected time - run any jobs we were expecting
Small jump forwards (up to 5 minutes) - run the jobs for the intervening minutes
Medium jump forwards (up to 3 hours, so this would include DST starting in spring) - run any wildcard jobs first (because the catch up could take more than a minute), then catch up on the intervening fixed time jobs
Large jump (3 hours or more either way) - start over with the current time
Jump backwards (up to 3 hours, so including the end of DST) - because any fixed time jobs have 'probably' already run, only run any wildcard jobs until the time is caught up again
If in doubt, the source comments these wakeupKind values clearly.
Edit
To follow up on whether sleep() could be affected by a clock change, it looks like the answer is indirectly there in a couple of the Linux man pages.
Firstly the notes for the sleep() function confirm that is implemented by nanosleep()
The notes for nanosleep() say Linux measures the time using the CLOCK_MONOTONIC clock (even though POSIX.1 says it shouldn't)
Scroll down a bit in the docs for clock_settime() to see the explanation of CLOCK_MONOTONIC, which explains it is not affected by jumps in the system time, but it would be affected by incremental NTP style clock sync adjustments.
So in summary, a system admin style clock change will have no effect on the sleep(). But for example if an NTP adjustment came in and said to 'gently' advance the clock, cron would experience a series of slightly short sleep() function calls.
There are many implementations of cron systems (See here). One of the most commonly used cron's is Vixie cron. And its man page states:
Daylight Saving Time and other time changes
Local time changes of less than three hours, such as those caused by the Daylight Saving Time changes, are handled in a special way. This only applies to jobs that run at a specific time and jobs that run with a granularity
greater than one hour. Jobs that run more frequently are scheduled normally.
If time was adjusted one hour forward, those jobs that would have run in the interval that has been skipped will be run immediately. Conversely, if time was adjusted backwards, running the same job twice is avoided.
Time changes of more than 3 hours are considered to be corrections to the clock or the timezone, and the new time is used immediately.
source: man 8 cron
I believe this answers most of your points.
In addition to point five:
At 11:29:59, the NTP service corrects the time to 11:31:00 - will the job be executed that day at all?
First of, if NTP corrects the time with more then a minute, you have a very bad clock! This should not happen too often. Generally, you might have such a step when you enable NTP but then it should be much less.
In any case, if the DeltaT is not to high, generally below 125 ms, your system will slew the time. Slewing the time means to change the virtual frequency of the software clock to make the clock go faster or slower until the requested correction is achieved. Slewing the clock for a larger amount of time may require some time, too. For example standard Linux adjusts the time with a rate of 0.5ms per second.
This implies, (under the assumption of Vixie cron, and probably many others):
If NTP jumps more then 3 hours, the job is skipped
If NTP jumps less then 3 hours but more then 125 ms, Vixie cron handles the job nicely by assuming the concepts of the time-jumps.
If NTP corrects the time for less then 125 ms, cron does not notice the time-jump due to the slewing.
Interesting information:
RFC5905: Network Time Protocol Version 4: Protocol and Algorithms Specification
The NTP FAQ and Howto
https://wiki.gentoo.org/wiki/Cron/en
You're actually asking two related questions. The general answer is it depends[1], but I'll answer based on the Debian Linux installation I'm on right now:
How does cron handle DST changes and other 'special' time-related events?
On my Debian Linux system cron handles 'DST and other time-related changes/fixes' (per the man page) so that jobs don't get run twice or skipped due to changes like DST. (See https://debian-handbook.info/browse/stable/sect.task-scheduling-cron-atd.html for more specifics) Related to the 5th point raised in your second question, I would expect these same facilities to deal with NTP-related time jumps but don't know for certain.
How often is cron triggered and how quickly does it pick up my crontab changes?
Again, on my Debian Linux system the cron daemon wakes up once a minute and will detect and utilize any crontab changes man since the previous check/run one minute ago. Note that there is no guarantee that cron fires off at 12:00:00 or 12:00:59 or any specific time between (only that it fire when the time is 12:00:??) so in the event that you change a crontab at 12:00:17 but cron fired at 12:00:13, your changes will not be picked up until the next run (most likely at 12:01:13 though there might be a slight amount of variance due to the Linux scheduler)
[1] It Depends...
The precise answer absolutely depends both on the platform (Linux/Unix/BSD/OS X/Windows) and the particular implementation of cron (there have been several over the decades with derivatives of Vixie cron being prevalent on Linux and BSD per https://en.wikipedia.org/wiki/Vixie_cron). If you're running something other than Linux, the man page / documentation for your implementation should provide details as to the specifics of how often it runs, picks up modified crontabs, DST handling etc. If you really need to know the specific details, df778899 is right in that you should look at the source code for your implementation as needed... because sometimes software/documentation is buggy.
On mac OS:
$> man cron
...
Available options:
-s Enable special handling of situations when the GMT offset of the local timezone changes, such as the switches between the standard time and daylight saving time.
The jobs run during the GMT offset changes time as intuitively expected. If a job falls into a time interval that disappears (for example, during the switch from standard time) to daylight saving time
or is duplicated (for example, during the reverse switch), then it is handled in one of two ways:
The first case is for the jobs that run every at hour of a time interval overlapping with the disappearing or duplicated interval. In other words, if the job had run within one hour before the GMT
offset change (and cron was not restarted nor the crontab(5) changed after that) or would run after the change at the next hour. They work as always, skip the skipped time or run in the added time as
usual.
The second case is for the jobs that run less frequently. They are executed exactly once, they are not skipped nor executed twice (unless cron is restarted or the user's crontab(5) is changed during
such a time interval). If an interval disappears due to the GMT offset change, such jobs are executed at the same absolute point of time as they would be in the old time zone. For example, if exactly
one hour disappears, this point would be during the next hour at the first minute that is specified for them in crontab(5).
-o Disable the special handling of situations when the GMT offset of the local timezone changes, to be compatible with the old (default) behavior. If both options -o and -s are specified, the option
specified last wins.
In k8s Cron Job Limitations mentioned that there is no guarantee that a job will executed exactly once:
A cron job creates a job object about once per execution time of its
schedule. We say “about” because there are certain circumstances where
two jobs might be created, or no job might be created. We attempt to
make these rare, but do not completely prevent them. Therefore, jobs
should be idempotent
Could anyone explain:
why this could happen?
what are the probabilities/statistic this could happen?
will it be fixed in some reasonable future in k8s?
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
The controller:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go
starts with a comment that lays the groundwork for an explanation:
I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)
If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.)
Just periodically list jobs and SJs, and then reconcile them.
Periodically means every 10 seconds:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105
The documentation following the quoted limitations also has some useful color on some of the circumstances under which 2 jobs or no jobs may be launched on a particular schedule:
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.
Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.
Higher level, solving for only-once in a distributed system is hard:
https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
Clocks and time synchronization in a distributed system is also hard:
https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html
To the questions:
why this could happen?
For instance- the node hosting the CronJobController fails at the time a job is supposed to run.
what are the probabilities/statistic this could happen?
Very unlikely for any given run. For a large enough number of runs, very unlikely to escape having to face this issue.
will it be fixed in some reasonable future in k8s?
There are no idemopotency-related issues under the area/batch label in the k8s repo, so one would guess not.
https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch
are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
Think more about the specific definition of idempotent, and the particular points in the job where there are commits. For instance, jobs can be made to support more-than-once execution if they save state to staging areas, and then there is an election process to determine whose work wins.
do other cron related services suffer with the same issue? Maybe it is a core cron problem?
Yes, it's a core distributed systems problem.
For most users, the k8s documentation gives perhaps a more precise and nuanced answer than is necessary. If your scheduled job is controlling some critical medical procedure, it's really important to plan for failure cases. If it's just doing some system cleanup, missing a scheduled run doesn't much matter. By definition, nearly all users of k8s CronJobs fall into the latter category.
I have a continuous webjob and sometimes it can take a REALLY, REALLY long time to process (i.e. several days). I'm not interested in partitioning it into smaller chunks to get it done faster (by doing it more parallel). Having it run slow and steady is fine with me. I was looking at the documentation about webjobs here where it lists out all the settings but it doesn't specify the defaults or maximums for these values. I was curious if anybody knew.
Since the docs say
"WEBJOBS_RESTART_TIME - Timeout in seconds between when a continuous job's process goes down (for any reason) and the time we re-launch it again (Only for continuous jobs)."
it doesn't matter how long your process runs.
Please clarify your question as most part of it is irrelevant to what you're asking at the end.
If you want to know the min - I'd say try 0. For max try MAX_INT (2147483647), that's 68 years. That should do it ;).
There is no "max run time" for a continuous WebJob. Note that, in practice, there aren't any assurances on how long a given instance of your Web App hosting the WebJob is going to exist, and thus your WebJob may restart anyway. It's always good design to have your continuous job idempotent; meaning it can be restarted many times, and pick back up where it left off.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to design a job scheduler cluster, which contains several hosts to do cron job scheduling. For example, a job which needs run every 5 minutes is submitted to the cluster, the cluster should point out which host to fire next run, making sure:
Disaster tolerance: if not all of the hosts are down, the job should be fired successfully.
Validity: only one host to fire next job run.
Due to disaster tolerance, job cannot bind to a specific host. One way is all the hosts polling a DB table(certainly with lock), this guaranteed only one host gets the next job run. Since it often locks table, is there any better design?
Use the Quartz framework for that. It has a cron like syntax, can be clustered and only one of the hosts in the cluster will do one job at a time. If a host or job fails, another host will retry the pending job.
I googled out the Dkron (Distributed job scheduling system). It has rest api and looks good. I plan try to use it
Dkron site
I'm not sure how to design one, but there are open-source products that do that which can serve as an example. One is Quartz scheduler that is mentioned above.
But, apparently, WallmartLabs have evaluated Quartz, found it to be not good enough, and thus created and open-sourced a better (in their opinion) alternative to it called BigBen. Perhaps you could also look at that one.
Consider using AWS Simple Workflow Service if you are OK with using AWS web services. The benefit over something like Quartz is that it doesn't depend on database which you have to host and it can provide much more than scheduling. For example it can run some activities that fix your cluster or page you if scheduling is not possible for any reason. Here is an example of a cron workflow.
I did require something like this long ago, when synchronisation was done with floppy disks. You should be clear about three things, which seem to be simple, but in distributed environment the arent :-)
"Synchronisation Sections"
If you get a net split, which means your cluster is split in two seperate sections wich can communicate inside the sections, but not between the two sections, the "fire the job exactly once" can only acquired per synchronisation section.
"Disaster"
If almost all times all computers are up and running and only very seldom one fails, and the failure of two is almost unthinkable, its a completely different thing, than every host is running only part time, the connections are unstable, or the synchronisation is done by dial-up connections or by floppys. If you want even deal with a net split, it becomes really really complicated.
If you want to deal with malicious hosts, you have another Problem.
"Validity"
Fire every job exactly once... you have to synchronize faster than the job firing interval.
edit: Tipp for scheduler-tasks design. I have a big text file, wich contains lines. Every line is a job task, starting with job-type, then time to execute, then command and last but not least a optional resubmission-interval for repeating tasks. Syncing means merging. Executed tasks are deleted. If resubmission is on, then a new task is inserted or appended.
In an ideal world, every host ist allways connected to the others, I would implement something like a token ring. If there is no master, one is selected by the hosts, and the master is expected to schedule everything until he is not sending heardbeats for some time. If there are two masters, they negotiate for one of them to become master(maybe lower MAC-Adress... whatever).
If you have to deal with malicious hosts, you can use some byzantine gerenals-problem solution. The selection of the master is allready pretty good proofed against malicious hosts. With a little bit of rsa-krypto the selected master can signature every command, resend attacks can be treated with timestamps or growing indices... voila.
only as a story from an onld programmer, not intended for today everything is allways connected to the internet world:
My big problem about 20 years ago was, that the hosts were synchronized from once a hour and once a day to once a week or once a month. So the solution was to have different commands:
1. execute on every host at a given date (wich is far enough in the future for synchronisation)
2. execute on a host, where "whoami" contains a certain substring.
3. execute on a random host with little probability, and send an acknowledgement to all others, that it is allready executed.
The third command-type does something like "fire only once", if the synchronisation is much faster than the probability of execution. It needs no master-slave architecture and it works pretty well, if you know the synchronisation intervalls.
Check out Chronos (https://mesos.github.io/chronos/) which runs on top of Mesos - (https://mesos.apache.org/) resource scheduler.
I'm learning node.js and just set up an empty Linux Virtual Machine and installed node.
I'm running a function constantly every minute
var request = require('request')
var minutes = 1, the_interval = minutes * 60 * 1000
setInterval(function() {
// Run code
})
}, the_interval);
And considering adding some other functions based on current time. - (e.g. run function if dateTime = Sunday at noon)
My question is are there any disadvantages to running a set up like this compared to a traditional cron job set up?
Keep in mind I have to run this function in node every minute anyways.
My question is are there any disadvantages to running a set up like this compared to a traditional cron job set up?
As long as //run the code isn't a CPU-bound thing like cryptography, stick with 1 node process, at least to start. Since you are requiring request I guess you might be making an HTTP request, which is IO, which means this will be fine.
It's just simpler to have 1 thing to install/launch/start/stop/upgrade/connect-a-debugger than to deal with an app server as well as a separate cron-managed process. For what it's worth, keeping it in javascript makes it portable across platforms, although that probably doesn't really matter.
There is also a handy node-cron module which I have used as well as approximately one bazillion other alternatives.
It depends on how strictly you have to adhere to that minute interval and if your node script is doing anything else in the meantime. If the only thing the script does is run something every X, I would strongly consider just having your node script do X instead, and scheduling it using the appropriate operating system scheduler.
If you build and run this in node, you have to manage the lifecycle of the app and make sure it's running, recover from crashes, etc. Just executing once a minute via CRON is much more straightforward and in my opinion conforms more to the Unix Philosophy.
Cron, unless your app is really small and simple.
Also, the in-memory setTimeout would not work if you ever end up behind a load balancer. You might have two+ instances of your node server running and thus your function/script running multiple times instead of once.