I have few cronjobs configured and running in Kubernetes. How to setup up cronjob email alerts for success or failure in Kubernetes.
This could be as easy as setting up a bash script with kubectl to send an email if you see a job that is Failed state.
while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(#.type=="Failed")].status}' | grep True`; then mail email#address -s jobfailed; else sleep 1 ; fi; done
or on newer K8s:
while true; do kubectl wait --for=condition=failed job/myjob; mail#address -s jobfailed; done
How to tell whether a Job is complete: Kubernetes - Tell when Job is Complete
You can also setup something like Prometheus with Alertmanager in your Kubernetes cluster to monitor your Jobs.
Some useful info here and here.
Related
I have several celery tasks that execute via beat. In development, I used a single command to set this up, like:
celery worker -A my_tasks -n XXXXX#%h -Q for_mytasks -c 1 -E -l INFO -B -s ./config/celerybeat-schedule --pidfile ./config/celerybeat.pid
On moving to production, I inserted this into a script that activated my venv, set the PYTHONPATH, removed old beat files, cd to correct directory and then run celery. This works absolutely fine. However, in production I want to separate the worker from the beat scheduler, like:
celery worker -A my_tasks -n XXXXX#%h -Q for_mytasks -c 1 -E -l INFO -f ./logs/celeryworker.log
celery beat -A my_tasks -s ./config/celerybeat-schedule --pidfile ./config/celerybeat.pid -l INFO -f ./logs/celerybeat.log
Now this all works fine when put into the relevant bash scripts. However, I need these to be run on server start-up. I encountered several problems:
1) in crontab -e #reboot my_script will not work. I have to insert a delay to allow rabbitmq to fully start, i.e. #reboot sleep 60 && my_script. Now this seems a little 'messy' to me but I can live with it.
2) celery worker takes several seconds to finish before celery beat can be run properly. I tried all manner of cron directives to accomplish beat being run after worker has executed successfully but couldn't get the beat to run. My current solution in crontab is something like this:
#reboot sleep 60 && my_script_worker
#reboot sleep 120 && my_script_beat
So basically, ubuntu boots, waits 60 seconds and runs celery worker then waits another 60 seconds before running celery beat. This works fine but it seems even more 'messy' to me. In an ideal world I would like to flag when rabbitmq is ready to run worker, then flag when worker has executed successfully so that I can run beat.
My question is : has anybody encountered this problem and if so do they have a more elegant way of kicking off celery worker & beat on server reboot?
EDIT: 24/09/2019
Thanks to DejanLekic & Greenev
I have spent some hours converting from cron to systemd. Yes, I agree totally that this is a far more robust solution. My celery worker & beat are now started as services by systemd on reboot.
There is one tip I have for people trying this that is not mentioned in the celery documentation. The template beat command will create a 'celery beat database' file called celerybeat-schedule in your working directory. If you restart your beat service, this file will cause spurious celery tasks to be spawned that don't seem to fit with your actual celery schedule. The solution is to delete this file each time the beat service starts. I also delete the pid file, if it's there. I did this by adding 2 ExecStartPre and a -s option to the beat service :
ExecStartPre=/bin/sh -c 'rm -f ${CELERYBEAT_DB_FILE}'
ExecStartPre=/bin/sh -c 'rm -f ${CELERYBEAT_PID_FILE}'
ExecStart=/bin/sh -c '${CELERY_BIN} beat \
-A ${CELERY_APP} --pidfile=${CELERYBEAT_PID_FILE} \
-s ${CELERYBEAT_DB_FILE} \
--logfile=${CELERYBEAT_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL}'
Thanks guys.
To daemonize celery worker we are using systemd, so the worker and the beat could be getting to run as separate services and configured to start on the server reboot via just making these services enabled
What you really want is to run Celery beat process as a systemd or SysV service. It is described in depth in the Daemonization section of the Celery documentation. In fact, same goes for the worker process too.
Why? - Unlike your solution, which involves crontab with #reboot line, systemd for an example can check the health of the service and restart it if needed. All Linux services on your Linux boxes are started this way because it has been made for this particular purpose.
I know how to put a pod to sleep by command:
kubectl -n logging patch sts <sts name> --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["sleep", "infinity"] }]'
Whats the command to wake up the pod?
What you are actually doing is to update the statefulset, changing the command parameter for its pods. The command parameter sets the entrypoint for the container, in other words, the command that is executed when starting the container.
You are setting that command to sleep infinity. Thus to wake up the pod, just update the statefulset and set the command to the original one.
The best solution to do this would be to just scale the statufulset to 0 replicas with:
kubectl -n logging scale sts <sts name> --replicas 0
And scale up to the original replicas number with:
kubectl -n logging scale sts <sts name> --replicas <original number>
This way you don't have any pod running sleep infinity in your cluster, and you will save costs by not having this useless pods wasting resources.
I can install a new cron job using crontab command.
For example,
crontab -e
0 0 * * * root rdate -s time.bora.net && clock -w > /dev/null 2>&1
Now I have 100 machines in my cluster, I want to install the above cron task in all of the machines.
How to distribute cron jobs to the cluster machines?
Thanks,
Method 1: Ansible
Ansible can distribute machine configuration to remote machines. Please refer to:
https://www.ansible.com/
Method 2: Distributed Cron
You can use distribted cron to assign cron job. It has a master node and you can config your job easily and monitor the running result.
https://github.com/shunfei/cronsun
crontab is stored /var/spool/cron/(username)
so, write your own cron jobs and distribute that location after acquire root permission.
but if other user edit the crontab at the same time, you can never be sure when it will get changed.
below links help you
https://ubuntuforums.org/showthread.php?t=2173811
https://forums.linuxmint.com/viewtopic.php?t=144841
https://serverfault.com/questions/347318/is-it-bad-to-edit-cron-file-manually
Ansible already has a cron module:
https://docs.ansible.com/ansible/2.7/modules/cron_module.html
I am trying to implement an automated code to shut down and start VM Instances in my Google Cloud account via Crontab. The OS is Ubuntu 12 lts and is installed with Google service account so it can handle read/write on my Google cloud account.
My actual code is in this file /home/ubu12lts/cronfiles/resetvm.sh
#!/bin/bash
echo Y | gcloud compute instances stop my-vm-name --zone us-central1-a
sleep 120s
gcloud compute instances start my-vm-name --zone us-central1-a
echo "completed"
When I call the above file like this,
$ bash /home/ubu12lts/cronfiles/resetvm.sh
It works perfect and does the job.
Now I wanted to set this up in cron so it would do automatically every hour. So I did
$ sudo crontab -e
And added this code in cron
0 * * * * /bin/sh /home/ubu12lts/cronfiles/resetvm.sh >>/home/ubu12lts/cron.log
And made script executable
chmod +x /home/ubu12lts/cronfiles/resetvm.sh
I also tested the crontab by adding a sample command of creating .txt file with a sample message and it worked perfect.
But the above code for gcloud SDK doesn't work through cron. The VM doesn't stop neither starts in my GC compute engine.
Anyone can help please?
Thank you so much.
You have added the entry to root's crontab, while your Cloud SDK installation is setup for a different user (I am guessing ubu121lts).
You should add the entry in ubu121lts's crontab using:
crontab -u ubu121lts -e
Additionally your entry is currently scheduled to run on the 0th minute every hour. Is that what you intended?
I have run into a similar issue before. I fixed it by forcing the profile in my script.sh,loading the gcloud environment variables with it. Example below:
#!/bin/bash
source /etc/profile
echo Y | gcloud compute instances stop my-vm-name --zone us-central1-a
sleep 120s gcloud compute instances start my-vm-name --zone
us-central1-a echo "completed"
This also helped me resize node count in GKE.
I am using a Torque+MAUI cluster.
The cluster's utilization now is ~10 node/40 nodes available, a lot of job being queued but cannot be started.
I submitted the following PBS script using qsub:
#!/bin/bash
#
#PBS -S /bin/bash
#PBS -o STDOUT
#PBS -e STDERR
#PBS -l walltime=500:00:00
#PBS -l nodes=1:ppn=32
#PBS -q zone0
cd /somedir/workdir/
java -Xmx1024m -Xms256m -jar client_1_05.jar
The job gets R(un) status immediately, but I had this abnormal information from qstat -n
8655.cluster.local user zone0 run.sh -- 1 32 -- 500:00:00 R 00:00:31
z0-1/0+z0-1/1+z0-1/2+z0-1/3+z0-1/4+z0-1/5+z0-1/6+z0-1/7+z0-1/8+z0-1/9
+z0-1/10+z0-1/11+z0-1/12+z0-1/13+z0-1/14+z0-1/15+z0-1/16+z0-1/17+z0-1/18
+z0-1/19+z0-1/20+z0-1/21+z0-1/22+z0-1/23+z0-1/24+z0-1/25+z0-1/26+z0-1/27
+z0-1/28+z0-1/29+z0-1/30+z0-1/31
The abnormal part is -- in run.sh -- 1 32, as the sessionId is missing, and evidently the script does not run at all, i.e. the java program does not ever had traces of being started.
After this kind of strange running for ~5 minutes, the job will be set back to Q(ueue) status and seemingly will not being run again (I had monitored this for ~1 week and it does not run even being queued to the top most job).
I tried submit the same job 14 times, and monitored its node in qstat -n, 7 copies ran successfully, having varied node numbers, but all jobs being allocated z0-1/* get stuck with this strange startup behavior.
Anyone know a solution to this issue?
For a temporary workaround, how can I specify NOT to use those strange nodes in PBS script?
It sounds like something is wrong with those nodes. One solution would be to offline the nodes that aren't working: pbsnodes -o <node name> and allow the cluster to continue to work. You may need to release the holds on any jobs. I believe you can run releasehold ALL to accomplish this in Maui.
Once you take care of that I'd investigate the logs on those nodes (start with the pbs_mom logs and the syslogs) and figure out what is wrong with them. Once you figure out and correct what is wrong with them, you can put the nodes back online: pbsnodes -c <node_name>. You may also want to look into setting up some node health scripts to proactively detect and handle these situations.
For users, contact your administrator and in the mean time, run the job using this workaround.
Use pbsnodes to check for free and healthy nodes
Modify PBS directive #PBS -l nodes=<freenode1>:ppn=<ppn1>+<freenode2>:ppn=<ppn2>+...
submit the job using qsub