CoreOS alternative to /usr/lib/systemd/system-shutdown/ - coreos

I recently stumbled across the fact that on shutdown/reboot any script in /usr/lib/systemd/system-shutdown will get executed before the shutdown starts.
Paraphrasing - https://www.freedesktop.org/software/systemd/man/systemd-halt.service.html
With the /usr filesystem being read only on CoreOS I cannot put any of my shutdown scripts in /usr/lib/systemd/system-shutdown. I'm hoping someone more knowledgeable about CoreOS and systemd knows an alternate directory path on CoreOS nodes that would give me the same results. Or a configuration that I can adjust to point the directory to /etc/systemd/system-shutdown or something else.
Optionally any pointers on creating a custom service that does the same thing as systemd-shutdown.
My use case is that I have a few scripts that I want to execute when a node shutsdown. For example remove the node from the monitoring system, unschedule the node in kubernetes and drain any running pods while allowing in flight transactions to finish.

Related

Export Memory Dump Azure Kubernetes

I need to export memory dump from Aks Cluster and save it in some location
How can I do it? Is easy to export to a storage account? Exist another solution? Can someone give me an step y step?
EDIT: the previous answer was wrong, I didn't paid attention you needed a dump. You'll actually will need to get it from Boot Diagnostic or some command line:
https://learn.microsoft.com/en-us/azure/virtual-machines/troubleshooting/boot-diagnostics#enable-boot-diagnostics-on-existing-virtual-machine
This question is quite old, but let me nevertheless share how I realized it:
Linux has an internal setting called RLIMIT_CORE which limits the size of the core dump you'll receive when your application crashes - this is what you find quite quickly.
Next, you have to define the location of where core files are saved, which is done in the file /proc/sys/kernel/core_pattern. The given path can either be a relative file name (saved next to the binary which crashed), an absolute path (absolute to the mounted namespace) or - here is where it gets interesting - a pipe followed by an absolute path to an executable (application or script). This script will (according to the docs - see headline Piping core dumps to a program) be started as user and group root - but furthermore, it will (according to this post in the Linux mailing list) also be executed in the global namespace - in other words, outside of the container.
If you are like me, and you do not have access to the image used for new nodes on your AKS cluster, you want to set these values using DaemonSets, a pod which runs once on every node.
Armed with all this knowledge, you can do the following:
Create a DaemonSet - a pod running on every machine performing the initial setup.
This DaemonSet will run as a privileged container to allow it to switch to the root namespace.
After having switched namespaces successfully, it can change the value of /proc/sys/kernel/core_pattern.
The value should be something like |/bin/dd of=/core/%h.%e.%p.%t (dd will take the stdin, the core file, and save it to the location defined by the parameter of). Core files will now be saved at /core/. The name of the file can be explained by the variables found in the docs for core files.
After knowing that the files will be saved to /core/ of the root namespace, we can mount our storage there - in my case Azure File Storage. Here's a tutorial of how to mount AzureFileStorage.
Pods have the RestartPolicy set to Always. Since the job of your pod is done, and you don't want it to restart automatically, let it remain running using sleep infinity.
This writeup is almost a copy of what I discovered while contacting the support from Microsoft. Here's the thread in their forum, which contains an almost finished configuration for a DaemonSet.
I'll leave some links here which I used during my research:
how to generate core file in docker container?
How to access docker host filesystem from privileged container
https://medium.com/#patnaikshekhar/initialize-your-aks-nodes-with-daemonsets-679fa81fd20e
Sidenote:
I could also just have mounted the AzureFileSystem into every container and set the value for /proc/sys/kernel/core_pattern to just /core/%h.%e.%p.%t but this would require me to mention the mount on every container. Going this way I could free the configuration of the pods of this administrative task and put it where it (in my opinion) belongs, to the initial machine setup.

How to create a livenessprobe for a node.js container that is not a server?

I have to create a readyness and liveness probe for a node.js container (docker) in kubernetes. My problem is that the container is NOT a server, so I cannot use an http request to see if it is live.
My container runs a node-cron process that download some csv file every 12 h, parse them and insert the result in elasticsearch.
I know I could add express.js but I woud rather not do that just for a probe.
My question is:
Is there a way to use some kind of liveness command probe? If it is possible, what command can I use?
Inside the container, I have pm2 running the process. Can I use it in any way for my probe and, if so, how?
Liveness command
You can use a Liveness command as you describe. However, I would recommend to design your job/task for Kubernetes.
Design for Kubernetes
My container runs a node-cron process that download some csv file every 12 h, parse them and insert the result in elasticsearch.
Your job is not executing so often, if you deploy it as a service, it will take up resources all the time. And when you write that you want to use pm2 for your process, I would recommend another design. As what I understand, PM2 is a process manager, but Kubernetes is also a process manager in a way.
Kubernetes native CronJob
Instead of handling a process with pm2, implement your process as a container image and schedule your job/task with Kubernetes CronJob where you specify your image in the jobTemplate. With this design, you don't have any livenessProbe but your task will be restarted if it fails, e.g. fail to insert the result to elasticSearch due to a network problem.
First, you should certainly consider a Kubernetes CronJob for this workload. That said, it may not be appropriate for your job, for example if your job takes the majority of the time between scheduled runs to run, or you need more complex interactions between error handling in your job and scheduling. Finally, you may even want a liveness probe running for the container spawned by the CronJob if you want to check that the job is making progress as it runs -- this uses the same syntax as you would use with a normal job.
I'm less familiar with pm2, though I don't think you should need to use the additional job management inside of Kubernetes, which should already provide most of what you need.
That said, it is certainly possible to use an arbitrary command for your liveness probe, and as you noted it is even explicitly covered in the kubernetes liveness/rediness probe documentation
You just add an exec member to the livenessProbe stanza for the container, like so:
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
If the command returns 0 (e.g. succeeds), then the kubelet considers the container to be alive and healthy. (e.g. in this trivial example, the container is considered healthy only while /tmp/healthy exists).
In your case, I can think of several possibilities to use. As one example, the job could probably be configured to drop a sentinel file that indicates it is making progress in some way. For example, append the name and timestamp of the last file copied. The liveness command would then be a small script that could read that file and ensure that there has been adequate progress (e.g. in the cron job case, that a file has been copied within the last few minutes).
Readiness probes probably don't make sense in the context of the service you describe, since they're more about not sending the application traffic, but they can also have a similar stanza, just for readinessProbe rather than livenessProbe.

Running command on EC2 launch and shutdown in auto-scaling group

I'm running a Docker swarm deployed on AWS. The setup is an auto-scaling group of EC2 instances that each act as Docker swarm nodes.
When the auto-scaling group scales out (spawns new instance) I'd like to run a command on the instance to join the Docker swarm (i.e. docker swarm join ...) and when it scales in (shuts down instances) to leave the swarm (docker swarm leave).
I know I can do the first one with user data in the launch configuration, but I'm not sure how to act on shutdown. I'd like to make use of lifecycle hooks, and the docs mention I can run custom actions on launch/terminate, but it is never explained just how to do this. It should be possible to do without sending SQS/SNS/Cloudwatch events, right?
My AMI is a custom one based off of Ubuntu 16.04.
Thanks.
One of the core issues is that removing a node from a Swarm is currently a 2 or 3-step action when done gracefully, and some of those actions can't be done on the node that's leaving:
docker node demote, if leaving-node is a manager
docker swarm leave on leaving-node
docker swarm rm on a manager
This step 3 is what's tricky because it requires you to do one of three things to complete the removal process:
Put something on a worker that would let it do things on a manager remotely (ssh to a manager with sudo perms, or docker manager API access). Not a good idea. This breaks the security model of "workers can't do manager things" and greatly increases risk, so not recommended. We want our managers to stay secure, and our workers to have no control or visibility into the swarm.
(best if possible) Setup an external solution so that on a EC2 node removal, a job is run to SSH or API into a manager and remove the node from swarm. I've seen people do this, but can't remember a link/repo for full details on using a lambda, etc. to deal with the lifecycle hook.
Setup a simple cron on a single manager (or preferably as a manager-only service running a cron container) that removes workers that are marked down. This is a sort of blunt approach and has edge cases where you could potentially delete a node that's existing but considered down/unhealthy by swarm, but I've not heard of that happening. If it was fancy, it could maybe validate with AWS that node is indeed gone before removing.
WORST CASE, if a node goes down hard and doesn't do any of the above, it's not horrible, just not ideal for graceful management of user/db connections. After 30s a node is considered down and Service tasks will be re-created on healthy nodes. A long list of workers marked down in the swarm node list doesn't have an effect on your Services really, it's just unsightly (as long as there are enough healthy workers).
THERE'S A FEATURE REQUEST in GitHub to make this removal easier. I've commented on what I'm seeing in the wild. Feel free to post your story and use case in the SwarmKit repo.

Automatically suspend VMs after shutdown of Proxmox host

I'm looking for a way to suspend my VMs after the Proxmox host do a restart. Using Hyper-V, its possible to define an action for each VM like suspend or restart, which should be done on the VM after host reboot. Proxmox by default shutdown the VM together with the host. I couldn't find any config option, only to let Proxmox automatically start a VM after shutdown.
I found this article: http://8086.support/content/13/75/en/how-do-i-configure-kvm-to-suspend_restore-virtual-machines-when-the-host-is-rebooted.html Seems exactly what I need, but the file /etc/sysconfig/libvirt-guests doesn't exist. This file is part of the libvirt-client package, which is not installed and so no part of Proxmox. So I'm not sure, if its a good idea to use Proxmox together with another management solution, which libvirt seems to be. According to this Entry, its even not possible.
Isn't there a native way from proxmox to suspend a VM after host shutdown?
Have you tried posting on the Proxmox forums? They're the expert of their product so I'd recommend it.
Even if there's not an easy "built in" way to configure that by default, it's still possible. Proxmox is Debian under the hood, so you could write a script to do what you want on shutdown/reboot.
The builtin pvesh allows you to interact with your PVE server from the commandline and do tons of different things (including suspend and start). It interacts with the PVE RESTful API. Info on pvesh is here and the full API docs are here.
Once you've written a script that will suspend or restart your VMs, you can then leverage SystemD to launch your script at the appropriate time. E.g. the CLI part of this

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

Resources