check if Kubernetes deployment was sucessful in CI/CD pipeline - azure

I have an AKS cluster with Kubernetes version 1.14.7.
I have build CI/CD pipelines to deploy newly created images to the cluster.
I am using kubectl apply to update a specific deployment with the new image. sometimes and for many reasons, the deployment fails, for example ImagePullBackOff.
is there a command to run after the kubectl apply command to check if the pod creation and deployment was successful?

For this purpose Kubernetes has kubectl rollout and you should use option status.
By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled over by another revision, use --revision=N where N is the revision you need to watch for.
You can read the full description here
If you use kubect apply -f myapp.yaml and check rollout status you will see:
$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 1 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 2 of 3 updated replicas are available…
deployment "myapp" successfully rolled out

There is another way to wait for deployment to become available with a configured timeout like
kubectl wait --for=condition=available --timeout=60s deploy/myapp
otherwise kubectl rollout status can be used but it may stuck forever in some rare cases and will require manual cancellation of pipeline if that happens.

You can parse the output through jq:
kubectl get pod -o=json | jq '.items[]|select(any( .status.containerStatuses[]; .state.waiting.reason=="ImagePullBackOff"))|.metadata.name'

It looks like kubediff tool is a perfect match for your task:
Kubediff is a tool for Kubernetes to show you the differences between your running configuration and your version controlled configuration.
The tool can be used from the command line and as a Pod in the cluster that continuously compares YAML files in the configured repository with the current state of the cluster.
$ ./kubediff
Usage: kubediff [options] <dir/file>...
Compare yaml files in <dir> to running state in kubernetes and print the
differences. This is useful to ensure you have applied all your changes to the
appropriate environment. This tools runs kubectl, so unless your
~/.kube/config is configured for the correct environment, you will need to
supply the kubeconfig for the appropriate environment.
kubediff returns the status to stdout and non-zero exit code when difference is found. You can change this behavior using command line arguments.
You may also want to check the good article about validating YAML files:
Validating Kubernetes Deployment YAMLs

Related

Node container unable to locate Hashicorp Vault secrets file on startup on AWS EKS 1.24

We have a small collection of Kubernetes pods which run react/next.js UIs in a node 16 alpine container (node:16.18.1-alpine3.15 to be precise). All of this runs in AWS EKS 1.23. We make use of annotations on these pods in order to inject secrets from Hashicorp Vault at start up. The annotations pull the desired secrets from Vault and write these to a file on the pod. Example of said annotations below :
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-init-first: "true"
vault.hashicorp.com/agent-pre-populate-only: "true"
vault.hashicorp.com/role: "onejourney-ui"
vault.hashicorp.com/agent-inject-secret-config: "secret/data/onejourney-ui"
vault.hashicorp.com/agent-inject-template-config: |
{{- with secret "secret/data/onejourney-ui" -}}
export AUTH0_CLIENT_ID="{{ .Data.data.auth0_client_id }}"
export SENTRY_DSN="{{ .Data.data.sentry_admin_dsn }}"
{{- end }}
When the pod starts up, we source this file (which is created by default at /vault/secrets/config) to set environment variables and then delete the file. We do that with the following pod arguments in our helm chart :
node:
args:
- /bin/sh
- -c
- source /vault/secrets/config; rm -rf /vault/secrets/config; yarn start-admin;
We recently upgraded some of AWS EKS clusters from 1.23 to 1.24. After doing so, we noted that our node applications were failing to start and entering a crash loop. Looking in the logs of these containers, the problem seemed to be that the pod was unable to locate the secrets file anymore.
Interestingly, the Vault init container completed successfully and shows that the file was successfully created...
Out of curiosity, I removed the node args to source the file which allowed the container to start successfully, but I found when execing into the pod, the file WAS infact present and had the content I was expecting. The file also had the correct owner and permissions as we see in a good working instance in EKS 1.23.
We have other containers (php-fpm) which consume secrets in the same manner however these were not affected on 1.24, only node containers were affected. There were no namespace, pod or deployment annotations I saw added which would have been a possible cause. After rolling the cluster back down to EKS 1.23, the deployment worked as expected.
I'm left scratching my head as to why the pod is unable to source that file on 1.24. Any suggestions on what to check or a possible cause would be greatly appreciated.

Argo CD stuck in deadlock when deleting its installation via Helm with Terraform

I am installing Argo CD in a cluster using a Helm release via Terraform. It installs Argo CD CRDs and automatically add a custom project and a initial application. This application monitors a path with other applications (app of apps).
If I want to remove Argo CD and all its apps and argocd namespace from the cluster using terraform destroy, it usually gets stuck in a deadlock for deleting the apps.
I understand that having the finalizers below will make the deletion of the app cascde and delete all the resources when using kubectl delete app ..., which is good:
metadata:
finalizers:
- resources-finalizer.argocd.argoproj.io
However, in my app definitions I also have the sync policy for prune and autoHeal set to true so if an app is accidentally deleted, it's recreated immediately like this:
syncPolicy:
automated:
prune: true
selfHeal: true
Am I right to assume that these two definitions are conflicting and causing the deadlock when destroying Argo CD with Terraform? The destroy command tries to delete the app which gets stuck in a loop recreates it everytime.
If that's the case, how can I achieve that Argo CD will recreate an app if it is deleted accidentally (via UI, kubectl or argocd CLI) but will be completely pruned with terraform destroy?

Auto rollout kubernes deployment if the deployment fails in azuredevops

We have a release pipeline in AzureDevops to deploy the microservice to AKS and send the log of the microservice once its deployed.
We are using below command to deploy the deployment with the template kubectl and command as "-f /home/admin/builds/$(build.buildnumber)/Myservice_Deployment.yml --record"
Here we noticed that the task is not waiting for the existing pod to terminate and crate the new pod, but its continuing and just finishing the job.
Our expected scenario
1) Deploy the microservice using kubectl apply -f /home/admin/builds/$(build.buildnumber)/Myservice_Deployment.yml --record
2) wait for the existing pod to terminate and ensure that the new pod is in running status.
3) once the new pod is running status collect the log of the pod by kubectl log command and sent to the team
4) if the pod is not is not in running state, roll back to previous stable state.
I tried with different shell scripts to achieve this in azuredevops, but didnt succeeded
Ex:
ATTEMPTS=0
ROLLOUT_STATUS_CMD="kubectl --kubeconfig /home/admin/kubernetes/Dev-kubeconfig rollout status deployment/My-service"
until $ROLLOUT_STATUS_CMD || [ $ATTEMPTS -eq 60 ]; do
$ROLLOUT_STATUS_CMD
ATTEMPTS=$((attempts + 1))
sleep 10
done
Also need to get the log of the microservice using the kubectl log command and the file should be in the format with Date and need to be shared over mail..
you have several questions mixed up in a single question, but you'd need to configure your deployment with liveness probe for your desired behaviour to happen
Reading: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command

How to change the schedule of a Kubernetes cronjob or how to start it manually?

Is there a simple way to change the schedule of a kubernetes cronjob like kubectl change cronjob my-cronjob "10 10 * * *"? Or any other way without needing to do kubectl apply -f deployment.yml? The latter can be extremely cumbersome in a complex CI/CD setting because manually editing the deployment yaml is often not desired, especially not if the file is created from a template in the build process.
Alternatively, is there a way to start a cronjob manually? For instance, a job is scheduled to start in 22 hours, but I want to trigger it manually once now without changing the cron schedule for good (for testing or an initial run)?
You can update only the selected field of resourse by patching it
patch -h
Update field(s) of a resource using strategic merge patch, a JSON merge patch, or a JSON patch.
JSON and YAML formats are accepted.
Please refer to the models in
https://htmlpreview.github.io/?https://github.com/kubernetes/kubernetes/blob/HEAD/docs/api-reference/v1/definitions.html
to find if a field is mutable.
As provided in comment for ref :
kubectl patch cronjob my-cronjob -p '{"spec":{"schedule": "42 11 * * *"}}'
Also, in current kubectl versions, to launch a onetime execution of a declared cronjob, you can manualy create a job that adheres to the cronjob spec with
kubectl create job --from=cronjob/mycron
The more recent versions of k8s (from 1.10 on) support the following command:
$ kubectl create job my-one-time-job --from=cronjobs/my-cronjob
Source is this solved k8s github issue.
From #SmCaterpillar answer above kubectl patch my-cronjob -p '{"spec":{"schedule": "42 11 * * *"}}',
I was getting the error: unable to parse "'{spec:{schedule:": yaml: found unexpected end of stream
If someone else is facing a similar issue, replace the last part of the command with -
"{\"spec\":{\"schedule\": \"42 11 * * *\"}}"
I have a friend who developed a kubectl plugin that answers exactly that !
It takes an existing cronjob and just create a job out of it.
See https://github.com/vic3lord/cronjobjob
Look into the README for installation instructions.
And if you want to do patch a k8s cronjob schedule with the Python kubernetes library, you can do this like that:
from kubernetes import client, config
config.load_kube_config()
v1 = client.BatchV1beta1Api()
body = {"spec": {"schedule": "#daily"}}
ret = v1.patch_namespaced_cron_job(
namespace="default", name="my-cronjob", body=body
)
print(ret)

gcloud app deploy does not remove previous versions

I am running a Node.js app on Google App Engine, using the following command to deploy my code:
gcloud app deploy --stop-previous-version
My desired behavior is for all instances running previous versions to be terminated, but they always seem to stick around. Is there something I'm missing?
I realize they are not receiving traffic, but I am still paying for them and they cause some background telemetry noise. Is there a better way of running this command?
Example output of the gcloud app instances list:
As you can see I have two different versions running.
We accidentally blew through our free Google App Engine credit in less than 30 days because of an errant flexible instance that wasn't cleared by subsequent deployments. When we pinpointed it as the cause it had scaled up to four simultaneous instances that were basically idling away.
tl;dr: Use the --version flag when deploying to specify a version name. An existing instance with the same version will be
replaced then next time you deploy.
That led me down the rabbit hole that is --stop-previous-version. Here's what I've found out so far:
--stop-previous-version doesn't seem to be supported anymore. It's mentioned under Flags on the gcloud app deploy reference page, but if you look at the top of the page where all the flags are listed, it's nowhere to be found.
I tried deploying with that flag set to see what would happen but it seemingly had no effect. A new version was still created, and I still had to go in and manually delete the old instance.
There's an open Github issue on the gcloud-maven-plugin repo that specifically calls this out as an issue with that plugin but the issue has been seemingly ignored.
At this point our best bet at this point is to add --version=staging or whatever to gcloud deploy app. The reference docs for that flag seem to indicate that that it'll replace an existing instance that shares that "version":
--version=VERSION, -v VERSION
The version of the app that will be created or replaced by this deployment. If you do not specify a version, one will be generated for you.
(emphasis mine)
Additionally, Google's own reference documentation on app.yaml (the link's for the Python docs but it's still relevant) specifically calls out the --version flag as the "preferred" way to specify a version when deploying:
The recommended approach is to remove the version element from your app.yaml file and instead, use a command-line flag to specify your version ID
As far as I can tell, for Standard Environment with automatic scaling at least, it is normal for old versions to remain "serving", though they should hopefully have zero instances (even if your scaling configuration specifies a nonzero minimum). At least that's what I've seen. I think (I hope) that those old "serving" instances won't result in any charges, since billing is per instance.
I know most of the above answers are for Flexible Environment, but I thought I'd include this here for people who are wondering.
(And it would be great if someone from Google could confirm.)
I had same problem as OP. Using the flex environment (some of this also applies to standard environment) with Docker (runtime: custom in app.yaml) I've finally solved this! I tried a lot of things and I'm not sure which one fixed it (or whether it was a combination) so I'll list the things I did here, the most likely solutions being listed first.
SOLUTION 1) Ensure that cloud storage deletes old versions
What does cloud storage have to do with anything? (I hear you ask)
Well there's a little tooltip (Google Cloud Platform Web UI (GCP) > App Engine > Versions > Size) that when you hover over it says:
(Google App Engine) Flexible environment code is stored and billed from Google Cloud Storage ... yada yada yada
So based on this info and this answer I visited GCP > Cloud Storage > Browser and found my storage bucket AND a load of other storage buckets I didn't know existed. It turns out that some of the buckets store cached cloud functions code, some store cached docker images and some store other cached code/stuff (you can tell which is which by browsing the buckets).
So I added a deletion policy to all the buckets (except the cloud functions bucket) as follows:
Go to GCP > Cloud Storage > Browser and click the link (for the relevant bucket) in the Lifecycle Rules column > Click ADD A RULE > THEN:
For SELECT ACTION choose "Delete Object" and click continue
For SELECT OBJECT choose "Number of newer versions" and enter 1 in the input
Click CREATE
This will return you to the table view and you should now see the rule in the lifecycle rules column.
REPEAT this process for all relevant buckets (the relevant buckets were described earlier).
THEN delete the contents of the relevant buckets. WARNING: Some buckets warn you NOT to delete the bucket itself, only the contents!
Now re-deploy and your latest version should now get deployed and hopefully you will never have this problem again!
SOLUTION 2) Use deploy flags
I added these flags
gcloud app deploy --quiet --promote --stop-previous-version
This probably doesn't help since these flags seem to be the default but worth adding just in case.
Note that for the standard environment only (I heard on the grapevine) you can also use the --no-cache flag which might help but with flex, this flag caused the deployment to fail (when I tried).
SOLUTION 3)
This probably does not help at all, but I added:
COPY app.yaml .
to the Dockerfile
TIP 1)
This is probably more of a helpful / useful debug approach than a fix.
Visit GCP > App Engine > Versions
This shows all versions of your app (1 per deployment) and it also shows which version each instance is running (instances are configured in app.yaml).
Make sure all instances are running the latest version. This should happen by default. Probably worth deleting old versions.
You can determine your version from the gcloud app deploy logs (at the start of the logs) but it seems that the versions are listed by order of deployment anyway (most recent at top).
TIP 2)
Visit GCP > App Engine > Instances
SSH into an instance. This is just a matter of clicking a few buttons (see screenshot below). Once you have SSH'd in run:
docker exec -it gaeapp /bin/bash
Which will get you into the docker container running your code. Now you can browse around to make sure it has your latest code.
Well I think my answer is long enough now. If this helps, don't thank me, J-ES-US is the one you should thank ;) I belong to Him ^^
Google may have updated their documentation cited in #IAmKale's answer
Note that if the version is running on an instance of an auto-scaled service, using --stop-previous-version will not work and the previous version will continue to run because auto-scaled service instances are always running.
Seems like that flag only works with manually scaled services.
This is a supplementary and optional answer in addition to my other main answer.
I am now, in addition to my other answer, auto incrementing version manually on deploy using a script.
My script contents are below.
Basically, the script auto increments version every time you deploy. I am using node.js so the script uses npm version to bump the version but this line could easily be tweaked to whatever language you use.
The script requires a clean git working directory for deployment.
The script assumes that when the version is bumped, this will result in file changes (e.g. changes to package.json version) that need pushing.
The script essentially tries to find your SSH key and if it finds it then it starts an SSH agent and uses your SSH key to git commit and git push the file changes. Else it just does a git commit without a push.
It then does a deploy using the --version flag ... --version="${deployVer}"
Thought this might help someone, especially since the top answer talks a lot about using the --version flag on a deploy.
#!/usr/bin/env bash
projectName="vehicle-damage-inspector-app-engine"
# Find SSH key
sshFile1=~/.ssh/id_ed25519
sshFile2=~/Desktop/.ssh/id_ed25519
sshFile3=~/.ssh/id_rsa
sshFile4=~/Desktop/.ssh/id_rsa
if [ -f "${sshFile1}" ]; then
sshFile="${sshFile1}"
elif [ -f "${sshFile2}" ]; then
sshFile="${sshFile2}"
elif [ -f "${sshFile3}" ]; then
sshFile="${sshFile3}"
elif [ -f "${sshFile4}" ]; then
sshFile="${sshFile4}"
fi
# If SSH key found then fire up SSH agent
if [ -n "${sshFile}" ]; then
pub=$(cat "${sshFile}.pub")
for i in ${pub}; do email="${i}"; done
name="Auto Deploy ${projectName}"
git config --global user.email "${email}"
git config --global user.name "${name}"
echo "Git SSH key = ${sshFile}"
echo "Git email = ${email}"
echo "Git name = ${name}"
eval "$(ssh-agent -s)"
ssh-add "${sshFile}" &>/dev/null
sshKeyAdded=true
fi
# Bump version and git commit (and git push if SSH key added) and deploy
if [ -z "$(git status --porcelain)" ]; then
echo "Working directory clean"
echo "Bumping patch version"
ver=$(npm version patch --no-git-tag-version)
git add -A
git commit -m "${projectName} version ${ver}"
if [ -n "${sshKeyAdded}" ]; then
echo ">>>>> Bumped patch version to ${ver} with git commit and git push"
git push
else
echo ">>>>> Bumped patch version to ${ver} with git commit only, please git push manually"
fi
deployVer="${ver//"."/"-"}"
gcloud app deploy --quiet --promote --stop-previous-version --version="${deployVer}"
else
echo "Working directory unclean, please commit changes"
fi
For node.js users if you call the script deploy.sh you should add:
"deploy": "sh deploy.sh"
In your package.json scripts and deploy with npm run deploy

Resources