heptio velero unable to take backup of persistant volumes in azure AKS - azure

i am using
- AKS
- k8s version 1.12.5
- Velero version:- v0.11.0
- Documents referred from the link
installed velero on server
Install prereq i.e. 00-prereqs.yaml It installs velero namespace,
velero service account rbac rules etc.
Created azure storage account and container
in it. (i used terraform to create storage account while
used AZ CLI to create storage container). It is all based on their
documentation available.
Created secret.
kubectl create secret generic cloud-credentials
--namespace velero
--from-literal AZURE_SUBSCRIPTION_ID=""
--from-literal AZURE_TENANT_ID=""
--from-literal AZURE_CLIENT_ID=""
--from-literal AZURE_CLIENT_SECRET=""
--from-literal AZURE_RESOURCE_GROUP="name-of-resource-group-where-my-vm etc created typically starts with MC_ in azure"
applied remaining k8s resources present at
execute backup commands
it observed that this command created files for my backup in my storage account as well.
the similar structure created for other backups as well.
while checking pod logs it is observed following information
time="2019-03-22T14:38:02Z" level=info msg="Executing takePVSnapshot"
backup=velero/d042203191536 group=v1 groupResource=pods
logSource="pkg/backup/item_backupper.go:378"
name=pvc-6dd56a3d-4c90-11e9-bc92-1297bc38e414 namespace=default
time="2019-03-22T14:38:02Z" level=info msg="label
\"failure-domain.beta.kubernetes.io/zone\" is not present on
PersistentVolume"
again
level=error msg="Error getting block store for volume snapshot
time="2019-03-22T14:38:02Z" level=info msg="PersistentVolume is not a
supported volume type for snapshots, skipping."
backup=velero/d042203191536 group=v1 groupResource=pods
logSource="pkg/backup/item_backupper.go:436"
and following error as well
level=error msg="backup failed" controller=backup
error="[clusterroles.rbac.authorization.k8s.io
\"system:auth-delegator\" not found,
clusterroles.rbac.authorization.k8s.io \"system:auth-delegator\" not
found]" key=velero/d042203191618
logSource="pkg/controller/backup_controller.go:202"
all these logs I observed after executing backup at multiple time intervals
not sure if I am missing anything .any pointers to resolve these problems are really helpful.

Those are currently supported Volume providers
| [Azure Managed Disks][3] | Ark Team | [Slack][10], [GitHub Issue][11] |
| [Google Compute Engine Disks][4] | Ark Team | [Slack][10], [GitHub Issue][11] |
| [Restic][1] | Ark Team | [Slack][10], [GitHub Issue][11] |
| [Portworx][6] | PortWorx | |
| [DigitalOcean][7] | StackPointCloud | |
Make sure your volume type is compatible with velero plugins

Related

Node container unable to locate Hashicorp Vault secrets file on startup on AWS EKS 1.24

We have a small collection of Kubernetes pods which run react/next.js UIs in a node 16 alpine container (node:16.18.1-alpine3.15 to be precise). All of this runs in AWS EKS 1.23. We make use of annotations on these pods in order to inject secrets from Hashicorp Vault at start up. The annotations pull the desired secrets from Vault and write these to a file on the pod. Example of said annotations below :
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/agent-init-first: "true"
vault.hashicorp.com/agent-pre-populate-only: "true"
vault.hashicorp.com/role: "onejourney-ui"
vault.hashicorp.com/agent-inject-secret-config: "secret/data/onejourney-ui"
vault.hashicorp.com/agent-inject-template-config: |
{{- with secret "secret/data/onejourney-ui" -}}
export AUTH0_CLIENT_ID="{{ .Data.data.auth0_client_id }}"
export SENTRY_DSN="{{ .Data.data.sentry_admin_dsn }}"
{{- end }}
When the pod starts up, we source this file (which is created by default at /vault/secrets/config) to set environment variables and then delete the file. We do that with the following pod arguments in our helm chart :
node:
args:
- /bin/sh
- -c
- source /vault/secrets/config; rm -rf /vault/secrets/config; yarn start-admin;
We recently upgraded some of AWS EKS clusters from 1.23 to 1.24. After doing so, we noted that our node applications were failing to start and entering a crash loop. Looking in the logs of these containers, the problem seemed to be that the pod was unable to locate the secrets file anymore.
Interestingly, the Vault init container completed successfully and shows that the file was successfully created...
Out of curiosity, I removed the node args to source the file which allowed the container to start successfully, but I found when execing into the pod, the file WAS infact present and had the content I was expecting. The file also had the correct owner and permissions as we see in a good working instance in EKS 1.23.
We have other containers (php-fpm) which consume secrets in the same manner however these were not affected on 1.24, only node containers were affected. There were no namespace, pod or deployment annotations I saw added which would have been a possible cause. After rolling the cluster back down to EKS 1.23, the deployment worked as expected.
I'm left scratching my head as to why the pod is unable to source that file on 1.24. Any suggestions on what to check or a possible cause would be greatly appreciated.

Installing nginx ingress controller into AKS cluster - can't pull image from Azure Container Registry - 401 Unauthorized

I'm trying to install an nginx ingress controller into an Azure Kubernetes Service cluster using helm. I'm following this Microsoft guide. It's failing when I use helm to try to install the ingress controller, because it needs to pull a "kube-webhook-certgen" image from a local Azure Container Registry (which I created and linked to the cluster), but the kubernetes pod that's initially scheduled in the cluster fails to pull the image and shows the following error when I use kubectl describe pod [pod_name]:
failed to resolve reference "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
This section describes using helm to create an ingress controller.
The guide describes creating an Azure Container Registry, and link it to a kubernetes cluster, which I've done successfully using:
az aks update -n myAKSCluster -g myResourceGroup --attach-acr <acr-name>
I then import the required 3rd party repositories successfully into my 'local' Azure Container Registry as detailed in the guide. I checked that the cluster has access to the Azure Container Registry using:
az aks check-acr --name MyAKSCluster --resource-group myResourceGroup --acr letsencryptdemoacr.azurecr.io
I also used the Azure Portal to check permissions on the Azure Container Registry and the specific repository that has the issue. It shows that both the cluster and repository have the ACR_PULL permission)
When I run the helm script to create the ingress controller, it fails at the point where it's trying to create a kubernetes pod named nginx-ingress-ingress-nginx-admission-create in the ingress-basic namespace that I created. When I use kubectl describe pod [pod_name_here], it shows the following error, which prevents creation of the ingress controller from continuing:
Failed to pull image "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen:v1.5.1#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": [rpc error: code = NotFound desc = failed to pull and unpack image "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": failed to resolve reference "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068: not found, rpc error: code = Unknown desc = failed to pull and unpack image "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": failed to resolve reference "letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068": failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
This is the helm script that I run in a linux terminal:
helm install nginx-ingress ingress-nginx/ingress-nginx --namespace ingress-basic --set controller.replicaCount=1 --set controller.nodeSelector."kubernetes\.io/os"=linux --set controller.image.registry=$ACR_URL --set controller.image.image=$CONTROLLER_IMAGE --set controller.image.tag=$CONTROLLER_TAG --set controller.image.digest="" --set controller.admissionWebhooks.patch.nodeSelector."kubernetes\.io/os"=linux --set controller.admissionWebhooks.patch.image.registry=$ACR_URL --set controller.admissionWebhooks.patch.image.image=$PATCH_IMAGE --set controller.admissionWebhooks.patch.image.tag=$PATCH_TAG --set defaultBackend.nodeSelector."kubernetes\.io/os"=linux --set defaultBackend.image.registry=$ACR_URL --set defaultBackend.image.image=$DEFAULTBACKEND_IMAGE --set defaultBackend.image.tag=$DEFAULTBACKEND_TAG --set controller.service.loadBalancerIP=$STATIC_IP --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-dns-label-name"=$DNS_LABEL
I'm using the following relevant environment variables:
$ACR_URL=letsencryptdemoacr.azurecr.io
$PATCH_IMAGE=jettech/kube-webhook-certgen
$PATCH_TAG=v1.5.1
How do I fix the authorization?
It seems like the issue is caused by the new ingress-nginx/ingress-nginx helm chart release. I have fixed it by using version 3.36.0 instead of the latest (4.0.1).
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
--version 3.36.0 \
...
Azure support identified and provided a solution to this and essentially confirmed that the documentation in the Microsoft tutorial is at best now outdated against the current Helm release for the ingress controller.
The full error message I was getting was similar to the following, which indicates that the first error encountered is actually that the image is NotFound. The message about Unauthorized is actually misleading. The issue appears to be that the install references 'digests' for a couple of the images required by the install (basically the digest is a unique identifier for the image). The install appears to have been using digests of the docker images from the original location, and not the digest of my copy of the images that I imported into the Azure Container Registry. This obviously then doesn't work, as the digests of the images the install is trying to pull don't match the digests of the images that are imported to my Azure Container Registry.
Failed to pull image 'letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen:v1.5.1#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068': [rpc error: code = NotFound desc = failed to pull and unpack image 'letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068': failed to resolve reference 'letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068': letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068: not found, rpc error: code = Unknown desc = failed to pull and unpack image 'letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068': failed to resolve reference 'letsencryptdemoacr.azurecr.io/jettech/kube-webhook-certgen#sha256:f3b6b39a6062328c095337b4cadcefd1612348fdd5190b1dcbcb9b9e90bd8068': failed to authorize: failed to fetch anonymous token: unexpected status: 401 Unauthorized]
The generated digest for the images that I'd imported into my local Azure Container Registry needed to be specified as additional arguments to the helm install:
--set controller.image.digest="sha256:e9fb216ace49dfa4a5983b183067e97496e7a8b307d2093f4278cd550c303899" \
--set controller.admissionWebhooks.patch.image.digest="sha256:950833e19ade18cd389d647efb88992a7cc077abedef343fa59e012d376d79b7" \
I then had a 2nd issue where I was getting CrashLoopBackoff for the ingress controller pod. I fixed this by re-importing a different version of the ingress controller image than the one referenced in the tutorial, as follows:
set environment variable used to identify the tag to pull for the ingress controller image
CONTROLLER_TAG=v1.0.0
delete the ingress repository from the Azure Container Registry (I did this manually via the portal), then re-import it using the following (the values of other variables should be as specified in the Microsoft tutorial):
az acr import --name $REGISTRY_NAME --source $CONTROLLER_REGISTRY/$CONTROLLER_IMAGE:$CONTROLLER_TAG --image $CONTROLLER_IMAGE:$CONTROLLER_TAG
Make sure you guys set all the digests to empty
--set controller.image.digest=""
--set controller.admissionWebhooks.patch.image.digest=""
--set defaultBackend.image.digest=""
Basically, this will pull the image <your-registry>.azurecr.io/ingress-nginx/controller:<version> without the #digest:<digest>
The other problem, if you use the latest chart version, the deployment will crash into CRASHLOOPBACKOFF status. Checking the live log of the pod, you will see the problem with flags, eg Unknown flag --controller-class. To resolve this problem, you could specify the -version flag in the helm install command to use the version 3.36.0. All deployment problems should be resolved.
Faced the same issue on AWS and using a older version of the helm chart helped.
I used the version 3.36.0 and it worked fine.

How to create a session for Azure IotHub with Azure CLI?

In the past I used IoTHub Explorer for logging in and creating a session to then do further operations (like calling device methods). IoTHub Explorer has been deprecated by Microsoft. (I'm doing some application-level test automation)
How can I create sessions as I did with the explorer using the azure CLI az?
Here is what I did in the past:
iothub-explorer login "HostName=..."
iothub-explorer device-method <device> "<method>" ...
Here is what I do now:
az iot hub invoke-device-method -l "HostName=..." -n <hub-name> -d <device -method-name <method>
As can be seen, I have to provide the -l-option to every call to az iot. Ideally I can avoid this by creating a session.
I tried to use az login which opens a website, not ideal for test-automation. And even after then, calling az iot hub invoke-device-method without -l leads to an exception: AttributeError: 'IotHubResourceOperations' object has no attribute 'config'
I tried to generate a sas-token but I'm not sure what to do with it.
Turns out, my azure-cli environment was not properly set up: refer to https://github.com/Azure/azure-cli/issues/15461. Do not mix up Debian/system packages of azure-cli (do not use actually) and pip installed ones. Do everything with pip, either as a user or as root.
I created a new virtualenv to clean it:
$ virtualenv ~/python-venv/azure-venv
$ . ~/python-venv/azure-cli/bin/activate
(azure-venv) $ pip install azure-cli
(azure-venv) $ az login
(azure-venv) $ az iot hub generate-sas-token --duration 3600 -n <hubname> -l <login-string>
(azure-venv) $ az iot hub invoke-device-method -n <hub-name> -d <device --method-name <method>
And it works.

check if Kubernetes deployment was sucessful in CI/CD pipeline

I have an AKS cluster with Kubernetes version 1.14.7.
I have build CI/CD pipelines to deploy newly created images to the cluster.
I am using kubectl apply to update a specific deployment with the new image. sometimes and for many reasons, the deployment fails, for example ImagePullBackOff.
is there a command to run after the kubectl apply command to check if the pod creation and deployment was successful?
For this purpose Kubernetes has kubectl rollout and you should use option status.
By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled over by another revision, use --revision=N where N is the revision you need to watch for.
You can read the full description here
If you use kubect apply -f myapp.yaml and check rollout status you will see:
$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 1 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 2 of 3 updated replicas are available…
deployment "myapp" successfully rolled out
There is another way to wait for deployment to become available with a configured timeout like
kubectl wait --for=condition=available --timeout=60s deploy/myapp
otherwise kubectl rollout status can be used but it may stuck forever in some rare cases and will require manual cancellation of pipeline if that happens.
You can parse the output through jq:
kubectl get pod -o=json | jq '.items[]|select(any( .status.containerStatuses[]; .state.waiting.reason=="ImagePullBackOff"))|.metadata.name'
It looks like kubediff tool is a perfect match for your task:
Kubediff is a tool for Kubernetes to show you the differences between your running configuration and your version controlled configuration.
The tool can be used from the command line and as a Pod in the cluster that continuously compares YAML files in the configured repository with the current state of the cluster.
$ ./kubediff
Usage: kubediff [options] <dir/file>...
Compare yaml files in <dir> to running state in kubernetes and print the
differences. This is useful to ensure you have applied all your changes to the
appropriate environment. This tools runs kubectl, so unless your
~/.kube/config is configured for the correct environment, you will need to
supply the kubeconfig for the appropriate environment.
kubediff returns the status to stdout and non-zero exit code when difference is found. You can change this behavior using command line arguments.
You may also want to check the good article about validating YAML files:
Validating Kubernetes Deployment YAMLs

Azure VM extension update failure

I tried to add a custom script to VM through extensions. I have observed that when vm is created, Microsoft.Azure.Extensions.CustomScript type is created with name "cse-agent" by default. So I try to update extension by encoding the file with script property
az vm extension set \
--resource-group test_RG \
--vm-name aks-agentpool \
--name CustomScript \
--subscription ${SUBSCRIPTION_ID} \
--publisher Microsoft.Azure.Extensions \
--settings '{"script": "'"$value"'"}'
$value represents the script file encoded in base 64.
Doing that gives me an error:
Deployment failed. Correlation ID: xxxx-xxxx-xxx-xxxxx.
VM has reported a failure when processing extension 'cse-agent'.
Error message: "Enable failed: failed to get configuration: invalid configuration:
'commandToExecute' and 'script' were both specified, but only one is validate at a time"
From the documentation, it is mentioned that when script attribute is present,
there is no need for commandToExecute. As you can see above I haven't mentioned commandToExecute, it's somehow taking it from previous extension. Is there a way to update it without deleting it? Also it will be interesting to know what impact will cse-agent extension will create when deleted.
FYI: I have tried deleting 'cse-agent' extension from VM and added my extension. It worked.
the CSE-AGENT vm extension is crucial and manages all of the post install needed to configure the nodes to be considered a valid Kubernetes nodes. Removing this CSE will break the VMs and will render your cluster inoperable.
IF you are interested in applying changes to nodes in an existing cluster, while not officially supported, you could leverage the following project.
https://github.com/juan-lee/knode
This allows you to configure the nodes using a DaemonSet, which helps when you node pools have the auto-scaling feature enabled.
for simple Node alteration of the filesystem, a privilege pod with host path will also work
https://dev.to/dannypsnl/privileged-pod-debug-kubernetes-node-5129

Resources