Kubernetes - Wait for available does not work as expected - azure

I'm using a Gitlab CI/CD pipeline to deploy a few containers to a Kubernetes environment. The script (excerpt) basically just deploys a few resources like this:
.deploy-base:
# For deploying, we need an image that can interact with k8s
image:
name: registry.my-org.de/public-projects/kubectl-gettext:1-21-9
entrypoint: ['']
variables:
# Define k8s namespace and domain used for deployment:
NS: $KUBE_NAMESPACE
before_script:
- echo $NS
- cp $KUBECONFIG ~/.kube/config
- export CI_ENVIRONMENT_DOMAIN=$(echo "$CI_ENVIRONMENT_URL" | sed -e 's/[^/]*\/\/\([^#]*#\)\?\([^:/]*\).*/\2/')
script:
- kubectl config get-contexts
- kubectl config use-context org-it-infrastructure/org-fastapi-backend:azure-aks-agent
# Make Docker credentials available for deployment:
- kubectl create secret -n $NS docker-registry gitlab-registry-secret --docker-server=$CI_REGISTRY --docker-username=$CI_DEPLOY_USER --docker-password=$CI_DEPLOY_PASSWORD --docker-email=$GITLAB_USER_EMAIL -o yaml --dry-run | kubectl replace --force -n $NS -f -
- kubectl -n $NS patch serviceaccount default -p '{"imagePullSecrets":[{"name":"gitlab-registry-secret"}]}'
# Create config map for container env variables
- envsubst < dev/config-map.yml | kubectl -n $NS replace --force -f -
# Start and expose deployment, set up ingress:
- envsubst < dev/backend-deploy.yml | kubectl -n $NS replace --force -f -
# Set up ingress with env var expansion from template:
- envsubst < dev/ingress.yml | kubectl -n $NS replace --force -f -
# Wait for pod
- kubectl -n $NS wait --for=condition=available deployment/backend --timeout=180s
The last command should wait for the deployment to become available and return as soon as it does. Since the latest Gitlab 15 update and the switch from certificate based authentication vs agent based authentication towards K8s it doesn't work anymore and yields the following error message:
error: timed out waiting for the condition on deployments/backend
It also takes way longer than the specified 180s, its more like 15-20 minutes.
The application is available and works as expected, also the deloyment looks good:
$kubectl -n org-fastapi-backend-development describe deployment backend
Name: backend
Namespace: org-fastapi-backend-development
CreationTimestamp: Thu, 02 Jun 2022 14:15:18 +0200
Labels: app=app
Annotations: deployment.kubernetes.io/revision: 1
Selector: app=app
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=app
Containers:
app:
Image: registry.my-org.de/org-it-infrastructure/org-fastapi-backend:development
Port: 80/TCP
Host Port: 0/TCP
Environment Variables from:
backend-config ConfigMap Optional: false
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: backend-6bb4f4bcd5 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 83s deployment-controller Scaled up replica set backend-6bb4f4bcd5 to 1
As you can see, the Condition Available has the status set to true, still the wait command does not return positively.
Both kubectl as well as the Kubernetes environment (its Azure AKS) is running version 1.21.9

Related

Azure private registry for docker image

Below is my yaml file to create a container group with two containers names as fluentd and mapp.
But for the mapp container I want to get the image from a private repository. I am not using Azure Container Registry, I do not have an experience with it either.
I want to push the logs to Loganalytics.
apiVersion: 2019-12-01
location: eastus2
name: mycontainergroup003
properties:
containers:
- name: mycontainer003
properties:
environmentVariables: []
image: fluent/fluentd
ports: []
resources:
requests:
cpu: 1.0
memoryInGB: 1.5
- name: mapp-log
properties:
image: reg-dev.rx.com/gl/xg/iss/mapp/com.corp.mapp:1.0.0-SNAPSHOT_latest
resources:
requests:
cpu: 1
memoryInGb: 1.5
ports:
- port: 80
- port: 8080
command: - /bin/sh - -c - > i=0; while true; do echo "$i: $(date)" >> /var/log/1.log; echo "$(date) INFO $i" >> /var/log/2.log; i=$((i+1)); sleep 1; done
imageRegistryCredentials:
- server: reg-dev.rx.com
username: <username>
password: <password>
osType: Linux
restartPolicy: Always
diagnostics:
logAnalytics:
workspaceId: <id>
workspaceKey: <key>
tags: null
type: Microsoft.ContainerInstance/containerGroups
I am executing below command to run the yaml:
>az container create -g rg-np-tp-ip01-deployt-docker-test --name mycontainergroup003 --file .\azure-deploy-aci-2.yaml
(InaccessibleImage) The image 'reg-dev.rx.com/gl/xg/iss/mapp/com.corp.mapp:1.0.0-SNAPSHOT_latest' in container group 'mycontainergroup003' is not accessible. Please check the image and registry credential.
Code: InaccessibleImage
Message: The image 'reg-dev.rx.com/gl/xg/iss/mapp/com.corp.mapp:1.0.0-SNAPSHOT_latest' in container
group 'mycontainergroup003' is not accessible. Please check the image and registry credential.
How can I make the imageregistry reg-dev.rx.com accessible from Azure. Till now, I used the same imageregistry in every yaml and ran 'kubectl apply' command. But now I am trying to run the yaml via Azure cli.
Can someone please help?
The Error you are getting usually comes when you are giving wrong name and credentials for login server or Image that you are trying to pull.
I Can not tested as which private registry you are trying to use. But same thing can use achive using Azure Container registry. I tested in my environment and its working fine for me same you can apply in your environment as well.
You can pushed your existing image into ACR using below command
Example : you can apply like this below
Step 1 : login in azure
az login
Step 2: Created Container Registry
az acr create -g "<resource group>" -n "TestMyAcr90" --sku Basic --admin-enabled true
.
Step 3 :Tag docker image in the following format loginserver/imagename
docker tag 0e901e68141f testmyacr90.azurecr.io/my_nginx
Step 4 : login to ACR.
docker login testmyacr90.azurecr.io
Step 5 : Push docker images into container registry
docker push testmyacr90.azurecr.io/my_nginx
YAML FILE
apiVersion: 2019-12-01
location: eastus2
name: mycontainergroup003
properties:
containers:
- name: mycontainer003
properties:
environmentVariables: []
image: fluent/fluentd
ports: []
resources:
requests:
cpu: 1.0
memoryInGB: 1.5
- name: mapp-log
properties:
image: testmyacr90.azurecr.io/my_nginx:latest
resources:
requests:
cpu: 1
memoryInGb: 1.5
ports:
- port: 80
- port: 8080
command:
- /bin/sh
- -c
- >
i=0;
while true;
do
echo "$i: $(date)" >> /var/log/1.log;
echo "$(date) INFO $i" >> /var/log/2.log;
i=$((i+1));
sleep 1;
done
imageRegistryCredentials:
- server: testmyacr90.azurecr.io
username: TestMyAcr90
password: SJ9I6XXXXXXXXXXXZXVSgaH
osType: Linux
restartPolicy: Always
diagnostics:
logAnalytics:
workspaceId: dc742888-fd4d-474c-b23c-b9b69de70e02
workspaceKey: ezG6IXXXXX_XXXXXXXVMsFOosAoR+1zrCDp9ltA==
tags: null
type: Microsoft.ContainerInstance/containerGroups
You can get the loginserver name , Username and password of ACR from here.
Succesfully run the file and able to create Container Group along with two container as declare in file.

Taking Thread dump/ Heap dump of Azure Kubernetes pods

We are running our kafka stream application on Azure kubernetes written in java. We are new to kubernetes. To debug an issue we want to take thread dump of the running pod.
Below are the steps we are following to take the dump.
Building our application with below docker file.
FROM mcr.microsoft.com/java/jdk:11-zulu-alpine
RUN apk update && apk add --no-cache gcompat
RUN addgroup -S user1 && adduser -S user1 -G user1
USER user1
WORKDIR .
COPY target/my-application-1.0.0.0.jar .
Submitting the image with below deployment yaml file
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-application-v1.0.0.0
spec:
replicas: 1
selector:
matchLabels:
name: my-application-pod
app: my-application-app
template:
metadata:
name: my-application-pod
labels:
name: my-application-pod
app: my-application-app
spec:
nodeSelector:
agentpool: agentpool1
containers:
- name: my-application-0
image: myregistry.azurecr.io/my-application:v1.0.0.0
imagePullPolicy: Always
command: ["java","-jar","my-application-1.0.0.0.jar","input1","$(connection_string)"]
env:
- name: connection_string
valueFrom:
configMapKeyRef:
name: my-application-configmap
key: connectionString
resources:
limits:
cpu: "4"
requests:
cpu: "0.5"
To get a shell to a Running container you can run the command below:
kubectl exec -it <POD_NAME> -- sh
To get thread dump running below command
jstack PID > threadDump.tdump
but getting permission denied error
Can some one suggest how to solve this or steps to take thread/heap dumps.
Thanks in advance
Since you likely need the thread dump locally, you can bypass creating the file in the pod and just stream it directly to a file on your local computer:
kubectl exec -i POD_NAME -- jstack 1 > threadDump.tdump
If your thread dumps are large you may want to consider piping to pv first to get a nice progress bar.

gitlab job failed - image pull failed

I am trying to do docker scan by using Trivy and integrating it in GitLab the pipeline is passed.
However the job is failed, not sure why the job is failed.
the docker image is valid.
updated new error after enabled shared runner
gitlab.yml
Trivy_container_scanning:
stage: test
image: docker:stable-git
variables:
# Override the GIT_STRATEGY variable in your `.gitlab-ci.yml` file and set it to `fetch` if you want to provide a `clair-whitelist.yml`
# file. See https://docs.gitlab.com/ee/user/application_security/container_scanning/index.html#overriding-the-container-scanning-template
# for details
GIT_STRATEGY: none
IMAGE: "$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA"
allow_failure: true
before_script:
- export TRIVY_VERSION=${TRIVY_VERSION:-v0.20.0}
- apk add --no-cache curl docker-cli
- docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" $CI_REGISTRY
- curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin ${TRIVY_VERSION}
- curl -sSL -o /tmp/trivy-gitlab.tpl https://github.com/aquasecurity/trivy/raw/${TRIVY_VERSION}/contrib/gitlab.tpl
script:
- trivy --exit-code 0 --cache-dir .trivycache/ --no-progress --format template --template "#/tmp/trivy-gitlab.tpl" -o gl-container-scanning-report.json $IMAGE
#- ./trivy — exit-code 0 — severity HIGH — no-progress — auto-refresh trivy-ci-test
#- ./trivy — exit-code 1 — severity CRITICAL — no-progress — auto-refresh trivy-ci-test
cache:
paths:
- .trivycache/
artifacts:
reports:
container_scanning: gl-container-scanning-report.json
dependencies: []
only:
refs:
- branches
Dockerfile
FROM composer:1.7.2
RUN git clone https://github.com/aquasecurity/trivy-ci-test.git && cd trivy-ci-test && rm Cargo.lock && rm Pipfile.lock
CMD apk add — no-cache mysql-client
ENTRYPOINT [“mysql”]
job error:
Running with gitlab-runner 13.2.4 (264446b2)
on gitlab-runner-gitlab-runner-76f48bbd84-8sc2l GCJviaG2
Preparing the "kubernetes" executor
30:00
Using Kubernetes namespace: gitlab-managed-apps
Using Kubernetes executor with image docker:stable-git ...
Preparing environment
30:18
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0pgp84 to be running, status is Pending
ERROR: Job failed (system failure): prepare environment: image pull failed: Back-off pulling image "docker:stable-git". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
another error:
Running with gitlab-runner 13.2.4 (264446b2)
on gitlab-runner-gitlab-runner-76f48bbd84-8sc2l GCJviaG2
Preparing the "kubernetes" executor
30:00
Using Kubernetes namespace: gitlab-managed-apps
Using Kubernetes executor with image $CI_REGISTRY/devops/docker-alpine-sdk:19.03.15 ...
Preparing environment
30:03
Waiting for pod gitlab-managed-apps/runner-gcjviag2-project-1020-concurrent-0t7plc to be running, status is Pending
ERROR: Job failed (system failure): prepare environment: image pull failed: Failed to apply default image tag "/devops/docker-alpine-sdk:19.03.15": couldn't parse image reference "/devops/docker-alpine-sdk:19.03.15": invalid reference format. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
The root cause is actually no variable being setup in gitlab cicd variables.
After defined the registry credentials, all works.
This is followed by gitlab-org/gitlab-runner issue 27664
either a GitLab infrastructure issue
or (comment from Bruce Lau)
After some trial and errors, me and our team figured out the issue is due to the runner failed to use service account secret to pull images.
In order to solve this issue, we use a custom config which specify image_pull_secrets in .dockercfg format in order to pull images successfully.
Content of runner-custom-config-map:
kind: ConfigMap
apiVersion: v1
metadata:
name: runner-custom-config-map
namespace: runner-namespace
data:
config.toml: |-
[[runners]]
[runners.kubernetes]
image_pull_secrets = ["secret_to_docker_cfg_file_with_sa_token"]
Used in the runner operator spec:
spec:
concurrent: 1
config: runner-custom-config-map
gitlabUrl: 'https://example.gitlab.com'
imagePullPolicy: Always
serviceaccount: kubernetes-service-account
token: gitlab-runner-registration-secret
With secret_to_docker_cfg_file_with_sa_token:
kind: Secret
apiVersion: v1
name: secret_to_docker_cfg_file_with_sa_token
namespace: plt-gitlab-runners
data:
.dockercfg: >-
__docker_cfg_file_with_pull_token__
type: kubernetes.io/dockercfg
June 2022: the issue is closed by MR 3399 for GitLab 15.0:
"Check serviceaccount and imagepullsecret availability before creating pod"
To prevent the pod creation when needed resources are not available.

How to change the file-system watcher limit in Kubernetes (fs.inotify.max_user_watches)

I'm using pm2 to watch the directory holding the source code for my app-server's NodeJS program, running within a Kubernetes cluster.
However, I am getting this error:
ENOSPC: System limit for number of file watchers reached
I searched on that error, and found this answer: https://stackoverflow.com/a/55763478
# insert the new value into the system config
echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p
However, I tried running that in a pod on the target k8s node, and it says the command sudo was not found. If I remove the sudo, I get this error:
sysctl: setting key "fs.inotify.max_user_watches": Read-only file system
How can I modify the file-system watcher limit from the 8192 found on my Kubernetes node, to a higher value such as 524288?
I found a solution: use a privileged Daemon Set that runs on each node in the cluster, which has the ability to modify the fs.inotify.max_user_watches variable.
Add the following to a node-setup-daemon-set.yaml file, included in your Kubernetes cluster:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-setup
namespace: kube-system
labels:
k8s-app: node-setup
spec:
selector:
matchLabels:
name: node-setup
template:
metadata:
labels:
name: node-setup
spec:
containers:
- name: node-setup
image: ubuntu
command: ["/bin/sh","-c"]
args: ["/script/node-setup.sh; while true; do echo Sleeping && sleep 3600; done"]
env:
- name: PARTITION_NUMBER
valueFrom:
configMapKeyRef:
name: node-setup-config
key: partition_number
volumeMounts:
- name: node-setup-script
mountPath: /script
- name: dev
mountPath: /dev
- name: etc-lvm
mountPath: /etc/lvm
securityContext:
allowPrivilegeEscalation: true
privileged: true
volumes:
- name: node-setup-script
configMap:
name: node-setup-script
defaultMode: 0755
- name: dev
hostPath:
path: /dev
- name: etc-lvm
hostPath:
path: /etc/lvm
---
apiVersion: v1
kind: ConfigMap
metadata:
name: node-setup-config
namespace: kube-system
data:
partition_number: "3"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: node-setup-script
namespace: kube-system
data:
node-setup.sh: |
#!/bin/bash
set -e
# change the file-watcher max-count on each node to 524288
# insert the new value into the system config
sysctl -w fs.inotify.max_user_watches=524288
# check that the new value was applied
cat /proc/sys/fs/inotify/max_user_watches
Note: The file above could probably be simplified quite a bit. (I was basing it on this guide, and left in a lot of stuff that's probably not necessary for simply running the sysctl command.) If others succeed in trimming it further, while confirming that it still works, feel free to make/suggest those edits to my answer.
You do not want to run your container as a privileged container if you can help it.
The solution here is to set the following kernel parameters, then restart your container(s). The container(s) will use the variables from the kernel that your container is running within. This is because containers do not run separate kernels on Linux hosts (containers use the same kernel).
fs.inotify.max_user_watches=10485760
fs.aio-max-nr=10485760
fs.file-max=10485760
kernel.pid_max=10485760
kernel.threads-max=10485760
You should paste the above into: /etc/sysctl.conf.

Docker VotingApp build/release Jenkins on Kubernetes not idempotent

I'm trying out deployments on Kubernetes via Jenkins with the Docker Voting App. I use the Azure Container registry as a repository for the docker images. On first try all is deployed ok:
When I re-run the pipeline without changing something I get the following error:
Redis service definition:
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: redis
version: alpine
name: redis
selfLink: /api/v1/namespaces//services/redis
spec:
clusterIP:
ports:
- name:
port: 6379
protocol: TCP
targetPort: 6379
selector:
app: redis
version: alpine
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
---
The docker images are build with "latest" tag.
stage 'Checkout'
node {
git 'https://github.com/*****/example-voting-app.git' // Checks out example votiung app repository
stage 'Docker Builds'
docker.withRegistry('https://*****.azurecr.io', 'private-login') {
parallel(
"Build Worker App":{def myEnv = docker.build('*****.azurecr.io/example-voting-app-worker:latest', 'worker').push('latest')},
"Build Result App":{def myEnv = docker.build('*****.azurecr.io/example-voting-app-result:latest', 'result').push('latest')},
"Build Vote App":{def myEnv = docker.build('*****.azurecr.io/example-voting-app-vote:latest', 'vote').push('latest')}
)
}
stage 'Kubernetes Deployment'
sh 'kubectl apply -f kubernetes/basic-full-deployment.yml'
sh 'kubectl delete pods -l app=vote'
sh 'kubectl delete pods -l app=result'
stage 'Smoke Test'
sh 'kubectl get deployments'
}
Your definition contains fields that are auto-generated/managed by the apiserver. Some of them are created at the time of object creation and can't be updated afterwards. Remove the following fields from your file to make it happy:
metadata:
creationTimestamp: null
selfLink: /api/v1/namespaces//services/redis
status:
loadBalancer: {}

Resources