AKS fails to pull image - failed size validation - azure

I have set up a custom docker image registry on Gitlab and AKS for some reason fails to pull the image from there.
Error that is being thrown out is:
Failed to pull image "{registry}/{image}:latest": rpc error: code = FailedPrecondition desc =
failed to pull and unpack image "{registry}/{image}:latest": failed commit on ref "layer-sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546": "layer-sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546" failed size validation: 0 != 27145985: failed precondition
What is interesting is that the image does not have the layer with id
sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546
Perhaps something is cached on AKS side? I deleted the pod along with the deployment before redeploying.
I couldn't find much about this kind of errors and I have no idea what may be causing that. Pulling the same image from local docker environment works flawlessly.
Any tip would be much appreciated!

• You can try scaling up the registry to run on all nodes. Kubernetes controller tries to be smart and routes node requests internally, instead of sending traffic to the loadbalancer IP. The issue though that if there is no registry service on that node, the packets go nowhere. So, scale up or route through a non-AKS LB.
• Also, clean the image layer cache folder in ${containerd folder}/io.containerd.content.v1.content/ingest.Containerd would not clean this cache automatically when some layer data broken. You can also try purging the contents in this path ${containerd folder}/io.containerd.content.v1.content/ingest.
• Might be this can be a TCP network connection issue between the AKS cluster and the docker image registry on Gitlab, so you can try using a proxy and configure it to close the connection between them after ‘X’ bytes of data are transferred as the retry of the pull starts over at 0% for the layer which then results in the same error because after some time we get a connection close and the layer was again not pulled completely. So, will recommend using a registry which is located near their cluster to have the higher throughput.
• Also try restarting the communication pipeline between AKS cluster and the docker image registry on gitlab, it fixes this issue for the time being until it re-occurs.
Please find the below link for more information: -
https://docs.gitlab.com/ee/user/packages/container_registry/

Related

Kubernetes Persistent Volume not shows the real capacity

I have a persistent volume in my cluster (Azure disk) that contains 8Gi.
I resized it to contain 9Gi, then changed my PV yaml to 9Gi as well (since it is not updated automatically) and everything worked fine.
Then I made a test and changed the yaml of my PV to 1000Gi (and expected to see an error) and received error from my pvc that claims this PV: "NodeExpand failed to expand the volume : rpc error: code = Internal desc = resize requested for 10, but after resizing volume size was 9"
However, if I typed kubectl get pv, it is still looks like this PV capacity is 1000Gi (and of course that in Azure this is still 9Gi since I not resized it).
Any advice?
As a general rule: you should not have to change anything on your PersistentVolumes.
When you request more space, editing a PersistentVolumeClaim: a controller (either CSI, or in-tree driver/kube-controllers) would implement that change against your storage provider (ceph, aws, ...).
Once done expanding the backend volume, that same controller would update the corresponding PV. At which point, you may (or might not) have to restart the Pods attached to your volume, for its filesystem to be grown.
While I'm not certain how to fix the error you saw: one way to avoid those would be to refrain from editing PVs.

Unable to push to Azure container registry

While trying to push new containers to the Azure container registry, I get the following errors.
Successfully built b5f5a0e4c64b
Successfully tagged dekkiotest1.azurecr.io:5000/c4module:0.0.1-amd64
The push refers to repository [dekkiotest1.azurecr.io:5000/c4module]
Get https://dekkiotest1.azurecr.io:5000/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I have verified that this only occurs for new images I'm trying to push. I can update existing images alright. I have verified that the registry has memory available.
#Charles Xu,
Thanks for pointing it out. Yes, removing the port number fixed my error.
For the benefit of others, when you create a project using Visual studio code, it defaults to localhost:5000 as the registry to store the container. If you are using a MSFT container, remember to remove the port number.

docker pull from Artifactory results in "net/http: request canceled" inconsistently

We are running Artifactory 5.11.0 (just update to 6.0.2 today and haven't yet seen this) in a docker container and when our automation executes a docker pull from Artifactory, 9/10 times it is successful. Sometimes, even when running the docker pull from the machine hosting Artifactory, the docker pull fails with:
Pulling 'docker.{artifactory url}/staging:latest'...
Error response from daemon: Get http://docker.{artifactory url}/v2/staging/manifests/latest: Get http://docker.{artifactory url}:80/artifactory/api/docker/docker/v2/token?account=admin&scope=repository%3Astaging%3Apull&service=docker.{artifactory url}%3A80:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers)
Like I said, most of the time this is working perfect, but that 1/10 (probably less) we get the above error during our automated builds. I tried running the docker pull in a while loop over night until it hit a failure and there was no failure. Ran ping overnight and no packets were lost.
OS: Debian 9 x64
Docker version 17.09.0-ce, build afdb6d4 and seems to happen more frequently with Docker version 18.03.1~ce-0~debian, but I have no direct evidence to suggest the client is at fault.
Here is what JFrog provided me to try to resolve this issue. (Note: we were on an older version of Artifactory at the time and they did recommend that we update it to the latest as there were several updates that could help).
The RAM value -Xmx 2g was the default value provided by Artifactory. We can increase that value by going into the Docker container "docker exec -it artifactory bash"
and then $Artifactory_Home/bin/artifactory.default ( Mostly: - /opt/jfrog/artifactory/bin/artifactory.default) and we can change the RAM value accordingly. Please follow this link for more information.
We should also change the access max threads count and we can do that by going to $Artifactory_Home/tomcat/config/server.xml and change it to:
<Connector port="8040" sendReasonPhrase="true" maxThreads="<200>"/>
Also add below line in /var/opt/jfrog/artifactory/etc/artifactory.system.properties
artifactory.access.client.max.connections=200
To deal with heavy loads we need to append the below line in /var/opt/jfrog/artifactory/etc/db.properties.Please follow this link for more information.
pool.max.active=200
Also, they told me to be sure that we were using the API Key when authenticating the docker client with Artifactory instead of user/pass login since the latter will go through our ldap authentication and the former will not:
One thing to try would be to use an API Key instead of the plain text password, as using an API key will not reach out to the LDAP server.
We were already doing this, so this had no impact on the issue.
Also posted here: https://www.jfrog.com/jira/browse/RTFACT-17919?focusedCommentId=66063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-66063
I hope this helps as it helped us.

App Service unavailable for unknown reason

We're running App Service for Linux in docker container.
When things work, they work really good. But, occasionally, our site becomes unavailable for unclear reason.
Our health status reports looks like this:
Now, after some time, the app becomes completely unavailable. Health check reports Available, but in out docker log we find records like this:
2017-11-18 08:01:50.060 ERROR - Container for --- site ---is unhealthy. Stopping site.
2017-11-18 08:32:49.295 INFO - Issuing docker login to sever: http://---
2017-11-18 08:32:49.837 INFO - docker login to http://--- succeeded
2017-11-18 08:32:49.858 INFO - Issuing docker pull ---
2017-11-18 08:39:49.096 INFO - docker pull returned STDOUT>> 40: Pulling from ---
The only thing that helps is restarting the app. Then it comes back to normal and all works as expected.
I emphasise, site doesn't hang on every 'Unavailable' report from the Health check. It hangs randomly. CPU/Memory are at normal levels, nothing unusual there and no crasy spikes.
Application itself has general exceptions filter and no uncaught exceptions go out of app.
Any ideas why it might happen?
Depending on the site of your docker image, the application goes offline while it's pulling and initializing the new image. I noticed that our deploy took nearly 20 minutes before coming back up.

RabbitMQ cluster fails when one node is not reachable

I created a RabbitMQ cluster via Docker and Docker Cloud. I am running two RabbitMQ container on two separate nodes (both hosted on AWS).
The output of rabbitmqctl cluster_status is:
Cluster status of node 'rabbit#rabbitmq-cluster-2' ...
[{nodes,[{disc,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']}]},
{running_nodes,['rabbit#rabbitmq-cluster-1','rabbit#rabbitmq-cluster-2']},
{cluster_name,<<"rabbit#rabbitmq-cluster-1">>},
{partitions,[]}]
However, when I am stopping one container/node, then my messages cannot get delievered and get queued in .dlx
I am using senecajs with NodeJS.
Did anybody have the same problems and can point me into a direction?
To answer my own question:
The problem was that Docker, after starting, caches the DNS and is not
able to connect to a new one. So if one cluster fails, Docker still
tries to connect to the one, instead of trying a new one.
The solution was to write my own function when connecting to the RabbitMQ. I first check with net.createConnection if the host is online. If yes, I connect to it, if not I try a different one.
Every time a RabbitMQ node is down, my service fails, restarts and calls the "try this host" function.

Resources