docker pull from Artifactory results in "net/http: request canceled" inconsistently - linux

We are running Artifactory 5.11.0 (just update to 6.0.2 today and haven't yet seen this) in a docker container and when our automation executes a docker pull from Artifactory, 9/10 times it is successful. Sometimes, even when running the docker pull from the machine hosting Artifactory, the docker pull fails with:
Pulling 'docker.{artifactory url}/staging:latest'...
Error response from daemon: Get http://docker.{artifactory url}/v2/staging/manifests/latest: Get http://docker.{artifactory url}:80/artifactory/api/docker/docker/v2/token?account=admin&scope=repository%3Astaging%3Apull&service=docker.{artifactory url}%3A80:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers)
Like I said, most of the time this is working perfect, but that 1/10 (probably less) we get the above error during our automated builds. I tried running the docker pull in a while loop over night until it hit a failure and there was no failure. Ran ping overnight and no packets were lost.
OS: Debian 9 x64
Docker version 17.09.0-ce, build afdb6d4 and seems to happen more frequently with Docker version 18.03.1~ce-0~debian, but I have no direct evidence to suggest the client is at fault.

Here is what JFrog provided me to try to resolve this issue. (Note: we were on an older version of Artifactory at the time and they did recommend that we update it to the latest as there were several updates that could help).
The RAM value -Xmx 2g was the default value provided by Artifactory. We can increase that value by going into the Docker container "docker exec -it artifactory bash"
and then $Artifactory_Home/bin/artifactory.default ( Mostly: - /opt/jfrog/artifactory/bin/artifactory.default) and we can change the RAM value accordingly. Please follow this link for more information.
We should also change the access max threads count and we can do that by going to $Artifactory_Home/tomcat/config/server.xml and change it to:
<Connector port="8040" sendReasonPhrase="true" maxThreads="<200>"/>
Also add below line in /var/opt/jfrog/artifactory/etc/artifactory.system.properties
artifactory.access.client.max.connections=200
To deal with heavy loads we need to append the below line in /var/opt/jfrog/artifactory/etc/db.properties.Please follow this link for more information.
pool.max.active=200
Also, they told me to be sure that we were using the API Key when authenticating the docker client with Artifactory instead of user/pass login since the latter will go through our ldap authentication and the former will not:
One thing to try would be to use an API Key instead of the plain text password, as using an API key will not reach out to the LDAP server.
We were already doing this, so this had no impact on the issue.
Also posted here: https://www.jfrog.com/jira/browse/RTFACT-17919?focusedCommentId=66063&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-66063
I hope this helps as it helped us.

Related

AKS fails to pull image - failed size validation

I have set up a custom docker image registry on Gitlab and AKS for some reason fails to pull the image from there.
Error that is being thrown out is:
Failed to pull image "{registry}/{image}:latest": rpc error: code = FailedPrecondition desc =
failed to pull and unpack image "{registry}/{image}:latest": failed commit on ref "layer-sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546": "layer-sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546" failed size validation: 0 != 27145985: failed precondition
What is interesting is that the image does not have the layer with id
sha256:e1acddbe380c63f0de4b77d3f287b7c81cd9d89563a230692378126b46ea6546
Perhaps something is cached on AKS side? I deleted the pod along with the deployment before redeploying.
I couldn't find much about this kind of errors and I have no idea what may be causing that. Pulling the same image from local docker environment works flawlessly.
Any tip would be much appreciated!
• You can try scaling up the registry to run on all nodes. Kubernetes controller tries to be smart and routes node requests internally, instead of sending traffic to the loadbalancer IP. The issue though that if there is no registry service on that node, the packets go nowhere. So, scale up or route through a non-AKS LB.
• Also, clean the image layer cache folder in ${containerd folder}/io.containerd.content.v1.content/ingest.Containerd would not clean this cache automatically when some layer data broken. You can also try purging the contents in this path ${containerd folder}/io.containerd.content.v1.content/ingest.
• Might be this can be a TCP network connection issue between the AKS cluster and the docker image registry on Gitlab, so you can try using a proxy and configure it to close the connection between them after ‘X’ bytes of data are transferred as the retry of the pull starts over at 0% for the layer which then results in the same error because after some time we get a connection close and the layer was again not pulled completely. So, will recommend using a registry which is located near their cluster to have the higher throughput.
• Also try restarting the communication pipeline between AKS cluster and the docker image registry on gitlab, it fixes this issue for the time being until it re-occurs.
Please find the below link for more information: -
https://docs.gitlab.com/ee/user/packages/container_registry/

Keep connection alive using --sysctl with Docker run

I currently have a container that is running node services. There code running creates a subscription to the Salesforce Change data capture event bus using CometD. This has been working well but after some time the service will stop receiving events from the Salesforce.
I am thinking this is happening because the alpine Linux container could be marking the connection as broken after data is not received after a while. I have verified that the CometD libraries are creating a connection with keep alive set as true.
Right now I am trying to increase the keep-alive time by running the container with the command:
docker run --sysctl net.ipv4.tcp_keepalive_time=10800 --sysctl net.ipv4.tcp_keepalive_intvl=60 --sysctl net.ipv4.tcp_keepalive_probes=20 -p 80:3000 <imageid>
My thinking behind this is:
net.ipv4.tcp_keepalive_time=10800
This means that the keepalive routines wait for three hours (1080 secs)
before sending the first keepalive probe
net.ipv4.tcp_keepalive_intvl=60
Resend the prob every 60 seconds
net.ipv4.tcp_keepalive_probes=20
If no ACK response is received for 20 consecutive times, the connection is
marked as broken.
I guess what I am asking is if this is the correct way to go about running the docker container so that sysctl will run with the settings I have passed in.
I am new to docker, so, I'm sure I did something that doesn't make sense. Thank you for any suggestions.

Unable to push to Azure container registry

While trying to push new containers to the Azure container registry, I get the following errors.
Successfully built b5f5a0e4c64b
Successfully tagged dekkiotest1.azurecr.io:5000/c4module:0.0.1-amd64
The push refers to repository [dekkiotest1.azurecr.io:5000/c4module]
Get https://dekkiotest1.azurecr.io:5000/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I have verified that this only occurs for new images I'm trying to push. I can update existing images alright. I have verified that the registry has memory available.
#Charles Xu,
Thanks for pointing it out. Yes, removing the port number fixed my error.
For the benefit of others, when you create a project using Visual studio code, it defaults to localhost:5000 as the registry to store the container. If you are using a MSFT container, remember to remove the port number.

App Service unavailable for unknown reason

We're running App Service for Linux in docker container.
When things work, they work really good. But, occasionally, our site becomes unavailable for unclear reason.
Our health status reports looks like this:
Now, after some time, the app becomes completely unavailable. Health check reports Available, but in out docker log we find records like this:
2017-11-18 08:01:50.060 ERROR - Container for --- site ---is unhealthy. Stopping site.
2017-11-18 08:32:49.295 INFO - Issuing docker login to sever: http://---
2017-11-18 08:32:49.837 INFO - docker login to http://--- succeeded
2017-11-18 08:32:49.858 INFO - Issuing docker pull ---
2017-11-18 08:39:49.096 INFO - docker pull returned STDOUT>> 40: Pulling from ---
The only thing that helps is restarting the app. Then it comes back to normal and all works as expected.
I emphasise, site doesn't hang on every 'Unavailable' report from the Health check. It hangs randomly. CPU/Memory are at normal levels, nothing unusual there and no crasy spikes.
Application itself has general exceptions filter and no uncaught exceptions go out of app.
Any ideas why it might happen?
Depending on the site of your docker image, the application goes offline while it's pulling and initializing the new image. I noticed that our deploy took nearly 20 minutes before coming back up.

Uninitialized Constant on custom Puppet Function

I've got a function that I'm trying to run on my puppetmaster on each client run. It runs just fine on the puppetmaster itself, but it causes the agent runs to fail on the nodes because of the following error:
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: uninitialized constant Puppet::Parser::Functions::<my_module>
I'm not really sure why. I enabled debug logging on the master via config.ru, but I see the same error in the logs with no more useful messages.
What other steps can I take to debug this?
Update:
Adding some extra details.
Puppet Community, with Foreman connected Puppetmaster running on Apache2 with Passenger / Rack
Both client and master are running Puppet 3.7.5
Both client and master are Ubuntu 14.04
Both client and master are using Ruby 1.9.3p484 (2013-11-22 revision 43786) [x86_64-linux]
Pluginsync is enabled
The custom function works fine on the puppetmaster when run run as part of the puppetmaster's manifest (it's a client of itself) or using puppet apply directly on the server.
The function is present on the clients, and when I update for debugging purposes I do see the file appear on the client side.
I cannot paste the function here unfortunately because it is proprietary code. It does rely on the aws-sdk, but I have verified that that ruby gem is present on both the client and the master side, with the same version in both places. I've even tried surrounding the entire function with:
begin
rescue LoadError
end
and have the same result.
This is embarrassingly stupid, but it turns out I had somehow never noticed that I hadn't actually included this line in my function:
require 'aws-sdk'
And so the error I was receiving:
uninitialized constant Puppet::Parser::Functions::Aws
Was actually referring to the AWS SDK being missing, and not a problem with the puppet module itself (which was also confusingly named aws), which is how I was interpreting it. Basically, I banged my head against the wall for several days over a painfully silly mistake. Apologies to all who tried to help me :)

Resources