I'm running Apache Airflow on Kubernetes and running into a strange error when trying to pull log files.
*** Failed to fetch log file from worker. HTTPConnectionPool(host='geometrical-galaxy-7364-worker-0.geometrical-galaxy-7364-worker.astronomer-geometrical-galaxy-7364.svc.cluster.local', port=8793): Max retries exceeded with url: /log/FILE/begin/2018-12-06T00:00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e86dab7b8>: Failed to establish a new connection: [Errno 111] Connection refused',))
It looks to me like there are too many requests being made on the stateful set (if I jump into the pod that holds the log files they are all there, but they don't get pulled into the UI that's trying to pull them).
Is there somewhere that a limit for HTTP requests for a stateful set or a pod gets set?
There is nowhere to set a limitation on the number of HTTP requests at the k8s level for pods. You can review the full break down of the statefulset spec here and you will see that there is no field for a limitation on these requests.
Limiting factors for new HTTP requests could be the container image you are using. As an example, Apache web server limits can be found here. The limitation is likely built into the Airflow container you are using. Unfortunately, I can't find documentation that discusses this limit or how to increase it.
I'm fairly certain the error you're seeing is from Airflow trying to fetch task logs from a worker via requests, which uses urllib3, that attempts retries on failed HTTP requests.
Your webserver is attempting to get the logs, being denied by the worker server, and is erroring out. Make sure you're running airflow serve-logs on all workers and that the port is open from your webserver to each.
Related
I'm running a Logstash instance which is connected to an ES cluster behind a load balancer.
The load balancer has an idle timeout of 5 minutes.
Logstash is configured with the ES url corresponding to the loadbalancer ip.
Normally everything works fine, but what happens is that after a period of requests inactivity, the next request processed by LS goes in error with the following:
[2018-10-30T08:15:00,757][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://10.100.24.254:9200/, :error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-10-30T08:15:00,759][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2018-10-30T08:15:02,760][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2018-10-30T08:15:02,760][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2018-10-30T08:15:05,651][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://10.100.24.254:9200/, :path=>"/"}
LS eventually recovers, but it takes more than 1 min and this is not acceptable for our SLA.
I suspect that's due to the loadbalancer closing the connections after 5 min of inactivity.
I've tried setting:
timeout => 3
which makes things better. The request is retried after 3 secs, but this is still not good enough.
What's the best set of configuration options that I can use to make sure the connections are always healthy and working before the requests are attempted and so I experience no delay at all?
Try validate_after_inactivity setting as described here
Or you can try enabling keep alive on your logstash server so logstash knows the connection has been severed when LB hits idle time out and it starts a new connection instead of sending requests on the old stale connection.
I have a very simple piece of code written in node.js which runs on Kubernetes and AWS. The app just does POST/GET request to create and get data from other services. service1-->service2->service3
Service1 get post request and call service2, service2 calls postgres DB (using sequlize) and create a new row and then call service3, service3 get data from the DB and returns the response to service2, service2 returns the response to service1.
Most of the times it works, but once in 4-5 attempts + concurrency, it dropped and I got a timeout. the problem is that the service1 receives the response back (according to the logs and network traces) but it seems that the connection was dropped somewhere between the services and I got a timeout (ESOCKETTIMEDOUT).
I've tried to use to replace request.js with node-fetch
I've tried to use NewRelic/Elastic APM
I've tried to use node -prof and analyze it with node --prof-process with no conclusions.
Is it possible Kubernetes drops my connection?
Hard to tell without debugging but since some connections are getting dropped when you add more load + concurrency it's likely that you need more replicas on your Kubernetes deployments and possibly adjusts the Resources on your container pod specs.
If this turns out to be the case you can also configure an HPA (Horizontal Pod Autoscaler) to handle your load.
We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?
NodeJs: v0.12.4
Couchbase: 2.0.8
Service deployed with PM2
Instance of the bucket is created once per service rather than once per call based on recommendation from couchbase support as instantiating and connecting bucket is expensive.
During the load everything seems to be in order with near 0 failure rate.
After couple of days of service being barely if at all in use client fails to connected to the bucket with the following error:
{"message":"Client-Side timeout exceeded for operation. Inspect network conditions or increase the timeout","code":23}
Recycling the node.js process using 'pm2 restart' resolves the issue.
Any ideas/suggestions short of re-creating instance of the bucket and re-connecting to the bucket?
Here is my application cloud environment.
I have ELB with sticky session -> 2 HA Proxy -> 1 Machines which hosts my application on jboss.
I am processing a request which takes more than 1 minute. I am logging IP addresses at the start of the processing request.
When i process this request through browser, I see that duplicate request is being logged after 1 minute and few seconds. If first request routes from the HAProxy1 then another request routes from HAProxy2. On browser I get HttpStatus=0 response after 2.1 minute
My hypotesis is that ELB is triggering this duplicate request.
Kindly help me to verify this hypothesis.
When I use the Apache Http Client for same request, I do not see duplicate request being triggered. Also I get exception after 1 minute and few seconds.
org.apache.http.NoHttpResponseException: The target server failed to respond
Kindly help me to understand what is happening over here.
-Thanks
By ELB I presume you are referring to Amazon AWS's Elastic Load Balancer.
Elastic Load Balancer has a built-in request time-out of 60 seconds, which cannot be changed. The browser has smart re-try logic, hence you're seeing two requests, but your server should be processing them as two separate unrelated requests, so this actually makes matters worse. Using httpclient, the timeout causes the NoHttpResponseException, and no retry is used.
The solution is to either improve the performance of your request on the server, or have the initial request fire off a background task, and then a supplemental request (possibly using AJAX) which polls for completion.