ElasticSearch No Available connections error in Logstash - logstash

I'm running a Logstash instance which is connected to an ES cluster behind a load balancer.
The load balancer has an idle timeout of 5 minutes.
Logstash is configured with the ES url corresponding to the loadbalancer ip.
Normally everything works fine, but what happens is that after a period of requests inactivity, the next request processed by LS goes in error with the following:
[2018-10-30T08:15:00,757][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out {:url=>http://10.100.24.254:9200/, :error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2018-10-30T08:15:00,759][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://10.100.24.254:9200/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2018-10-30T08:15:02,760][WARN ][logstash.outputs.elasticsearch] UNEXPECTED POOL ERROR {:e=>#<LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError: No Available connections>}
[2018-10-30T08:15:02,760][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2018-10-30T08:15:05,651][INFO ][logstash.outputs.elasticsearch] Running health check to see if an Elasticsearch connection is working {:healthcheck_url=>http://10.100.24.254:9200/, :path=>"/"}
LS eventually recovers, but it takes more than 1 min and this is not acceptable for our SLA.
I suspect that's due to the loadbalancer closing the connections after 5 min of inactivity.
I've tried setting:
timeout => 3
which makes things better. The request is retried after 3 secs, but this is still not good enough.
What's the best set of configuration options that I can use to make sure the connections are always healthy and working before the requests are attempted and so I experience no delay at all?

Try validate_after_inactivity setting as described here
Or you can try enabling keep alive on your logstash server so logstash knows the connection has been severed when LB hits idle time out and it starts a new connection instead of sending requests on the old stale connection.

Related

Spark yarn-client mode behind a load balancer that drops inactive TCP connections

I am running Spark on YARN in client mode. The driver is separated from the ApplicationMaster by a load balancer that kills inactive TCP connections after 5 minutes. This kills even active YARN jobs after 5 minutes.
This is because the ApplicationMaster opens an RPC connection to the driver and sends the RegisterClusterManager message. Following that, it only sends messages across this connection if the number of executors increases or decreases. This might not happen every five minutes. Five minutes afte the last RPC call, the load balancer kills the RPC connection due to inactivity, and the onDisconnected method is called on the RPC connection, killing the YARN job.
This is a corporate environment, and I do not have the possibility of changing the load balancer behavior regarding dropping inactive TCP sessions. I can live with YARN jobs timeouting after they are inactive for 5 minutes, but running jobs should not terminate.
I suspect the correct way would be to use OS-level TCP keepalive for the connections. However, the version of Spark I am using does not offer this feature yet.
Is there any way to solve this without rolling my own version of Spark that manually implements a keepalive or heartbeat mechanism in that RPC session?
I was able to work around this problem by routing traffic past the load balancer.

Setting HTTP request settings on KubernetesPods

I'm running Apache Airflow on Kubernetes and running into a strange error when trying to pull log files.
*** Failed to fetch log file from worker. HTTPConnectionPool(host='geometrical-galaxy-7364-worker-0.geometrical-galaxy-7364-worker.astronomer-geometrical-galaxy-7364.svc.cluster.local', port=8793): Max retries exceeded with url: /log/FILE/begin/2018-12-06T00:00:00/1.log (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e86dab7b8>: Failed to establish a new connection: [Errno 111] Connection refused',))
It looks to me like there are too many requests being made on the stateful set (if I jump into the pod that holds the log files they are all there, but they don't get pulled into the UI that's trying to pull them).
Is there somewhere that a limit for HTTP requests for a stateful set or a pod gets set?
There is nowhere to set a limitation on the number of HTTP requests at the k8s level for pods. You can review the full break down of the statefulset spec here and you will see that there is no field for a limitation on these requests.
Limiting factors for new HTTP requests could be the container image you are using. As an example, Apache web server limits can be found here. The limitation is likely built into the Airflow container you are using. Unfortunately, I can't find documentation that discusses this limit or how to increase it.
I'm fairly certain the error you're seeing is from Airflow trying to fetch task logs from a worker via requests, which uses urllib3, that attempts retries on failed HTTP requests.
Your webserver is attempting to get the logs, being denied by the worker server, and is erroring out. Make sure you're running airflow serve-logs on all workers and that the port is open from your webserver to each.

Azure Http connection gets interrupted after 5 minutes

We have a setup with several RESTful APIs on the same VM in Azure.
The websites run in Kestrel on IIS.
They are protected by the azure application gateway with firewall.
We now have requests that would run for at least 20 minutes.
The request run the full length uninterrupted on Kestrel (Visible in the logs) but the sender either get "socket hang up" after exactly 5 minutes or run forever even if the request finished in kestrel. The request continue in Kestrel even if the connection was interrupted for the sender.
What I have done:
Wrote a small example application that returns after a set amount of
seconds to exclude our websites being the problem.
Ran the request in the VM (to localhost): No problems, response was received.
Ran the request within Azure from one to another VM: Request ran forever.
Ran the request from outside of Azure: Request terminates after 5 minutes
with "socket hang up".
Checked set timeouts: Kestrel: 50m , IIS: 4000s, ApplicationGateway-HttpSettings: 3600
Request were tested with Postman,
Is there another request or connection timeout hidden somewhere in Azure?
We now have requests that would run for at least 20 minutes.
This is a horrible architecture and it should be rewritten to be async. Don't take this personally, it is what it is. Consider returning a 202 Accepted with a Location header to poll for the result.
You're most probably hitting the Azure SNAT layer timeout —
Change it under the Configuration blade for the Public IP.
So I ran into something like this a little while back:
For us the issue was probably the timeout like the other answer suggests but the solution was (instead of increasing timeout) to add PGbouncer in front of our postgres database to manage the connections and make sure a new one is started before the timeout fires.
Not sure what your backend connection looks like but something similar (backend db proxy) could work to give you more ability to tune connection / reconnection on your side.
For us we were running AKS (azure Kubernetes service) but all azure public ips obey the same rules that cause issues similar to this one.
While it isn't an answer I know there are also two types of public IP addresses, one of them is considered 'basic' and doesn't have the same configurability, could be something related to the difference between basic and standard public ips / load balancers?

Connection to Redis cache fails after restart - Azure

We are using following code to connect to our caches (in-memory and Redis):
settings
.WithSystemRuntimeCacheHandle()
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime)
.And
.WithRedisConfiguration(CacheManagerRedisConfigurationKey, connectionString)
.WithMaxRetries(3)
.WithRetryTimeout(100)
.WithJsonSerializer()
.WithRedisBackplane(CacheManagerRedisConfigurationKey)
.WithRedisCacheHandle(CacheManagerRedisConfigurationKey, true)
.WithExpiration(CacheManager.Core.ExpirationMode.Absolute, defaultExpiryTime);
It works fine, but sometimes machine is restarted (automatically by Azure where we host it) and after the restart connection to Redis fails with following exception:
Connection to '{connection string}' failed.
at CacheManager.Core.BaseCacheManager`1..ctor(String name, ICacheManagerConfiguration configuration)
at CacheManager.Core.BaseCacheManager`1..ctor(ICacheManagerConfiguration configuration)
at CacheManager.Core.CacheFactory.Build[TCacheValue](String cacheName, Action`1 settings)
at CacheManager.Core.CacheFactory.Build(Action`1 settings)
According to Redis FAQ (https://learn.microsoft.com/en-us/azure/redis-cache/cache-faq) part: "Why was my client disconnected from the cache?" it might happen after redeploy.
The question is
is there any mechanism to restore the connection after redeploy
is anything wrong in way we initialize the connection
We are sure the connection string is OK
Most clients (including StackExchange.Redis) usually connect / re-connect automatically after a connection break. However, your connect timeout setting needs to be large enough for the re-connect to happen successfully. Remember, you only connect once, so it's alright to give the system enough time to be able to reconnect. Higher connect timeout is especially useful when you have a burst of connections or re-connections after a blip causing CPU to spike and some connections might not happen in time.
In this case, I see RetryTimeout as 100. If this is the Connection timeout, check if this is in milliseconds. 100 milliseconds is too low. You might want to make this more like 10 seconds (remember it's a one time thing, so you want to give it time to be able to connect).

Catching auth error on redis / heroko / node.js

I'm running a redis / node.js server and had a
[Error: Auth error: Error: ERR max number of clients reached]
My current setup is, that I have a connection manager, that adds connections until the maximum number of concurrent connections for my heroku app (256, or 128 per dyno) is reached. If so, it just delivers an already existing connection. It's ultra fast and it's working.
However, yesterday night I got this error and I'm not able to reproduce it. It may be a rare error and I'm not sleeping well, knowing it's out there. Because: Once the error is thrown, my app is no longer reachable.
So my questions would be:
is that kind of a connection manager a good idea?
would it be a better idea to use that manager to wait for 'idle' to be called and the close the connection, meaning that I had to reestablish a connection everytime a requests kicks in (this is what I wanted to avoid)
how can I stop my app from going down? Should i just flush the connection pool whenever an error occurs?
What are your general strategies for handling multiple concurrent connections with a given maximum?
In case somebody is reading along:
The error was caused by a messed up redis 0.8.x that I deployed to live:
https://github.com/mranney/node_redis/issues/251
I was smart enough to remove the failed connections from the connection pool but forgot to call '.quit()' on it, hence the connection was out there in the wild but still a connection.

Resources