What's the best way to programmatically add or remove individual backend servers to/from a Varnish director without downtime? I've been looking for a good example of this and cannot find one.
I would like to be able to scale my backend servers up and down with demand.
Thanks!
Sam
Although it's not the most elegant or even dynamic way of adding backends, I would approach this by defining the backends in a separate VCL and include it in default.vcl using
include "backend.vcl";
Where you define your backends and a director. For example
probe healthcheck {
.url = "/online";
.interval = 15s;
.timeout = 0.3 s;
.window = 3;
.threshold = 1;
.initial = 1;
}
backend web1 {
.host = "10.1.2.1";
.port = "80";
.connect_timeout = 300s;
.first_byte_timeout = 5000s;
.between_bytes_timeout = 300s;
.probe = healthcheck;
}
director backendpool round-robin {
{ .backend = web1; }
}
And use the backendpool as the backend. Then using shell scripting, ssh, a custom daemon or which ever method suits your needs best update the backend.vcl by adding backends and issue a reload for Varnish.
The problem with this approach is that Varnish doesn't actually remove the backends that have been removed from the backend.vcl. Even though they are not used, Varnish will keep probing them. This may lead to unexpected behaviour in the long run. At least the results of the backend health probes can get confusing if backend names are re-used. For example after renaming the web1 backend in the above example a couple of times and then changing its host to an invalid one, here are the Backend_health polling results after reverting back to the above configuration with only a valid web1 backend defined.
0 Backend_health - web1 Still healthy 4--X-RH 3 1 3 0.001141 0.001055 HTTP/1.1 200 OK
0 Backend_health - web1 Still sick ------- 0 1 3 0.000000 0.000000
0 Backend_health - web2 Still healthy 4--X-RH 3 1 3 0.001061 0.001111 HTTP/1.1 200 OK
0 Backend_health - web3 Still healthy 4--X-RH 3 1 3 0.001007 0.001021 HTTP/1.1 200 OK
There is a patch for more granular backend handling for Varnish 2.1, but to my knowledge it is not available for Varnish 3.
Related
Backgound:
I'm currently hosting an ASP.NET application in Azure with the following specs:
ASP .Net Core 2.2
Using Flurl for HTTP requests
Kestrel Webserver
Docker (Linux - mcr.microsoft.com/dotnet/core/aspnet:2.2 runtime)
Azure App Service on P2V2 tier app service plan
I have a a couple of background jobs that run on the service that makes a lot of outbound HTTP calls to a 3rd party service.
Issue:
Under a small load (approximately 1 call per 10 seconds), all requests are completed in under a second with no issue. The issue I'm having is that under a heavy load, when service can make up to 3/4 calls in a 10 second span, some of the requests will randomly timeout and throw an exception. When I was using RestSharp the exception would read "The operation has timed out". Now that I'm using Flurl, the exception reads "The call timed out".
Here's the kicker - If I run the same job from my laptop running Windows 10 / Visual Studios 2017, this problem does NOT occur. This leads me to believe I'm hitting some limit or running out of some resource in my hosted environment. Unclear if that is connection/socket or thread related.
Things I've tried:
Ensure all code paths to the request are using async/await to prevent lockouts
Ensure Kestrel Defaults allow unlimited connections (it does by default)
Ensure Dockers default connection limits are sufficient (2000 by default, more than enough)
Configuring ServicePointManager settings for connection limits
Here is the code in my startup.cs that I'm currently using to try and prevent this issue:
public class Startup
{
public Startup(IHostingEnvironment hostingEnvironment)
{
...
// ServicePointManager setup
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.Expect100Continue = false;
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.EnableDnsRoundRobin = true;
ServicePointManager.ReusePort = true;
// Set Service point timeouts
var sp = ServicePointManager.FindServicePoint(new Uri("https://placeholder.thirdparty.com"));
sp.ConnectionLeaseTimeout = 15 * 1000; // 15 seconds
FlurlHttp.ConfigureClient("https://placeholder.thirdparty.com", cli => cli.Settings.ConnectionLeaseTimeout = new TimeSpan(0, 0, 15));
}
}
Has anyone else run into a similar issue to this? I'm open to any suggestions on how to best debug this situation, or possible methods to correct the issue. I'm at a complete loss after researching this for several days.
Thank you in advance.
I had similar issues. Take a look at Asp.net Core HttpClient has many TIME_WAIT or CLOSE_WAIT connections . Debugging via netstat helped identify the problem for me. As one possible solution. I suggest you use IHttpClientFactory. You can get more info from https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-2.2 It should be fairly easy to use as described in Flurl client lifetime in ASP.Net Core 2.1 and IHttpClientFactory
I cant understand the behavior of Varnish in case of 500 error from backend.
- Why it increments MAIN.n_object counter? I think it should cache only 20x and redirects.
- If first request finished with 500 response from backend, all subsequent requests to the same url not caching, even if backend begins return 200 response.
Help me to understand this logic.
If you're really using default VCL, then the default logic is as you describe. But you're missing that it does start caching it after a while. Typically 2 minutes.
Varnish sees 500 status -> talks to backend and does not cache the page for 2 min
Later Varnish sees 200 status -> Varnish caches the page and delivers it further on from cache.
This is required to implement hit-for-pass - My understanding of this is the following: Varnish will by default pile-up requets to backend and not send them as they arrive for optimization. When Varnish sees that something is not cacheable (500 status, etc.) it will not do the pile-up behavior and talk to backend directly (hit-for-pass).
In case you want to decrease the amount of time that pages are marked as hit-for-pass, you would need to add some VCL. This will make sure that built-in VCL with 120s value is not run. The following will mark a page with 500 status as uncacheable for 10 seconds:
sub vcl_backend_response {
if (beresp.status == 500) {
set beresp.ttl = 10s;
set beresp.uncacheable = true;
return (deliver);
}
}
Sometimes my dovecot log return:
service(imap-login): process_limit (512) reached, client connections are being dropped
I can increase process_limit in dovecot config file, but i dont understand, how will it affect the system.
How to diagnose why process limit is too high? I have around 50 users in my postfix+dovecot+roundcube system.
My configuration:
FreeBSD 10.0-stable
Postfix 2.10
Dovecot 2.2.12
Dovecot have two modes for login processes.
First is called secure mode when each client is connected via its own process.
Second is called performance mode when single process serve all the clients.
In fact performance mode is not so insecure, but rather secure mode is paranoid.
You have to set desired mode in the config:
service imap-login {
inet_listener imap {
port = 143
}
inet_listener imaps {
port = 993
ssl = yes
}
# service_count = 0 # Performance mode
service_count = 1 # Secure mode
process_min_avail = 1
}
In my case performance mode serve to 1k+ users.
I have a VPN concentrator VM that runs Linux 2.6.18 (RHEL version 2.6.18-274.12.1.el5) with ipsec-tools 0.7.3.
I have a bunch of connections to various concentrators, but there is one that keeps dying on me. The remote is a Cisco ASA.
Phase 1 and phase 2 come up correctly, and everything seems to go fine, but suddenly the remote stops responding. I can see ipsec packets going out but no responses coming back. DPD seems to be working fine up until that point (I see packets being sent every 10 seconds). This is not happening all the time either, sometimes it stays up for a long time.
On the remote, the tunnel is no longer active at that point, but racoon still thinks it has phase 1 + phase 2 going. Is there some message that an ASA sends that racoon ignores?
What IĀ also don't understand is that the DPD logic doesn't kill the connection.
Here's my racoon.conf:
remote x.x.x.x {
exchange_mode main;
lifetime time 8 hours;
dpd_delay 10;
proposal {
authentication_method pre_shared_key;
encryption_algorithm aes 256;
hash_algorithm sha1;
dh_group 2;
}
proposal_check obey;
}
sainfo subnet y.y.y.y/32[0] any subnet z.z.z.0/26 any {
pfs_group 2;
lifetime time 1 hour;
encryption_algorithm aes 256;
authentication_algorithm hmac_sha1;
compression_algorithm deflate;
}
It's been a while since this was asked, but you might try newer versions of ipsec-tools. There have been a number of protocol interop fixes in newer versions. Also, double check that your parameters match the ASA, particularly regarding the various lifetime settings. I've also had good success with "rekey force" in racoon's "remote" sections. Here are the relevant config sections I use for interoperating with ASAs:
remote w.x.y.z
{
exchange_mode main;
lifetime time 28800 seconds;
proposal_check obey;
rekey force;
proposal {
encryption_algorithm aes 256;
hash_algorithm sha1;
authentication_method pre_shared_key;
dh_group 2;
}
}
sainfo subnet a.b.c.d/n any subnet e.f.g.h/n any
{
lifetime time 1 hour ;
encryption_algorithm aes 256;
authentication_algorithm hmac_sha1;
compression_algorithm deflate ;
}
We currently have an application hosted on a Azure VM instance.
This application sometimes processes long-running and idle HTTP requests. This is causing an issue because Azure will close all connections that have been idle for longer than a few minutes.
I've seen some suggestions about setting a lower TCP keepalive rate. I've tried setting the this rate to around 45 seconds but my HTTP requests are still being closed.
Any suggestions? Our VM is running Server 2008 R2.
As a simple workaround, I had my script send a newline character every 5 seconds or so to keep the connection alive.
Example:
set_time_limit(60 * 30);
ini_set("zlib.output_compression", 0);
ini_set("implicit_flush", 1);
function flushBuffers()
{
#ob_end_flush();
#ob_flush();
#flush();
#ob_start();
}
function azureWorkaround($char = "\n")
{
echo $char;
flushBuffers();
}
$html = '';
$employees = getEmployees();
foreach($employee in $employees) {
html .= getReportHtmlForEmployee($employee);
azureWorkaround();
}
echo $html;
The Azure Load Balancer now supports configurable TCP Idle timeout for your Cloud Services and Virtual Machines. This feature can be configured using the Service Management API, PowerShell or the service model.
For more information check the announcement at http://azure.microsoft.com/blog/2014/08/14/new-configurable-idle-timeout-for-azure-load-balancer/