So in my vcl_recv I have this header being set
set req.http.Grace = "NONE";
and when the backend is up, everything has the Grace: NONE header being set which is great... and then we have
sub vcl_hit {
# Called when a cache lookup is successful.
if (obj.ttl >= 0s) {
# A pure unadultered hit, deliver it
return (deliver);
}
if (std.healthy(req.backend_hint)) {
# Backend is healthy. Limit age to 10s.
if (obj.ttl + 10s > 0s) {
set req.http.Grace = "normal(limited)";
return (deliver);
} else {
# No candidate for grace. Fetch a fresh object.
return(fetch);
}
} else {
# backend is sick - use full grace
if (obj.ttl + obj.grace > 0s) {
set req.http.Grace = "full";
return (deliver);
} else {
# no graced object.
return (fetch);
}
}
# fetch & deliver once we get the result
return (fetch); # Dead code, keep as a safeguard
}
So, I understand that apparently the full grace is when the backend is down and I get that if the backend isn't down we don't adjust the grace, but when exactly will that normal(limited) block kick in? It seems like when the backend is up it serves everything with Grace: NONE, and if I stop nginx it goes right to Grace: FULL. I just don't know when
if (obj.ttl + 10s > 0s) {
set req.http.Grace = "normal(limited)";
should kick in since I can't seem to make it, at least according to that header being set...
My vcl_backend_response has these values (for testing, but yea)
# A TTL of 24h
set beresp.ttl = 60s;
# Define the default grace period to serve cached content
set beresp.grace = 6h;
The block in question will kick in for the first request to come in for an expired object within 10 seconds after its' expiration.
E.g., you request an object at 00:00:00, it gets fetched from the backend and gets stored with a TTL of 60 seconds. If you request the same object at 00:01:07, you should receive the (now-expired) cached object and see the "normal(limited)" header.
Assuming this VCL is running on Varnish 4.x, hitting an expired object in grace should trigger a background refresh, so any subsequest requests should receive a freshly-cached object.
In a nutshell, this rule is saying:
Store all objects for 6 hours and 1 minute
Serve objects younger than 60 seconds from cache
Serve objects between 60 and 70 seconds old from cache, but refresh the cached object in the background
Only serve objects older than 70 seconds from cache if the backend healhcheck is failing
UPDATE:
You pretty much got it. Objects are stored - kept in memory - for the sum of TTL and grace. That's how we arrive at the maximum storage duration of 6 hours and 1 minute - 6 hours grace and 1 minute TTL.
TTL is how long you consider an object to be "fresh" for, meaning that it can be served from cache without checking if it might have changed on the origin server. Grace, on the other hand, kicks in when an object is no longer "fresh", but you want to serve it anyway - usually, for one of two reasons:
Your backend is failing, and serving a "stale", expired object is better than serving an error.
For example, think of a CMS that shows articles and comments. Normally, you'd like to keep the TTL short so that new comments are displayed in a timely manner. However, if your CMS crashes, you'd rather serve the article with old comments rather than an "Oops, the server's dead" page.
The object expired recently, so it's not a huge deal to serve the expired object - and it's preferable to serve a slightly stale object instantly rather than waiting for the backend application to return a response.
In this case, think of an application that aggregates third party feeds - you'd rather serve a slightly state set of feed data, and then refresh the cached object in the background, rather than make the user wait until all calls to third party applications complete and the data is aggregated.
Related
We are facing issues with varnish Max Threads hit & backend and session connections spikes. We are not sure about the cause, but what we have observed is it happens when the origin servers have high response times and eventually return uncacheable (502) responses.
Varnish usage :
We've configured varnish behind nginx proxy, so the incoming requests first hit nginx and then is consistently balanced to n varnish. Varnish, in case of miss, call the origin nginx host, here example.com.
In our case, we only cache HTTP GET requests and all of them have JSON payload in response, size ranging from 0.001 MB to 2 MB.
Example request :
HTTP GET : http://test.com/test/abc?arg1=val1&arg2=val2
Expected xkey : test/abc
Response : Json payload
Approx QPS : 60-80 HTTP GET Requests
Avg obj ttl : 2d
Avg obj grace : 1d
Attaching the vcl file, statistics and varnish run command for debugging purpose.
Monitoring Stats :
Requests
Cache status
Sessions
Threads
Backend Connections
Objects expired
Varnish and VCL Configuration :
Varnish version : Linux,5.4.0,x86_64,varnish-6.5.1
varnishd -F -j unix,user=nobody -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -s file,/opt/varnishdata/cache,750G
vcl 4.0;
import xkey;
import std;
acl purgers {
"localhost";
}
backend default {
.host = "example.com";
.port = "80";
}
sub vcl_recv {
unset req.http.Cookie;
if (req.method == "PURGE") {
if (client.ip !~ purgers) {
return (synth(403, "Forbidden"));
}
if (req.http.xkey) {
set req.http.n-gone = xkey.softpurge(req.http.xkey);
return (synth(200, "Invalidated "+req.http.n-gone+" objects"));
}
else {
return (purge);
}
}
# remove request id from request
set req.url = regsuball(req.url, "reqid=[-_A-z0-9+()%.]+&?", "");
# remove trailing ? or &
set req.url = regsub(req.url, "[?|&]+$", "");
# set hostname for backend request
set req.http.host = "example.com";
}
sub vcl_backend_response {
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
# extract xkey
if (bereq.url ~ "/some-string/") {
set beresp.http.xkey = regsub (bereq.url,".*/some-string/([^?]+).*","\1");
}
# This block will make sure that if the upstream return a 5xx, but we have the response in the cache (even if it's expired),
# we fall back to the cached value (until the grace period is over).
if ( beresp.status != 200 && beresp.status != 422 ){
# This check is important. If is_bgfetch is true, it means that we've found and returned the cached object to the client,
# and triggered an asynchronous background update. In that case, if it was a 5xx, we have to abandon, otherwise the previously cached object
# would be erased from the cache (even if we set uncacheable to true).
if (bereq.is_bgfetch)
{
return (abandon);
}
# We should never cache a 5xx response.
set beresp.uncacheable = true;
}
}
sub vcl_deliver {
unset resp.http.X-Varnish;
unset resp.http.Via;
set resp.http.X-Cached = req.http.X-Cached;
}
sub vcl_hit {
if (obj.ttl >= 0s) {
set req.http.X-Cached = "HIT";
return (deliver);
}
if (obj.ttl + obj.grace > 0s) {
set req.http.X-Cached = "STALE";
return (deliver);
}
set req.http.X-Cached = "MISS";
}
sub vcl_miss {
set req.http.X-Cached = "MISS";
}
Please let us know if there are any suggestions to improve the current configuration or anything else required to debug the same.
Thanks
Abhishek Surve
Measure thread shortage and increase thread count
If you run out of threads, from a firefighting point of view it makes sense to increase the threads per thread pool.
Here's a varnishstat command that displays realtime thread consumption and potential thread limits:
varnishstat -f MAIN.threads -f MAIN.threads_limited
Press the d key to display fields with a zero value.
If the MAIN.threads_limited increases, we know you have exceeded the maximum threads per pool that is set by the thread_pool_max runtime parameter.
It makes sense to display the current thread_pool_max value by executing the following command:
varnishadm param.show thread_pool_max
You can use varnishadm param.show to set the new thread_pool_max value, but it is not persisted and won't survive a restart.
The best way is to set it though a -p parameter in your systemd service file.
Watch out with file storage
I noticed you're using the file stevedore to store large volumes of data. We strongly advise against using it, because it is very sensitive to disk fragmentation. It can slow down Varnish when it has to perform too many disk seeks and relies too much on the kernel's page cache to be efficient.
On open source Varnish, -s malloc is still your best bet. You can increase your cache capacity through horizontal scaling and having 2 tiers of Varnish.
The most reliable way to use disk for large volumes of data is Varnish Enterprise's Massive Storage Engine. It's not free and open source, but it was built specifically to counter the poor performance of the file stevedore.
Looking for uncached content
Based on how you're describing the problem, it looks like Varnish has to spend too much time dealing with uncached responses. This requires a backend connection.
Luckily Varnish lets go of the backend thread and allows client threads to deal with other tasks while Varnish is waiting for the backend to respond.
But if we can limit the number of backend fetches, maybe we can improve the overall performance of Varnish.
I'm not too concerned about cache misses, because a cache miss is a hit that hasn't happened yet, however we can look at the requests that cause the most cache misses by running the following command:
varnishtop -g request -i requrl -q "VCL_Call eq 'MISS'"
This will list the URL of the top misses. You can then drill down on individual request and figure out why cause cache misses so often.
You can use the following command to inspect the logs for a specific URL:
varnishlog -g request -q "ReqUrl eq '/my-page'"
Please replace /my-page with the URL of the endpoint you're inspecting.
For cache misses, we care about their TTL. Maybe the TTL was set too low. The TTL tag will show you which TTL value is used.
Also keep an eye on the Timestamp tags, because they can highlight any potential slowdown.
Looking for uncacheable content
Uncacheable content is more dangerous than uncached content. A cache miss will eventually result in a hit, whereas a cache bypass will always be uncacheable and will always require a backend fetch.
The following command will list your top cache bypasses by URL:
varnishtop -g request -i requrl -q "VCL_Call eq 'PASS'"
Then again, you can drill down using the following command
varnishlog -g request -q "ReqUrl eq '/my-page'"
It's important to understand why Varnish would bypass the cache for certain requests. The built-in VCL describes this process. See https://www.varnish-software.com/developers/tutorials/varnish-builtin-vcl/ for more information about the built-in VCL.
Typical things you should look for:
HTTP requests with a request method other than GET or HEAD
HTTP requests with an Authorization header
HTTP requests with a Cookie header
HTTP responses with a Set-Cookie header
HTTP responses with a s-maxage=0 or a max-age=0 directive in the Cache-Control header
HTTP responses with a private, no-cache or no-store directive in the Cache-Control header
HTTP responses that contain a Vary: * header
You can also run the following command to figure out how many passes take place on your system:
varnishstat -f MAIN.s_pass
If that is too high, you might want to write some VCL that handles Authorization headers, Cookie headers and Set-Cookie headers.
The conclusion can also be that you need to optimize your Cache-Control headers.
If you've done all the optimization you can and you still get a lot of cache bypasses, you need to scale out your platform a bit more.
Be on the lookout for zero TTL
One line of VCL that caught my eye is the following:
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
You are using an X-Cache-ttl response header to set the TTL. Why would you do that if there is a conventional Cache-Control header for that?
An extra risk is that fact that the built-in VCL cannot handle this and cannot properly mark these requests as uncacheable.
The most dangerous thing that can happen is that you set beresp.ttl = 0 through this header and that you hit a scenario where set beresp.uncacheable = true is reached in your VCL.
If the beresp.ttl remains zero at that point, Varnish will not be able to store Hit-For-Miss objects in the cache for these situations. This means that subsequent requests for this resource will be added to the waiting list. But because we're dealing with uncacheable content, these requests will never be satisfied by Varnish's request coalescing mechanism.
The result is that the waiting list will be processed serially and this will increase the waiting time, which might result in exceeding the available threads.
My advice is to add set beresp.ttl = 120s right before you set set beresp.uncacheable = true;. This will ensure Hit-For-Miss objects are created for uncacheable content.
Use s-maxage & stale-while-revalidate
To build on the entire conventional header argument, please remove the following lines of code from your VCL:
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
Replace this logic with the proper use of Cache-Control headers.
Here's an example of a Cache-Control header with a 3600s TTL and a 1 day grace:
Cache-Control: public, s-maxage=3600, stale-while-revalidate=86400
This feedback is not related to your problem, but is just a general best practice.
Conclusion
At this point it's not really clear what the root cause of your problem is. You talk about threads and slow backends.
On the one hand I have given you ways to inspect the thread pool usage and a way to increase the threads per pool.
On the other hand, we need to look at potential cache misses and cache bypass scenarios that might disrupt the balance on the system.
If certain headers cause unwanted cache bypasses, we might be able to improve the situation by writing the proper VCL
And finally, we need to ensure you are not adding requests to the waitlist if they are uncacheable.
I try to make an REST-API, but varnish returns always the first called response and I have no idea why.
If I open a page with a Browser, Varnish returns HTML -> is OK.
If I curl the same page curl -i https://example.com -H "Accept: application/json" Varnish also returns HTML -> which is False.
As I see, Varnish always returns the first cached item, If this is JSON varnish returns JSON, if this is HTML Varnish returns HTML.
Without Varnish everything works like expected.
If you're serving different content type on the same URL, you you might want to tell Varnish to partition cache accordingly.
In fact, Varnish doesn't do much special about it, and it behaves like other proxies would. If they see a URL without information specifying how a resource's cache should partition, then no matter if it is a JSON or a regular request, the first request will be cached and served the same irrespective of request type.
So you need to tell Varnish how to partition cache for a resource.
The "Vary" header
The most straightforward and "HTTP" compatible way for other proxies in the wild, is Vary response header.
It tells the proxy cache (Varnish in this case), to partition, vary cache for a resource based on a header value coming from a client.
E.g. client sends header X: some-value and your app sends header Vary: X is what it takes to make the cache different between different value of X.
For Varnish 3, there is an example with Accept-Encoding.
The article details an implementation challenge with Vary - different clients may be sending quite different values for varied header thus resulting in severely partitioned cache. So you typically want to normalize the varying header's value, to a set of known, expected values.
In your case you want to Vary (and normalize) the Accept header. So something along the lines of (in vcl_recv procedure):
if (req.http.Accept) {
if (req.http.Accept ~ "application/json") {
set req.http.Accept = "application/json";
} else {
set req.http.Accept = "text/html";
}
}
Next you need to have your app actually send Vary: Accept (inside your app source files). Alternatively, you can throw some Varnish VCL instead, if modiying app source files is not feasible:
sub vcl_fetch {
if (!beresp.http.Vary) { # no Vary at all
set beresp.http.Vary = "Accept";
} elseif (beresp.http.Vary !~ "Accept") { # add to existing Vary
set beresp.http.Vary = beresp.http.Vary + ", Accept";
}
}
I would like to know, how to use javascript to achieve my use case,
My app receives a post request, then it incr memcache key, then it publish the increased value straightaway to users(mobile app) using third party API.
Eg. first requst value become 1, publish 1.
second request value become 2, publish 2 ...
It works fine with requests less than 2k within 30 secs.
If the requests number goes up to 10k, users(mobile app) may receive too many messages from publisher(battery consuming)
So I have to the throttle publishing calls, instead of publishing per request, I want to publish the value every second. In 1 second, the value can be 1, then publish 1. In 2 second then value can be 100, then publish 100. So that I saved 99 publish calls.
When requests are not coming anymore, I don't want a worker keep running every second.
Each time it increments, cache the new value to a global variable and post it to clients using setInterval. Here is a simple example:
var key = 0;
// Update the cache to the present
// value on application start
memcache.get('key', updateKey);
// Handle increment request and
// save the new value
app.post('/post', function(req, res){
memcache.incr('key', updateKey);
});
// Update the cached key
function updateKey(err, val){
key = val;
}
// Publish to clients once
// a second
function publish(){
clients.emit(key);
}
setInterval(publish, 1000);
Starting and stopping this routine is a little more involved and may depend on how you're serving requests / incrementing the value.
Take a look at node-rate-limiter
You can implement it in a number of ways to solve your problem...
How do I expire the administrator session after a period of inactivity in SilverStripe 3.1.x? Is there a config option for this?
I searched and found the following code snippet, which, when placed in the Page_Controller class, works for frontend users, but totally ineffective in the administration area.
public function init() {
parent::init();
self::logoutInactiveUser();
}
public static function logoutInactiveUser() {
$inactivityLimit = 1; // in Minutes - deliberately set to 1 minute for testing purposes
$inactivityLimit = $inactivityLimit * 60; // Converted to seconds
$sessionStart = Session::get('session_start_time');
if (isset($sessionStart)){
$elapsed_time = time() - Session::get('session_start_time');
if ($elapsed_time >= $inactivityLimit) {
$member = Member::currentUser();
if($member) $member->logOut();
Session::clear_all();
$this->redirect(Director::baseURL() . 'Security/login');
}
}
Session::set('session_start_time', time());
}
After over 1 minute of inactivity, the admin user is still logged in and the session has not timed out.
For people like myself still searching for a solution to this, there's a much simpler alternative. As it turns out, the only good solution at the moment is indeed to disable LeftAndMain.session_keepalive_ping and simon_w's solution will not work precisely because of this ping. Also, disabling this ping should not cause data loss (at least not for SilverStripe 3.3+) because the user will be presented with an overlay when they attempt to submit their work. After validating their credentials, their data will be submitted to the server as usual.
Also, for anyone who (like myself) was looking for a solution on how to override the CMS ping via LeftAndMain.session_keepalive_ping using _config.yml keep reading.
Simple Fix: In your mysite/_config.php, simply add:
// Disable back-end AJAX calls to /Security/ping
Config::inst()->update('LeftAndMain', 'session_keepalive_ping', false);
This will prevent the CMS from refreshing the session which will naturally expire on it's own behind the scenes (and will not be submitted on the next request). That way, the setting you may already have in _config.yml dictating the session timeout will actually be respected and allowing you to log out a user who's been inactive in the CMS. Again, data should not be lost for the reasons mentioned in the first paragraph.
You can optionally manually override the session timeout value in mysite/_config/config.yml to help ensure it actually expires at some explicit time (e.g. 30min below):
# Set session timeout to 30min.
Session:
timeout: 1800
You may ask: Why is this necessary?
Because, while the bug (or functionality?) preventing you from overriding the LeftAndMain.session_keepalive_ping setting to false was supposedly fixed in framework PR #3272 it was actually reverted soon thereafter in PR #3275
I hope this helps anyone else confused by this situation like I was!
This works, but would love to hear from the core devs as to whether or not this is best practice.
In mysite/code I created a file called MyLeftAndMainExtension.php with the following code:
<?php
class MyLeftAndMainExtension extends Extension {
public function onAfterInit() {
self::logoutInactiveUser();
}
public static function logoutInactiveUser() {
$inactivityLimit = 1; // in Minutes - deliberately set to 1 minute for testing
$inactivityLimit = $inactivityLimit * 60; // Converted to seconds
$sessionStart = Session::get('session_start_time');
if (isset($sessionStart)){
$elapsed_time = time() - Session::get('session_start_time');
if ($elapsed_time >= $inactivityLimit) {
$member = Member::currentUser();
if($member) $member->logOut();
Session::clear_all();
Controller::curr()->redirect(Director::baseURL() . 'Security/login');
}
}
Session::set('session_start_time', time());
}
}
Then I added the following line to mysite/_config.php
LeftAndMain::add_extension('MyLeftAndMainExtension');
That seemed to do the trick. If you prefer to do it through yml, you can add this to mysite/_config/config.yml :
LeftAndMain:
extensions:
- MyLeftAndMainExtension
The Session.timeout config option is available for setting an inactivity timeout for sessions. However, setting it to anything greater than 5 minutes isn't going to work in the CMS out of the box.
Having a timeout in the CMS isn't productive, and your content managers will end up ruing the timeout. This is because it is possible (and fairly common) to be active in the CMS, while appearing inactive from the server's perspective (say, you're writing a lengthy article). As such, the CMS is designed to send a ping back to the server every 5 minutes to ensure users are logged in. While you can stop this behaviour by setting the LeftAndMain.session_keepalive_ping config option to false, I strongly recommended against doing so.
we recently have put Varnish in front of our Drupal because the server was suffering heavy load and we are very pleased in general.
The only problem remaining is that we sometimes have an infinite redirection loop in the cached data. We have found this through our HTTP-Monitoring. We check the front page every minute. The page in the cache sometimes contains the full front page, but with a Location header set, that sends the user to the front page again.
We are not quite sure what could cause this, but also have no clue on how could track this down. Of course, the best way to handle this would be on the drupal side, but we can't really tell why this does happen.
Is there a way to log the cases when this happens? Or is it possible to detect this in varnish and mark the current cache content as invalid?
Of course, we don't want to always pass intentional redirects to the origin server, but the ones that would cause an infinite loop.
I hope to hear some ideas how we can further track this down. Many thank in advance for all kinds of hints.
I have found a workaround for this:
sub vcl_fetch {
// Fix a strange problem: HTTP 301 redirects to the same page sometimes go in$
if (beresp.http.Location == "http://" + req.http.host + req.url) {
if (req.restarts > 2) {
unset beresp.http.Location;
#set beresp.http.X-Restarts = req.restarts;
} else {
return (restart);
}
}
}
I give the backend a second (and thirhd) chance to return a proper page. If that fails as well, the Location header is removed. This works, because the proper page is served with just an additional invalid Location header.
The accepted answer by #philip updated for Varnish 4:
sub vcl_backend_response {
#Fix a strange problem: HTTP 301 redirects to the same page sometimes go in$
if (beresp.http.Location == "http://" + bereq.http.host + bereq.url) {
if (bereq.retries > 2) {
unset beresp.http.Location;
#set beresp.http.X-Restarts = bereq.retries;
} else {
return (retry);
}
}
}