Varnish 3: Accept JSON returns HTML

Varnish 3: Accept JSON returns HTML - varnish

I try to make an REST-API, but varnish returns always the first called response and I have no idea why.
If I open a page with a Browser, Varnish returns HTML -> is OK.
If I curl the same page curl -i https://example.com -H "Accept: application/json" Varnish also returns HTML -> which is False.
As I see, Varnish always returns the first cached item, If this is JSON varnish returns JSON, if this is HTML Varnish returns HTML.
Without Varnish everything works like expected.

If you're serving different content type on the same URL, you you might want to tell Varnish to partition cache accordingly.
In fact, Varnish doesn't do much special about it, and it behaves like other proxies would. If they see a URL without information specifying how a resource's cache should partition, then no matter if it is a JSON or a regular request, the first request will be cached and served the same irrespective of request type.
So you need to tell Varnish how to partition cache for a resource.
The "Vary" header
The most straightforward and "HTTP" compatible way for other proxies in the wild, is Vary response header.
It tells the proxy cache (Varnish in this case), to partition, vary cache for a resource based on a header value coming from a client.
E.g. client sends header X: some-value and your app sends header Vary: X is what it takes to make the cache different between different value of X.
For Varnish 3, there is an example with Accept-Encoding.
The article details an implementation challenge with Vary - different clients may be sending quite different values for varied header thus resulting in severely partitioned cache. So you typically want to normalize the varying header's value, to a set of known, expected values.
In your case you want to Vary (and normalize) the Accept header. So something along the lines of (in vcl_recv procedure):
if (req.http.Accept) {
if (req.http.Accept ~ "application/json") {
set req.http.Accept = "application/json";
} else {
set req.http.Accept = "text/html";
}
}
Next you need to have your app actually send Vary: Accept (inside your app source files). Alternatively, you can throw some Varnish VCL instead, if modiying app source files is not feasible:
sub vcl_fetch {
if (!beresp.http.Vary) { # no Vary at all
set beresp.http.Vary = "Accept";
} elseif (beresp.http.Vary !~ "Accept") { # add to existing Vary
set beresp.http.Vary = beresp.http.Vary + ", Accept";
}
}

Related

Varnish Max Threads hit & backend and session connections issue

We are facing issues with varnish Max Threads hit & backend and session connections spikes. We are not sure about the cause, but what we have observed is it happens when the origin servers have high response times and eventually return uncacheable (502) responses.
Varnish usage :
We've configured varnish behind nginx proxy, so the incoming requests first hit nginx and then is consistently balanced to n varnish. Varnish, in case of miss, call the origin nginx host, here example.com.
In our case, we only cache HTTP GET requests and all of them have JSON payload in response, size ranging from 0.001 MB to 2 MB.
Example request :
HTTP GET : http://test.com/test/abc?arg1=val1&arg2=val2
Expected xkey : test/abc
Response : Json payload
Approx QPS : 60-80 HTTP GET Requests
Avg obj ttl : 2d
Avg obj grace : 1d
Attaching the vcl file, statistics and varnish run command for debugging purpose.
Monitoring Stats :
Requests
Cache status
Sessions
Threads
Backend Connections
Objects expired
Varnish and VCL Configuration :
Varnish version : Linux,5.4.0,x86_64,varnish-6.5.1
varnishd -F -j unix,user=nobody -a :6081 -T localhost:6082 -f /etc/varnish/default.vcl -s file,/opt/varnishdata/cache,750G
vcl 4.0;
import xkey;
import std;
acl purgers {
"localhost";
}
backend default {
.host = "example.com";
.port = "80";
}
sub vcl_recv {
unset req.http.Cookie;
if (req.method == "PURGE") {
if (client.ip !~ purgers) {
return (synth(403, "Forbidden"));
}
if (req.http.xkey) {
set req.http.n-gone = xkey.softpurge(req.http.xkey);
return (synth(200, "Invalidated "+req.http.n-gone+" objects"));
}
else {
return (purge);
}
}
# remove request id from request
set req.url = regsuball(req.url, "reqid=[-_A-z0-9+()%.]+&?", "");
# remove trailing ? or &
set req.url = regsub(req.url, "[?|&]+$", "");
# set hostname for backend request
set req.http.host = "example.com";
}
sub vcl_backend_response {
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
# extract xkey
if (bereq.url ~ "/some-string/") {
set beresp.http.xkey = regsub (bereq.url,".*/some-string/([^?]+).*","\1");
}
# This block will make sure that if the upstream return a 5xx, but we have the response in the cache (even if it's expired),
# we fall back to the cached value (until the grace period is over).
if ( beresp.status != 200 && beresp.status != 422 ){
# This check is important. If is_bgfetch is true, it means that we've found and returned the cached object to the client,
# and triggered an asynchronous background update. In that case, if it was a 5xx, we have to abandon, otherwise the previously cached object
# would be erased from the cache (even if we set uncacheable to true).
if (bereq.is_bgfetch)
{
return (abandon);
}
# We should never cache a 5xx response.
set beresp.uncacheable = true;
}
}
sub vcl_deliver {
unset resp.http.X-Varnish;
unset resp.http.Via;
set resp.http.X-Cached = req.http.X-Cached;
}
sub vcl_hit {
if (obj.ttl >= 0s) {
set req.http.X-Cached = "HIT";
return (deliver);
}
if (obj.ttl + obj.grace > 0s) {
set req.http.X-Cached = "STALE";
return (deliver);
}
set req.http.X-Cached = "MISS";
}
sub vcl_miss {
set req.http.X-Cached = "MISS";
}
Please let us know if there are any suggestions to improve the current configuration or anything else required to debug the same.
Thanks
Abhishek Surve

Measure thread shortage and increase thread count
If you run out of threads, from a firefighting point of view it makes sense to increase the threads per thread pool.
Here's a varnishstat command that displays realtime thread consumption and potential thread limits:
varnishstat -f MAIN.threads -f MAIN.threads_limited
Press the d key to display fields with a zero value.
If the MAIN.threads_limited increases, we know you have exceeded the maximum threads per pool that is set by the thread_pool_max runtime parameter.
It makes sense to display the current thread_pool_max value by executing the following command:
varnishadm param.show thread_pool_max
You can use varnishadm param.show to set the new thread_pool_max value, but it is not persisted and won't survive a restart.
The best way is to set it though a -p parameter in your systemd service file.
Watch out with file storage
I noticed you're using the file stevedore to store large volumes of data. We strongly advise against using it, because it is very sensitive to disk fragmentation. It can slow down Varnish when it has to perform too many disk seeks and relies too much on the kernel's page cache to be efficient.
On open source Varnish, -s malloc is still your best bet. You can increase your cache capacity through horizontal scaling and having 2 tiers of Varnish.
The most reliable way to use disk for large volumes of data is Varnish Enterprise's Massive Storage Engine. It's not free and open source, but it was built specifically to counter the poor performance of the file stevedore.
Looking for uncached content
Based on how you're describing the problem, it looks like Varnish has to spend too much time dealing with uncached responses. This requires a backend connection.
Luckily Varnish lets go of the backend thread and allows client threads to deal with other tasks while Varnish is waiting for the backend to respond.
But if we can limit the number of backend fetches, maybe we can improve the overall performance of Varnish.
I'm not too concerned about cache misses, because a cache miss is a hit that hasn't happened yet, however we can look at the requests that cause the most cache misses by running the following command:
varnishtop -g request -i requrl -q "VCL_Call eq 'MISS'"
This will list the URL of the top misses. You can then drill down on individual request and figure out why cause cache misses so often.
You can use the following command to inspect the logs for a specific URL:
varnishlog -g request -q "ReqUrl eq '/my-page'"
Please replace /my-page with the URL of the endpoint you're inspecting.
For cache misses, we care about their TTL. Maybe the TTL was set too low. The TTL tag will show you which TTL value is used.
Also keep an eye on the Timestamp tags, because they can highlight any potential slowdown.
Looking for uncacheable content
Uncacheable content is more dangerous than uncached content. A cache miss will eventually result in a hit, whereas a cache bypass will always be uncacheable and will always require a backend fetch.
The following command will list your top cache bypasses by URL:
varnishtop -g request -i requrl -q "VCL_Call eq 'PASS'"
Then again, you can drill down using the following command
varnishlog -g request -q "ReqUrl eq '/my-page'"
It's important to understand why Varnish would bypass the cache for certain requests. The built-in VCL describes this process. See https://www.varnish-software.com/developers/tutorials/varnish-builtin-vcl/ for more information about the built-in VCL.
Typical things you should look for:
HTTP requests with a request method other than GET or HEAD
HTTP requests with an Authorization header
HTTP requests with a Cookie header
HTTP responses with a Set-Cookie header
HTTP responses with a s-maxage=0 or a max-age=0 directive in the Cache-Control header
HTTP responses with a private, no-cache or no-store directive in the Cache-Control header
HTTP responses that contain a Vary: * header
You can also run the following command to figure out how many passes take place on your system:
varnishstat -f MAIN.s_pass
If that is too high, you might want to write some VCL that handles Authorization headers, Cookie headers and Set-Cookie headers.
The conclusion can also be that you need to optimize your Cache-Control headers.
If you've done all the optimization you can and you still get a lot of cache bypasses, you need to scale out your platform a bit more.
Be on the lookout for zero TTL
One line of VCL that caught my eye is the following:
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
You are using an X-Cache-ttl response header to set the TTL. Why would you do that if there is a conventional Cache-Control header for that?
An extra risk is that fact that the built-in VCL cannot handle this and cannot properly mark these requests as uncacheable.
The most dangerous thing that can happen is that you set beresp.ttl = 0 through this header and that you hit a scenario where set beresp.uncacheable = true is reached in your VCL.
If the beresp.ttl remains zero at that point, Varnish will not be able to store Hit-For-Miss objects in the cache for these situations. This means that subsequent requests for this resource will be added to the waiting list. But because we're dealing with uncacheable content, these requests will never be satisfied by Varnish's request coalescing mechanism.
The result is that the waiting list will be processed serially and this will increase the waiting time, which might result in exceeding the available threads.
My advice is to add set beresp.ttl = 120s right before you set set beresp.uncacheable = true;. This will ensure Hit-For-Miss objects are created for uncacheable content.
Use s-maxage & stale-while-revalidate
To build on the entire conventional header argument, please remove the following lines of code from your VCL:
# Sets default TTL in case the baackend does not send a Caching related header
set beresp.ttl = std.duration(beresp.http.X-Cache-ttl, 2d);
# Grace period to keep serving stale entries
set beresp.grace = std.duration(beresp.http.X-Cache-grace, 1d);
Replace this logic with the proper use of Cache-Control headers.
Here's an example of a Cache-Control header with a 3600s TTL and a 1 day grace:
Cache-Control: public, s-maxage=3600, stale-while-revalidate=86400
This feedback is not related to your problem, but is just a general best practice.
Conclusion
At this point it's not really clear what the root cause of your problem is. You talk about threads and slow backends.
On the one hand I have given you ways to inspect the thread pool usage and a way to increase the threads per pool.
On the other hand, we need to look at potential cache misses and cache bypass scenarios that might disrupt the balance on the system.
If certain headers cause unwanted cache bypasses, we might be able to improve the situation by writing the proper VCL
And finally, we need to ensure you are not adding requests to the waitlist if they are uncacheable.

Nodejs not retaining upper case of request header

I am using node js as reverse proxy mostly using http and http-proxy module. While sending the request to to nodejs to redirect to one of my application, i have to pass request headers which will all be in upper case. However, nodejs or rather http is converting all upper case to lower case, because of which one of the validation of my application is failing.
My code snippet is:
http.createServer(function (request, response) {
var redirection = 'http://localhost:8000';
var path = url.parse(request.url).path;
switch (path) {
case '/health':
proxy.web(request, response, { target: redirection });
break;
}).listen(8080);
Request headers passed are:
curl -H "X-AUTH: PBxqcEm5sU743Cpk" -X GET http://localhost:8080/health
Now what is happening is, header "X-AUTH" is getting transformed into "x-auth" and my application is not able to validate it. In my application the header matching is case sensitive.
The request headers printed from node js request object are:
{ host: 'localhost:8080',
'user-agent': 'curl/7.47.1',
accept: '*/*',
'x-auth': 'PBxqcEm5sU743Cpk' }
My requirement is to retain the upper case of the header passed in request so that my application can validate and authorize it.
Please let me know if there is any way to achieve this
Thanks a lot

FWIW HTTP header field names are case-insensitive so the case really should not matter.
However, node does provide access to the raw headers (including duplicates) via req.rawHeaders. Since req.rawHeaders is an array (format is [name1, value1, name2, value2, ...]), you will need to iterate over it to find the header(s) you are looking for.

How to re-order HTTP headers?

I was wondering if there was any way to re-order HTTP headers that are being sent by our browser, before getting sent back to the web server?
Since the order of the headers leaves some kind of "fingerprinting", see this post and this post, I was thinking about using MITMProxy (with Inline Scripting, I guess) to modify headers on-the-fly. Is this possible?
How would one achieve that?
Note: I'm looking for a method that could be scripted, not a method using a graphical tool like the Burp Suite (although Burp is known to be able to re-order headers)
I'm open to suggestions. Perhaps NGINX might come to the rescue as well?
EDIT: I should be more specific, by giving an example...
Let's say I'm using Firefox. With the use of a funky add-on, I'm spoofing my user-agent to "look" like a Chrome browser. But then if I test my browser with ip-check.info, the "signature" of my browser remains the one of Firefox, even though my spoofed user-agent shows "Chrome".
So the solution, in this specific case, should be to re-order the HTTP headers in the same manner as Chrome does.
How can this be done?

For the record, the order of the HTTP headers should not matter at all according to RFC 7230. But now that you have asked... this can be done in mitmproxy as follows:
import random
def request(context, flow):
# flow.request.headers.fields is a tuple of (name, value) header tuples.
h = list(flow.request.headers.fields)
random.shuffle(h)
flow.request.headers.fields = tuple(h)
See the mitmproxy documentation on netlib.http.Headers for more details.
There are tons of way to reorder them as you wish:
def reorder(headers, header_order=["Host","User-Agent","Accept"]):
lines = []
for name in header_order: # add existing headers in the specified order
if name in headers:
lines.extend(headers.get_all(name))
del headers[name]
lines.extend(headers.fields) # all other headers
return lines
request.headers.fields = reorder(request.headers)

Varnish: how to cache a URL with parameters - need to strip multiple parameters but not all

I'm testing a revised Varnish config and I need to see if certain URLs are hitting the cache or not. It seems not to like multiple parameters.
The Varnish config change is to not treat URLs with certain parameters as unique content. E.g.
/news/tech
/news/tech?itq=1001
/news/tech?itq=1002&ito=3553
should all be equivalent.
Scenario 1
Requesting a page that hasn't been cached yet:
curl -I 'http://example.com/news/tech'
Result:
X-Varnish-Cache: MISS
Sending the same request a second time gives this result:
X-Varnish-Cache: HIT
Scenario 2
Requesting the above URL again, but with a parameter:
curl -I 'http://example.com/news/tech?itq=1001'
That is one of the parameters to not treat as unique content.
Result:
X-Varnish-Cache: HIT
Scenario 3
Requesting with a second parameter:
curl -I 'http://example.com/news/tech?itq=1001&ito=3553'
Response:
X-Varnish-Cache: MISS
It seems like the Varnish config works for ? but not for &
Here's the relevant line in my Varnish config:
set req.url = regsuball(req.url, "([\?|\&])+(utm_campaign|utm_content|utm_medium|utm_source|utm_term|ITO|et_cid|et_rid|qs|itq|ito|itx\[idio\])=[^&\s]*&?", "\1");
I guess this is only running once, so it won't strip out multiple parameters. How would I do that?

After a bit of experimentation, I found a way to do this.
# Strip out query parameters that do not affect the page content
set req.url = regsuball(req.url, "([\?|\&])+(utm_campaign|utm_content|utm_medium|utm_source|utm_term|ITO|et_cid|et_rid|qs|itq|ito|itx\[idio\])=[^&\s]+", "\1");
# Get rid of trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");
# Replace ?&
set req.url = regsub(req.url, "(\?\&)", "\?");
The 2nd and 3rd commands are just cleanup. But this does seem to work.

Implementation #thirtyish gives issues when it is used with combination of additional get parameters.
e.g. ?utm_campaign=1&utm_source=2&my_add_parameter=3 is not working.
If we change the order to ?my_add_parameter=3=utm_campaign=1&utm_source=2 is working.
And by not working I mean it generates multiple & signs in the url query.
I update the regex to fix that.
set req.url = regsuball(req.url, "[\?\&](utm_\w+|hsa_\w+|gclid|fbclid|pc)=[^&\s]+", "");
# trailing & or ?
set req.url = regsuball(req.url, "[\?|&]+$", "");
set req.url = regsub(req.url, "(\?\&)|(\&)", "\?");

How to post data using node-curl?

I'm very new to LINUX working with node.js. Its just my 2nd day. I use node-curl for curl request. In the link below I have found example with Get request. Can anybody provide me a Post request example using node-curl.
https://github.com/jiangmiao/node-curl/blob/master/examples/low-level.js

You need to use setopt in order to specify POST options for a cURL request. The options you should start looking at first are CURLOPT_POST and CURLOPT_POSTFIELDS. From the libcurl documentation linked from node-curl:
CURLOPT_POST
A parameter set to 1 tells the library to do a regular HTTP post. This will also make the library use a "Content-Type: application/x-www-form-urlencoded" header. (This is by far the most commonly used POST method).
Use one of CURLOPT_POSTFIELDS or CURLOPT_COPYPOSTFIELDS options to specify what data to post and CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE to set the data size.
Optionally, you can provide data to POST using the CURLOPT_READFUNCTION and CURLOPT_READDATA options but then you must make sure to not set CURLOPT_POSTFIELDS to anything but NULL. When providing data with a callback, you must transmit it using chunked transfer-encoding or you must set the size of the data with the CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE option. To enable chunked encoding, you simply pass in the appropriate Transfer-Encoding header, see the post-callback.c example.
CURLOPT_POSTFIELDS
... [this] should be the full data to post in a HTTP POST operation. You must make sure that the data is formatted the way you want the server to receive it. libcurl will not convert or encode it for you. Most web servers will assume this data to be url-encoded.
This POST is a normal application/x-www-form-urlencoded kind (and libcurl will set that Content-Type by default when this option is used), which is the most commonly used one by HTML forms. See also the CURLOPT_POST. Using CURLOPT_POSTFIELDS implies CURLOPT_POST.
If you want to do a zero-byte POST, you need to set CURLOPT_POSTFIELDSIZE explicitly to zero, as simply setting CURLOPT_POSTFIELDS to NULL or "" just effectively disables the sending of the specified string. libcurl will instead assume that you'll send the POST data using the read callback!
With that information, you should be able add the following options to the low-level example to have it make a POST request:
var fieldsStr = '{}';
curl.setopt('CURLOPT_POST', 1); // true?
curl.setopt('CURLOPT_POSTFIELDS', fieldsStr);
You will need to tweak the contents of fieldsStr to match the format the server is expecting. Per the documentation you may also need to url-encode the data - which should be as simple as using encodeURIComponent according to this post.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string