Troubleshooting 500 Error Due to Cookie Size - varnish

Visitors to a website get 500 Internal Server error after browsing for a bit due to a tracking cookie that pushes the overall cookie size for our domain to over 4kb (it's a page view cookie so it appends the page name each time you visit a new page).
I can reproduce the issue using curl with a very large cookie payload. In doing this I've been able to verify where exactly the 500 is coming from (we go from Cloudflare to Varnish to the backend webserver). I've verified that the requests that fail don't make it to the webserver, so I believe Varnish is the one serving up the 500s. I have also watched the varnishlog and seen the 500s come through.
This is an example response from the varnishlog
-- VCL_return hash
-- VCL_call HASH
-- VCL_return lookup
-- Hit 57254162
-- VCL_call HIT
-- VCL_return deliver
-- RespProtocol HTTP/1.1
-- RespStatus 200
-- RespReason OK
-- RespHeader X-Powered-By: Express
-- RespHeader Date: Thu, 01 Aug 2019 23:05:52 GMT
-- RespHeader Content-Type: application/json; charset=utf-8
-- RespHeader Content-Length: 1174
-- RespHeader X-Varnish: 57156196 57519178
-- RespHeader Age: 86
-- RespHeader Via: 1.1 varnish-v4
-- VCL_call DELIVER
-- RespHeader X-Cache: HIT
-- RespUnset X-Powered-By: Express
-- VCL_return deliver
-- Timestamp Process: 1564700838.564547 0.000354 0.000354
-- RespHeader Accept-Ranges: bytes
-- Debug "RES_MODE 2"
-- RespHeader Connection: keep-alive
-- Error workspace_client overflow
-- RespProtocol HTTP/1.1
-- RespStatus 500
-- RespReason Internal Server Error
-- Timestamp Resp: 1564700838.564580 0.000387 0.000033
-- ReqAcct 10063 0 10063 0 0 0
-- End
Here is what I'd added to the vcl_recv section to remove the offending cookie
set req.http.Cookie = regsuball(req.http.Cookie, "_loc_[^;]+(; )?", "");
I don't understand what the significance is of two RespStatus entries here. Why is it 200, and then 500? I've also noticed that if I use curl, which is using HTTP/1.1 I get the 500, but if I use HTTPie, which uses HTTP/2, I get a 200. Is that expected? Would Varnish handle the cookie size differently depending on the http version?
*Edited: I think I've figured out that the difference in the two response statuses are that one is the delivery of the content to varnish, and the second is the delivery of the content to the client.

As the log says, the workspace is too small to accommodate the transaction (headers, notably), try increasing it:
varnishadm param.set workspace_client 128k
For a long explanation: varnish uses a "worspace" for each transaction. This is a chunk of memory used to allocate data, and the whole chunk is wiped out at the end of the transaction. The headers notably are copied into the workspace, and everytime to add or modify a header, it goes there too.
Ths issue here is that you don't have enough space. Earlier version would just panic, but it's now smarter and just produces a synthetic response with a 500 status. The trick is that it realizes the lack of workspace after the initial response has been copied, so you see both responses in the log.

Related

Missing rate limiting headers in Azure API responses

My app is being throttled by Azure Management API and it'd be much easier to investigate which requests are causing troubles if I have access to rate limiting headers. I thought they're returned by default but they're not, at least not for me. I've followed steps listed in this article but none of rate limiting headers is returned. I even upgraded Azure CLI to the latest version but it didn't help either.
So the question is whether I need to enable something in Azure portal to retrieve this kind of information or maybe it's something else. Anyway, I'd be very thankful for any help as I feel a bit confused.
Here's an example response:
GET https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourcegroups?api-version=2021-04-01
HTTP/2 200
cache-control: no-cache
pragma: no-cache
content-type: application/json; charset=utf-8
expires: -1
x-ms-request-id: d093ff31-9b06-41dd-9c3a-6e7b7275e02
x-ms-correlation-request-id: d093ff31-9b06-41dd-9c3a-6e7b7275e023
x-ms-routing-request-id: EASTUS2:20211024T140342Z:d093ff31-9b06-41dd-9c3a-6e7b7275e023
strict-transport-security: max-age=31536000; includeSubDomains
x-content-type-options: nosniff
date: Sun, 24 Oct 2021 14:03:42 GMT
content-length: 289
<JSON_DATA>

Rewrite rules defined on Azure application gateway does not seem to work on TRACE methods

I have defined a rewrite rule on my Azure application gateway that is rewriting a response header (Server=Unknown). I see that the rule is correctly executed on a GET, OPTIONS, DELETE method (returning either HTTP 200 or 405), however the rule does not seem to be fired on a TRACE method.
I wanted to address a finding from penetration tests that state that the server discloses technical information allowing an attacker to be informed of the reverse proxy installed.
Below is a screenshot of the HTTP DELETE method:
HTTP/1.1 405
Date: Mon, 02 Nov 2020 14:47:18 GMT
Content-Type: text/plain
Content-Length: 0
Connection: keep-alive
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Cache-Control: no-cache,no-store,must-revalidate
Pragma: no-cache
Allow: GET
Server: Unknown
And below the same call using TRACE:
HTTP/1.1 405 Not Allowed
Server: Microsoft-Azure-Application-Gateway/v2
Date: Mon, 02 Nov 2020 14:47:50 GMT
Content-Type: text/html
Content-Length: 183
Connection: close
Also to me the fact that the TRACE does not contain as many headers as the DELETE is a proof that the call will not reach the web server (which is fine with me) but then I expect the application gateway to fire the same rewrite rule as for any other method.
I also tried to remove the header instead of setting it to "Unknown" but this has the same effect (the header is removed on all methods except TRACE).
Trace method is not yet added to the list. We have this in our road map but with no ETA. Please follow Azure Updates page for further updates.

How to determine why a flush error occurs?

We have an ISAPI module and filter that inspects and modifies responses. We have this scenario where Firefox with HTTP2 enabled sends a request that fails within IIS, and a second request is immediately re-introduced into the pipeline (perhaps resent from the Firefox client). The two requests are very similar except the first one had TE: trailer header and connection:close. When looking at the Failed Request Trace, we see that the flush on the first request fails with 'the parameter is incorrect' (below). Is there a way to track down more information on why the flush failed? I tried to track it down within the managed pipeline but wasn't able to - it seems it may be occurring within native code, or maybe a communication error back to the client(?). If Firefox has HTTP2 disabled, the flush error doesn't occur. If we don't have the ISAPI module and filter, the first request succeeds.
0 ms
Verbose
GENERAL_RESPONSE_ENTITY_BUFFER
Buffer
HTTP/1.1 302 Found
Content-Length: 192
Content-Type: text/html; charset=utf-8
Location: https://SERVER-NAME/VDIR/PATH/FILE.aspx?url=https%3a%2f%2fSERVER-NAME%2fVDIR
Server: Microsoft-IIS/10.0
request-id: b8945a72-543a-4474-9837-9420b3176c5b
X-Powered-By: ASP.NET
X-X-Server: SERVER-NAME
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
0 ms
Informational
GENERAL_FLUSH_RESPONSE_END
BytesSent
0
ErrorCode
The parameter is incorrect.
(0x80070057)
0 ms
GENERAL_REQUEST_END
BytesSent
0
BytesReceived
733
HttpStatus
302
HttpSubStatus
0 ```

Cloudant "case_clause" error with pouchdb when replicating

I am working with Pouchdb and Cloudant, and when my web app starts up it does a replication from Cloudant down to my pouchdb in the browser. I have an idea of how pouchdb works internally, and this is how I believe the process works (high level):
Replication starts
Gets a checkpoint doc from cloudant db (contains latest sequence number retrived from server, if not exists, assumes sequence # is 0, which is my case)
Grabs the changes from the changes freed starting at that sequence number (it grabs up to 25 changes)
Writes(or updates) the checkpoint doc back to cloudant server with the new sequence number (this way if a network error occurs, it can continue where it left off or for the next replication)
Repeats until no changes left
Replication complete
The problem is at step 4, that when pouch tries to write that doc to the cloudant server (for the first time), the server returns a 'case_clause' error. I am thinking the issue might be an invalid id sent to cloudant (cloudant doesn't accept ids of this format), because the id of the doc written to the server is _local/799c37dfaefb3774a04f55c7f8cee947 (or other random numbers and characters at the end). I don't know if that is a valid doc id or not (for cloudant that is, because this is accurate for pouchdb), so I guess I am asking, is that the issue (unacceptable id for cloudant), or is there some other issue based on the error the cloudant server returns.
Here is the doc being written:
{
_id: "_local/799c37dfaefb3774a04f55c7f8cee947",
last_seq: "63"
}
Here is the full error output from Chrome debugger:
{
error: "case_clause"
reason: "{{case_clause,{ok,{error,[{{doc,>,
{338,
[>]},
{[{>,>}]},
[],false,[]},
{error,internal_server_error}}]}}},
[{fabric,update_doc,3},{chttpd_db,'-update_doc/6-fun-0-',3}]}"
stack: Array[4]
0: "chttpd_db:update_doc/6"
1: "chttpd:handle_request/1"
2: "mochiweb_http:headers/5"
3: "proc_lib:init_p_do_apply/3"
length: 4
__proto__: Array[0]
status: 500
}
Note: When I go into cloudant's Futon and manually enter the url for the checkpoint doc using its id, it does not exist.
Thanks
EDIT:
Header Info from the above request using Chrome debugger:
Request URL:http://lessontrek.toddbluhm.c9.io/db/ilintindingreseseldropec/_local%2F799c37dfaefb3774a04f55c7f8cee947
Request Method:PUT
Status Code:500 Internal Server Error
Request Headersview parsed
PUT /db/ilintindingreseseldropec/_local%2F799c37dfaefb3774a04f55c7f8cee947 HTTP/1.1
Host: lessontrek.toddbluhm.c9.io
Connection: keep-alive
Content-Length: 111
Accept: application/json
Origin: http://lessontrek.toddbluhm.c9.io
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36
Content-Type: application/json
Referer: http://lessontrek.toddbluhm.c9.io/app
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: connect.sid=s%3A8MVBFmbizTX4VNOqZNtIuxQI.TZ9yKRqNv0ePbTB%2FmSpJsncYszJ8qBSD5EWHzxQYIbg; AuthSession=(removed for security purposes, but valid); db_name=ilintindingreseseldropec; __utma=200306492.386329876.1368934655.1375164160.1375252679.55; __utmc=200306492; __utmz=200306492.1372711539.22.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); c9.live.proxy=(removed for security purposes, but valid)
Request Payloadview parsed
{"_id":"_local/799c37dfaefb3774a04f55c7f8cee947","last_seq":"63","_rev":"338-7db9750558e43e2076a3aa720a6de47b"}
Response Headersview parsed
HTTP/1.1 500 Internal Server Error
x-powered-by: Express
vary: Accept-Encoding
x-couch-request-id: 7d2ca9fc
server: CouchDB/1.0.2 (Erlang OTP/R14B)
date: Wed, 31 Jul 2013 07:29:23 GMT
content-type: application/json
cache-control: must-revalidate
content-encoding: gzip
transfer-encoding: chunked
via: 1.1 project-livec993c2dc8b8c.rhcloud.com (node-web-proxy/0.4)
X-C9-Server: proxy_subdomain_collab-bus2_01
Cloudant, like CouchDB, expects all _local revs to begin "0-". pouchdb should not be generating rev values of this form. If you try this PUT against CouchDB you get the same stack trace.

JSON content compression stops working after some time with Azure WebRole

I have JSON compression configured for my Web API in Azure, following this MSDN article Use AppCmd.exe to Configure IIS at Startup.
I publish my roles and start testing and all is well according to Fiddler.
Here is an example request header:
GET http://x.cloudapp.net:8080/api/xyz HTTP/1.1
Accept: application/json
Host: x.cloudapp.net:8080
Accept-Encoding: gzip
Here is an example response header:
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json; charset=utf-8
Content-Encoding: gzip
Expires: -1
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 18 Jul 2013 22:27:38 GMT
Content-Length: 2472
Just a few Web API calls later (like 6 seconds later) all responses are no longer compressed.
Request header:
GET http://xyz HTTP/1.1
Accept: application/json
Host: sp-test-server2012.cloudapp.net:8080
Accept-Encoding: gzip
Response header:
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Type: application/json; charset=utf-8
Expires: -1
Server: Microsoft-IIS/8.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Thu, 18 Jul 2013 22:27:44 GMT
Content-Length: 16255
Note the missing Content-Encoding in the second response.
So I get a few hundred calls that are compressed, and then most of the rest are uncompressed. Every now and again I can see that another response is compressed. Or if I stop testing for a while and then resume it seems that compression starts again.
Is compression in IIS 8 'throttled' or something? Say, if the CPU is nearly maxed out, does IIS stop compressing?
In monitoring my WebRole in Azure, my CPU usage can go above 90% during my heavy load testing. It is hard to tell if this is correlated with the lack of compression on the results. Memory usage does not appear to be an issue at all.
I would like this to be more reliable and predictable!
Well, apparently yesterday my Google Fu failed me. I found the answer today and it is true that IIS will or will not dynamically compress content based on CPU usage. HTTP Compression
There are two settings that control dynamic compression. One specifies when it is disabled: dynamicCompressionDisableCpuUsage, default 90%. Another specifies when it is re-enabled dynamicCompressionEnableCpuUsage, default 50%.
The things you learn.
This article might be helpful to force compression:
ASP.NET Web API GZip compression ActionFilter with 8 lines of code
Of course they'll charge for CPU time in heavy load situations.

Resources