Reasons for Cloudfront ClientCommError and ways to get around it - amazon-cloudfront

I am serving few website assets from Cloudfront (backed by S3) and periodically seeing errors like this
2022-02-09 21:20:48 LAX3-C4 0 208.48.9.194 GET my_distribution.cloudfront.net /my/assets/3636.23f5cbf8445b5e7edb3f.js 000 https://my.site.com/ Mozilla/5.0%20(Windows%20NT%2010.0;%20Win64;%20x64;%20rv:96.0)%20Gecko/20100101%20Firefox/96.0 - - Error 7z652evl8PjlvQ65TxEtHHK3qoTU7Tf9F6CW3yHGYxRUYFGxjTlKAw== my_distribution.cloudfront.net https 61 0.003 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error HTTP/2.0 - - 62988 0.000 ClientCommError - - - -
Cloudfront's explanation of ClientCommError: The response to the viewer was interrupted due to a communication problem between the server and the viewer
I have already introduced retries to try and load the resource 3 times before giving up , but it doesn't help for the most part. Also, looking at the location from which resources are requested they are often close by (meaning not from overseas and even on the same coast in US), and my files are pretty small , so the issue can't be the size of a file (ex: 475 B)
What are ways to mitigate such load errors and ensure all resources can be downloaded.

I wasted two hours on the same thing... Turns out I naively used curl to test it and as curl (sensibly) refused to output binary data to my console nothing was actually pulled from s3 to cloudfront. Once I added --output to curl I started getting hits from Cloudfront.

Related

How to Troubleshoot IIS Error 502.3 Codes 12030 and 12152 Sporadically Occurring

I have set up a server using IIS on Windows Service 2016 using python FastAPI with uvicorn. Sending individual querying works well, but I had been moving to testing the API with parallel queries using k6. I had been sending queries with 30 VU across 1min with a random sleep between, resulting in around 2.1 requests/sec. However I noticed that the service had been having sporadic 502.3 errors about 15% of the time.
The error codes tagged to it were: 12030 and 12152. According to https://learn.microsoft.com/en-us/windows/win32/winhttp/error-messages:
ERROR_WINHTTP_CONNECTION_ERROR
12030
The connection with the server has been reset or terminated, or an incompatible SSL protocol was encountered. For example, WinHTTP version 5.1 does not support SSL2 unless the client specifically enables it.
ERROR_WINHTTP_INVALID_SERVER_RESPONSE
12152
The server response cannot be parsed.
The failure percentage seems to scale with higher number of requests per second.
I had checked the httperr logs under C:\Windows\System32\LogFiles\HTTPERR, but only saw Timer_ConnectionIdle which I read to be not an issue.
How else can I troubleshoot these error 502.3 to see what the issue is?
UPDATE 2022/12/20:
Managed to get the FRT for one of the occurrences. How to proceed to troubleshoot? It does seem to be just indicate 502.3 error
Event: MODULE_SET_RESPONSE_ERROR_STATUS
ModuleName: httpPlatformHandler
Notification: EXECUTE_REQUEST_HANDLER
HttpStatus: 502
HttpReason: Bad Gateway
HttpSubStatus: 3
ErrorCode: 2147954552

Slow response times from free web app server every day at same time

Every day at about 3:00PM-4:00PM GMT the response times start to increase (no memory increase or CPU increase)
There is a azure availability test going to server every 10 minutes.
As this is a dev site there is no traffic to it other than me (at the odd time) and the availability test
I log to a variable internally the startup time and this shows that the site is not restarting
The first request via a browser when this starts happening is very slow (2 minutes - probably some timeout).
After that it runs perfectly. That seems like the site is shutting down and then starting up on first request, but the pings are keeping it alive so the site is not shutting down (as far as I know)
On the odd log entry I get - I seem to be getting 502 errors - but I can't confirm this as the FEEB logs are usually off at this time.
FREB logs turn off automatically after 1 hour and as this is the middle of the night for me (NZDT) - I don't get a chance to turn on.
See attached images - as you can see the response times just increase at same time
Ignore the requests where they are above 20 - thats me going to it via browser
I always check the azure dashboard BEFORE viewing site in browser
Just got this error (from web browser randomly - keep accessing the same page:
502: The specified CGI application encountered an error and the server terminated the process.
Other relevant Info (Perhaps):
I initially had the availability test ping going to a ping endpoint /ping that only returned a 200 and empty string when I noticed this happening
It now points to the sites homepage to see if it changed anything - still the same.
Assuming the database is not the issue as the /ping endpoint doesn't touch the database - just a straight controller return.
Internal Exception handling is catching nothing
Service: Azure Free Web App (Development)
There are no web jobs or timed events on this site
Azure Dashboard Initial
Current tests:
Uploading as new site to a Basic 1 Small
Restarting dev site 12 hours before issues (usually 20 hours before)
Results:
Restarting free web-app 12ish hours before issue - same result at same time - so its not the app slowly overloading or it would me much later
Basic 1 Small: no problems - could it be something with the dev server ?
Azure Dashboard From Today
Observations:
Same behavior with /ping endpoint (just return empty string 200 Ok) and Main home page endpoint (database lookups [w/caching] / razer)
If anyone has any ideas what might be going on - I would very much appreciate it
:-)
Update:
It seems to of stopped (on its own) about 11/1/2016 1:50:49 AM GMT - my internal timestamp says it restarted - and then the errors started again same time as usual. Note: no-one is using the app. The basic 1 Small Server is still going fine.
Sorry I can't add anymore images (not enough rep)
By default, web apps are unloaded if they are idle for some period of time, which could cause the web site slow response during this period of time. Besides, this article is about troubleshooting HTTP "502 Bad Gateway" error or a HTTP "503 Service Unavailable" error in Azure web apps, you could read it. And from the article we could know scaling the web app could mitigate the issue.

Load testing bottleneck on nodejs with Google Compute Engine

I cannot figure out what is the cause of the bottleneck on this site, very bad response times once about 400 users reached. The site is on Google compute engine, using an instance group, with network load balancing. We created the project with sailjs.
I have been doing load testing with Google container engine using kubernetes, running the locust.py script.
The main results for one of the tests are:
RPS : 30
Spawn rate: 5 p/s
TOTALS USERS: 1000
AVG(res time): 27500!! (27,5 seconds)
The response time initially is great, below one second, but when it starts reaching about 400 users the response time starts to jump massively.
I have tested obvious factors that can influence that response time, results below:
Compute engine Instances
(2 x standard-n2, 200gb disk, ram:7.5gb per instance):
Only about 20% cpu utilization used
Outgoing network bytes: 340k bytes/sec
Incoming network bytes: 190k bytes/sec
Disk operations: 1 op/sec
Memory: below 10%
MySQL:
Max_used_connections : 41 (below total possible)
Connection errors: 0
All other results for MySQL also seem fine, no reason to cause bottleneck.
I tried the same test for a new sailjs created project, and it did better, but still had terrible results, 5 seconds res time for about 2000 users.
What else should I test? What could be the bottleneck?
Are you doing any file reading/writing? This is a major obstacle in node.js, and will always cause some issues. Caching read files or removing the need for such code should be done as much as possible. In my own experience, serving files like images, css, js and such trough my node server would start causing trouble when the amount of concurrent requests increased. The solution was to serve all of this trough a CDN.
Another proble could be the mysql driver. We had some problems with connection not being closed correctly (Not using sails.js, but I think they used the same driver at the time I encountered this), so they would cause problems on the mysql server, resulting in long delays when fetching data from the database. You should time/track the amount of mysql queries and make sure they arent delayed.
Lastly, it could be some special issue with sails.js and Google compute engine. You should make sure there arent any open issues on either of these about the same problem you are experiencing.

Azure WebSites / App Service Unexplained 502 errors

We have a stateless (with shared Azure Redis Cache) WebApp that we would like to automatically scale via the Azure auto-scale service. When I activate the auto-scale-out, or even when I activate 3 fixed instances for the WebApp, I get the opposite effect: response times increase exponentially or I get Http 502 errors.
This happens whether I use our configured traffic manager url (which worked fine for months with single instances) or the native url (.azurewebsites.net). Could this have something to do with the traffic manager? If so, where can I find info on this combination (having searched)? And how do I properly leverage auto-scale with traffic-manager failovers/perf? I have tried putting the traffic manager in both failover and performance mode with no evident effect. I can gladdly provide links via private channels.
UPDATE: We have reproduced the situation now the "other way around": On the account where we were getting the frequent 5XX errors, we have removed all load balanced servers (only one server per app now) and the problem disappeared. And, on the other account, we started to balance across 3 servers (no traffic manager configured) and soon got the frequent 502 and 503 show stoppers.
Related hypothesis here: https://ask.auth0.com/t/health-checks-response-with-500-http-status/446/8
Possibly the cause? Any takers?
UPDATE
After reverting all WebApps to single instances to rule out any relationship to load balancing, things ran fine for a while. Then the same "502" behavior reappeared across all servers for a period of approx. 15 min on 04.Jan.16 , then disappeared again.
UPDATE
Problem reoccurred for a period of 10 min at 12.55 UTC/GMT on 08.Jan.16 and then disappeared again after a few min. Checking logfiles now for more info.
UPDATE
Problem reoccurred for a period of 90 min at roughly 11.00 UTC/GMT on 19.Jan.16 also on .scm. page. This is the "reference-client" Web App on the account with a Web App named "dummy1015". "502 - Web server received an invalid response while acting as a gateway or proxy server."
I don't think Traffic Manager is the issue here. Since Traffic Manager works at the DNS level, it cannot be the source of the 5XX errors you are seeing. To confirm, I suggest the following:
Check if the increased response times are coming from the DNS lookup or from the web request.
Introduce Traffic Manager whilst keeping your single instance / non-load-balanced set up, and confirm that the problem does not re-appear
This will help confirm if the issue relates to Traffic Manager or some other aspect of the load-balancing.
Regards,
Jonathan Tuliani
Program Manager
Azure Networking - DNS and Traffic Manager

Site scraping: why am I getting DNS issues after multiple hits?

I am scraping a site for data every 50-90 seconds randomly using a C# console application running on .net 4.5. There are couple of values I am posting to the site and based off the returned value I kick off some other process. The problem is after say about a thousand hits or so I get what looks like a DNS error. I am trying to sort out what the source of the problem is first, before trying to fix it. Here below are some of the errors I see in my logs:
The remote name could not be resolved
Unable to connect to the remote server
Unexpected character encountered while parsing value <. Path '',
line 0, position 0.
Unable to read data from the transport connection An existing
connection was forcibly closed by the remote host.
Unable to read data from the transport connection An established
connection was aborted by the software in your host machine.
About 60% of the time I have got the first error. The remaining 40% is divided between the rest of the errors listed above.Are these issues caused by the website I am scraping or by the DNS servers on my end or something else? For all practical purposes the website I am scraping is ok with it as long as I keep the interval between automated hits above 45 seconds which I am doing. The data I am downloading is on an average about 30KB per hit. Please help me understand what could be going wrong and what things I could try to fix this.
I'd say you're running against an automated system designed to protect the site against a DDoS attack http://en.wikipedia.org/wiki/Denial-of-service_attack.
It's seeing that your same IP address is hitting it repeatedly in a short space of time and is simply blocking your resolution of the eventual server.

Resources