Downloading an image sometimes times out when requested from AWS Lambda - node.js

Background
I have an AWS Lambda function (Node, currently v16) that has been doing its job for many months. It is actively maintained. One of the things it does, is download a number of images of which the URLs are supplied via the payload.
Problem
Invocations suddenly started failing sporadically. It was clear quickly that some of the requests for downloading the images were timing out. The reason for these timeouts, however, is a mystery.
Potentially useful details
A list of implementation details, debugging steps I have tried, and other random info that could shed some light on this.
For each invocation, there are typically between five and fifteen images to be downloaded.
Most invocations are successful, i.e. all images are downloaded quickly and without problems.
Invocations fail about 10-15% of the time. In most instances, only a single image fails to download. But I've also seen two or three images time out.
When retrying invocations, it is always the same image that times out. Sooner or later though, it will suddenly work. Sometimes it works on the second try, sometimes I have to wait a little longer.
Even when one invocation with a particular payload is failing, others can be successful at the same time.
When running similar code locally (same payload, same image downloading code) it never fails. So even when the images time out when the request comes from AWS Lambda, it'll still work from my machine.
Any ideas?
I am open to any suggestions, theories, or debugging ideas.

Related

Using child_process.exec in Google Functions is very slow and random

I am trying to execute arbitrary user provided code in a reasonably secure and controlled way. I have been doing that using child_process.exec within a Google Cloud Function.
However, I'm finding that the execution time can vary pretty dramatically.
Running a single console.log inside a Cloud Function directly, vs within child_process.exec inside a Cloud Function, results in an overhead of 500-4000 ms in execution time.
It seems a little crazy that it both:
Can vary so widely.
Can take over 4 seconds extra to run in a separate thread.
My guess this is because they are essentially only allocating one thread to the Cloud Function, and my process has to wait around for another on that machine to free-up.
Is there something I can do to even this out?
UPDATE:
So I've been able to reproduce this consistently. It's definitely something causing require statements to take awhile inside child_process.exec Cloud Functions when the dependencies are medium/large.
Originally able to reproduce with just using Mocha to execute an empty unit test.
But I created a whole repo to reproduce it better here
And a blog post talking about my results here
I'd interested if someone could explain this.
For the moment, the issue appears to be that require calls to medium/large dependencies, made inside a child_process.exec call, can take awhile sometimes.
Not sure why.
Cloud Run does not have this issue.
But I created a whole repo to reproduce it better here
And a blog post talking about my results here
I'd interested if someone could explain this.

spotify models.player.load(...) promise not resolving

Anyone ever notice the Promise from models.player.load("context","playing"); simply doesn't come back (which is to say it neither fails nor dones)?
This doesn't happen every time I try the operation, only sporadically.
Working with API version 1.0.0, on Spotify for Windows.
Yep, timeouts don't seem to fail or done. At least I assume they're timeouts, because of their sporadic appearance, and different behavior for different users.
See later comments on my answer here: Intermittent issue with tracks snapshot for current user top list
Can be pretty frustrating. Only option I know of at this point is a success flag and a retry timer.

"General permanent error" during libspotify search

I've recently started programming in C on my Raspberry Pi. I have downloaded libspotify (I have the correct version), and have managed it pretty well.
Just recently (~2 hours ago (around 18:00 30/12/2013)), libspotify started to return SP_ERROR_OTHER_PERMANENT when checking for a search error in the search_complete_cb callback.
Before the error started occuring, I have built and started the program quite a few times (and thus, logging in many times, during only a short period of time), and to test my 'Search' feature, I have used the same query every time. Then, without making any changes to my program, suddenly there were no results returned after calling sp_search_create.
I am worried that the developer account has been somehow suspended for either repeatedly logging in, or because it seemed weird to the spotify crew that I would search for the same query all the time. I don't really know what the problem is caused by. There are no emails or warnings sent to the address connected to the account. The problem has lasted for a while now, so it seems like it's not going away at first.
Additional details
log_message tells me there is a ChannelError(4, 0, search). I have also seen ChannelError(5, 0, search), but only once.
I can still play music from the official Spotify desktop client for Windows.
I have an earlier version of the program, before I rewrote it to get a bit more structure, that works. The same API key and same credentials are used in both programs, so that excludes a ban. The rewrite does log in, but no results are returned from searching. In the old version, I get a lot of results. All working. I have rebooted the Raspberry Pi several times, but that doesn't seem to help.
If you need any code or other information, I'll be happy to share. Just point out what's needed, because the code is split over a lot of files.
Well, if your old one is working then the problem will be in your rewrite. Don't pay too much notice to the error messages, they're pretty much par for the course, and can be triggered by something as benign as a cache miss. Unless you're actually getting an error callback somewhere, the log messages are meaningless.
As for your problem, I can't really make any guesses without seeing your code. One thing to check and is the most common course of a permanent error is to make sure you're actually logged in. The login process is asynchronous, and any functionality that requires you to be logged in (searching is one of them) will fail before login is completed.

JSF2 slow page loading

I'm working with a JSF2 webapp. When I'm navigating between different pages they normally load fast; less than 100 ms. Sometimes though, for no apparent reason, it takes several seconds.
I've been trying to find some common denominator for when this occurs, but it happens regardless of page and regardless if I have visited the page several times before. Also, after a page has been slow to load, the next time I load it, it will load fast again for some time.
It all seems to happen randomly.
I have tried to find out what part of the application that takes time to carry out its task. I've timed more or less everything I can think of and it's not database calls, the logic in my classes or anything like that. Instead, looking at the "network" graph of chrome, it seems to be the initial call to the page that is the time thief.
Looking at the "network" diagram of Chrome, it shows that the latency for the first call is several seconds on those occasions.
Had this been due to my own bad code, I could at least have timed it and found out where I had made mistakes. Seeing that this seems to happen before my own code is even reached, I have no idea about how to solve this problem.
This may not be the actual reason to the problem, but I noticed that my internet connection was going up and down, which seems to affect the application even though I'm running a local server.
If I have made a request to the application and the internet connection goes down, the requested page won't load and as soon as the connection is back, the page loads.
I didn't think that this would affect the application at all, as the server is local and I can inactivate the internet connection and still access the application.

Node.js app has periodic slowness and/or timeouts (does not accept incoming requests)

This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...
Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.
My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.
You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.
Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.
What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.
If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.
We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

Resources