Make AWS Lambda send back a file after several minutes of processing - node.js

I have a node app that takes an url, scrape some text with puppeteer and translate it using deepl before sending me back the result in a txt file. It works as expected locally but having a lot of urls to visit and wanting to learn, I'm trying to make this app works with AWS Lambda and a docker image.
I was thinking about using a GET/POST request to send the url to API Gateway to trigger my lambda and wait for it to send me back the txt file. The issue is the whole process takes 2/3 minutes to complete and send back the file. It is not a problem locally but I know you should not have an http request wait for 3 minutes before returning.
I don't really know how to tackle this problem. Should I create a local server and make the lambda post a request to my ip adress once it is done?
I'm a loss here.
Thanks in advance!

One can see a few alternatives to what is seemingly asynchronous processing concern.
Poke the lambda with the data it needs (via an API, SDK, or CLI) then have it write its results to an S3 bucket. One could poll the s3 bucket for the results asynchronously and pull them down, obviously this requires some scripting.
An another approach would be to have the lambda post the results to a SNS topic that you've subscribed to.
That said, I'm not entirely sure what is meant by local IP, but I would avoid pushing data directly to a self-managed server (or your local IP), rather I would want to use one of the AWS "decoupling" services like SNS, SQS, or even S3 to split apart processing steps. This way it's possible to make many requests and pull down the data as needed.

Related

How can my node.js server on AWS handle multiple users generating images when node calls Python program?

I have containerized node.js code that runs on ECS. When multiple users use node.js to call a .py image generating problem, only 1 user gets the image, the rest get errors. I wonder if it appropriate to use Lambda so that the image generation multithreads.
For some reason, the containerized code which uses docker works locally, but not on AWS when multiple users access the .py function.
If users send images from mobile application, you can use aws-sdk to upload images from mobile device to AWS S3, and configure Lambda trigger for image upload.
This Lambda will process image data and return you the result.
Since Lambda from serverless world, it can handle really big amount of invocations.
So, in case if you/your team can add aws-sdk to mobile application, it's nice approach to upload image directly from device to S3, trigger Lambda to image processing and change user's data in some storage.
If you have untrusted environment, like user's browser, it's not O.K. to upload image directly from browser to S3, since to achieve this goal you have to provide AWS access keys.
So, in this case, it's O.K. to upload image to server and transfer image from server to S3.
After this, the logic keeps the same: trigger AWS Lambda, process data and update it in storage.
This behaviour will reduce load to the server and allows you to work on features, instead of working on image storing and other stuff, which will just bother your server.

NodeJS API sync uploaded files through SFTP

I have a NodeJS REST API which has endpoints for users to upload assets (mostly images). I distribute my assets through a CDN. How I do it right now is call my endpoint /assets/upload with a multipart form, the API creates the DB resource for the asset and then use SFTP to transfer the image to the CDN origin's. Upon success I respond with the url of the uploaded asset.
I noticed that the most expensive operation for relatively small files is the connection to the origin through SFTP.
So my first question is:
1. Is it a bad idea to always keep the connection alive so that I can
always reuse it to sync my files.
My second question is:
2. Is it a bad idea to have my API handle the SFTP transfer to the CDN origin, should I consider having a CDN origin that could handle the HTTP request itself?
Short Answer: (1) it a not a bad idea to keep the connection alive, but it comes with complications. I recommend trying without reusing connections first. And (2) The upload should go through the API, but there maybe be ways to optimize how the API to CDN transfer happens.
Long Answer:
1. Is it a bad idea to always keep the connection alive so that I can always reuse it to sync my files.
It is generally not a bad idea to keep the connection alive. Reusing connections can improve site performance, generally speaking.
However, it does come with some complications. You need to make sure the connection is up. You need to make sure that if the connection went down you recreate it. There are cases where the SFTP client thinks that the connection is still alive, but it actually isn't, and you need to do a retry. You also need to make sure that while one request is using a connection, no other requests can do so. You would possibly want a pool of connections to work with, so that you can service multiple requests at the same time.
If you're lucky, the SFTP client library already handles this (see if it supports connection pools). If you aren't, you will have to do it yourself.
My recommendation - try to do it without reusing the connection first, and see if the site's performance is acceptable. If it isn't, then consider reusing connections. Be careful though.
2. Is it a bad idea to have my API handle the SFTP transfer to the CDN origin, should I consider having a CDN origin that could handle the HTTP request itself?
It is generally a good idea to have the HTTP request go through the API for a couple of reasons:
For security reasons' you want your CDN upload credentials to be stored on your API, and not on your client (website or mobile app). You should assume that your code for website can be seen (via view source) and people can generally decompile or reverse engineer mobile apps, and they'll be able to see your credentials in the code.
This hides implementation details from the client, so you can change this in the future without the client code needing to change.
#tarun-lalwani's suggestion is actually a good one - use S3 to store the image, and use a lambda trigger to upload it to the CDN. There are a couple of Node.js libraries that allow you to stream the image through your API's http request towards the S3 bucket directly. This means that you don't have to worry about disk space on your machine instance.
Regarding your question to #tarun-lalwani's comment - one way to do it is to use the S3 image url path until the lambda function is finished. S3 can serve images too, if properly given permissions to do so. Then after the lambda function is finished uploading to the CDN, you just replace the image path in your db.

get amazon S3 to send http request upon file upload

I need my nodejs application to receive a http request with a file name, when a file is uploaded in my S3 bucket.
I would like some recommendations on the most simple/straight forward way to achieve this.
So far I see 3 ways to do this, but I feel Im overthinking this, and there surely exist better options:
1/ file uploaded on s3 -> S3 send a notification to SNS -> SNS sends a http request to my application
2/ file uploaded on s3 -> lambda function is triggered and sends a http request to my application
3/ make my application watch the bucket on regular basis and do something when a file is uploaded
thanks
ps. yes, Im really new to amazon services :)
SNS: Will work OK, but you'll have to manage the SNS topic subscription. You also won't have any control over the HTTP post's format.
Lambda: This is what I would go with. It gives you the most control.
How would you efficiently check for new objects exactly? This isn't a good solution.
You could also have S3 post the new object events to SQS, and configure your application to poll the SQS queue instead of listening for an HTTP request.
SNS- If you want to call multiple services on updating S3 then I would suggest SNS. You can create a topic for SNS and there can have multiple subscribers to that topic. Later if you want to add more HTTP then it would be as simple as subscribing the topic.
Lambda - If you need to sent notification to only one HTTP endpoint then I would strongly recommend this.
SQS - You don't need SQL in this scenario. SQS is mainly for decoupling components and would be the best fit for microservices but you can use with other messaging systems as well
You don't need to build something on your on to regularly monitor the bucket for changes as already services there like Lambda, SNS etc.

Serverless Framework Facebook Bot Slow (AWS Lambda)

I'm working on a facebook chat bot, and I'm developing it using the serverless framework (Node.js) and deploying it to aws lambda. For the first few weeks, I just ran a local serverless lambda simulator using the serverless offline plugin and everything was working great. Yesterday, I finally decided to deploy it to AWS lambda, and now I see a significant drop in performance and consistency. Sometimes the bot takes 10 seconds to respond and sometimes it is instantaneous. The weird part is, on the lambda cloud logs, it always says the function completes in around 150 ms, which seems super fast, but the facebook bot simply doesn't mirror that speed. I am hitting a database, but the queries are definitely not taking anywhere near 10 seconds to run.
UPDATE:
I decided to try to test the bot my manually sending requests to the API endpoint using postman (which is basically curl). Every time the api responded instantly, even when I send the exact same request body that the messenger does. So it seems like the request is just taking a long time to reach the lambda api, but when it gets there it runs as it should. Any ideas of how to fix this?
If the API is responding quickly to your curl request, then the problem isn't on AWS end. Try matching when you send your request via Facebook to your app and when your app recieves it.
If it's getting held up on Facebooks end, Im afraid there isnt much you can do to solve it.
Another issue could be the datacenter your lambda is running in versus where facebook is. For example, using chkutil.com, you can see facebook.com seems particularly slow from the Asia-Pacific datacenters.
As it turns out, Facebook was experiencing DNS issues and has since remedied the issue.

Use case of AWS lambda to nodejs project on Elastic Beanstalk

I have a thumbnailing function running on lambda, and I want to deploy it on elastic beanstalk. Lambda did a lot of background jobs for me so that when I deploy my function to elastic beanstalk, it won't work for me properly as I expected.
My lambda function can thumbnail all images in a given folder of given s3 bucket, and store them into different size images in the same location when it's triggered. However, when I deployed that to beanstalk, it will not be triggered by any of s3 events.
I know the rough step to fix it, but I need to know few specific things:
Before creating a lambda function, we need to configure event resourses:
I want to know if I can somehow pass them in beanstalk, I'm thinking about passing a json into my node.js function, but I don't know exactly how.
I don't know if I should add my function in an infinite loop to monitor events notifications from s3.
I want to combine this node.js function with other independent node.js service by express. And I want to display summary message about how many images has been thumbnailed in browser. But currently, with lambda package structure, I'm exporting a function handler to other js files. How can I export internal data to another static hjs/jade page?
How can I get the notification from s3?
In brief, if it isn't worthy of adding such complexity to deploy lambda function to beanstalk, should I just leave it as a lambda function?
Regarding Elastic Beanstalk vs. AWS Lambda, I think Lambda is going to be more scalable, as well as cheaper, for this sort of task. And I think saving status information to a DynamoDB table would be a quick and easy way to make statistics available that you can display in your web application, while preventing those statistics from disappearing if you redeploy or restart your application. Saving that data in DynamoDB would also allow you to have more than one EC2 instance serving your website in Elastic Beanstalk without having to worry about synchronizing that data across servers somehow.
Regarding sending S3 notifications to Elastic Beanstalk, you would need to do the following:
Configure S3 to send notifications to an SNS topic
Configure the SNS topic to send notifications to an HTTP endpoint
Configure an HTTP endpoint in your Beanstalk application to receive and process those notifications

Resources