Boto3 Pre-Sign Upload URL takes very long to generate - python-3.x

I have a Python application that needs to generate many Pre-Sign Upload URLs for a single bucket on our AWS cloud.
I noticed a strange behavior - it seems that every so often generating the URL would take around 5 seconds instead of the nano seconds it usually takes. I vaguely remember that generating a URL is a local operation and I couldn't find any documentation on this.
I'm running a Falcon app over Gunicorn with boto3~=1.20.54. The Gunicorn is configured with multiple sync workers and same number of threads.
It should be noted that this is a rare occurrence but it is very random and I can't explain it. I did find some explanations that the boto3 client is somewhat lazy in that it loads meta data on the S3 bucket on the first operation (see Warning section). I added a noop call to list_buckets as the first thing when the app loads but it didn't affect the behavior.
Any help will be apprecicated.

Related

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

Process Redis KUE jobs within multiple kubernetes pods/instances

I'm using Sails.js for an API which I deploy from a Dockerfile in a Google Cloud kubernetes cluster and scale the workload with 3-5 pods. The API provides endpoints to upload single image files and bigger zip files which I directly extract on the current API pod/instance.
Both, single image files and the extracted archive content (100-1000 files with all together 15-85mb of content), I have to upload to various Storage buckets. This is where redis kue comes into play. To make sure the API is not blocking the request for the uploads for too long, I create delayed kue jobs to move all the uploaded files and folders to storage buckets or chain jobs and create thumbnails with the help of ImageMagick first.
All this can take some time, depending on the current workload of the cluster, sometimes more and sometimes less.
All this works pretty fine with one single instance but within a cluster, it's a different story. Since the kubernetes instance for the API can change from request to request, the uploads can land on instance A, but the job for the files is being processed and handled by instance B (The worker, as well as the API itself, are running on the same instance!) which might won't have the uploads available which leads into a failed job.
It takes time for Google to keep the pods in sync and to spread the uploads to all the other pods.
What I have tried is the following:
Since the name of the current pod is available via env variable HOSTNAME, I'm storing the HOSTNAME with all kue jobs and check within the worker if the HOSTNAME from the jobs matches with the HOSTNAME of the current environment and only allow to process the jobs if both HOSTNAMEs are matching.
Uploads need to be available ASAP; why I can't add a job delay of a few minutes and hope that by the time the job is going to be processed, Google has synchronized its pods.
Pending jobs which don't match the HOSTNAME, I push back to the queue and add delay to it.
What I want to have is a queue which doesn't have to take care of hostnames and conditional checks to successfully process its jobs in a cluster like mine.
for this one "which might won't have the uploads available which leads into a failed job" could you please consider using "Persistent Volumes".
In this case your jobs could work independly looking for extracted archive content into shared storage.
Hope this help. Please share with your findings.

AWS SDK calls from a Lambda take longer than 30 seconds

I have a NodeJs Lambda function in AWS which needs to read some data. As a data source we've tried two options - S3 and DynamoDB. Both on them have the same issue - when we conduct load testing (10 req/sec during 100sec) some requests to those S3/DynamoDB fail to complete in 30 sec, which is our Lambda timeout. The requests themselves are very light - for S3 it is a 1KB file and for DynamoDB it is a table with only one record in it. On average those requests take less than 100ms, but sometimes we get these very long peaks I'm talking about.
The rate of such long requests is quite small - less than 1%, but this is still not acceptable for us. Moreover, I don't see any reasons why we have such long responses.
Another thing we've noticed is that those 30sec+ requests usually happen after long periods (4h or more) of not calling those S3/DynamoDB resources.
The only reason I can think of is that after long inactivity periods AWS infrastructure unable to create required number of ENIs fast enough. ENIs are needed because both S3 and DynamoDB are called via HTTP by aws-sdk. But this is just a guess which I don't know how to validate.
Currently, I'm thinking of warming-up ENIs by making requests to S3/DynamoDB, but I haven't tried it yet.
If anybody has had similar issues I would appreciate any suggestions on how to fix the issue.
P.S. Increasing a Lambda timeout is not an options for us. 30secs are more than enough to make such a simple calls.

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

AWS Step/Lambda - storing variable between runs

In my first foray into any computing in the cloud, I was able to follow Mark West's instructions on how to use AWS Rekognition to process images from a security camera that are dumped into an S3 bucket and provide a notification if a person was detected. His code was setup for the Raspberry Pi camera but I was able to adapt it to my IP camera by having it FTP the triggered images to my Synology NAS and use CloudSync to mirror it to the S3 bucket. A step function calls Lambda functions per the below figure and I get an email within 15 seconds with a list of labels detected and the image attached.
The problem is the camera will upload one image per second as long the condition is triggered and if there is a lot of activity in front of the camera, I can quickly rack up a few hundred emails.
I'd like to insert a function between make-alert-decision and nodemailer-send-notification that would check to see if an email notification was sent within the last minute and if not, proceed to nodemailer-send-notification right away and if so, store the list of labels, and path to the attachment in an array and then send a single email with all of the attachments once 60 seconds had passed.
I know I have to store the data externally and came across this article explaining the benefits of different methods of caching data and I also thought that I could examine the timestamps of the files uploaded to S3 to compare the time elapsed between the two most recent uploaded files to decide whether to proceed or batch the file for later.
Being completely new to AWS, I am looking for advice on which method makes the most sense from a complexity and cost perspective. I can live with the lag involved in any of methods discussed in the article, just don't know how to proceed as I've never used or even heard of any of the services.
Thanks!
You can use a SQS queue to which the lambda make-alert-decision sends message with each label and path to attachment.
The lambda nodemailer-send-notification would be a consumer of that queue, but being executed on a regular schedule.
You can specify that lambda to be executed every 1 minute, reading all the messages from the queue - and deleting them from the queue right away or setting a visibility time suitable and deleting afterwards - to get the list of attachments and send a single email. We would have a single email with all the attachments every 60 seconds.

Resources