I have a CSV file in S3. I want to run a python script using data present in S3. The S3 file will change once in a week. I need to pass an input argument to my python script which loads my S3 file into Pandas and do some calculation to return the result.
Currently I am loading this S3 file using Boto3 in my server for each input argument. This process takes more time to return the result, and my nginx returns with 504 Gateway timeout.
I am expecting some AWS service to do it in cloud. Can anyone point me in a right direction which AWS service is suitable to use here
You have several options:
Use AWS Lambda, but Lambda has limited local storage (500mb) and memory (3gb) with 15 run time.
Since you mentioned Pandas I recommend using AWS Glue which has ability:
Detect new file
Large Mem, CPU supported
Visual data flow
Support Spark DF
Ability to query data from your CSV files
Connect to different database engines.
We currently use AWS Glue for our data parser processes
Related
I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!
I'm developing a AWS Lambda in python which will trigger by API Gateway and lambda will connect my snowflake. I'll process few CSV files via API Gateway to get some data from snowflake. Currently I'm using Python connector to connect Snowflake.
My issue is, if my csv has 100 records so it process the records recursively and it connects snowflake from lambda every time to process each record and its impacting on the performance.
Is there any method or mechanism that lambda can create a session for certain period of time and process all records in single connection.
As far as I know, connect() will automatically create a session that will last for a period of time. Once connected, you can use the cursor to execute multiple commands without needing to call connect() every time. Docs here. But I'm guessing you know this, and what you want is a single command instead of having to call multiple INSERT.
This is also possible, using a STAGE and COPY INTO command instead of INSERT. You can find an example from Snowflake documentation for bulk loading from AWS S3 here.
I have a usecase where I want to upload my local file to s3 bucket and further continue next process. So, I am using amazon s3 out node. it uploads file properly but it's not allowing me to perform further operations. Because the node has one was connection.
How to perform next operation using this node? Any hint would be appreciable.
You have 3 options
Modify the node to have an output. If you do this you should consider raising a pull request against the project so the authors can decide if they want to include your changes.
Split the flow into 2 parts. And then use the AWS watch node (included in the same package as the AWS s3 output node you are using) to watch for the event that matches the file being uploaded. You will have to store any data that is included with the file you are uploading in the context so it can be retrieved by the second flow.
Is pretty much the same as option 2, but you can use the Status node instead to watch for changes in status of the AWS output node to trigger the later steps.
Out of three options I'd probably follow them in the same order
I've mounted a public s3 bucket to aws ec2 instance using Goofys (kind of similar to s3fs), which will let me access files in the s3 bucket on my ec2 instance as if they were local paths. I want to use these files in my aws lambda function, passing in these local paths to the event parameter in aws lambda in python. Given that AWS lambda has a storage limit of 512 MB, is there a way I can give aws lambda access to the files on my ec2 instance?
AWS lambda really works well for my purpose (I'm trying to calculate a statistical correlation between 2 files, which takes 1-1.5 seconds), so it'd be great if anyone knows a way to make this work.
Appreciate the help.
EDIT:
In my AWS lambda function, I am using the python library pyranges, which expects local paths to files.
In my AWS lambda function, I am using the python library pyranges, which expects local paths to files.
You have a few options:
Have your Lambda function first download the files locally to the /tmp folder, using boto3, before invoking pyranges.
Possibly use S3Fs to emulate file handles for S3 objects.
I am trying to follow this documentation- https://aws.amazon.com/blogs/database/stream-changes-from-amazon-rds-for-postgresql-using-amazon-kinesis-data-streams-and-aws-lambda/
in order to stream changes in the Postgres database (running in AWS rds) to AWS-Kinesis. When I run the given code in ec2 (or on my local system), it works and it prints any crud operation in the terminal. However, option 2 using lambda does not work. Nowhere has it been mentioned that how the lambda will be triggered. Also, a lambda is supposed to run for a max of 15 min. I am highly confused and would really like some help on this.