I need to copy a folder within the same bucket and then run various logic on the copied content.
I wanted to know if there's a way to copy the entire folder without using listObjects and then proceeding to copying each file separately.
Because this would mean running listObjects and copying each file and then doing listObjects again on the new folder and then running logic on each file.
So basically I'm trying to save IO and avoid multiple loops.
Please advise.
You can use the --recursive tag in your SDK of choice to accomplish this. Combine this with the --include and --exclude flags which can use wild cards, you can achieve your goal. See this page of the CLI documentation
something like:
aws s3 cp s3://mybucket/logs/ s3://mybucket/logs2/ --recursive --exclude "*" --include "*.log"
Amazon S3 does not provide a command that 'copies a folder'. Instead, each object must be individually copied via its own API request.
This means that you will first need to obtain a listing of the objects. This can be obtained via:
A call to ListObjects (note: It can only return a maximum of 1000 objects per API call)
OR
Use Amazon S3 Inventory to generate a list of existing objects in CSV format, and then use that list to generate the Copy requests
If you have a large number of objects, you could consider using Amazon S3 Batch Operations, which can copy files or invoke an AWS Lambda function for each object.
You could also configure Amazon S3 to trigger an AWS Lambda function whenever an object is Created (including when it is copied). Thus, the creation of the object can directly trigger the logic that you want to run.
Related
I have a bucket that has multiple users, and would like to pre-sign urls for the client to upload to s3 (some files can be large, so I'd rather they not pass through the Node server. My question is this: Until the mongo database is hit, there is no mongo Object Id to tag as a prefix for the file. (I'm separating the files in this structure: (UserID/PostID/resource) so you can check all of a user's pictures by looking under /UserID, and you can target a specific post by also adding the PostID. Conversely, there is no Object URL until the client uploads the file, so I'm at a bit of an impasse.
Is it bad practice to rename files after they touch the bucket? I just can't pre-know the ObjectID (the post has to be created in Mongo first) - but the user has to select what files they want to upload before the object is created. I was thinking the best flow could be one of two situations:
Client sets files -> Mongo created Document -> Responds to client with ObjectID and pre-signed urls for each file with the key set to /UserID/PostID/name. After successful upload, it re-triggers an update function on the server to edit the urls of the post. after update, send success to client.
Client uploads files to root of bucket -> Mongo doc created where urls of uploaded s3 files are being stored -> iterate over list and prepend the UserID and newly-created PostID, updating mongo document -> success response to client
Is there another approach that I don't know about?
Answering your question:
Is it bad practice to rename files after they touch the server?
If you are planing to use S3 to save your files, there is no server, so there is no problems to change these files after you upload them.
The only thing that you need to understand is renaming a object you need to two requests:
copy the object with a new name
delete the old object with the old name
And this means that maybe can be a problem in costs/latency if you have a huge number of changes (but I can say for most of cases this will not be a problem)
I can say that the first option will be a good option for you, and the only thing that I would change is adding a Serverless processing for your object/files, using the AWS Lambda service will be a good option .
In this case instead of updating the files on the server, you will update using a Lambda function, you only need to add a trigger for your bucket in the PutObject event on S3, this way will can change the name of your files in the best processing time for your client and with low costs.
I have a feeling the answer to my question will be a correct google term that i am missing but here we go.
I need to trigger all objects in an s3 bucket without uploading. The reason being i have a lambda that gets triggered on PutObject and i want to reprocess all those files again. There are huge images and re-uploading does not sound like a good idea.
I am trying to do this in nodejs but any language that anyone is comfortable with will help and i will translate.
Thanks
Amazon S3 Event can trigger an AWS Lambda function when an object is created/deleted/replicated.
However, it is not possible to "trigger the object" -- the object would need to be created/deleted/replicated to cause the Amazon S3 Event to be generated.
As an alternative, you could create a small program that lists the objects in the bucket, and then directly invokes the AWS Lambda function, passing the object details in the event message to make it look like it came from Amazon S3. There is a sample S3 Event in the Lambda 'test' function -- you could copy this template and have your program insert the appropriate bucket and object key. Your Lambda function would then process it exactly as if an S3 Event had triggered the function.
In addition to what explained above, you can use AWS S3 Batch Operations.
We used this to encrypt existing objects in the S3 bucket which were not encrypted earlier.
This was the easiest out of the box solution available in the S3 console itself.
You could also loop through all objects in the bucket and add a tag. Next, adjust your trigger event to include tag changes. Code sample in bash to follow after I test it.
My scenario is I am currently using AWS CLI to upload my directory content to S3 bucket using following AWS CLI command:
aws s3 sync results/foo s3://bucket/
Now I need to replace this and have python code to do this. I am exploring boto3 documentation to find the right way to do it. I see some options such as:
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.Client.upload_file
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.ServiceResource.Object
Could someone suggest which is the right approach.
I am aware that I would have to get the credentials by calling boto3.client('sts').assume_role(role, session) and use them subsequently.
The AWS CLI is actually written in Python and uses the same API calls you can use.
The important thing to realize is that Amazon S3 only has an API call to upload/download one object at a time.
Therefore, your Python code would need to:
Obtain a list of files to copy
Loop through each file and upload it to Amazon S3
Of course, if you want sync functionality (which only copies new/modified files), then your program will need more intelligence to figure out which files to copy.
Boto3 has two general types of methods:
client methods that map 1:1 with API calls, and
resource methods that are more Pythonic but might make multiple API calls in the background
Which type you use is your own choice. Personally, I find the client methods easier for uploading/downloading objects, and the resource methods are good when having to loop through resources (eg "for each EC2 instance, for each EBS volume, check each tag").
I am trying to run a python script that takes a folder of pdfs as an input, and outputs an excel file in the current directory. In terminal I would enter the line below and an excel file would appear in the current directory.
$python3 script.py folder
I was wondering how to run this script with a folder located in an aws s3 bucket as the input without having to download the folder because it is pretty big. I believe you have to use an ec2 instance but am unclear about the whole process, especially how to have the s3 folder object be the input parameter for the python script.
You can use AWS SDK (Boto3) in Python to list content of S3 bucket, get each object and perform operations on it.
Here's how you normally do it:
Get access to s3 client handler:
Get Boto3 S3 Client
List S3 Bucket Objects:
List S3 Bucket Objects
Iterate the list and get object:
Get Each Object
Perform whatever operations you're looking for on each object.
Moreover, you can use generators with python to make the application memory optimized while iterating over list.
Note: If you're using EC2, it's a best practice to attached the IAM role with permissions to the specific bucket you're trying to list.
Thanks!
You would use the AWS SDK for Python (Boto3) to list the contents of an S3 location and stream the contents of each S3 object. The parameter you would pass to the script would be an S3 url like s3://my-bucket/my-folder. You would have to replace all the local file system I/O calls in the script with Boto3 S3 API calls. There would be no requirement to run the script on an EC2 instance, although it would generally have a faster connection to S3 than your local computer would.
I have a huge .csv file on my local machine. I want to load that data in a DynamoDB (eu-west-1, Ireland). How would you do that?
My first approach was:
Iterate the CSV file locally
Send a row to AWS via a curl -X POST -d '<row>' .../connector/mydata
Process the previous call within a lambda and write in DynamoDB
I do not like that solution because:
There are too many requests
If I send data without the CSV header information I have to hardcode the lambda
If I send data with the CSV header there is too much traffic
I was also considering putting the file in an S3 bucket and process it with a lambda, but the file is huge and the lambda's memory and time limits scare me.
I am also considering doing the job on an EC2 machine, but I lose reactivity (if I turn off the machine while not used) or I lose money (if I do not turn off the machine).
I was told that Kinesis may be a solution, but I am not convinced.
Please tell me what would be the best approach to get the huge CSV file in DynamoDB if you were me. I want to minimise the workload for a "second" upload.
I prefer using Node.js or R. Python may be acceptable as a last solution.
If you want to do it the AWS way, then data pipelines may be the best approach:
Here is a tutorial that does a bit more than you need, but should get you started:
The first part of this tutorial explains how to define an AWS Data
Pipeline pipeline to retrieve data from a tab-delimited file in Amazon
S3 to populate a DynamoDB table, use a Hive script to define the
necessary data transformation steps, and automatically create an
Amazon EMR cluster to perform the work.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html
If all your data is in S3 you can use AWS Data pipeline's predefined template to 'import DynamoDB data from S3' It should be straightforward to configure.