Need to upload directory content to S3 bucket - python-3.x

My scenario is I am currently using AWS CLI to upload my directory content to S3 bucket using following AWS CLI command:
aws s3 sync results/foo s3://bucket/
Now I need to replace this and have python code to do this. I am exploring boto3 documentation to find the right way to do it. I see some options such as:
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.Client.upload_file
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.ServiceResource.Object
Could someone suggest which is the right approach.
I am aware that I would have to get the credentials by calling boto3.client('sts').assume_role(role, session) and use them subsequently.

The AWS CLI is actually written in Python and uses the same API calls you can use.
The important thing to realize is that Amazon S3 only has an API call to upload/download one object at a time.
Therefore, your Python code would need to:
Obtain a list of files to copy
Loop through each file and upload it to Amazon S3
Of course, if you want sync functionality (which only copies new/modified files), then your program will need more intelligence to figure out which files to copy.
Boto3 has two general types of methods:
client methods that map 1:1 with API calls, and
resource methods that are more Pythonic but might make multiple API calls in the background
Which type you use is your own choice. Personally, I find the client methods easier for uploading/downloading objects, and the resource methods are good when having to loop through resources (eg "for each EC2 instance, for each EBS volume, check each tag").

Related

AWS S3 copy folder without using listobject

I need to copy a folder within the same bucket and then run various logic on the copied content.
I wanted to know if there's a way to copy the entire folder without using listObjects and then proceeding to copying each file separately.
Because this would mean running listObjects and copying each file and then doing listObjects again on the new folder and then running logic on each file.
So basically I'm trying to save IO and avoid multiple loops.
Please advise.
You can use the --recursive tag in your SDK of choice to accomplish this. Combine this with the --include and --exclude flags which can use wild cards, you can achieve your goal. See this page of the CLI documentation
something like:
aws s3 cp s3://mybucket/logs/ s3://mybucket/logs2/ --recursive --exclude "*" --include "*.log"
Amazon S3 does not provide a command that 'copies a folder'. Instead, each object must be individually copied via its own API request.
This means that you will first need to obtain a listing of the objects. This can be obtained via:
A call to ListObjects (note: It can only return a maximum of 1000 objects per API call)
OR
Use Amazon S3 Inventory to generate a list of existing objects in CSV format, and then use that list to generate the Copy requests
If you have a large number of objects, you could consider using Amazon S3 Batch Operations, which can copy files or invoke an AWS Lambda function for each object.
You could also configure Amazon S3 to trigger an AWS Lambda function whenever an object is Created (including when it is copied). Thus, the creation of the object can directly trigger the logic that you want to run.

Trigger S3 Object without upload

I have a feeling the answer to my question will be a correct google term that i am missing but here we go.
I need to trigger all objects in an s3 bucket without uploading. The reason being i have a lambda that gets triggered on PutObject and i want to reprocess all those files again. There are huge images and re-uploading does not sound like a good idea.
I am trying to do this in nodejs but any language that anyone is comfortable with will help and i will translate.
Thanks
Amazon S3 Event can trigger an AWS Lambda function when an object is created/deleted/replicated.
However, it is not possible to "trigger the object" -- the object would need to be created/deleted/replicated to cause the Amazon S3 Event to be generated.
As an alternative, you could create a small program that lists the objects in the bucket, and then directly invokes the AWS Lambda function, passing the object details in the event message to make it look like it came from Amazon S3. There is a sample S3 Event in the Lambda 'test' function -- you could copy this template and have your program insert the appropriate bucket and object key. Your Lambda function would then process it exactly as if an S3 Event had triggered the function.
In addition to what explained above, you can use AWS S3 Batch Operations.
We used this to encrypt existing objects in the S3 bucket which were not encrypted earlier.
This was the easiest out of the box solution available in the S3 console itself.
You could also loop through all objects in the bucket and add a tag. Next, adjust your trigger event to include tag changes. Code sample in bash to follow after I test it.

Run python script located on computer with aws s3 folder object as input parameter in an ec2 instance

I am trying to run a python script that takes a folder of pdfs as an input, and outputs an excel file in the current directory. In terminal I would enter the line below and an excel file would appear in the current directory.
$python3 script.py folder
I was wondering how to run this script with a folder located in an aws s3 bucket as the input without having to download the folder because it is pretty big. I believe you have to use an ec2 instance but am unclear about the whole process, especially how to have the s3 folder object be the input parameter for the python script.
You can use AWS SDK (Boto3) in Python to list content of S3 bucket, get each object and perform operations on it.
Here's how you normally do it:
Get access to s3 client handler:
Get Boto3 S3 Client
List S3 Bucket Objects:
List S3 Bucket Objects
Iterate the list and get object:
Get Each Object
Perform whatever operations you're looking for on each object.
Moreover, you can use generators with python to make the application memory optimized while iterating over list.
Note: If you're using EC2, it's a best practice to attached the IAM role with permissions to the specific bucket you're trying to list.
Thanks!
You would use the AWS SDK for Python (Boto3) to list the contents of an S3 location and stream the contents of each S3 object. The parameter you would pass to the script would be an S3 url like s3://my-bucket/my-folder. You would have to replace all the local file system I/O calls in the script with Boto3 S3 API calls. There would be no requirement to run the script on an EC2 instance, although it would generally have a faster connection to S3 than your local computer would.

Better/best approach to load huge CSV file into DynamoDb

I have a huge .csv file on my local machine. I want to load that data in a DynamoDB (eu-west-1, Ireland). How would you do that?
My first approach was:
Iterate the CSV file locally
Send a row to AWS via a curl -X POST -d '<row>' .../connector/mydata
Process the previous call within a lambda and write in DynamoDB
I do not like that solution because:
There are too many requests
If I send data without the CSV header information I have to hardcode the lambda
If I send data with the CSV header there is too much traffic
I was also considering putting the file in an S3 bucket and process it with a lambda, but the file is huge and the lambda's memory and time limits scare me.
I am also considering doing the job on an EC2 machine, but I lose reactivity (if I turn off the machine while not used) or I lose money (if I do not turn off the machine).
I was told that Kinesis may be a solution, but I am not convinced.
Please tell me what would be the best approach to get the huge CSV file in DynamoDB if you were me. I want to minimise the workload for a "second" upload.
I prefer using Node.js or R. Python may be acceptable as a last solution.
If you want to do it the AWS way, then data pipelines may be the best approach:
Here is a tutorial that does a bit more than you need, but should get you started:
The first part of this tutorial explains how to define an AWS Data
Pipeline pipeline to retrieve data from a tab-delimited file in Amazon
S3 to populate a DynamoDB table, use a Hive script to define the
necessary data transformation steps, and automatically create an
Amazon EMR cluster to perform the work.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part1.html
If all your data is in S3 you can use AWS Data pipeline's predefined template to 'import DynamoDB data from S3' It should be straightforward to configure.

aws s3 putObject vs sync

i need to upload a large file to aws s3 bucket. in every 10 minute my code delete old file from source directory and generate a new file. File size is around 500 MB. Now i used s3.putObject() method for uploading each file after creation. i also heard about aws s3 sync. its coming with aws-cli. it used for uploading files to s3 bucket.
i used aws-sdk for node.js for s3 upload. aws-sdk for node.js does not contain s3-sync method. is s3-sync is better than s3.putObject() method?. i need faster upload.
There's always more than way to make on thing, so to upload a file into a S3 bucket you can :
use aws CLI and run aws s3 cp ...
use aws CLI and run aws s3api put-object ...
use aws SDK (your language of choice)
you can also use sync method but for a single file, there's no need to sync a whole directory, and generally when looking for better performance its better to start multiple cp instances to benefit from multi thread vs sync mono-thread.
basically all this methods are wrapper for the aws S3 API calls. From amazon doc
Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead:
Use the AWS SDKs to send your requests (see Sample Code and Libraries). With this option, you don't need to write code to calculate a signature for request authentication because the SDK clients authenticate your requests by using access keys that you provide. Unless you have a good reason not to, you should always use the AWS SDKs.
Use the AWS CLI to make Amazon S3 API calls. For information about setting up the AWS CLI and example Amazon S3 commands see the following topics:
Set Up the AWS CLI in the Amazon Simple Storage Service Developer Guide.
Using Amazon S3 with the AWS Command Line Interface in the AWS Command Line Interface User Guide.
so Amazon would recommend to use the SDK. At the end of the day, I think its really a matter to what you're most comfortable and how you will integrate this piece of code into the rest of your program. For one-time action, I always go to CLI.
In term of performance though, using one or the other will not make difference as again they're just wrapper to AWS API call. For transfer optimization, you should look at aws s3 transfer acceleration and see if you can enable it

Resources