Running queries against Amazon S3 using Boto3

Running queries against Amazon S3 using Boto3 - python-3.x

Because of Glacier Deep's expensive support for small objects, I am writing an archiver. It would be most helpful to me to be able to ask boto3 to give me a list of objects in the bucket which are not already in the desired storage class. Thanks to this answer, I know I can do this in a shell:
aws s3api list-objects --bucket $BUCKETNAME --query 'Contents[?StorageClass!=`DEEP_ARCHIVE`]'
Is there a way to pass that query parameter into boto3? I haven't dug into the source yet, but I thought it was essentially a wrapper on the command line tools- but I can't find docs or examples anywhere using this technique.

Is there a way to pass that query parameter into boto3?
Sadly, you can't do this, as --query option is specific to AWS CLI. But boto3 is Python AWS SDK, so you very easily post-process its outputs to obtain the same results as from CLI.
The --query option is based on jmespath. So if you really want to use jmespath in your python, you can use jmespath package .

Query S3 Inventory size column with Athena.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory-athena-query.html

Related

Trigger S3 Object without upload

I have a feeling the answer to my question will be a correct google term that i am missing but here we go.
I need to trigger all objects in an s3 bucket without uploading. The reason being i have a lambda that gets triggered on PutObject and i want to reprocess all those files again. There are huge images and re-uploading does not sound like a good idea.
I am trying to do this in nodejs but any language that anyone is comfortable with will help and i will translate.
Thanks

Amazon S3 Event can trigger an AWS Lambda function when an object is created/deleted/replicated.
However, it is not possible to "trigger the object" -- the object would need to be created/deleted/replicated to cause the Amazon S3 Event to be generated.
As an alternative, you could create a small program that lists the objects in the bucket, and then directly invokes the AWS Lambda function, passing the object details in the event message to make it look like it came from Amazon S3. There is a sample S3 Event in the Lambda 'test' function -- you could copy this template and have your program insert the appropriate bucket and object key. Your Lambda function would then process it exactly as if an S3 Event had triggered the function.

In addition to what explained above, you can use AWS S3 Batch Operations.
We used this to encrypt existing objects in the S3 bucket which were not encrypted earlier.
This was the easiest out of the box solution available in the S3 console itself.

You could also loop through all objects in the bucket and add a tag. Next, adjust your trigger event to include tag changes. Code sample in bash to follow after I test it.

Need to upload directory content to S3 bucket

My scenario is I am currently using AWS CLI to upload my directory content to S3 bucket using following AWS CLI command:
aws s3 sync results/foo s3://bucket/
Now I need to replace this and have python code to do this. I am exploring boto3 documentation to find the right way to do it. I see some options such as:
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.Client.upload_file
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/s3.html#S3.ServiceResource.Object
Could someone suggest which is the right approach.
I am aware that I would have to get the credentials by calling boto3.client('sts').assume_role(role, session) and use them subsequently.

The AWS CLI is actually written in Python and uses the same API calls you can use.
The important thing to realize is that Amazon S3 only has an API call to upload/download one object at a time.
Therefore, your Python code would need to:
Obtain a list of files to copy
Loop through each file and upload it to Amazon S3
Of course, if you want sync functionality (which only copies new/modified files), then your program will need more intelligence to figure out which files to copy.
Boto3 has two general types of methods:
client methods that map 1:1 with API calls, and
resource methods that are more Pythonic but might make multiple API calls in the background
Which type you use is your own choice. Personally, I find the client methods easier for uploading/downloading objects, and the resource methods are good when having to loop through resources (eg "for each EC2 instance, for each EBS volume, check each tag").

AWS Glue Job Python Script Boto3 - want to hide credentials

Has anyone found a way to hide boto3 credentials in a python script that gets called from AWS Glue?
Right now I have my key and access_key embedded within my script, and I am pretty sure that this is not good practice...

I found the answer! When I established the IAM role for my Glue services, I didn't realize I was opening that up to boto3 as well.
The answer is that I don't need to pass my credentials. I simply use this:
mySession = boto3.Session(region_name='my_region_name')
and it works like a charm!

aws s3 putObject vs sync

i need to upload a large file to aws s3 bucket. in every 10 minute my code delete old file from source directory and generate a new file. File size is around 500 MB. Now i used s3.putObject() method for uploading each file after creation. i also heard about aws s3 sync. its coming with aws-cli. it used for uploading files to s3 bucket.
i used aws-sdk for node.js for s3 upload. aws-sdk for node.js does not contain s3-sync method. is s3-sync is better than s3.putObject() method?. i need faster upload.

There's always more than way to make on thing, so to upload a file into a S3 bucket you can :
use aws CLI and run aws s3 cp ...
use aws CLI and run aws s3api put-object ...
use aws SDK (your language of choice)
you can also use sync method but for a single file, there's no need to sync a whole directory, and generally when looking for better performance its better to start multiple cp instances to benefit from multi thread vs sync mono-thread.
basically all this methods are wrapper for the aws S3 API calls. From amazon doc
Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead:
Use the AWS SDKs to send your requests (see Sample Code and Libraries). With this option, you don't need to write code to calculate a signature for request authentication because the SDK clients authenticate your requests by using access keys that you provide. Unless you have a good reason not to, you should always use the AWS SDKs.
Use the AWS CLI to make Amazon S3 API calls. For information about setting up the AWS CLI and example Amazon S3 commands see the following topics:
Set Up the AWS CLI in the Amazon Simple Storage Service Developer Guide.
Using Amazon S3 with the AWS Command Line Interface in the AWS Command Line Interface User Guide.
so Amazon would recommend to use the SDK. At the end of the day, I think its really a matter to what you're most comfortable and how you will integrate this piece of code into the rest of your program. For one-time action, I always go to CLI.
In term of performance though, using one or the other will not make difference as again they're just wrapper to AWS API call. For transfer optimization, you should look at aws s3 transfer acceleration and see if you can enable it

How can I get the total size of an Amazon S3 bucket using Node.js?

I'm working on a few Node.js projects where Amazon S3 buckets will be set up for each user who creates an account. I wanting to limit the total size of what a user can use.
I've been looking at the Knox client https://github.com/learnboost/knox which should make working with Amazon S3 easy when developing Node.js applications.
But after much research, I can't seem to find a way of getting back a bucket file size efficiently - on which I can do user account limitations etc.
I'm hoping this is the right approach in the first, but maybe not? Conceptually, I'm wanting to store user account media files uploaded on Amazon S3 and I want to limit how much the user can upload and use in total.
Many thanks,
James

There is no exposed API to get the size of a bucket. The only way to do it is to get all the keys, iterate through them, and sum up the size of all objects in the bucket.

As Elf stated, there is no direct way to get the total size of a bucket from any of the S3 operations. The best you can do is loop through all of the items in the bucket and sum their respecive sizes.
Here's a complete example of a program which lists all of the objects in a bucket and prints out a summary of file count and byte count at the end:
https://github.com/appsattic/node-awssum-scripts/blob/master/bin/amazon-s3-list.js
Feel free to copy it and change it to your requirements.

I know you asked about Node.js, but for those who just want ANY way to do this, check out the s3cmd tool: http://s3tools.org/s3cmd
Once you get that setup you can run s3cmd du s3://bucket-name
Try it on a small bucket first to be sure it is working. This command still loops through everything, so big bucket = big time.

I got to this page because I was looking for the same solution. The aws cli has a way of doing this:
aws s3api list-objects --bucket $bucket --output json --query "[sum(Contents[].Size)]"
I wrote a super simple wrapper that will convert the bytes to KB, MB, or GB.
I am not the most elegant coder on the planet but this works by going:
s3du my-bucket-name g (for GB, k for KB, m for MB)
The larger the bucket, the longer this will take, but it works:
https://github.com/defenestratexp/s3du.git
Obviously, you have to have the aws cli properly installed for this method to work. Cheers :D

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string