How to download specific number of files/objects from S3 bucket using aws cli command? - aws-cli

Consider I want to download only 10 files from the bucket, how do we pass 10 as an argument.

The easiest way to do so is to make a python script that you can run every 30 minutes.I have written the python code that will do your work :
import boto3
import random
s3 = boto3.client('s3')
source=boto3.resource('s3')
keys = []
resp = s3.list_objects_v2(Bucket='bucket_name')
for obj in resp['Contents']:
keys.append(obj['Key'])
length = len(keys);
for x in range(10):
hello=random.randint(0,length)
source.meta.client.download_file('bucket_name', keys[hello] , keys[hello])
In line 12 you can pass a number as an argument that will define the number of random files you want to download. Further if you want your script to execute the task automatically every 30 minutes, then you can define above code as a separate method and then can use "sched" module of python to call this method repeatedly for which you can find the code in the link here:
What is the best way to repeatedly execute a function every x seconds in Python?

Your use case appears to be:
Every 30 minutes
Download 10 random files from Amazon S3
Presumably, these 10 files should not be files previously downloaded.
There is no in-built S3 functionality to download a random selection of files. Instead, you will need to:
Obtain a listing of files from your desired S3 bucket and optional path
Randomly select which files you want to download
Download the selected files
This would be easily done via a programming language (eg Python), where you could obtain an array of filenames, randomize it, then loop through the list and download each file.
You can also do it in a shell script by calling the AWS Command-Line Interface (CLI) to obtain the listing (aws s3 ls) and to copy the files (aws s3 cp).
Alternatively, you could choose to synchronize ALL the files to your local machine (aws s3 sync) and then select random local files to process.
Try the above steps. If you experience difficulties, post your code and the error/problem you are experiencing and we can assist.

Related

Reading file from G Drive via Apache Beam

I'm trying to fetch file from Google Drive using Apache Beam. I tried,
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
lines = (pipeline | beam.Create(filenames))
print(lines)
This returns a string like PCollection[[19]: Create/Map(decode).None]
I need to read a file from Google Drive and write it into GCS bucket. How can I read a file form G Drive from Apache beam?
If you don’t have complex transformations to apply, I thinks it’s better to not use Beam in this case.
Solution 1 :
You can instead use Google Collab (Juypiter Notebook on Google servers), mount your gDrive and use the gCloud CLI to copy files.
You can check the following links :
google-drive-to-gcs
stackoverflow-copy-file-from-google-drive-to-gcs
Solution 2
You can also use APIs to retrieve files from Google Drive and copy them to Cloud Storage.
You can for example develop a Python script using Python Google clients and the following packages :
google-api-python-client
google-auth-httplib2
google-auth-oauthlib
google-cloud-storage
This article shows an example.
If you want to use Beam for this, you would could write a function
def read_from_gdrive_and_yield_records(path):
...
and then use it like
filenames = ['https://drive.google.com/file/d/<file_id>']
with beam.Pipeline() as pipeline:
paths = pipeline | beam.Create(filenames)
records = paths | beam.FlatMap(read_from_gdrive_and_emit_records)
records | beam.io.WriteToText('gs://...')
Though as mentioned, unless you have a lot of files, this may be overkill.

Downloading S3 files in Google Colab

I am working on a project and it happens that some data is provided in form of S3fileSystem. I can read that data using S3FileSystem.open(path). But there are more than 360 files and it takes atleast 3 minutes to read a single file. I was wondering, is there any way of downloading these files in my system and read them from there, instead of reading it directly from S3fileSystem. There is another reason, although I can read all those files but once my session on colab reconnects I have to re-read all those files again, hence it will take a lot of time. I am using following code to read files
fs_s3 = s3fs.S3FileSystem(anon=True)
s3path = 'file_name'
remote_file_obj = fs_s3.open(s3path, mode='rb')
ds = xr.open_dataset(remote_file_obj, engine= 'h5netcdf')
Is there any way of downloading those files?
You can use another s3fs to mount the bucket, then copy the files to Colab.
how to mount
After mounting, you can
!cp /s3/yourfile.zip /content/

boto3 - Getting files only uploaded in the past month in S3

I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

Use images in s3 with SageMaker without .lst files

I am trying to create (what I thought was) a simple image classification pipeline between s3 and SageMaker.
Images are stored in an s3 bucket with their class labels in their file names currently, e.g.
My-s3-bucket-dir
cat-1.jpg
dog-1.jpg
cat-2.jpg
..
I've been trying to leverage several related example .py scripts, but most seem to be download data sets already in .rec format or containing special manifest or annotation files I don't have.
All I want is to pass the images from s3 to the SageMaker image classification algorithm that's located in the same region, IAM account, etc. I suppose this means I need a .lst file
When I try to manually create the .lst it doesn't seem to like it and it also takes too long doing manual work to be a good practice.
How can I automatically generate the .lst file (or otherwise send the images/classes for training)?
Things I read made it sound like im2rec.py was a solution, but I don't see how. The example I'm working with now is
Image-classification-fulltraining-highlevel.ipynb
but it seems to download the data as .rec,
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')
which just skips working with the .jpeg files. I found another that converts them to .rec but again it has essentially the .lst already as .json and just converts it.
I have mostly been working in a Python Jupyter notebook within the AWS console (in my browser) but I have also tried using their GUI.
How can I simply and automatically generate the .lst or otherwise get the data/class info into SageMaker without manually creating a .lst file?
Update
It looks like im2py can't be run against s3. You'd have to completely download everything from all s3 buckets into the notebook's storage...
Please note that [...] im2rec.py is running locally,
therefore cannot take input from the S3 bucket. To generate the list
file, you need to download the data and then use the im2rec tool. - AWS SageMaker Team
There are 3 options to provide annotated data to the Image Classification algo: (1) packing labels in recordIO files, (2) storing labels in a JSON manifest file ("augmented manifest" option), (3) storing labels in a list file. All options are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html.
Augmented Manifest and .lst files option are quick to do since they just require you to create an annotation file with a usually quick for loop for example. RecordIO requires you to use im2rec.py tool, which is a little more work.
Using .lst files is another option that is reasonably easy: you just need to create annotation them with a quick for loop, like this:
# assuming train_index, train_class, train_pics store the pic index, class and path
with open('train.lst', 'a') as file:
for index, cl, pic in zip(train_index, train_class, train_pics):
file.write(str(index) + '\t' + str(cl) + '\t' + pic + '\n')

How to append files in GCS with the same schema?

Is there any way one can append two files in GCS, suppose file one is a full
load and second file is an incremental load. Then what's the way we can append
the two?
Secondly, using gsutil compose will append the two files including the attributes
names as well. So, in the final file I want the data of the two files.
You can append two separate files using compose in the Google Cloud Shell and rename the output file as the first file, like this:
gsutil compose gs://bucket/obj1 [gs://bucket/obj2 ...] gs://bucket/obj1
This command is meant for parallel uploads in which you divide a large object file in smaller objects. They get uploaded to Google Cloud Storage and then you can append them to get the original file. You can find more information on Composite Objects and Parallel Uploads.
I've come up with two possible solutions:
Google Cloud Function solution
The option I would go for is using a Cloud Function. Doing something like the following:
Create an empty bucket like append_bucket.
Upload the first file.
Create a Cloud Function to be triggered by new uploaded files on the
bucket.
Upload the second file.
Read the first and the second file (you will have to download them as string first).
Make the append operation.
Upload the result to the bucket.
Google Dataflow solution
You can also do it with Dataflow for BigQuery (keep in mind it’s still in beta).
Create a BigQuery dataset and table.
Create a Dataflow instance, from the template Cloud Storage Text to BigQuery.
Create a Javascript file with the logic to transform the text.
Upload your files in Json format to the bucket.
Dataflow will read the Json file, execute the Javascript code and append the new data to the BigQuery dataset.
At last, export the BigQuery query result to Cloud Storage.

Resources