Boto3 Script to List EBS Snapshots Older than 30 days while exporting instance data to CSV - python-3.x

Looking for a boto3 script to identify EBS snapshots older than 30 days that identifies Instance ID, Volume ID, Volume Name, Volume Size, and Volume Type, and then have that data export to a CSV.
Our plan is to ultimately delete the snapshots older than 30 days but are looking to just start with identification.
Thank you!!

Here's a way to get that information.
For an explanation of the fields accessed, see describe_snapshots(). That documentation also shows sample output, which makes it relatively easy to code this type of script.
import boto3
from datetime import datetime, timezone
ec2_client = boto3.client('ec2')
snapshot_response = ec2_client.describe_snapshots(OwnerIds=['self'])
for snapshot in snapshot_response['Snapshots']:
print(snapshot['SnapshotId'])
print(snapshot['VolumeId'])
print(snapshot['VolumeSize'])
print(snapshot['StartTime'])
days_old = (datetime.now(timezone.utc) - snapshot['StartTime']).days
print(days_old)
volume_response = ec2_client.describe_volumes(VolumeIds=[snapshot['VolumeId']])
volume = volume_response['Volumes'][0]
print(volume['VolumeType'])
for attachment in volume['Attachments']:
print(attachment['InstanceId'])
The call to describe_volumes() was required to retrieve the VolumeType and InstanceId because they are attributes of the volume from which the Snapshot was produced. If you are merely deleting snapshots based on their creation date, you should not need to call describe_volumes().

Related

Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.
The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:
When I modify this file, the changes can be observed in the information returned by os.stat.
So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
These files would have this new column as shown below after running the code:
It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.
See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

boto3 - Getting files only uploaded in the past month in S3

I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document

Lambda Function to convert csv to parquet from s3

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

AWS Lambda Nodejs: Get all objects created in the last 24hours from a S3 bucket

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated
Thank you
You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property
There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time.
For example:
s3://bucket/reports/date=2019-08-28/report1.json
s3://bucket/reports/date=2019-08-28/report2.json
s3://bucket/reports/date=2019-08-28/report3.json
s3://bucket/reports/date=2019-08-29/report1.json
This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

Running Python code (Boto3) remotely (AWS)

I have code that moves items from one s3 bucket to another. I am running it locally on my computer. However, it will take a long time to finish running as there are many items in the bucket.
import boto3
#Get resource
s3 = boto3.resource('s3')
#Get reference to buckets
src = s3.Bucket('src')
dst = s3.Bucket('dst')
#Iterate through the items in the source bucket
for item in src.objects.all():
#Creates a copy of the item?
copy_source = {
'Bucket' : 'src',
'Key' : item.key
}
#Places the copy of the item in the destination bucket
dst.copy(copy_source,'Images/'+item.key)
Is there any way I can run this code remotely such that I would not have to monitor it? I have tried AWS lambda but it has a maximum run time of 15 minutes. Is there something like that I could use but for a longer time.
You could use a Data Pipeline.
A data pipeline spawns an EC2 instance where you can run your job.
You can schedule the pipeline to run run at least every 15 minutes. (But not less)
There is also the option to create a pipeline that you van run on demand.
It also offers a console where you can view the jobs and their outcome and have the opportunity to rerun failed jobs.
For this kind of activity you should probably use this:
https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
Another option is to just start an EC2 instance run your job and then stop it.

Resources