I have written the python boto3 code to take the ec2 inventory in AWS and in that same code, I am modifying to take multiple AWS accounts ec2 inventory list into csv but in that I am getting the ec2 output details only for last value. someone help me with the below script to generate multiple AWS ec2 inventory to csv.
import boto3
import csv
profiles = ['dev','stag']
for name in profiles:
aws_mag_con=boto3.session.Session(profile_name=name)
ec2_con_re=aws_mag_con.resource(service_name="ec2",region_name="ap-southeast-1")
cnt=1
csv_ob=open("inventory_info.csv","w",newline='')
csv_w=csv.writer(csv_ob)
csv_w.writerow(["S_NO","Instance_Id",'Instance_Type','Architecture','LaunchTime','Privat_Ip'])
for each in ec2_con_re.instances.all():
print(cnt,each,each.instance_id,each.instance_type,each.architecture,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address)
csv_w.writerow([cnt,each.instance_id,each.instance_type,each.architecture,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address])
cnt+=1
csv_ob.close()
above script I am getting the output of stag aws account only.
This is because your indentation is incorrect. The loop only accounts for the first line and everything else will be executed for the last element of profiles (when the for loop finishes). It should be:
import boto3
import csv
profiles = ['dev','stag']
for name in profiles:
aws_mag_con=boto3.session.Session(profile_name=name)
ec2_con_re=aws_mag_con.resource(service_name="ec2",region_name="ap-southeast-1")
cnt=1
csv_ob=open("inventory_info.csv","w",newline='')
csv_w=csv.writer(csv_ob)
csv_w.writerow(["S_NO","Instance_Id",'Instance_Type','Architecture','LaunchTime','Privat_Ip'])
for each in ec2_con_re.instances.all():
print(cnt,each,each.instance_id,each.instance_type,each.architecture,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address)
csv_w.writerow([cnt,each.instance_id,each.instance_type,each.architecture,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address])
cnt+=1
csv_ob.close()
Related
I need to list all these ( ( DBSnapshotIdentifier , DBInstanceIdentifier, AvailabilityZone , SnapshotType , Encrypted , SnapshotCreateTime) properties of both automated and manual snapshots of all the available clusters and import this data into a spreadsheet.
I tried aws rds describe-db-snapshots this command but I need to get those properties alone and also import those to a spreadsheet
Use the regular cli command and save it into a json file. aws rds describe-db-snapshots > filename.json and then convert JSON to CSV via the Command Line Using JQ
#Im working on below script to pull the information of AWS ec2 instances with details.
#What it does now is that it pull the information that I am after and create a csv file in the same patch of the script. I am working on expanding this to make it a Lambda function to pull down the information from all AWS accounts and instead of creating CSV on local machine push the data (CSV) format to S3 location.
#Now challenge ahead that I need help on how to modify the script to be able to write to S3 bucket instead of writing to the local drive?
import boto3
import csv
from pprint import pprint
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
Thank you
Trivial way, you upload the file you created on disk.
There are advanced way that avoid writing on disk, you can find them on SO.
import boto3
import csv
from pprint import pprint
#Creating Session With Boto3.
session = boto3.Session(
profile_name="your profile",
)
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
#Creating S3 Resource From the Session.
s3 = session.resource('s3')
bucket = s3.Bucket("your-bucket-name")
bucket.upload_file('EC2_Details.csv')
I am writing a python3 lambda function which needs to return all of the files that were uploaded to an S3 bucket in the past 30 days from the time that the function is ran.
How should I approach this? Ideally, I want to only iterate through the files from the past 30 days and nothing else - there are thousands upon thousands of files in the S3 bucket that I am iterating through, and maybe 100 max will be updated/uploaded per month. It would be very inefficient to have to iterate through every file and compare dates like that. There is also a 29 second time limit for AWS API gateway.
Any help would be greatly appreciated. Thanks!
You will need to iterate through the list of objects (sample code: List s3 buckets with its size in csv format) and compare the date within the Python code (sample code: Get day old filepaths from s3 bucket).
There is no filter when listing objects (aside from Prefix).
An alternative is to use Amazon S3 Inventory, which can provide a daily CSV file listing the contents of a bucket. You could parse that CSV instead of the listing objects.
A more extreme option is to keep a separate database of objects, which would need to be updated whenever objects are added/deleted. This could be done via Amazon S3 Events that trigger an AWS Lambda function. Lots of work, though.
I can't give you an 100% answer, since you have asked for the upload date, but if you can live with the 'last modified' value, this code snippet should do the job:
import boto3
import datetime
paginator = boto3.resource('s3').meta.client.get_paginator('list_objects')
date = datetime.datetime.now() - datetime.timedelta(30)
filtered_files = (page['Key'] for page in paginator.paginate(Bucket="bucketname").search(f"Contents[?to_string(LastModified)>='\"{date}\"']"))
For filterting I used JMESPath
From the architect perspective
The bottle neck is that whether if you can iterate all objects with in 30 seconds. If natively there are too many files, there are a few more options you can use:
Create a aws lambda function that triggered by S3:PutObject event, and store the S3 key, and last_modified_at information into Dynamodb (A AWS Key Value NoSQL database). Then you can easily use Dynamodb to filter the S3 key and retrieve those S3 object accordingly.
Crreate a aws lambda function that triggered by S3:PutObject event, and move the file to a partitioned S3 Key schema location such as s3://bucket/datalake/year=${year}/month=${month}/day=${day}/your-file.csv. Then you can easily use the partition information to locate the subset of your objects, which fits in 30 seconds hard limit.
From programming perspective
Here's the code snippet solves your problem using this library s3pathlib:
from datetime import datetime, timedelta
from s3pathlib import S3path
# define a folder
p_dir = S3Path("bucket/my-folder/")
# find one month ago datetime
now = datetime.utcnow()
one_month_ago = now - timedelta(days=30)
# filter by last modified
for p in p_bucket.iter_objects().filter(
# any Filterable Attribute can be used for filtering
S3Path.last_modified_at >= one_month_ago
):
# do whatever you like
print(p.console_url) # click link to open it in console, inspect
If you want to use other S3Path attributes for filtering, and use other comparators, or even define your custom filter, you can follow this document
I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.
I've been tasked with creating a script to delete all the current S3 buckets and create some new ones. This is something that they want to do on an ongoing basis. So far I have all the preliminaries:
import boto
from boto.s3.key import Key
import boto.s3.connection
from __future__ import print_function
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id='my_access_key', aws_secret_access_key='my_secret_key')
ls = conn.get_all_buckets()
print(*ls,sep='\n')
This gives me a list of all the current buckets. Now if I want to remove the buckets my understanding is that they have to be emptied first, using a method something like:
db = conn.get_bucket('bucket_name')
for key in db.list():
key.delete()
And then I could do:
conn.delete_bucket('bucket_name')
I want to set it up such that it pulls each bucket name from 'ls', but I'm not sure how to go about this. I tried this:
for i in ls:
db = conn.get_bucket('i')
for key in db.list():
key.delete()
But I get an error "S3ResponseError: 400 Bad Request". I'm getting a sneaking suspicion that it's not pulling the separate elements from the list. Do I maybe have to get data frames involved? As far as I know, boto doesn't have an option to just nuke all the folders outright.
I'd recommend using boto3
The following should do the trick, though it's untested (I don't want to delete all my buckets :))
import boto3
client = session.client('s3')
s3 = boto3.resource('s3')
buckets = client.list_buckets()
for bucket in buckets['Buckets']:
s3_bucket = s3.Bucket(bucket['Name'])
s3_bucket.objects.all().delete()
s3_bucket.delete()