I wrote a script to count the number of objects in s3 buckets and total size of each bucket. The code works when I run it against a few test buckets, but then times out when I include all production buckets. Thousands of objects.
import boto3
s3 = boto3.resource('s3')
bucket_list = []
bucket_size = {}
bucket_list = s3.buckets.all()
skip_list = ('some-test-bucket')
for bu in bucket_list:
if bu.name not in skip_list:
bucket_size[bu.name] = [0, 0]
print(bu.name)
for obj in bu.objects.all():
bucket_size[bu.name][0] += 1
bucket_size[bu.name][1] += obj.size
print("{0:30} {1:15} {2:10}".format("bucket", "count", "size"))
for i,j in bucket_size.items():
print("{0:30} {1:15} {2:10}".format(i, j[0], j[1]))
It starts to run, moves along and then gets hung on certain buckets like this:
botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
There's no quick way to get metadata like this? This is doing the hard way in a sense - counting every object.
So, I'm asking if there's a better script, not why it times out. When I click through some of the timed out buckets, I noticed there are some .gz files in there. Don't know why it would matter.
Of course I looked at the documentation, but I find it hard to get meaningful actionable info.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html
If you just wish to know the number of objects in a bucket, you can use metrics from Amazon CloudWatch.
From Monitoring Metrics with Amazon CloudWatch - Amazon Simple Storage Service:
BucketSizeBytes
The amount of data in bytes stored in a bucket in the STANDARD storage class, INTELLIGENT_TIERING storage class, Standard - Infrequent Access (STANDARD_IA) storage class, OneZone - Infrequent Access (ONEZONE_IA), Reduced Redundancy Storage (RRS) class, Deep Archive Storage (DEEP_ARCHIVE) class or, Glacier (GLACIER) storage class. This value is calculated by summing the size of all objects in the bucket (both current and noncurrent objects), including the size of all parts for all incomplete multipart uploads to the bucket.
NumberOfObjects
The total number of objects stored in a bucket for all storage classes except for the GLACIER storage class. This value is calculated by counting all objects in the bucket (both current and noncurrent objects) and the total number of parts for all incomplete multipart uploads to the bucket.
Related
I'm trying to copy some files from one bucket to another (same region), getting speed of around 315mb/s. However I'm using it in lambda and there is a 15 min timeout limit. So for bigger files goes into timeout
Below is the code snippet I'm using (in python), is there any other way I can speed it up? any inputs are welcome.
s3_client = boto3.client(
's3',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
aws_session_token=session_token,
config=Config(signature_version='s3v4')
)
s3_client.copy(bucket_pair["input"], bucket_pair["output"]["Bucket"],
bucket_pair["output"]["Key"])
I saw many posts of passing chunksize and all, but I don't see them in ALLOWED_COPY_ARGS. Thanks.
You can use a step function and iterate over all objects and copy them. To increase throughput you can use a map task
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html
If you don’t want to use stepfunction you can use one producer lambda to write all objects into a sqs queue and consume them from a lambda to copy them to the respective target.
A different option would be to use S3 object replication
https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html
But I’m not sure if that fits for your use case
I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?
If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.
I am trying to transform data from CSV to JSON in AWS lambda (using Python 3). The size of file is 65 MB, so its getting timeout before completing the process and the entire execution get fails.
I would need to know how I can handle such a case where AWS Lambda should able to process a maximum set of data within the time out period and the remaining payload should keep into an S3 bucket.
Below is the transformation code
import json
import boto3
import csv
import os
json_content = {}
def lambda_handler(event, context):
s3_source = boto3.resource('s3')
if event:
fileObj=event['Records'][0]
fileName=str(fileObj['s3']['object']['key'])
eventTime =fileObj['eventTime']
fileObject= s3_source.Object('inputs3', fileName)
data = fileObject.get()['Body'].read().decode('utf-8-sig').split()
arr=[]
csvreader= csv.DictReader(data)
newFile=getFile_extensionName(fileName,extension_type)
for row in csvreader:
arr.append(dict(row))
json_content['Employees']=arr
print("Json Content is",json_content)
s3_source.Object('s3-output', "output.json").put(Body=(bytes(json.dumps(json_content).encode('utf-8-sig'))))
print("File Uploaded")
return {
'statusCode': 200,
'fileObject':eventTime,
}
AWS Lambda function configuration:
Memory: 640 MB
Timeout: 15 min
Since your function is timing-out, you only have two options:
Increase the amount of assigned memory. This will also increase the amount of CPU assigned to the function, so it should run faster. However, this might not be enough to avoid the timeout.
or
Don't use AWS Lambda.
The most common use-case for AWS Lambda functions is for small microservices, sometimes only running for a few seconds or even a fraction of a second.
If your use-case runs for over 15 minutes, then it probably isn't a good candidate for AWS Lambda.
You can look at alternatives such as running your code on an Amazon EC2 instance or using a Fargate container.
It looks like your function is running out of memory:
Memory Size: 1792 MB Max Memory Used: 1792
Also, it only ran for 12 minutes:
Duration: 723205.42 ms
(723 seconds ≈ 12 minutes)
Therefore, you should either:
Increase memory (but this costs more), or
Change your program so that, instead of accumulating the JSON string in memory, you continually write it out to a local disk file to /tmp/ and then upload the resulting file to Amazon S3
However, the maximum disk storage space provided to an AWS Lambda function is 512MB and it appears that your output file is bigger than this. Therefore, increasing memory would be the only option. The increased expense related to assigning more resources to the Lambda function suggests that you might be better-off using EC2 or Fargate rather than Lambda.
I need to know how many write units a certain object will consume before inserting it to DynamoDB.
There are some docs from AWS explaining how to estimate write units, however, I would like to check if there is some built in function in boto3 before writing my own.
import boto3
some_object = { ... }
dynamo = boto3.resource('dynamodb')
# get kb size my object will consume
size = dynamo.get_size_of(some_object)
# or even how many write units
writers = dynamo.get_writers(some_object)
There is nothing inside boto3 for calculating the size of items you're going to write. We sometimes use the code from this blog post to check python data structures before writing them.
Background: I am very new to the AWS Management Console, and I just created a very simple AWS Lambda function(Python 3.7) that deletes EBS Volume Snapshots based on a time limit. I also created a CloudWatch event to trigger the function every hour of the day. This works well for a few volumes, but with additions of volumes, this causes cost / speed issues.
Question: I am trying to update my EBS Snapshot Deletion lambda function code to find the rate limit of requests to avoid throttling, and use that to create an exponential backoff/retry model for the program(making the program scalable no matter how many snapshots there are). I would assume there is a special API call that can help me with this, but I have not been able to find any concrete information online. Any help would be much appreciated! Pasting my current code below:
import boto3
from datetime import datetime, timezone, timedelta
ec2=boto3.resource('ec2') #resource, higher level
snapshots = ec2.snapshots.filter(OwnerIds=['self']) #all snapshots owned by me, returns list
def lambda_handler(event, context):
# TODO implement
for i in snapshots:#for each snapshot
start_time=i.start_time #timestamp when snapshot was initiated
delete_time=datetime.now(tz=timezone.utc)-timedelta(days=1) #Correct time in UTC timezone - 1 days]
if delete_time>start_time:#if delete time is more than start time (more than a day)
i.delete() #call method to delete that snapshot
print ('Snapshot with Id = {snaps} is deleted'.format(snaps=i.snapshot_id)) #pull ID that was deleted