How to calculate md5 for file in s3 bucket - python-3.x

I have a requirement to calculate md5 values for files that are held in s3 buckets. I know that I could download them to an onprem server and do it there but I want to keep my onprem server as small as possible and some of my s3 files are large (500MB+). So I have started developing a lambda python function do handle this but I can't figure out how to chunk through the file so I can generate the md5 value. Here is the code, I look forward to any assistance provided.
def s3_md5sum(bucket_name, object_key):
try:
md5Object = s3object.Object(bucket_name, object_key)
body = md5Object.get()['Body'].read()
except ClientError:
raise
else:
md5_obj = hashlib.md5()
while True:
buffer = body.read(8096)
if not buffer:
break
md5_obj.update(buffer)
hash_code = md5_obj.hexdigest()
md5 = str(hash_code).lower()
return md5

You can read the file as a stream instead of trying to read the entire file in memory. You can then use the hashlib library to create the MD5 based on the chuncks of the stream. An example of this can be found in this SO question.

Related

compress .txt file on s3 location to .gz file

I need to compress a .txt- file to .gz which is on S3 location and then upload it to a different S3 bucket. I have written following code but its not working as expected:
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
print(compressed_fp)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
source_bucket = event['Records'][0]['s3']['bucket']['name']
file_key_name = event['Records'][0]['s3']['object']['key']
response = s3.get_object(Bucket=source_bucket, Key=file_key_name)
original = BytesIO(response['Body'].read())
original.seek(0)
upload_gzipped(source_bucket, file_key_name, original)
Can someone please help here, or any other approach to gzip the file on S3 location
It would appear that you are writing an AWS Lambda function.
A simpler program flow would probably be:
Download the file to /tmp/ using s3_client.download_file()
Gzip the file
Upload the file to S3 using s3.client_upload_file()
Delete the files in /tmp/
Also, please note that the AWS Lambda function might be invoked with multiple objects being passed via the event. However, your code is currently only processing the first record with event['Records'][0]. The program should loop through these records like this:
for record in event['Records']:
source_bucket = record['s3']['bucket']['name']
file_key_name = record['s3']['object']['key']
...
Instead of writing the file into your /tmp folder, it might be better to read it into a buffer since the /tmp folder has limited memory.
buffer = BytesIO(file.get()["Body"].read())
For gzipping you can simply use something like this:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There's a similar tutorial here for a Lambda function: https://medium.com/p/f7bccf0099c9

Use checksum to verify integrity of uploaded and downloaded files from AWS S3 via Django

Using Django I am trying to upload multiple files in AWS S3. The file size may vary from 500 MB to 2 GB. I need to check the integrity of both uploaded and downloaded files.
I have seen that using PUT operation I can upload a single object up to 5GB in size. They also provide 'ContentMD5' option to verify the file. My questions are:
Should I use PUT option if I upload a file larger than 1 GB? Because generating MD5 checksum of such file may exceed system memory. How can I solve this issue? Or is there any better solution available for this task?
To download the file with checksum, AWS has get_object()function. My questions are:
Is it okay to use this function to downlaod multiple files?
How can I use it to download multiple files with checksum from S3? I look for some example but there is noting much. Now I am using following code to download and serve multiple files as a zip in Django. I want to serve the files as zip with checksum to validate the transfer.
s3 = boto3.resource('s3', aws_access_key_id=base.AWS_ACCESS_KEY_ID, aws_secret_access_key=base.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(base.AWS_STORAGE_BUCKET_NAME)
s3_file_path = bucket.objects.filter(Prefix='media/{}/'.format(url.split('/')[-1]))
# set up zip folder
zip_subdir = url.split('/')[-1]
zip_filename = zip_subdir + ".zip"
byte_stream = BytesIO()
zf = ZipFile(byte_stream, "w")
for path in s3_file_path:
s3_url = f"https://%s.s3.%s.amazonaws.com/%s" % (base.AWS_STORAGE_BUCKET_NAME,base.AWS_S3_REGION_NAME,path.key)
file_response = requests.get(s3_url)
if file_response.status_code == 200:
try:
tmp = tempfile.NamedTemporaryFile()
print(tmp.name)
tmp.name = path.key.split('/')[-1]
f1 = open(tmp.name, 'wb')
f1.write(file_response.content)
f1.close()
zip_path = os.path.join('/'.join(path.key.split('/')[1:-1]), tmp.name)
zf.write(tmp.name,zip_path)
finally:
os.remove(tmp.name)
zf.close()
response = HttpResponse(byte_stream.getvalue(), content_type="application/x-zip-compressed")
response['Content-Disposition'] = 'attachment; filename=%s' % zip_filename
I am learning about AWS S3 and this is the first time I am using it. I would appreciate any kind of suggestion regarding this problem.

How to we send a file (accepted as part of Multipart request) to MINIO object storage in python without saving the file in local storage?

I am trying to write an API in python (Falcon) to accept a file from multipart-form parameter and put the file in MINIO object storage. The problem is I want to send the file to Minio without saving it in any temp location.
Minio-python client has a function using which we can send the file.
`put_object(bucket_name, object_name, data, length)`
where data is the file data and length is total length of object.
For more explanation: https://docs.min.io/docs/python-client-api-reference.html#put_object
I am facing problem accumulating the values of "data" and "length" arguments in the put_object function.
The type of file accepted in the API class is of falcon_multipart.parser.Parser which cannot be sent to Minio.
I can make it work if I write the file to any temp location and then read it from the desired location and send.
Can anyone help me finding a solution to this?
I tried reading file data from the Parser object and tried converting the file to bytes io.BytesIO. But it did not work.
def on_post(self,req, resp):
file = req.get_param('file')
file_data = file.file.read()
file_data= io.BytesIO(file_data)
bucket_name = req.get_param('bucket_name')
self.upload_file_to_minio(bucket_name, file, file_data)
def upload_file_to_minio(self, bucket_name, file, file_data):
minioClient = Minio("localhost:9000", access_key='minio', secret_key='minio', secure=False)
try:
file_stat = sys.getsizeof(file_data)
#file_stat = file_data.getbuffer().nbytes
minioClient.put_object(bucket_name, "SampleFile" , file, file_stat)
except ResponseError as err:
print(err)
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/minio/helpers.py", line 382, in is_non_empty_string
if not input_string.strip():
AttributeError: 'NoneType' object has no attribute 'strip'
A very late answer to your question. As of Falcon 3.0, this should be possible leveraging the framework's native multipart/form-data support.
There is an example how to perform the same task to AWS S3: How can I save POSTed files (from a multipart form) directly to AWS S3?
However, as I understand, MinIO requires either the total length, which is unknown, or alternatively, it requires you to wrap the upload as a multipart form. That should be doable by reading reasonably large (e.g., 8 MiB or similar) chunks into the memory, and wrapping them as multipart upload parts without storing anything on disk.
IIRC Boto3's transfer manager does something like that under the hood for you too.

Read n rows from csv in Google Cloud Storage to use with Python csv module

I have a variety of very large (~4GB each) csv files that contain different formats. These come from data recorders from over 10 different manufacturers. I am attempting to consolidate all of these into BigQuery. In order to load these up on a daily basis I want to first load these files into Cloud Storage, determine the schema, and then load into BigQuery. Due to the fact that some of the files have additional header information (from 2 - ~30 lines) I have produced my own functions to determine the most likely header row and the schema from a sample of each file (~100 lines), which I can then use in the job_config when loading the files to BQ.
This works fine when I am working with files from local storage direct to BQ as I can use a context manager and then Python's csv module, specifically the Sniffer and reader objects. However, there does not seem to be an equivalent method of using a context manager direct from Storage. I do not want to bypass Cloud Storage in case any of these files are interrupted when loading into BQ.
What I can get to work:
# initialise variables
with open(csv_file, newline = '', encoding=encoding) as datafile:
dialect = csv.Sniffer().sniff(datafile.read(chunk_size))
reader = csv.reader(datafile, dialect)
sample_rows = []
row_num = 0
for row in reader:
sample_rows.append(row)
row_num+=1
if (row_num >100):
break
sample_rows
# Carry out schema and header investigation...
With Google Cloud Storage I have attempted to use download_as_string and download_to_file, which provide binary object representations of the data, but then I cannot get the csv module to work with any of the data. I have attempted to use .decode('utf-8') and it returns a looong string with \r\n's. I then used splitlines() to get a list of the data but still the csv functions keep giving a dialect and reader that splits the data into single characters as each entry.
Has anyone managed to get a work around to use the csv module with files stored in Cloud Storage without downloading the whole file?
After having a look at the csv source code on GitHub, I have managed to use the io module and csv module in Python to solve this problem. The io.BytesIO and TextIOWrapper were the two key functions to use. Probably not a common use case but thought I would post the answer here to save some time for anyone that needs it.
# Set up storage client and create a blob object from csv file that you are trying to read from GCS.
content = blob.download_as_string(start = 0, end = 10240) # Read a chunk of bytes that will include all header data and the recorded data itself.
bytes_buffer = io.BytesIO(content)
wrapped_text = io.TextIOWrapper(bytes_buffer, encoding = encoding, newline = newline)
dialect = csv.Sniffer().sniff(wrapped_text.read())
wrapped_text.seek(0)
reader = csv.reader(wrapped_text, dialect)
# Do what you will with the reader object

Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda

I am trying to create a serverless processor for my chron job, In this job I receive a zipped file in my S3 bucket from one of my clients, file is around 50MB in size but once you unzip it, it becomes 1.5GB in size and there's a hard limit to the space available on AWS Lambda which is 500MB due to which I cannot download this file from S3 bucket and unzip it on my Lambda, I was successfully able to unzip my file and stream the content line by line from S3 using funzip in unix script.
for x in $files ; do echo -n "$x: " ; timeout 5 aws s3 cp $monkeydir/$x - | funzip
My Bucket Name:MonkeyBusiness
Key:/Daily/Business/Banana/{current-date}
Object:banana.zip
but now since I am trying to achieve same output using boto3, how I can stream the zipped content to sys i/o and unzip the stream save the content in separate files divided by 10000 lines each and upload the chunked files back to S3.
Need guidance as I am pretty new to AWS and boto3.
Please let me know if you need more details about the job.
Below given suggested solution is not applicable here because zlib documentation clearly states that said lib is compatible for gzip file format and my question is for zip file format.
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
This is not the exact answer. But you can try this out.
First, please adapt the answer that mentioned about gzip file with limited memory, this method allow one to stream file chunk by chunk. And boto3 S3 put_object() and upload_fileobj seems allow streaming.
You need to mix and adapt the above mentioned code with following decompression.
stream = cStringIO.StringIO()
stream.write(s3_data)
stream.seek(0)
blocksize = 1 << 16 #64kb
with gzip.GzipFile(fileobj=stream) as decompressor:
while True:
boto3.client.upload_fileobj(decompressor.read(blocksize), "bucket", "key")
I cannot guarantee the above code will works, it is just give you the idea to decompress file and re-uplaod it by chunks. You might even need to pipeline the decompress data to ByteIo and pipeline to upload_fileobj. There is a lot of testing.
if you don't need to decompress the file ASAP, my suggestion is use lambda to put the file into SQS queue. When there is "enough" file, trigger a SPOT instance (that will be pretty cheap) that will read the queue and process the file.

Resources