I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution
There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.
Related
My AWS OBJECT Lambda Function gets an unencrypted PDF via the Object Lambda inputS3Url. I want to use PyPDF2 to convert this to encrypted PDF, and send back via s3.write_get_object_response. How do I do this?
s3_url = object_get_context["inputS3Url"]
url=s3_url
response = requests.get(url)
my_raw_data = response.content
[SAVE ENCRYPTED my_raw_data TO VARIABLE so it can returned via S3.write_get_object_response - HOW?]
s3 = boto3.client('s3')
s3.write_get_object_response(
Body= [WHAT WOULD GO HERE?]
RequestRoute=request_route,
RequestToken=request_token)
The docs got you! Encrypting PDFs and Streaming Data is what you need (at least if I got you right; let me know if you want to achieve something else than getting a password-protected PDF on S3)
Not tested, but something like this
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
reader = PdfReader(BytesIO(my_raw_data))
writer = PdfWriter()
# Add all pages to the writer
for page in reader.pages:
writer.add_page(page)
# Add a password to the new PDF
writer.encrypt("my-secret-password")
# Save the new PDF to a file
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=bytes_stream,
RequestRoute=request_route,
RequestToken=request_token
)
I wanted to compare the content of two files compressed from the same file with the gzip.open() method from the python3 standard library. Above the code code creating two compressed files:
import shutil
import gzip
with open('file1.txt', 'rb') as f_in:
with gzip.open('output1.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
with open('file1.txt', 'rb') as f_in:
with gzip.open('output2.gz',"wb") as f_out:
shutil.copyfileobj(f_in, f_out)
Then I compared the gzipped-files with the filecmp module from the standard library:
import filecmp
print(filecmp.cmp("output1.gz", "output2.gz", shallow=False))
That return False due to different timestamps and filename in the headers of the files as pointed by Kris in this thread.
So according to the RFC 1952 the filename section of the headers that comes after the timestamp section is terminated with a null byte. So I created a function to look for this byte and start hashing the rest of the bytes with the md5() method from the hashlib module from the standard library. And then compare the hashes of the files.
import hashlib
def compareContent(filename):
md5 = hashlib.md5()
with open(filename, 'rb') as f_in:
while True:
text = f_in.read(1)
if text == b'\x00':
break
while True:
data = f_in.read(8192)
if not data:
break
md5.update(data)
return md5.hexdigest()`
hash_one = compareContent("output1.gz")
hash_two = compareContent("output2.gz")
print(hash_one == hash_two)
This finally returns True and seems to work well.
I'm just a little bit surprised that I could not find an already existing function to do this work.
As a beginner, my questions are:
Am I missing something? Does such a function already exist?
If not, Is there any better way to achieve this or to improve my code?
Here is the code available in a Colab notebook: compare gzipped-files
I am trying to read bucket files without to saving them as a file:
import boto3
import botocore
from io import StringIO
import pandas as pd
s3 = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
bucket = self.s3.Bucket('deutsche-boerse-xetra-pds')
objects = self.bucket.objects.filter(Prefix= date)
file = pd.read_csv(StringIO(self.bucket.Object(key=object.key).get().get('Body').read().decode('utf-8')))
This code works quite well. However, I would like to use concurrency (python asyncio) to speed up the reading process. I did a search into documentation but I could only find something for the download function but not for the get function.
Do you have any suggestion?
Thanks in advance.
I found out a solution which works with multiprocessing since my final goal was to reduce the processing time.
As follow the code:
def generate_bucket():
s3_resoursce = boto3.resource('s3',config=botocore.config.Config(signature_version=botocore.UNSIGNED))
xetra_bucket = s3_resoursce.Bucket('deutsche-boerse-xetra-pds')
return s3_resoursce, xetra_bucket
def read_csv(object):
s3local, bucket_local = self.generate_bucket()
return pd.read_csv(StringIO(bucket_local.Object(key=object).get().get('Body').read().decode('utf-8')))
def import_raw_data(date: List[str]) -> pd.DataFrame:
import multiprocessing
s3local, bucket_local2 = self.generate_bucket()
objects = [i.key for i in list(bucket_local2.objects.filter(Prefix= date[0]))]
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
df = pd.concat(p.map(self.read_csv, objects))
return df
For me it works, but I am sure that there could be the possibility to improve this code. I'm open to suggestions.
I am trying to create xml file with io.StringIO() and pack it into ZipFile, but the output is an empty zip file. Where is a mistake?
string_xml_buffer = io.StringIO()
string_xml_buffer.write('<MyContent>')
string_xml_buffer.write('</MyContent>')
bytes_zip_buffer = io.BytesIO()
zf = ZipFile(bytes_zip_buffer, mode = 'w')
zf.writestr('filename.xml', string_xml_buffer.getvalue())
# Django response
response = HttpResponse(zf, content_type='application/zip')
response['Content-Disposition'] = 'attachment; filename="f.zip"'
return response
the problem was in missing zf.close(). With this row the code works great.
Python runs like a charm on google cloud functions, but for the tmp files. Here's my simplified code:
FILE_PATH = "{}/report.pdf".format(tempfile.gettempdir())
pdf.output(FILE_PATH)
...
with open(FILE_PATH,'rb') as f:
data = f.read()
f.close()
encoded = base64.b64encode(data).decode()
attachment = Attachment()
attachment.content = str(encoded)
attachment.type = "application/pdf"
attachment.filename = "report"
attachment.disposition = "attachment"
attachment.content_id = "Report"
mail = Mail(from_email, subject, to_email, content)
mail.add_attachment(attachment)
Error is: [Errno 2] No such file or directory: '/tmp/report.pdf'
It works perfectly fine locally. Docs unfortunately only shows the node version. Workarounds would also be fine for sending that PDF.
It is a little difficult to find Google official documentation for writing in temporary folder. In My case, I needed to write in a temporary directory and upload it to google cloud storage using GCF.
Writing in temporary directory of Google Cloud Functions, it will consume memory resources provisioned for the function.
After creating the file and using it, it is recommended to remove it from the temporary directory. I used this code snippet for Write a csv into a temp dir in GCF(Python 3.7).
import pandas as pd
import os
import tempfile
from werkzeug.utils import secure_filename
def get_file_path(filename):
file_name = secure_filename(filename)
return os.path.join(tempfile.gettempdir(), file_name)
def write_temp_dir():
data = [['tom', 10], ['nick', 15]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
name = 'example.csv'
path_name = get_file_path(name)
df.to_csv(path_name, index=False)
os.remove(path_name)