getting scheduling error when forwarding s3 object to flask response - python-3.x

I'm using the following to pull a data file from an s3 compliant server:
try:
buf = io.BytesIO()
client.download_fileobj(project_id, path, buf)
body = buf.getvalue().decode("utf-8")
except botocore.exceptions.ClientError as e:
if defaultValue is not None:
return defaultValue
else:
raise S3Error(project_id, path, e) from e
else:
return body
The code generates this error:
RuntimeError: cannot schedule new futures after interpreter shutdown
In general, I'm simply trying to read an s3-compliant file into the body of a response object. The caller of the above snippet is as follows:
data = read_file(project_id, f"{PATH}/data.csv")
response = Response(
data,
mimetype="text/csv",
headers=[
("Content-Type", "application/octet-stream; charset=utf-8"),
("Content-Disposition", "attachment; filename=data.csv")
],
direct_passthrough=True
)
Playing with the code, if I don't get a runtime error, the request hangs in that I don't get a returned response.
Thank you to anyone with guidance.

I'm not sure how generic this answer will be, however, the combination of using boto to access the digital ocean version of the s3 implementation does not "strictly" permit using an object key that starts with /. Once I removed the offending leading character, the files downloaded as expected.
I base the boto specificity on the fact that I was able to read the files using a Haskell's amazonka.

Related

cannot upload data to s3 through lambda

I'm trying to extract data from trust advisor through lambda function and upload to s3. Some part of the function executes the append module on the data. However, that module block throws error. That specific block is
try:
check_summary = support_client.describe_trusted_advisor_check_summaries(
checkIds=[checks['id']])['summaries'][0]
if check_summary['status'] != 'not_available':
checks_list[checks['category']].append(
[checks['name'], check_summary['status'],
str(check_summary['resourcesSummary']['resourcesProcessed']),
str(check_summary['resourcesSummary']['resourcesFlagged']),
str(check_summary['resourcesSummary']['resourcesSuppressed']),
str(check_summary['resourcesSummary']['resourcesIgnored'])
])
else:
print("unable to append checks")
except:
print('Failed to get check: ' + checks['name'])
traceback.print_exc()
The error logs
unable to append checks
I'm new to Python. So, unsure of how to check for trackback stacks under else: statement. Also, am I doing anything wrong in the above ? Plz help
You are not calling the s3_upload function anywhere, also the code is invalid since it has file_name variable in it which is not initialized.
I've observed your script-
traceback.print_exc() This should be executed before the return statement so that the python compiler can identify the obstacles/errors
if __name__ == '__main__':
lambda_handler
This will work only if is used to execute some code only if the file was run directly, and not imported.
According to the documentation the first three parameters of the put_object method,
def put_object(self, bucket_name, object_name, data, length,
Fix your parameters of put_object.
you're not using s3_upload in your lambda.

How to mock an API that returns a file

I have a function (api_upload) that accepts a file and HTTP headers, it then sends the file along with the HTTP headers to an API service (example.com/upload), which processes the file, and returns a URL of the processed file.
I am trying to write a unit test for this function (api_upload), and I am feeling a bit lost since this is my first time playing with unit-tests. I have installed pip install requests-mock and also pip install pytest-mock. I guess my question is how do I mock the example.com/upload, because I don't need to actually call example.com/upload in my testing. But I need to test that the function api_upload is "working" as expected.
def api_upload(input_file, headers):
# covert the input_file into a dictionary, since requests library wants a dictionary
input_file = {'file': input_file}
url = 'https://example.com/upload'
files = {'file': open(input_file['file'], 'rb')}
response = requests.post(url, headers=headers, files=files)
saved_file_url = response.json()['url']
return saved_file_url
What I tried is:
def test_api_upload(requests_mock):
requests_mock.get("https://example.com/upload", text='What do I put here?')
#I am not sure what else to write here?
The first thing you need to alter is the (HTTP) method you call. api_upload() does a POST request so that needs to be matched by requests_mock call.
The second thing is that you need to mock the response that the mock would return for your requests.post() call. api_upload() apparently expects JSON that consists of an object with url as one of the attributes. That's what you need to return then.
As to what you should do inside your test function, I'd say:
Mock the POST call and the expected response.
Call api_upload() and give it a mock file (see tmp_path that returns an instance of pathlib.Path).
Assert that the returned URL is what you mocked.
An untested example:
def test_api_upload(requests_mock, tmp_path):
file = tmp_path / "a.txt"
file.write_text("aaa")
mocked_response = {"url": "https://example.com/file/123456789"}
requests_mock.post("https://example.com/upload", json=mocked_response)
assert api_upload(input_file=str(file), headers={}) == mocked_response["url"]

User wand by python to convert pdf to jepg, raise wand.exceptions.WandRuntimeError in docker

I want to convert the first page of pdf to an image. And my below code is working well in my local environment: Ubuntu 18. But when I run in the docker environment, it fails and raises:
wand.exceptions.WandRuntimeError: MagickReadImage returns false, but
did raise ImageMagick exception. This can occurs when a delegate is
missing, or returns EXIT_SUCCESS without generating a raster.
Am I missing a dependency? Or something else? I don't know what it's referring to as 'delegate'.
I saw the source code, it fails in here: wand/image.py::7873lines
if blob is not None:
if not isinstance(blob, abc.Iterable):
raise TypeError('blob must be iterable, not ' +
repr(blob))
if not isinstance(blob, binary_type):
blob = b''.join(blob)
r = library.MagickReadImageBlob(self.wand, blob, len(blob))
elif filename is not None:
filename = encode_filename(filename)
r = library.MagickReadImage(self.wand, filename)
if not r:
self.raise_exception()
msg = ('MagickReadImage returns false, but did raise ImageMagick '
'exception. This can occurs when a delegate is missing, or '
'returns EXIT_SUCCESS without generating a raster.')
raise WandRuntimeError(msg)
The line r = library.MagickReadImageBlob(self.wand, blob, len(blob)) returns true in my local environment, but in the docker it returns false. Moreover, the args blob and len(blob) is same.
def pdf2img(fp, page=0):
"""
convert pdf to jpeg image
:param fp: a file-like object
:param page:
:return: (Bool, File) if False, mean the `fp` is not pdf, if True, then the `File` is a file-like object
contain the `jpeg` format data
"""
try:
reader = PdfFileReader(fp, strict=False)
except Exception as e:
fp.seek(0)
return False, None
else:
bytes_in = io.BytesIO()
bytes_out = io.BytesIO()
writer = PdfFileWriter()
writer.addPage(reader.getPage(page))
writer.write(bytes_in)
bytes_in.seek(0)
im = Image(file=bytes_in, resolution=120)
im.format = 'jpeg'
im.save(file=bytes_out)
bytes_out.seek(0)
return True, bytes_out
I don't know what it's referring to as 'delegate'.
With ImageMagick, a 'delegate' refers to any shared library, utility, or external program that does the actual encoding & decoding of file type. Specifically, a file format to a raster.
Am I missing a dependency?
Most likely. For PDF, you would need a ghostscript installed on the docker instance.
Or something else?
Possible, but hard to determine without an error message. The "WandRuntimeError" exception is a catch-all. It exists because a raster could not be generated from the PDF, and both Wand & ImageMagick can not determine why. Usually there would be an exception if the delegate failed, security policy message, or an OS error.
Best thing would be to run a few gs commands to see if ghostscript is working correctly.
gs -sDEVICE=pngalpha -o page-%03d.png -r120 input.pdf
If the above works, then try again just with ImageMagick
convert -density 120 input.pdf page-%03d.png

How to have more than one handler in AWS Lambda Function?

I have a very large python file that consists of multiple defined functions. If you're familiar with AWS Lambda, when you create a lambda function, you specify a handler, which is a function in the code that AWS Lambda can invoke when service executes my code, which is represented below in my_handler.py file:
def handler_name(event, context):
...
return some_value
Link Source: https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html
However, as I mentioned above, I have multiple defined functions in my_handler.py that have their own events and contexts. Therefore, this will result in an error. Are there any ways around this in python3.6?
Your single handler function will need to be responsible for parsing the incoming event, and determining the appropriate route to take. For example, let's say your other functions are called helper1 and helper2. Your Lambda handler function will inspect the incoming event and then, based on one of the fields in the incoming event (ie. let's call it EventType), call either helper1 or helper2, passing in both the event and context objects.
def handler_name(event, context):
if event['EventType'] == 'helper1':
helper1(event, context)
elif event['EventType'] == 'helper2':
helper2(event, context)
def helper1(event, context):
pass
def helper2(event, context):
pass
This is only pseudo-code, and I haven't tested it myself, but it should get the concept across.
Little late to the game but thought it wouldn't hurt to share. Best practices suggest that one separate the handler from the Lambda's core logic. Not only is it okay to add additional definitions, it can lead to more legible code and reduce waste--e.g. multiple API calls to S3. So, although it can get out of hand, I disagree with some of those critiques to your initial question. It's effective to use your handler as a logical interface to the additional functions that will accomplish your various work. In Data Architecture & Engineering land it's often less-costly and more efficient to work in this manner. Particularly if you are building out ETL pipelines, following service-oriented architectural patterns. Admittedly, I'm a bit of a Maverick and some may find this unruly/egregious but I've gone so far as to build classes into my Lambdas for various reasons--e.g. centralized, data-lake-ish S3 buckets that accommodate a variety of file types, reduce unnecessary requests, etc...--and I stand by it. Here's an example of one of my handler files from a CDK example project I put on the hub awhile back. Hopefully it'll give you some useful ideas, or at the very least not feel alone in wanting to beef up your Lambdas.
import requests
import json
from requests.exceptions import Timeout
from requests.exceptions import HTTPError
from botocore.exceptions import ClientError
from datetime import date
import csv
import os
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
class Asteroids:
"""Client to NASA API and execution interface to branch data processing by file type.
Notes:
This class doesn't look like a normal class. It is a simple example of how one might
workaround AWS Lambda's limitations of class use in handlers. It also allows for
better organization of code to simplify this example. If one planned to add
other NASA endpoints or process larger amounts of Asteroid data for both .csv and .json formats,
asteroids_json and asteroids_csv should be modularized and divided into separate lambdas
where stepfunction orchestration is implemented for a more comprehensive workflow.
However, for the sake of this demo I'm keeping it lean and easy.
"""
def execute(self, format):
"""Serves as Interface to assign class attributes and execute class methods
Raises:
Exception: If file format is not of .json or .csv file types.
Notes:
Have fun!
"""
self.file_format=format
self.today=date.today().strftime('%Y-%m-%d')
# method call below used when Secrets Manager integrated. See get_secret.__doc__ for more.
# self.api_key=get_secret('nasa_api_key')
self.api_key=os.environ["NASA_KEY"]
self.endpoint=f"https://api.nasa.gov/neo/rest/v1/feed?start_date={self.today}&end_date={self.today}&api_key={self.api_key}"
self.response_object=self.nasa_client(self.endpoint)
self.processed_response=self.process_asteroids(self.response_object)
if self.file_format == "json":
self.asteroids_json(self.processed_response)
elif self.file_format == "csv":
self.asteroids_csv(self.processed_response)
else:
raise Exception("FILE FORMAT NOT RECOGNIZED")
self.write_to_s3()
def nasa_client(self, endpoint):
"""Client component for API call to NASA endpoint.
Args:
endpoint (str): Parameterized url for API call.
Raises:
Timeout: If connection not made in 5s and/or data not retrieved in 15s.
HTTPError & Exception: Self-explanatory
Notes:
See Cloudwatch logs for debugging.
"""
try:
response = requests.get(endpoint, timeout=(5, 15))
except Timeout as timeout:
print(f"NASA GET request timed out: {timeout}")
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f'Other error occurred: {err}')
else:
return json.loads(response.content)
def process_asteroids(self, payload):
"""Process old, and create new, data object with content from response.
Args:
payload (b'str'): Binary string of asteroid data to be processed.
"""
near_earth_objects = payload["near_earth_objects"][f"{self.today}"]
asteroids = []
for neo in near_earth_objects:
asteroid_object = {
"id" : neo['id'],
"name" : neo['name'],
"hazard_potential" : neo['is_potentially_hazardous_asteroid'],
"est_diameter_min_ft": neo['estimated_diameter']['feet']['estimated_diameter_min'],
"est_diameter_max_ft": neo['estimated_diameter']['feet']['estimated_diameter_max'],
"miss_distance_miles": [item['miss_distance']['miles'] for item in neo['close_approach_data']],
"close_approach_exact_time": [item['close_approach_date_full'] for item in neo['close_approach_data']]
}
asteroids.append(asteroid_object)
return asteroids
def asteroids_json(self, payload):
"""Creates json object from payload content then writes to .json file.
Args:
payload (b'str'): Binary string of asteroid data to be processed.
"""
json_file = open(f"/tmp/asteroids_{self.today}.json",'w')
json_file.write(json.dumps(payload, indent=4))
json_file.close()
def asteroids_csv(self, payload):
"""Creates .csv object from payload content then writes to .csv file.
"""
csv_file=open(f"/tmp/asteroids_{self.today}.csv",'w', newline='\n')
fields=list(payload[0].keys())
writer=csv.DictWriter(csv_file, fieldnames=fields)
writer.writeheader()
writer.writerows(payload)
csv_file.close()
def get_secret(self):
"""Gets secret from AWS Secrets Manager
Notes:
Have yet to integrate into the CDK. Leaving as example code.
"""
secret_name = os.environ['TOKEN_SECRET_NAME']
region_name = os.environ['REGION']
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
except ClientError as e:
raise e
else:
if 'SecretString' in get_secret_value_response:
secret = get_secret_value_response['SecretString']
else:
secret = b64decode(get_secret_value_response['SecretBinary'])
return secret
def write_to_s3(self):
"""Uploads both .json and .csv files to s3
"""
s3 = boto3.client('s3')
s3.upload_file(f"/tmp/asteroids_{self.today}.{self.file_format}", os.environ['S3_BUCKET'], f"asteroid_data/asteroids_{self.today}.{self.file_format}")
def handler(event, context):
"""Instantiates class and triggers execution method.
Args:
event (dict): Lists a custom dict that determines interface control flow--i.e. `csv` or `json`.
context (obj): Provides methods and properties that contain invocation, function and
execution environment information.
*Not used herein.
"""
asteroids = Asteroids()
asteroids.execute(event)

How to check if boto3 S3.Client.upload_fileobj succeeded?

I want to save the result of a long running job on S3. The job is implemented in Python, so I'm using boto3. The user guide says to use S3.Client.upload_fileobj for this purpose which works fine, except I can't figure out how to check if the upload has succeeded. According to the documentation, the method doesn't return anything and doesn't raise an error. The Callback param seems to be intended for progress tracking instead of error checking. It is also unclear if the method call is synchronous or asynchronous.
If the upload failed for any reason, I would like to save the contents to the disk and log an error. So my question is: How can I check if a boto3 S3.Client.upload_fileobj call succeeded and do some error handling if it failed?
I use a combination of head_object and wait_until_exists.
import boto3
from botocore.exceptions import ClientError, WaiterError
session = boto3.Session()
s3_client = session.client('s3')
s3_resource = session.resource('s3')
def upload_src(src, filename, bucketName):
success = False
try:
bucket = s3_resource.Bucket(bucketName)
except ClientError as e:
bucket = None
try:
# In case filename already exists, get current etag to check if the
# contents change after upload
head = s3_client.head_object(Bucket=bucketName, Key=filename)
except ClientError:
etag = ''
else:
etag = head['ETag'].strip('"')
try:
s3_obj = bucket.Object(filename)
except ClientError, AttributeError:
s3_obj = None
try:
s3_obj.upload_fileobj(src)
except ClientError, AttributeError:
pass
else:
try:
s3_obj.wait_until_exists(IfNoneMatch=etag)
except WaiterError as e:
pass
else:
head = s3_client.head_object(Bucket=bucketName, Key=filename)
success = head['ContentLength']
return success
There is a wait_until_exists() helper function that seems to be for this purpose in the boto3.resource object.
This is how we are using it:
s3_client.upload_fileobj(file, BUCKET_NAME, file_path)
s3_resource.Object(BUCKET_NAME, file_path).wait_until_exists()
I would recommend you to perform the following operations-
try:
response = upload_fileobj()
except Exception as e:
save the contents to the disk and log an error.
if response is None:
polling after every 10s to check if the file uploaded successfully or not using **head_object()** function..
If you got the success response from head_object :
break
If you got error in accessing the object:
save the contents to the disk and log an error.
So , basically do poll using head_object()

Resources