Multiprocessing.Pool: can not iterate over IMapIterator object in AWS Batch because of PicklingError - python-3.x

I need to request huge bulk of data from an API endpoint and I want to use multiprocessing (vs multithreading, company framework limitations)
I have a multiprocessing.Pool with predefined concurrency CONCURRENCY in a class called Batcher. The class looks like this:
class Batcher:
def __init__(self, concurrency: int = 8):
self.concurrency = concurrency
def _interprete_response_to_succ_or_err(self, resp: requests.Response) -> str:
if isinstance(resp, str):
if "Error:" in resp:
return "dlq"
else:
return "err"
if isinstance(resp, requests.Response):
if resp.status_code == 200:
return "succ"
else:
return "err"
def _fetch_dat_data(self, id: str) -> requests.Response:
try:
resp = requests.get(API_ENDPOINT)
return resp
except Exception as e:
return f"ID {id} -> Error: {str(e)}"
def _dispatch_batch(self, batch: list) -> dict:
pool = MPool(self.concurrency)
results = pool.imap(self._fetch_dat_data, batch)
pool.close()
pool.join()
return results
def _run_batch(self, id):
return self._dispatch_batch(id)
def start(self, id_list: list):
""" In real class, this function will create smaller
batches from bigger chunks of data """
results = self._run_batch(id_list)
print(
[
res.text
for res in results
if self._interprete_response_to_succ_or_err(res) == "succ"
]
)
this class is called in file like this
if __name__ == "__main__":
"""
the source of ids is a csv file with single column in s3 that contains list
of columns with single id per line
"""
id_list = boto3_get_object_body(my_file_name).decode().split("\n") # custom function, works
batcher = Batcher()
batcher.start(id_list)
This script is a part of AWS Batch Job that is triggered via CLI. the same function runs perfectly on my local machine with same environment as in AWS Batch. It throws
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
in the line where I try to iterate over IMapIterator object results that is generated by pool.imap()
Relevant Traceback:
for res in results
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class 'boto3.resources.factory.s3.ServiceResource'>: attribute lookup s3.ServiceResource on boto3.resources.factory failed
I am wondering if I am missing something blatantly obvious or this issue is related to EC2 Instance spun on by batch job at this point and appreciate any kind of lead to root cause analysis.

This error happens because multiprocessing could not import the relevant datatype for duplicating data or calling the target function in the new process it started. This usually happens when the object necessary for the target function to run is created someplace the child process do not know about (for example, a class created inside the if __name__ ==... block in main module), or if the object's __qualname__ property has been fiddled with (you might see this using something similar to functools.wraps or monkey-patching in general)
Therefore, to actually "fix" this, you need to dig in your code and see if the above is true. A good place to start is with the class that is raising the issue (in this case it's boto3.resources.factory.s3.ServiceResource), can you import this in the main module before the if __name__... block runs?
However, most of the times, you can get away with by simply reducing the data required to start the target function (less data = less chances for faults occuring). In this case, the target function you are calling in the pool is an instance method. To start this function in a new process, multiprocessing would need to pickle all the instance attributes, which might have their own instance attributes, and so on. Not only does this add overhead, it could also be possible that the problem lies in a particular instance attribute. Therefore, just as a good practice, if your target function can run independently but is currently an instance method, change it a to staticmethod instead.
In this case, this would mean changing _fetch_dat_data to a staticmethod, and submitting it to the pool using type(self)._fetch_dat_data instead.

Related

Python multiprocess: run several instances of a class, keep all child processes in memory

First, I'd like to thank the StackOverflow community for the tremendous help it provided me over the years, without me having to ask a single question.
I could not find anything that I can relate to my problem, though it is probably due to my lack of understanding of the subject, rather than the absence of a response on the website. My apologies in advance if this is a duplicate.
I am relatively new to multiprocess; some time ago I succeeded in using multiprocessing.pools in a very simple way, where I didn't need any feedback between the child processes.
Now I am facing a much more complicated problem, and I am just lost in the documentation about multiprocessing. I hence ask for you help, your kindness and your patience.
I am trying to build a parallel tempering monte-carlo algorithm, from a class.
The basic class very roughly goes as follows:
import numpy as np
class monte_carlo:
def __init__(self):
self.x=np.ones((1000,3))
self.E=np.mean(self.x)
self.Elist=[]
def simulation(self,temperature):
self.T=temperature
for i in range(3000):
self.MC_step()
if i%10==0:
self.Elist.append(self.E)
return
def MC_step(self):
x=self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1,1,3))
temp_E=np.mean(self.x)
if np.random.random()<np.exp((self.E-temp_E)/self.T):
self.E=temp_E
self.x=x
return
Obviously, I simplified a great deal (actual class is 500 lines long!), and built fake functions for simplicity: __init__ takes a bunch of parameters as arguments, there are many more lists of measurement else than self.Elist, and also many arrays derived from self.X that I use to compute them. The key point is that each instance of the class contains a lot of informations that I want to keep in memory, and that I don't want to copy over and over again, to avoid dramatic slowing down. Else I would just use the multiprocessing.pool module.
Now, the parallelization I want to do, in pseudo-code:
def proba(dE,pT):
return np.exp(-dE/pT)
Tlist=[1.1,1.2,1.3]
N=len(Tlist)
G=[]
for _ in range(N):
G.append(monte_carlo())
for _ in range(5):
for i in range(N): # this loop should be ran in multiprocess
G[i].simulation(Tlist[i])
for i in range(N//2):
dE=G[i].E-G[i+1].E
pT=G[i].T + G[i+1].T
p=proba(dE,pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = G[i].T
G[i].T = G[i+1].T
G[i+1].T = T_temp
Synthesis: I want to run several instances of my monte-carlo class in parallel child processes, with different values for a parameter T, then periodically pause everything to change the different T's, and run again the child processes/class instances, from where they paused.
Doing this, I want each class-instance/child-process to stay independent from one another, save its current state with all internal variables while it is paused, and do as few copies as possible. This last point is critical, as the arrays inside the class are quite big (some are 1000x1000), and a copy will therefore very quickly become quite time-costly.
Thanks in advance, and sorry if I am not clear...
Edit:
I am using a distant machine with many (64) CPUs, running on Debian GNU/Linux 10 (buster).
Edit2:
I made a mistake in my original post: in the end, the temperatures must be exchanged between the class-instances, and not inside the global Tlist.
Edit3: Charchit answer works perfectly for the test code, on both my personal machine and the distant machine I am usually using for running my codes. I hence check this as the accepted answer.
However, I want to report here that, inserting the actual, more complicated code, instead of the oversimplified monte_carlo class, the distant machine gives me some strange errors:
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gtk-WARNING **: ##:##:##:###: Locale not supported by C library.
Using the fallback 'C' locale.
Unable to init server: Could not connect: Connection refused
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###:
gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
(CMC_temper_all.py:55509): Gdk-CRITICAL **: ##:##:##:###: gdk_cursor_new_for_display: assertion 'GDK_IS_DISPLAY (display)' failed
The "##:##:##:###" are (or seems like) IP adresses.
Without the call to set_start_method('spawn') this error shows only once, in the very beginning, while when I use this method, it seems to show at every occurrence of result.get()...
The strangest thing is that the code seems otherwise to work fine, does not crash, produces the datafiles I then ask it to, etc...
I think this would deserve to publish a new question, but I put it here nonetheless in case someone has a quick answer.
If not, I will resort to add one by one the variables, methods, etc... that are present in my actual code but not in the test example, to try and find the origin of the bug. My best guess for now is that the memory space required by each child-process with the actual code, is too large for the distant machine to accept it, due to some restrictions implemented by the admin.
What you are looking for is sharing state between processes. As per the documentation, you can either create shared memory, which is restrictive about the data it can store and is not thread-safe, but offers better speed and performance; or you can use server processes through managers. The latter is what we are going to use since you want to share whole objects of user-defined datatypes. Keep in mind that using managers will impact speed of your code depending on the complexity of the arguments that you pass and receive, to and from the managed objects.
Managers, proxies and pickling
As mentioned, managers create server processes to store objects, and allow access to them through proxies. I have answered a question with better details on how they work, and how to create a suitable proxy here. We are going to use the same proxy defined in the linked answer, with some variations. Namely, I have replaced the factory functions inside the __getattr__ to something that can be pickled using pickle. This means that you can run instance methods of managed objects created with this proxy without resorting to using multiprocess. The result is this modified proxy:
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
Solution
Now we only need to make sure that when we are creating objects of monte_carlo, we do so using managers and the above proxy. For that, we create a class constructor called create. All objects for monte_carlo should be created with this function. With that, the final code looks like this:
from multiprocessing import Pool
from multiprocessing.managers import NamespaceProxy, BaseManager
import types
import numpy as np
class A:
def __init__(self, name, method):
self.name = name
self.method = method
def get(self, *args, **kwargs):
return self.method(self.name, args, kwargs)
class ObjProxy(NamespaceProxy):
"""Returns a proxy instance for any user defined data-type. The proxy instance will have the namespace and
functions of the data-type (except private/protected callables/attributes). Furthermore, the proxy will be
pickable and can its state can be shared among different processes. """
def __getattr__(self, name):
result = super().__getattr__(name)
if isinstance(result, types.MethodType):
return A(name, self._callmethod).get
return result
class monte_carlo:
def __init__(self, ):
self.x = np.ones((1000, 3))
self.E = np.mean(self.x)
self.Elist = []
self.T = None
def simulation(self, temperature):
self.T = temperature
for i in range(3000):
self.MC_step()
if i % 10 == 0:
self.Elist.append(self.E)
return
def MC_step(self):
x = self.x.copy()
k = np.random.randint(1000)
x[k] = (x[k] + np.random.uniform(-1, 1, 3))
temp_E = np.mean(self.x)
if np.random.random() < np.exp((self.E - temp_E) / self.T):
self.E = temp_E
self.x = x
return
#classmethod
def create(cls, *args, **kwargs):
# Register class
class_str = cls.__name__
BaseManager.register(class_str, cls, ObjProxy, exposed=tuple(dir(cls)))
# Start a manager process
manager = BaseManager()
manager.start()
# Create and return this proxy instance. Using this proxy allows sharing of state between processes.
inst = eval("manager.{}(*args, **kwargs)".format(class_str))
return inst
def proba(dE,pT):
return np.exp(-dE/pT)
if __name__ == "__main__":
Tlist = [1.1, 1.2, 1.3]
N = len(Tlist)
G = []
# Create our managed instances
for _ in range(N):
G.append(monte_carlo.create())
for _ in range(5):
# Run simulations in the manager server
results = []
with Pool(8) as pool:
for i in range(N): # this loop should be ran in multiprocess
results.append(pool.apply_async(G[i].simulation, (Tlist[i], )))
# Wait for the simulations to complete
for result in results:
result.get()
for i in range(N // 2):
dE = G[i].E - G[i + 1].E
pT = G[i].T + G[i + 1].T
p = proba(dE, pT) # (proba is a function, giving a probability depending on dE)
if np.random.random() < p:
T_temp = Tlist[i]
Tlist[i] = Tlist[i + 1]
Tlist[i + 1] = T_temp
print(Tlist)
This meets the criteria you wanted. It does not create any copies at all, rather, all arguments to the simulation method call are serialized inside the pool and sent to the manager server where the object is actually stored. It gets executed there, and the results (if any) are serialized and returned in the main process. All of this, with only using the builtins!
Output
[1.2, 1.1, 1.3]
Edit
Since you are using Linux, I encourage you to use multiprocessing.set_start_method inside the if __name__ ... clause to set the start method to "spawn". Doing this will ensure that the child processes do not have access to variables defined inside the clause.

How to log the return value of a POST method after returning the response?

I'm working on my first ever REST API, so apologies in advance if I've missed something basic. I have a function that takes a JSON request from another server, processes it (makes a prediction based on the data), and returns another JSON with the results. I'd like to keep a log on the server's local disk of all requests to this endpoint along with their results, for evaluation purposes and for retraining the model. However, for the purposes of minimising the latency of returning the result to the user, I'd like to return the response data first, and then write it to the local disk. It's not obvious to me how to do this properly, as the FastAPI paradigm necessitates that the result of a POST method is the return value of the decorated function, so anything I want to do with the data has to be done before it is returned.
Below is a minimal working example of what I think is my closest attempt at getting it right so far, using a custom object with a log decorator - my idea was just to assign the result to the log object as a class attribute, then use another method to write it to disk, but I can't figure out how to make sure that that function gets called after get_data every time.
import json
import uvicorn
from fastapi import FastAPI, Request
from functools import wraps
from pydantic import BaseModel
class Blob(BaseModel):
id: int
x: float
def crunch_numbers(data: Blob) -> dict:
# does some stuff
return {'foo': 'bar'}
class PostResponseLogger:
def __init__(self) -> None:
self.post_result = None
def log(self, func, *args, **kwargs):
#wraps(func)
def func_to_log(*args, **kwargs):
post_result = func(*args, **kwargs)
self.post_result = post_result
# how can this be done outside of this function ???
self.write_data()
return post_result
return func_to_log
def write_data(self):
if self.post_result:
with open('output.json', 'w') as f:
json.dump(self.post_result, f)
def main():
app = FastAPI()
logger = PostResponseLogger()
#app.post('/get_data/')
#logger.log
def get_data(input_json: dict, request: Request):
result = crunch_numbers(input_json)
return result
uvicorn.run(app=app)
if __name__ == '__main__':
main()
Basically, my question boils down to: "is there a way, in the PostResponseLogger class, to automatically call self.write_data after every call to self.log?", but if I'm using the wrong approach altogether, any other suggestions are also welcome.
You could have a Background Task for that purpose. A background task "will run only once the response has been sent" (see Starlette documentation). "This is useful for operations that need to happen after a request, but that the client doesn't really have to be waiting for the operation to complete before receiving the response" (see FastAPI documentation).
You can define a task function to run in the background for writing the log data, as shown below:
def write_log_data():
logger.write_data()
Then, import BackgroundTasks and define a parameter in your endpoint with a type declaration of BackgroundTasks. Inside of your endpoint, pass your task function (i.e., write_log_data, as defined above) to the background_tasks object with the method .add_task():
from fastapi import BackgroundTasks
#app.post('/get_data/')
#logger.log
def get_data(input_json: dict, request: Request, background_tasks: BackgroundTasks):
result = crunch_numbers(input_json)
background_tasks.add_task(write_log_data)
return result
The same principle could be applied if a middleware was used to capture and log the response data, as described in this answer, or a custom APIRoute class, as demonstrated in this answer.
For future reference, if you (or anyone) ever need to use async/await syntax, and run into concurrency issues (such as the event loop getting blocked) while performing some heavy background computation, please have a look at this answer, which explains the difference between defining an endpoint or a background task function with async def and def (briefly, async def endpoints/background tasks will run in the event loop, whereas def functions will run in an external threadpool that is then awaited), as well as provides solutions when it comes to running blocking I/O-bound or CPU-bound operations in such functions.

subclassing a message to add additional behavior

Not sure why this isn't working, I want to subclass a message and add additional behavior:
import data_pb2 as pb2
class Status(pb2.Status):
def __init__(self, streamer, *args, **kwargs):
super().__init__(*args, **kwargs)
self.streamer = streamer
def __setattr__(self, key, value):
super().__setattr__(key, value)
self.streamer.send_update()
When someone changes the pb2.Status message I want send_update to be called.
This is the unhelpful error message I'm getting:
Traceback (most recent call last):
File "server.py", line 62, in <module>
class Status(pb2.Status):
File "C:\AppData\Local\conda\conda\envs\lib\site-packages\google\protobuf\internal\python_message.py", line 126, in __new__
descriptor = dictionary[GeneratedProtocolMessageType._DESCRIPTOR_KEY]
KeyError: 'DESCRIPTOR'
Just discovered the unfortunate truth that we're not meant to extend the message classes:
https://developers.google.com/protocol-buffers/docs/pythontutorial
Protocol Buffers and O-O Design Protocol buffer classes are basically dumb data holders (like structs in C); they don't make good first class citizens in an object model. If you want to add richer behaviour to a generated class, the best way to do this is to wrap the generated protocol buffer class in an application-specific class. Wrapping protocol buffers is also a good idea if you don't have control over the design of the .proto file (if, say, you're reusing one from another project). In that case, you can use the wrapper class to craft an interface better suited to the unique environment of your application: hiding some data and methods, exposing convenience functions, etc. You should never add behaviour to the generated classes by inheriting from them. This will break internal mechanisms and is not good object-oriented practice anyway.
I've come up with a solution that works. When the message is updated I have a threading event and its set method gets called.
class Status:
def __init__(self, *args, **kwargs):
self.status = pb2.Status(*args, **kwargs)
self.event = None
def __setattr__(self, key, value):
if key == 'status' or key == 'event':
super().__setattr__(key, value)
else:
super().__getattribute__('status').__setattr__(key, value)
super().__getattribute__('event').set()
def __getattr__(self, item):
if item == 'event' or item == 'status':
return super().__getattribute__(item)
else:
return super().__getattribute__('status').__getattribute__(item)
event = threading.Event()
status = Status(version="1",
)
status_streamer = StatusStreamer(status, event)
status.event = event
status.version = str(int(status.version) + 1) #this triggers set to be called inside setattr, which results in the threads in SatusStreamer to stream the update
It's a bit hacky but because we cannot subclass the message this is acceptable. status is the message and event is the threading event, when those items are assigned and start they don't trigger the event being set. However, when any other attributes are assigned to it triggers the .set() which yields the update to the clients.

How to have more than one handler in AWS Lambda Function?

I have a very large python file that consists of multiple defined functions. If you're familiar with AWS Lambda, when you create a lambda function, you specify a handler, which is a function in the code that AWS Lambda can invoke when service executes my code, which is represented below in my_handler.py file:
def handler_name(event, context):
...
return some_value
Link Source: https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model-handler-types.html
However, as I mentioned above, I have multiple defined functions in my_handler.py that have their own events and contexts. Therefore, this will result in an error. Are there any ways around this in python3.6?
Your single handler function will need to be responsible for parsing the incoming event, and determining the appropriate route to take. For example, let's say your other functions are called helper1 and helper2. Your Lambda handler function will inspect the incoming event and then, based on one of the fields in the incoming event (ie. let's call it EventType), call either helper1 or helper2, passing in both the event and context objects.
def handler_name(event, context):
if event['EventType'] == 'helper1':
helper1(event, context)
elif event['EventType'] == 'helper2':
helper2(event, context)
def helper1(event, context):
pass
def helper2(event, context):
pass
This is only pseudo-code, and I haven't tested it myself, but it should get the concept across.
Little late to the game but thought it wouldn't hurt to share. Best practices suggest that one separate the handler from the Lambda's core logic. Not only is it okay to add additional definitions, it can lead to more legible code and reduce waste--e.g. multiple API calls to S3. So, although it can get out of hand, I disagree with some of those critiques to your initial question. It's effective to use your handler as a logical interface to the additional functions that will accomplish your various work. In Data Architecture & Engineering land it's often less-costly and more efficient to work in this manner. Particularly if you are building out ETL pipelines, following service-oriented architectural patterns. Admittedly, I'm a bit of a Maverick and some may find this unruly/egregious but I've gone so far as to build classes into my Lambdas for various reasons--e.g. centralized, data-lake-ish S3 buckets that accommodate a variety of file types, reduce unnecessary requests, etc...--and I stand by it. Here's an example of one of my handler files from a CDK example project I put on the hub awhile back. Hopefully it'll give you some useful ideas, or at the very least not feel alone in wanting to beef up your Lambdas.
import requests
import json
from requests.exceptions import Timeout
from requests.exceptions import HTTPError
from botocore.exceptions import ClientError
from datetime import date
import csv
import os
import boto3
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
class Asteroids:
"""Client to NASA API and execution interface to branch data processing by file type.
Notes:
This class doesn't look like a normal class. It is a simple example of how one might
workaround AWS Lambda's limitations of class use in handlers. It also allows for
better organization of code to simplify this example. If one planned to add
other NASA endpoints or process larger amounts of Asteroid data for both .csv and .json formats,
asteroids_json and asteroids_csv should be modularized and divided into separate lambdas
where stepfunction orchestration is implemented for a more comprehensive workflow.
However, for the sake of this demo I'm keeping it lean and easy.
"""
def execute(self, format):
"""Serves as Interface to assign class attributes and execute class methods
Raises:
Exception: If file format is not of .json or .csv file types.
Notes:
Have fun!
"""
self.file_format=format
self.today=date.today().strftime('%Y-%m-%d')
# method call below used when Secrets Manager integrated. See get_secret.__doc__ for more.
# self.api_key=get_secret('nasa_api_key')
self.api_key=os.environ["NASA_KEY"]
self.endpoint=f"https://api.nasa.gov/neo/rest/v1/feed?start_date={self.today}&end_date={self.today}&api_key={self.api_key}"
self.response_object=self.nasa_client(self.endpoint)
self.processed_response=self.process_asteroids(self.response_object)
if self.file_format == "json":
self.asteroids_json(self.processed_response)
elif self.file_format == "csv":
self.asteroids_csv(self.processed_response)
else:
raise Exception("FILE FORMAT NOT RECOGNIZED")
self.write_to_s3()
def nasa_client(self, endpoint):
"""Client component for API call to NASA endpoint.
Args:
endpoint (str): Parameterized url for API call.
Raises:
Timeout: If connection not made in 5s and/or data not retrieved in 15s.
HTTPError & Exception: Self-explanatory
Notes:
See Cloudwatch logs for debugging.
"""
try:
response = requests.get(endpoint, timeout=(5, 15))
except Timeout as timeout:
print(f"NASA GET request timed out: {timeout}")
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f'Other error occurred: {err}')
else:
return json.loads(response.content)
def process_asteroids(self, payload):
"""Process old, and create new, data object with content from response.
Args:
payload (b'str'): Binary string of asteroid data to be processed.
"""
near_earth_objects = payload["near_earth_objects"][f"{self.today}"]
asteroids = []
for neo in near_earth_objects:
asteroid_object = {
"id" : neo['id'],
"name" : neo['name'],
"hazard_potential" : neo['is_potentially_hazardous_asteroid'],
"est_diameter_min_ft": neo['estimated_diameter']['feet']['estimated_diameter_min'],
"est_diameter_max_ft": neo['estimated_diameter']['feet']['estimated_diameter_max'],
"miss_distance_miles": [item['miss_distance']['miles'] for item in neo['close_approach_data']],
"close_approach_exact_time": [item['close_approach_date_full'] for item in neo['close_approach_data']]
}
asteroids.append(asteroid_object)
return asteroids
def asteroids_json(self, payload):
"""Creates json object from payload content then writes to .json file.
Args:
payload (b'str'): Binary string of asteroid data to be processed.
"""
json_file = open(f"/tmp/asteroids_{self.today}.json",'w')
json_file.write(json.dumps(payload, indent=4))
json_file.close()
def asteroids_csv(self, payload):
"""Creates .csv object from payload content then writes to .csv file.
"""
csv_file=open(f"/tmp/asteroids_{self.today}.csv",'w', newline='\n')
fields=list(payload[0].keys())
writer=csv.DictWriter(csv_file, fieldnames=fields)
writer.writeheader()
writer.writerows(payload)
csv_file.close()
def get_secret(self):
"""Gets secret from AWS Secrets Manager
Notes:
Have yet to integrate into the CDK. Leaving as example code.
"""
secret_name = os.environ['TOKEN_SECRET_NAME']
region_name = os.environ['REGION']
session = boto3.session.Session()
client = session.client(service_name='secretsmanager', region_name=region_name)
try:
get_secret_value_response = client.get_secret_value(SecretId=secret_name)
except ClientError as e:
raise e
else:
if 'SecretString' in get_secret_value_response:
secret = get_secret_value_response['SecretString']
else:
secret = b64decode(get_secret_value_response['SecretBinary'])
return secret
def write_to_s3(self):
"""Uploads both .json and .csv files to s3
"""
s3 = boto3.client('s3')
s3.upload_file(f"/tmp/asteroids_{self.today}.{self.file_format}", os.environ['S3_BUCKET'], f"asteroid_data/asteroids_{self.today}.{self.file_format}")
def handler(event, context):
"""Instantiates class and triggers execution method.
Args:
event (dict): Lists a custom dict that determines interface control flow--i.e. `csv` or `json`.
context (obj): Provides methods and properties that contain invocation, function and
execution environment information.
*Not used herein.
"""
asteroids = Asteroids()
asteroids.execute(event)

Wrapping all possible method calls of a class in a try/except block

I'm trying to wrap all methods of an existing Class (not of my creation) into a try/except suite. It could be any Class, but I'll use the pandas.DataFrame class here as a practical example.
So if the invoked method succeeds, we simply move on. But if it should generate an exception, it is appended to a list for later inspection/discovery (although the below example just issues a print statement for simplicity).
(Note that the kinds of data-related exceptions that can occur when a method on the instance is invoked, isn't yet known; and that's the reason for this exercise: discovery).
This post was quite helpful (particularly #martineau Python-3 answer), but I'm having trouble adapting it. Below, I expected the second call to the (wrapped) info() method to emit print output but, sadly, it doesn't.
#!/usr/bin/env python3
import functools, types, pandas
def method_wrapper(method):
#functools.wraps(method)
def wrapper(*args, **kwargs): #Note: args[0] points to 'self'.
try:
print('Calling: {}.{}()... '.format(args[0].__class__.__name__,
method.__name__))
return method(*args, **kwargs)
except Exception:
print('Exception: %r' % sys.exc_info()) # Something trivial.
#<Actual code would append that exception info to a list>.
return wrapper
class MetaClass(type):
def __new__(mcs, class_name, base_classes, classDict):
newClassDict = {}
for attributeName, attribute in classDict.items():
if type(attribute) == types.FunctionType: # Replace it with a
attribute = method_wrapper(attribute) # decorated version.
newClassDict[attributeName] = attribute
return type.__new__(mcs, class_name, base_classes, newClassDict)
class WrappedDataFrame2(MetaClass('WrappedDataFrame',
(pandas.DataFrame, object,), {}),
metaclass=type):
pass
print('Unwrapped pandas.DataFrame().info():')
pandas.DataFrame().info()
print('\n\nWrapped pandas.DataFrame().info():')
WrappedDataFrame2().info()
print()
This outputs:
Unwrapped pandas.DataFrame().info():
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrame
Wrapped pandas.DataFrame().info(): <-- Missing print statement after this line.
<class '__main__.WrappedDataFrame2'>
Index: 0 entries
Empty WrappedDataFrame2
In summary,...
>>> unwrapped_object.someMethod(...)
# Should be mirrored by ...
>>> wrapping_object.someMethod(...)
# Including signature, docstring, etc. (i.e. all attributes); except that it
# executes inside a try/except suite (so I can catch exceptions generically).
long time no see. ;-) In fact it's been such a long time you may no longer care, but in case you (or others) do...
Here's something I think will do what you want. I've never answered your question before now because I don't have pandas installed on my system. However, today I decided to see if there was a workaround for not having it and created a trivial dummy module to mock it (only as far as I needed). Here's the only thing in it:
mockpandas.py:
""" Fake pandas module. """
class DataFrame:
def info(self):
print('pandas.DataFrame.info() called')
raise RuntimeError('Exception raised')
Below is code that seems to do what you need by implementing #Blckknght's suggestion of iterating through the MRO—but ignores the limitations noted in his answer that could arise from doing it that way). It ain't pretty, but as I said, it seems to work with at least the mocked pandas library I created.
import functools
import mockpandas as pandas # mock the library
import sys
import traceback
import types
def method_wrapper(method):
#functools.wraps(method)
def wrapper(*args, **kwargs): # Note: args[0] points to 'self'.
try:
print('Calling: {}.{}()... '.format(args[0].__class__.__name__,
method.__name__))
return method(*args, **kwargs)
except Exception:
print('An exception occurred in the wrapped method {}.{}()'.format(
args[0].__class__.__name__, method.__name__))
traceback.print_exc(file=sys.stdout)
# (Actual code would append that exception info to a list)
return wrapper
class MetaClass(type):
def __new__(meta, class_name, base_classes, classDict):
""" See if any of the base classes were created by with_metaclass() function. """
marker = None
for base in base_classes:
if hasattr(base, '_marker'):
marker = getattr(base, '_marker') # remember class name of temp base class
break # quit looking
if class_name == marker: # temporary base class being created by with_metaclass()?
return type.__new__(meta, class_name, base_classes, classDict)
# Temporarily create an unmodified version of class so it's MRO can be used below.
TempClass = type.__new__(meta, 'TempClass', base_classes, classDict)
newClassDict = {}
for cls in TempClass.mro():
for attributeName, attribute in cls.__dict__.items():
if isinstance(attribute, types.FunctionType):
# Convert it to a decorated version.
attribute = method_wrapper(attribute)
newClassDict[attributeName] = attribute
return type.__new__(meta, class_name, base_classes, newClassDict)
def with_metaclass(meta, classname, bases):
""" Create a class with the supplied bases and metaclass, that has been tagged with a
special '_marker' attribute.
"""
return type.__new__(meta, classname, bases, {'_marker': classname})
class WrappedDataFrame2(
with_metaclass(MetaClass, 'WrappedDataFrame', (pandas.DataFrame, object))):
pass
print('Unwrapped pandas.DataFrame().info():')
try:
pandas.DataFrame().info()
except RuntimeError:
print(' RuntimeError exception was raised as expected')
print('\n\nWrapped pandas.DataFrame().info():')
WrappedDataFrame2().info()
Output:
Unwrapped pandas.DataFrame().info():
pandas.DataFrame.info() called
RuntimeError exception was raised as expected
Wrapped pandas.DataFrame().info():
Calling: WrappedDataFrame2.info()...
pandas.DataFrame.info() called
An exception occurred in the wrapped method WrappedDataFrame2.info()
Traceback (most recent call last):
File "test.py", line 16, in wrapper
return method(*args, **kwargs)
File "mockpandas.py", line 9, in info
raise RuntimeError('Exception raised')
RuntimeError: Exception raised
As the above illustrates, the method_wrapper() decoratored version is being used by methods of the wrapped class.
Your metaclass only applies your decorator to the methods defined in classes that are instances of it. It doesn't decorate inherited methods, since they're not in the classDict.
I'm not sure there's a good way to make it work. You could try iterating through the MRO and wrapping all the inherited methods as well as your own, but I suspect you'd get into trouble if there were multiple levels of inheritance after you start using MetaClass (as each level will decorate the already decorated methods of the previous class).

Resources