Sharing a complex python object in memory between separate processes - python-3.x

I have a complex python object, of size ~36GB in memory, which I would like to share between multiple separate python processes. It is stored on disk as a pickle file, which I currently load separately for every process. I want to share this object to enable execution of more processes in parallel, under the amount of memory available.
This object is used, in a sense, as a read-only database. Every process initiates multiple access requests per second, and every request is just for a small portion of the data.
I looked into solutions like Radis, but I saw that eventually, the data needs to be serialized into a simple textual form. Also, mapping the pickle file itself to memory should not help because it will need to be extracted by every process. So I thought about two other possible solutions:
Using a shared memory, where every process can access the address in which the object is stored. The problem here is that the process will only see a bulk of bytes, which cannot be interpreted
Writing a code that holds this object and manages retrieval of data, through API calls. Here, I wonder about the performance of such solution in terms of speed.
Is there a simple way to implement either of these solutions? Perhaps there is a better solution for this situation?
Many thanks!

For complex objects there isn't readily available method to directly share memory between processes. If you have simple ctypes you can do this in a c-style shared memory but it won't map directly to python objects.
There is a simple solution that works well if you only need a portion of your data at any one time, not the entire 36GB. For this you can use a SyncManager from multiprocessing.managers. Using this, you setup a server that serves up a proxy class for your data (your data isn't stored in the class, the proxy only provides access to it). Your client then attaches to the server using a BaseManager and calls methods in the proxy class to retrieve the data.
Behind the scenes the Manager classes take care of pickling the data you ask for and sending it through the open port from server to client. Because you're pickling data with every call this isn't efficient if you need your entire dataset. In the case where you only need a small portion of the data in the client, the method saves a lot of time since the data only needs to be loaded once by the server.
The solution is comparable to a database solution speed-wise but it can save you a lot of complexity and DB-learning if you'd prefer to keep to a purely pythonic solution.
Here's some example code that is meant to work with GloVe word vectors.
Server
#!/usr/bin/python
import sys
from multiprocessing.managers import SyncManager
import numpy
# Global for storing the data to be served
gVectors = {}
# Proxy class to be shared with different processes
# Don't but the big vector data in here since that will force it to
# be piped to the other process when instantiated there, instead just
# return the global vector data, from this process, when requested.
class GloVeProxy(object):
def __init__(self):
pass
def getNVectors(self):
global gVectors
return len(gVectors)
def getEmpty(self):
global gVectors
return numpy.zeros_like(gVectors.values()[0])
def getVector(self, word, default=None):
global gVectors
return gVectors.get(word, default)
# Class to encapsulate the server functionality
class GloVeServer(object):
def __init__(self, port, fname):
self.port = port
self.load(fname)
# Load the vectors into gVectors (global)
#staticmethod
def load(filename):
global gVectors
f = open(filename, 'r')
for line in f:
vals = line.rstrip().split(' ')
gVectors[vals[0]] = numpy.array(vals[1:]).astype('float32')
# Run the server
def run(self):
class myManager(SyncManager): pass
myManager.register('GloVeProxy', GloVeProxy)
mgr = myManager(address=('', self.port), authkey='GloVeProxy01')
server = mgr.get_server()
server.serve_forever()
if __name__ == '__main__':
port = 5010
fname = '/mnt/raid/Data/Misc/GloVe/WikiGiga/glove.6B.50d.txt'
print 'Loading vector data'
gs = GloVeServer(port, fname)
print 'Serving data. Press <ctrl>-c to stop.'
gs.run()
Client
from multiprocessing.managers import BaseManager
import psutil #3rd party module for process info (not strictly required)
# Grab the shared proxy class. All methods in that class will be availble here
class GloVeClient(object):
def __init__(self, port):
assert self._checkForProcess('GloVeServer.py'), 'Must have GloVeServer running'
class myManager(BaseManager): pass
myManager.register('GloVeProxy')
self.mgr = myManager(address=('localhost', port), authkey='GloVeProxy01')
self.mgr.connect()
self.glove = self.mgr.GloVeProxy()
# Return the instance of the proxy class
#staticmethod
def getGloVe(port):
return GloVeClient(port).glove
# Verify the server is running
#staticmethod
def _checkForProcess(name):
for proc in psutil.process_iter():
if proc.name() == name:
return True
return False
if __name__ == '__main__':
port = 5010
glove = GloVeClient.getGloVe(port)
for word in ['test', 'cat', '123456']:
print('%s = %s' % (word, glove.getVector(word)))
Note that the psutil library is just used to check to see if you have the server running, it's not required. Be sure to name the server GloVeServer.py or change the check by psutil in the code so it looks for the correct name.

Related

flask requests are not thread safe when we run it with gunicorn [duplicate]

In my application, the state of a common object is changed by making requests, and the response depends on the state.
class SomeObj():
def __init__(self, param):
self.param = param
def query(self):
self.param += 1
return self.param
global_obj = SomeObj(0)
#app.route('/')
def home():
flash(global_obj.query())
render_template('index.html')
If I run this on my development server, I expect to get 1, 2, 3 and so on. If requests are made from 100 different clients simultaneously, can something go wrong? The expected result would be that the 100 different clients each see a unique number from 1 to 100. Or will something like this happen:
Client 1 queries. self.param is incremented by 1.
Before the return statement can be executed, the thread switches over to client 2. self.param is incremented again.
The thread switches back to client 1, and the client is returned the number 2, say.
Now the thread moves to client 2 and returns him/her the number 3.
Since there were only two clients, the expected results were 1 and 2, not 2 and 3. A number was skipped.
Will this actually happen as I scale up my application? What alternatives to a global variable should I look at?
You can't use global variables to hold this sort of data. Not only is it not thread safe, it's not process safe, and WSGI servers in production spawn multiple processes. Not only would your counts be wrong if you were using threads to handle requests, they would also vary depending on which process handled the request.
Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs. If you need to load and access Python data, consider multiprocessing.Manager. You could also use the session for simple data that is per-user.
The development server may run in single thread and process. You won't see the behavior you describe since each request will be handled synchronously. Enable threads or processes and you will see it. app.run(threaded=True) or app.run(processes=10). (In 1.0 the server is threaded by default.)
Some WSGI servers may support gevent or another async worker. Global variables are still not thread safe because there's still no protection against most race conditions. You can still have a scenario where one worker gets a value, yields, another modifies it, yields, then the first worker also modifies it.
If you need to store some global data during a request, you may use Flask's g object. Another common case is some top-level object that manages database connections. The distinction for this type of "global" is that it's unique to each request, not used between requests, and there's something managing the set up and teardown of the resource.
This is not really an answer to thread safety of globals.
But I think it is important to mention sessions here.
You are looking for a way to store client-specific data. Every connection should have access to its own pool of data, in a threadsafe way.
This is possible with server-side sessions, and they are available in a very neat flask plugin: https://pythonhosted.org/Flask-Session/
If you set up sessions, a session variable is available in all your routes and it behaves like a dictionary. The data stored in this dictionary is individual for each connecting client.
Here is a short demo:
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
# Check Configuration section for more details
SESSION_TYPE = 'filesystem'
app.config.from_object(__name__)
Session(app)
#app.route('/')
def reset():
session["counter"]=0
return "counter was reset"
#app.route('/inc')
def routeA():
if not "counter" in session:
session["counter"]=0
session["counter"]+=1
return "counter is {}".format(session["counter"])
#app.route('/dec')
def routeB():
if not "counter" in session:
session["counter"] = 0
session["counter"] -= 1
return "counter is {}".format(session["counter"])
if __name__ == '__main__':
app.run()
After pip install Flask-Session, you should be able to run this. Try accessing it from different browsers, you'll see that the counter is not shared between them.
Another example of a data source external to requests is a cache, such as what's provided by Flask-Caching or another extension.
Create a file common.py and place in it the following:
from flask_caching import Cache
# Instantiate the cache
cache = Cache()
In the file where your flask app is created, register your cache with the following code:
# Import cache
from common import cache
# ...
app = Flask(__name__)
cache.init_app(app=app, config={"CACHE_TYPE": "filesystem",'CACHE_DIR': Path('/tmp')})
Now use throughout your application by importing the cache and executing as follows:
# Import cache
from common import cache
# store a value
cache.set("my_value", 1_000_000)
# Get a value
my_value = cache.get("my_value")
While totally accepting the previous upvoted answers, and discouraging use of global variables for production and scalable Flask storage, for the purpose of prototyping or really simple servers, running under the flask 'development server'...
...
The Python built-in data types, and I personally used and tested the global dict, as per Python documentation are thread safe. Not process safe.
The insertions, lookups, and reads from such a (server global) dict will be OK from each (possibly concurrent) Flask session running under the development server.
When such a global dict is keyed with a unique Flask session key, it can be rather useful for server-side storage of session specific data otherwise not fitting into the cookie (max size 4 kB).
Of course, such a server global dict should be carefully guarded for growing too large, being in-memory. Some sort of expiring the 'old' key/value pairs can be coded during request processing.
Again, it is not recommended for production or scalable deployments, but it is possibly OK for local task-oriented servers where a separate database is too much for the given task.
...

How to run a threaded function that returns a variable?

Working with Python 3.6, what I’m looking to accomplish is to create a function that continuously scrapes dynamic/changing data from a webpage, while the rest of the script executes, and is able to reference the data returned from the continuous function.
I know this is likely a threading task, however I’m not super knowledgeable in it yet. Pseudo-code I might think looks something like this
def continuous_scraper():
# Pull data from webpage
scraped_table = pd.read_html(url)
return scraped_table
# start the continuous scraper function here, to run either indefinitely, or preferably stop after a predefined amount of time
scraped_table = thread(continuous_scraper)
# the rest of the script is run here, making use of the updating “scraped_table”
while True:
print(scraped_table[“Col_1”].iloc[0]
Here is a fairly simple example using some stock market page that seems to update every couple of seconds.
import threading, time
import pandas as pd
# A lock is used to ensure only one thread reads or writes the variable at any one time
scraped_table_lock = threading.Lock()
# Initially set to None so we know when its value has changed
scraped_table = None
# This bad-boy will be called only once in a separate thread
def continuous_scraper():
# Tell Python this is a global variable, so it rebinds scraped_table
# instead of creating a local variable that is also named scraped_table
global scraped_table
url = r"https://tradingeconomics.com/australia/stock-market"
while True:
# Pull data from webpage
result = pd.read_html(url, match="Dow Jones")[0]
# Acquire the lock to ensure thread-safety, then assign the new result
# This is done after read_html returns so it doesn't hold the lock for so long
with scraped_table_lock:
scraped_table = result
# You don't wanna flog the server, so wait 2 seconds after each
# response before sending another request
time.sleep(2)
# Make the thread daemonic, so the thread doesn't continue to run once the
# main script and any other non-daemonic threads have ended
scraper_thread = threading.Thread(target=continuous_scraper, daemon=True)
# start the continuous scraper function here, to run either indefinitely, or
# preferably stop after a predefined amount of time
scraper_thread.start()
# the rest of the script is run here, making use of the updating “scraped_table”
for _ in range(100):
print("Time:", time.time())
# Acquire the lock to ensure thread-safety
with scraped_table_lock:
# Check if it has been changed from the default value of None
if scraped_table is not None:
print(" ", scraped_table)
else:
print("scraped_table is None")
# You probably don't wanna flog your stdout, either, dawg!
time.sleep(0.5)
Be sure to read about multithreaded programming and thread safety. It's easy to make mistakes. If there is a bug, it often only manifests in rare and seemingly random occasions, making it difficult to debug.
I recommend looking into multiprocessing library and Pool class.
The docs have multiple examples of how to use it.
Question itself is too general to make a simple answer.

How is it possible to execute python code during deserialization?

I was reading about pickling in the context of persisting instances, and ran across this snippet:
Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it. [1]
My understanding is that pickling turns a data-structure into an array of bytes, and the pickle library also contains methods to take a pickled byte array and rebuild a python instance from it.
I tested some code to see if simply putting code into the class or init method would run it:
import pickle
class A:
print('class')
def __init__(self):
print('instance')
a = A()
print('pickling...')
with open('/home/usrname/Desktop/pfile', 'wb') as pfile:
pickle.dump(a, pfile, pickle.HIGHEST_PROTOCOL)
print('de-pickling...')
with open('/home/usrname/Desktop/pfile', 'rb') as pfile:
a2 = pickle.load(pfile)
However this only yields
class
instance
pickling...
de-pickling...
suggesting that the __ init__ method doesn't actually get run when the instance is unpickled. So I'm still confused how you would make code run during that process.
Really thorough writeup here: https://intoli.com/blog/dangerous-pickles/
From what I understand, it has to do with how pickles are interpreted by the Pickle Machine (PM) and run. You can craft a pickle file that will cause it to evaluate using eval() the statements provided.

How to properly memoize when using a ProcessPoolExecutor?

I suspect that something like:
#memoize
def foo():
return something_expensive
def main():
with ProcessPoolExecutor(10) as pool:
futures = {pool.submit(foo, arg): arg for arg in args}
for future in concurrent.futures.as_completed(futures):
arg = futures[future]
try:
result = future.result()
except Exception as e:
sys.stderr.write("Failed to run foo() on {}\nGot {}\n".format(arg, e))
else:
print(result)
Won't work (assuming #memoize is a typical dict-based cache) due to the fact that I am using a multi-processing pool and the processes don't share much. At least it doesn't seem to work.
What is the correct way to memoize in this scenario? Ultimately I'd also like to pickle the cache to disk and load it on subsequent runs.
You can use a Manager.dict from multiprocessing which uses a Manager to proxy between processes and store in a shared dict, which can be pickled. I decided to use Multithreading though because it's an IO bound app and thread shared memory space means I dont need all that manager stuff, I can just use a dict.

Why does some widgets don't update on Qt5?

I am trying to create a PyQt5 application, where I have used certain labels for displaying status variables. To update them, I have implemented custom pyqtSignal manually. However, on debugging I find that the value of GUI QLabel have changed but the values don't get reflected on the main window.
Some answers suggested calling QApplication().processEvents() occasionally. However, this instantaneously crashes the application and also freezes the application.
Here's a sample code (all required libraries are imported, it's just the part creating problem, the actual code is huge):
from multiprocessing import Process
def sub(signal):
i = 0
while (True):
if (i % 5 == 0):
signal.update(i)
class CustomSignal(QObject):
signal = pyqtSignal(int)
def update(value):
self.signal.emit(value)
class MainApp(QWidget):
def __init__(self):
super().__init__()
self.label = QLabel("0");
self.customSignal = CustomSignal()
self.subp = Process(target=sub, args=(customSignal,))
self.subp.start()
self.customSignal.signal.connect(self.updateValue)
def updateValue(self, value):
print("old value", self.label.text())
self.label.setText(str(value))
print("new value", self.label.text())
The output of the print statements is as expected. However, the text in label does not change.
The update function in CustomSignal is called by some thread.
I've applied the same method to update progress bar which works fine.
Is there any other fix for this, other than processEvents()?
The OS is Ubuntu 16.04.
The key problem lies in the very concept behind the code.
Processes have their own address space, and don't share data with another processes, unless some inter-process communication algorithm is used. Perhaps, multithreading module was used instead of threading module to bring concurrency to avoid Python's GIL and speedup the program. However, subprocess has cannot access the data of parent process.
I have tested two solutions to this case, and they seem to work.
threading module: No matter threading in Python is inefficient due to GIL, but it's still sufficient to some extent for basic concurrency demands. Note the difference between concurrency and speedup.
QThread: Since you are using PyQt, there's isn't any issue in using QThread, which is a better option because it takes concurrency to multiple cores taking advantage of operating system's system call, rather than Python in the middle.
Try adding
self.label.repaint()
immediately after updating the text, like this:
self.label.setText(str(value))
self.label.repaint()

Resources