How to execute parallel queries with PyGreSQL? - python-3.x

I am trying to run multiple queries in parallel with PyGreSQL and multiprocessing, but below code hangs without returning:
from pg import DB
from multiprocessing import Pool
from functools import partial
def create_query(table_name):
return f"""create table {table_name} (id integer);
CREATE INDEX ON {table_name} USING BTREE (id);"""
my_queries = [ create_query('foo'), create_query('bar'), create_query('baz') ]
def execute_query(conn_string, query):
con = DB(conn_string)
con.query(query)
con.close()
rs_conn_string = "host=localhost port=5432 dbname=postgres user=postgres password="
pool = Pool(processes=len(my_queries))
pool.map(partial(execute_query,rs_conn_string), my_queries)
Is there any way to make it work? Also is it possible make the 3 running queries in same "transaction" in case one query fails and the other get rolled back?

One obvious problem is that you always run the pool.map, not only in the main process, but also when the interpreters used in the parallel sub-processes import the script. You should do something like this instead:
def run_all():
with Pool(processes=len(my_queries)) as pool:
pool.map(partial(execute_query,rs_conn_string), my_queries)
if __name__ == '__main__':
run_all()
Regarding your second question, that's not possible since the transaction are per connection, which live in separate processes if you do it like that.
Asynchronous command processing might be what you want, but it is not yet supported by PyGreSQL. Psygopg + aiopg is probably better suited for doing things like that.

PyGreSql added async with the connection.poll() method. As far as pooling, I like to override MySQL.connectors pooling wrappers to handle pgdb connection objects. There’s a few ‘optional’ connection method calls that will fail that you have to comment out (I.e. checking connection status, etc. these can be implemented on the Pgdb connection object level if you want them, but the calls don’t match MySQL.connectors api interface). There’s probably some low-level bugs associated as the libs are only abstracted similarly, but this solution has been running in prod for a few months now without any problems.

Related

how to use Flask with multiprocessing

Concretely, I'm using Flask to process a request, pseudocode like this:
from flask import Flask, request
app = Flask(__name__)
#app.route("/foo", methods=["POST"])
def foo():
data = request.get_json() # {"request_id": "abc", "data": "some text"}
result_a = do_task_a(data) # returns {"result_a": "a"}, maybe about 1 second to finish
result_b = do_task_b(data) # returns {"result_b": "b"}, maybe about 1 second to finish
result_c = do_task_c(data) # returns {"result_c": "c"}, maybe about 1 second to finish
result = {
"result_a": result_a["result_a"],
"result_b": result_b["result_b"],
"result_c": result_c["result_c"]}
return result
app.run(host='0.0.0.0', port=4000, threaded=False)
Here, do_task_a, do_task_b, do_task_c are completely independent subtasks, I know I can use multiprocessing.Process to create processes to finish these three subtasks, and use join() to wait for subtask done, But I don't know it's proper way to create Process for every request?
Maybe I can use multiprocessing.Queue to help, But I don't find a good way.
I search for multiprocessing, but can't figure out a good solution.
I'm not a python guy, but indeed creating processes is sn expensive operation
If its possible - create threads they're cheaper than processes.
If you run the request multiple times - you can do even better than that, because creating threads per request is still quite expensive
Even more advanced setup is to create a "pre-loaded" thread pool. Like N threads that you always keep in memory ready for running arriving task.
In terms of technical solution I've found This article that explains how to create thread pools in python 3.2+

How do I retrieve data from a Django DB before sending off Celery task to remote worker?

I have a celery shared_task that is scheduled to run at certain intervals. Every time this task is run, it needs to first retrieve data from the Django DB in order to complete the calculation. This task may or may not be sent to a celery worker that is on a separate machine, so in the celery task I can't make any queries to a local celery database.
So far I have tried using signals to accomplish it, since I know that functions with the wrapper #before_task_publish are executed before the task is even published in the message queue. However, I don't know how I can actually get the data to the task.
#shared_task
def run_computation(data):
perform_computation(data)
#before_task_publish.connect
def receiver_before_task_publish(sender=None, headers=None, body=None, **kwargs):
data = create_data()
# How do I get the data to the task from here?
Is this the right way to approach this in the first place? Or would I be better off making an API route that the celery task can get to retrieve the data?
I'm posting the solution that worked for me, thanks for the help #solarissmoke.
What works best for me is utilizing Celery "chain" callback functions and separate RabbitMQ queues for designating what would be computed locally and what would be computed on the remote worker.
My solution looks something like this:
#app.task
def create_data_task():
# this creates the data to be passed to the analysis function
return create_data()
#app.task
def perform_computation_task(data):
# This performs the computation with given data
return perform_computation(data)
#app.task
def store_results(result):
# This would store the result in the DB here, but for now we just print it
print(result)
#app.task
def run_all_computation():
task = signature("path.to.tasks.create_data_task", queue="default") | signature("path.to.tasks.perform_computation_task", queue="remote_computation") | signature("path.to.tasks.store_results", queue="default")
task()
Its important to note here that these tasks were not run serially; they were in fact separate tasks that are run by the workers and therefore do not block a single thread. The other tasks are only activated by a callback function from the others. I declared two celery queues in RabbitMQ, a default one called default, and one specifically for remote computation called "remote_computation". This is described explicitly here including how to point celery workers at created queues.
It is possible to modify the task data in place, from the before_task_publish handler, so that it gets passed to the task. I will say upfront that there are many reasons why this is not a good idea:
#before_task_publish.connect
def receiver_before_task_publish(sender=None, headers=None, body=None, **kwargs):
data = create_data()
# Modify the body of the task data.
# Body is a tuple, the first entry of which is a tuple of arguments to the task.
# So we replace the first argument (data) with our own.
body[0][0] = data
# Alternatively modify the kwargs, which is a bit more explicit
body[1]['data'] = data
This works, but it should be obvious why it's risky and prone to breakage. Assuming you have control over the task call sites I think it would be better to drop the signal altogether and just have a simple function that does the work for you, i.e.,:
def create_task(data):
data = create_data()
run_computation.delay(data)
And then in your calling code, just call create_task(data) instead of calling the task directly (which is what you presumably do right now).

flask requests are not thread safe when we run it with gunicorn [duplicate]

In my application, the state of a common object is changed by making requests, and the response depends on the state.
class SomeObj():
def __init__(self, param):
self.param = param
def query(self):
self.param += 1
return self.param
global_obj = SomeObj(0)
#app.route('/')
def home():
flash(global_obj.query())
render_template('index.html')
If I run this on my development server, I expect to get 1, 2, 3 and so on. If requests are made from 100 different clients simultaneously, can something go wrong? The expected result would be that the 100 different clients each see a unique number from 1 to 100. Or will something like this happen:
Client 1 queries. self.param is incremented by 1.
Before the return statement can be executed, the thread switches over to client 2. self.param is incremented again.
The thread switches back to client 1, and the client is returned the number 2, say.
Now the thread moves to client 2 and returns him/her the number 3.
Since there were only two clients, the expected results were 1 and 2, not 2 and 3. A number was skipped.
Will this actually happen as I scale up my application? What alternatives to a global variable should I look at?
You can't use global variables to hold this sort of data. Not only is it not thread safe, it's not process safe, and WSGI servers in production spawn multiple processes. Not only would your counts be wrong if you were using threads to handle requests, they would also vary depending on which process handled the request.
Use a data source outside of Flask to hold global data. A database, memcached, or redis are all appropriate separate storage areas, depending on your needs. If you need to load and access Python data, consider multiprocessing.Manager. You could also use the session for simple data that is per-user.
The development server may run in single thread and process. You won't see the behavior you describe since each request will be handled synchronously. Enable threads or processes and you will see it. app.run(threaded=True) or app.run(processes=10). (In 1.0 the server is threaded by default.)
Some WSGI servers may support gevent or another async worker. Global variables are still not thread safe because there's still no protection against most race conditions. You can still have a scenario where one worker gets a value, yields, another modifies it, yields, then the first worker also modifies it.
If you need to store some global data during a request, you may use Flask's g object. Another common case is some top-level object that manages database connections. The distinction for this type of "global" is that it's unique to each request, not used between requests, and there's something managing the set up and teardown of the resource.
This is not really an answer to thread safety of globals.
But I think it is important to mention sessions here.
You are looking for a way to store client-specific data. Every connection should have access to its own pool of data, in a threadsafe way.
This is possible with server-side sessions, and they are available in a very neat flask plugin: https://pythonhosted.org/Flask-Session/
If you set up sessions, a session variable is available in all your routes and it behaves like a dictionary. The data stored in this dictionary is individual for each connecting client.
Here is a short demo:
from flask import Flask, session
from flask_session import Session
app = Flask(__name__)
# Check Configuration section for more details
SESSION_TYPE = 'filesystem'
app.config.from_object(__name__)
Session(app)
#app.route('/')
def reset():
session["counter"]=0
return "counter was reset"
#app.route('/inc')
def routeA():
if not "counter" in session:
session["counter"]=0
session["counter"]+=1
return "counter is {}".format(session["counter"])
#app.route('/dec')
def routeB():
if not "counter" in session:
session["counter"] = 0
session["counter"] -= 1
return "counter is {}".format(session["counter"])
if __name__ == '__main__':
app.run()
After pip install Flask-Session, you should be able to run this. Try accessing it from different browsers, you'll see that the counter is not shared between them.
Another example of a data source external to requests is a cache, such as what's provided by Flask-Caching or another extension.
Create a file common.py and place in it the following:
from flask_caching import Cache
# Instantiate the cache
cache = Cache()
In the file where your flask app is created, register your cache with the following code:
# Import cache
from common import cache
# ...
app = Flask(__name__)
cache.init_app(app=app, config={"CACHE_TYPE": "filesystem",'CACHE_DIR': Path('/tmp')})
Now use throughout your application by importing the cache and executing as follows:
# Import cache
from common import cache
# store a value
cache.set("my_value", 1_000_000)
# Get a value
my_value = cache.get("my_value")
While totally accepting the previous upvoted answers, and discouraging use of global variables for production and scalable Flask storage, for the purpose of prototyping or really simple servers, running under the flask 'development server'...
...
The Python built-in data types, and I personally used and tested the global dict, as per Python documentation are thread safe. Not process safe.
The insertions, lookups, and reads from such a (server global) dict will be OK from each (possibly concurrent) Flask session running under the development server.
When such a global dict is keyed with a unique Flask session key, it can be rather useful for server-side storage of session specific data otherwise not fitting into the cookie (max size 4 kB).
Of course, such a server global dict should be carefully guarded for growing too large, being in-memory. Some sort of expiring the 'old' key/value pairs can be coded during request processing.
Again, it is not recommended for production or scalable deployments, but it is possibly OK for local task-oriented servers where a separate database is too much for the given task.
...

mariadb.DatabaseError after long inactivity on mariadb connection in Python 3

So I am developing this online telnet-like game and it's not very popular (who knows, one day), so the database connection of my game engine is not used for hours at night. It is one script that waits for events, so it keeps running.
The first time a query is done after several hours of inactivity, I receive the mariadb.DatabaseError when trying to execute the cursor. If I redo the query, it works again. So while the function throws the exception that the connection is lost, it does repair it.
My question: how should I handle this?
These are things I see as possible solutions, but in my opinion, they are not very good:
wrapping every query inside a try-except structure, makes the code bulky with mostly unnecessary and repetitive code
writing my own 'decorator' function to execute a query, which will then reinitialize the database when I get mariadb.DatabaseError, which seems better, but makes me write wrapper functions around (almost) perfectly working library functions
doing a mostly pointless 'ping' query every N minutes, which is stressing on the db which is useless 99.9% of the time.
Here is some code to illustrate:
import mariadb
class Db:
...
def __init__(self):
self.conn = mariadb.connect(user=self.__db_user, password=self.__db_pass, host=self.__db_host, port=self.__db_port, database=self.__db_name)
def one_of_many_functions(self, ...):
cur = self.conn.cursor()
cur.execute('SELECT ...') # Here is where the mariadb.DatabaseError happens after long inactivity, and otherwise runs fine
...
I actually really don't understand why python's mariadb implementation doesn't handle this. When the connection is lost, cur.execute will throw a mariadb.DatabaseError, but no action is to be taken, because if I requery with that same database connection, it works again. So the connection does repair itself. Why does the component make me requery while it 'repairs' the connection itself and could query again?
But as it is what it is, my question is: what is the nicest way to handle this?
If you set a long time out value, there is even no guarantee, that the connection will drop due to other reasons (client timeout, 24 hr disconnect, ...)
An option would be to set auto_reconnect, as in the following example:
import mariadb
conn1= mariadb.connect()
conn2= mariadb.connect()
# Force MariaDB/Connector Python to reconnect
conn2.auto_reconnect= True
cursor1= conn1.cursor()
print("Connid of connection 2: %s" % conn2.connection_id);
# Since we don't want to wait, we kill the conn2 intentionally
cursor1.execute("KILL %s" % conn2.connection_id)
cursor2= conn2.cursor()
cursor2.execute("select connection_id()")
row= cursor2.fetchall()
print("Connid of connection 2: %s" % conn2.connection_id);
print(row)
Output:
Connid of connection 2: 174
Connid of connection 2: 175
[(175,)]
So after connection 2 was killed, next cursor.execute will establish a new connection before executing the statement. This solution will not work if you use an existing open cursor, since the internal statement handle becomes invalid.
Are you using a socket or TCP/IP for connection?
TCP/IP connections are designed to be cleaned up after a period of no traffic. You might say it's idiotic, but there's really no better way to know if a program crashes.
For the same reason, databases have their own timeout mechanism. For MySQL it's called wait_timeout.
Normally, a connection object (or its wrapper) would take care of running some no-op query if there is nothing else going on with the connection, something like select 1. This is a standard practice. Check the documentation for your connection object - it might already be there, you just need to configure it. Use something like 30-60 seconds.
If not, you will have to implement it yourself. It doesn't matter how, the point is that you cannot expect connections to stay open forever. Either make connections short-lived (open it only when you need it and close it afterwards), or implement a timer that will insert some no-op query periodically. In the latter case note that you will need to implement synchronization mechanism to make sure that your application query never runs at the same time as no-op query.
Have you considered using a connection pool.
# Create Connection Pool
pool = mariadb.ConnectionPool(
#...,
pool_size=1
)
Then in your connection method.
try:
pconn = pool.get_connection()
except mariadb.PoolError as e:
# Report Error
print(f"Error opening connection from pool: {e}")
The documentation doesn't say what happens when connections are closed or broken. I expect that it takes care of that, and always tries to provide a valid connection ( as long as your not asking for more connections than are in the pool.)
I got the code from their docs

Cherrypy_handling requests

I've been searching for a while now but can't find an answere.
I know that cherrypy creates a new thread for handling requests (GET, PUT, POST, DELETE etc).
Now i fetch the parameters like this:
...
#cherrypy.tools.json_in()
#cherrypy.tools.json_out()
def POST(self):
Forum.lock_post.acquire()
conn = self.io.psqlConnect(self.dict_psql)
cur = conn.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
params = cherrypy.request.json
...
return some_dict
As you can see im locking the thread to avoid race condition on the variable params. But is this really necessary? I'm asking cos if i do it like this all the other requests on POST will have to wait. Is there any better solution without locking the whole POST? I'm using params several times along the code.
First a clarification, CherryPy doesn't create a new thread for each requests, it has a predetermined pool of threads (10 by default), from which indeed one thread can be used to handle a single request at a time.
As for if you should lock cherrypy.request.json. You really don't, there is a concept called "thread locals" on which you can have multiple references to different objects depending on which thread is accessing such object. (python docs).
Having said that... you should make sure that the code that you write doesn't interfere with the state of the other threads (you can use the cherrypy.thread_data as a quick fix).
Take a look into the cherrypy plugin architecture, if you want a resource to be shared among threads usually a plugin is the way to: http://docs.cherrypy.org/en/latest/extend.html#plugins

Resources