Is Autocommit mode efficient? - python-3.x

I am using python3.7 and sqlite3 module, with "auto commit" mode (i.e. creating the connection object by setting the isolation_level to None). I want to check if this is as efficient as possible? This is how I created by connection object.
conn = sqlite3.connect(DATABASE_NAME, check_same_thread=False, isolation_level=None)

The Python sqlite3 module has some odd behaviour with respect to transactions, and the documentation on the subject is rather confusing. Its reference to "autocommit mode" is especially confusing - I think there is a meaning for this term in Python which is different to the term used in the underlying SQLite library. I'll just avoid that term altogether in this answer.
If isolation_level is NOT None (indeed the default is "" which is not None) then transactions will sometimes be started automatically. It's up to you to commit them or roll them back, either by calling COMMIT or ROLLBACK explicitly or by using the connection as a context manager:
# Approach 1: transactions start automatically
conn = sqlite3.connect(path_to_db)
with conn:
values = [(row[0], row[1]) for row in conn.execute("SELECT a, b FROM Foo")]
# transaction starts here - in between the two statements!
conn.execute("UPDATE Foo SET a=? WHERE b=?", (values[0][0] + 1, values[0][1]))
# At this point, with conn: will automatically commit/rollback the transaction
(By the way this is a very poor way to increment a value. You don't need to read all rows, and you don't even need separate SELECT and UPDATE statements! But this is meant to illustrate how the transactions work, not how to write good SQL.)
In the above example, Python will NOT start a transaction when it gets to the SELECT statement, but WILL start a transaction when it gets to the INSERT statement. The with conn: context manager will ensure the transaction is either committed (if there's no exception) or rolled back (if there is an exception) at the end of the block.
I could say more about when and why transactions are started automatically, but my advice is to set isolation_level to None, which stops Python from ever starting transactions automatically. You can then begin them manually for yourself, as progmatico's answer said:
# Approach 2: you manually start transactions in your code
conn = sqlite3.connect(path_to_db, isolation_level=None)
with conn:
conn.execute("BEGIN") # Starts the transaction
values = [(row[0], row[1]) for row in conn.execute("SELECT a, b FROM Foo")]
conn.execute("UPDATE Foo SET a=? WHERE b=?", (values[0][0] + 1, values[0][1]))
# At this point, with conn: will automatically commit/rollback the transaction
Or don't use them at all:
# Approach 3: no explicit transactions at all
conn = sqlite3.connect(path_to_db, isolation_level=None)
values = [(row[0], row[1]) for row in conn.execute("SELECT a, b FROM Foo")]
conn.execute("UPDATE Foo SET a=? WHERE b=?", (values[0][0] + 1, values[0][1]))
In the last code snippet, we never started a transaction (and neither did Python!), so the underlying SQLite library automatically put each of the two statements in their own implicit transactions. (This is what the underlying SQLite library documentation calls "autocommit mode". Oops, I did mention that term again!)
So, hopefully you can see that it doesn't make sense to ask whether isolation_level=None makes your code more efficient. It lets you choose when to start and finish transactions, if at all, as in approach 2 vs approach 3. But it's still up to you when you do that, and that's what affects the efficiency of your program. If you're inserting a lot of records, it's more efficient to do them in one transaction, like approach 2. But you don't want to keep transactions open for too long because that could block other threads/processes trying to use that database. And there's typically not much speed benefit to executing SELECT statements in a transaction, so approach 3 might be the way to go (but a transaction might be necessary for correctness anyway, depending on the application).

The answer to your question is literally written here.
It basically says that you can stop the sqlite3 Python module from starting transactions implicitly on any data modification statement you send, allowing you the user, to assume the control of transactions ocurring in the underlying sqlite library with BEGIN, ROLLBACK, SAVEPOINT, and RELEASE statements in your code.
This is useful to group several statements that should be executed as part of a single transaction, instead of having every single modification statement executing in its own implicit transaction.

Related

Is there a way to pass a variable as an argument to a trigger function in YugabyteDB?

[Disclaimer: This question was posed by one of our YugabyteDB users on our yugabyte.com/slack channel]
The documentation for triggers below gives an example of attaching a trigger to the update of an employee table.
https://docs.yugabyte.com/latest/explore/ysql-language-features/triggers/
Suppose you have a procedure that only allows a manager to transfer an employee if the employee is a direct report to the manager, something like the following:
#transfer_employee(manager_no, employee_no, department)
CREATE OR REPLACE PROCEDURE transfer_employee(integer, integer, text)
LANGUAGE plpgsql
AS $$
BEGIN
-- IF employee reports to mgr, allow mgr to transfer person
IF EXISTS (SELECT employee_no FROM mgr_table where mgr_id = $1 and employee_no = $2)
UPDATE employees
SET department = $3
WHERE employee_no = $2;
COMMIT;
END;
$$;
Is there a way in YugabyteDB to have the trigger gain access, or pass the state variables of the stored procedure, to the trigger so that you could log in a table the variables such as manager_id who made the change, and not just the new or old department (e.g. this is a variable that only exists in the context of the stored procedure)?
If it is possible, the syntax to do this is unclear to me from this example.
There is nothing here that I can see that is unique to yugabyte, so this is all postgres. This also means you can use the postgres documentation.
The postgres documentation states that the trigger function must be declared with no arguments. (https://www.postgresql.org/docs/11/plpgsql-trigger.html) In other words: a variable cannot be passed as an argument to the trigger function.
Since the body of the trigger is a function, the possibilities are reasonably unlimited. But I would strongly advise to think about making sure data and logic are not configured in such a way that make it non-obvious.
So to provide an answer to your question: yes, that is possible, here is a stackoverflow answer that provides the means to arrange the requested functionality: How to use variable settings in trigger functions?
(using custom variables set in the sessions setting table)
Please do not stop reading here. By doing this, the only way to make the triggered functionality working is by manually crafting the session state as the trigger desires. That makes it really hard (if not: impossible) to work with the data in general. Especially if this type of trigger is added in more places.
In general the way to do this is by creating procedures that an application must use (an API) in order to manipulate the data, so that any rules that require more information than a table level function like a trigger can see, can be handled there.
That way the database objects can have all their data scope rules (primary/foreign keys, check constraints, not null), but do not require anything beyond the tables theirselves to do data and database administration.
Adding to Frits's answer:
In general the way to do this is by creating procedures that an application must use (an API) in order to manipulate the data… 
Yes—100% agree. Stepping up a level, the question becomes “How to model session state in PostgreSQL”. In Oracle Database, the usual paradigm is to use package state (by all means visible only though a setter procedure and a getter function). Sad-to-say, there’s no simple way to do this in PostgreSQL (which has no packages). The only choices are tables or what the stack exchange piece that Frits referred to describes.
Tables are problematic: performance; distinguishing your rows from other sessions’ rows in a regular table; no such thing as a global temporary table and the PG, so session temporary table needs somehowsla to be created at the start of a session.
Here’s how I built a stopwatch for use across several server calls. Cumbersome. But it does work.
create procedure admin.start_stopwatch()
language plpgsql
as $body$
declare
 -- Make a memo of the current wall-clock time.
 start_time constant text not null := clock_timestamp()::text;
begin
 execute 'set stopwatch.start_time to '''||start_time||'''';
end;
$body$;
create function admin.stopwatch_reading()
 returns text
 -- It's critical to use "volatile". Else wrong results.
 volatile
language plpgsql
as $body$
declare
 -- Read the starting wall-clock time from the memo.
 start_time constant timestamptz not null := current_setting('stopwatch.start_time');
 -- Read the current wall-clock time.
 curr_time constant timestamptz not null := clock_timestamp();
 diff constant interval not null := curr_time - start_time;
begin
 return ...
end;
$body$;

Threading in Python 3

I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t1.start()
t2 = threading.Thread(target=insertTransactionData())
t2.start()
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
threading.Thread(target=insertBlock)
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Warning:
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
print('thread:',i)
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
t1.start()
It helps me to deal with increasing execution time for more than 1lakh data.

Postgresql FOR UPDATE SKIP LOCKED still selects duplicated rows

I am using PostgreSQL as a job queue. Following is my query to retrieve a job and update its state:
UPDATE requests AS re
SET
started_at = NOW(),
finished_at = NULL
FROM (
SELECT
_re.*
FROM requests AS _re
WHERE
_re.state = 'pending'
AND
_re.started_at IS NULL
LIMIT 1
FOR UPDATE SKIP LOCKED
) AS sub
WHERE re.id = sub.id
RETURNING
sub.*
Now, I have several machines, on each machine I have 1 process with several threads, and on each thread I have a worker. All workers in the same process shared a connection pool, typically having 10 - 20 connections.
The problem is, the above query will return some rows more than once!
I cannot find any reasons. Could anyone help?
To be more detailed, I am using Python3 and psycopg2.
Update:
I have tried #a_horse_with_no_name's answer, but seems not work.
I noticed that, one request is retrieved by two queries with the started_at updated to:
2016-04-21 14:23:06.970897+08
and
2016-04-21 14:23:06.831345+08
which are only differed by 0.14s.
I am wondering if at the time those two connections executes the inner SELECT subquery, both locks are not established yet?
Update:
To be more precise, I have 200 workers (i.e. 200 threads) in 1 process on 1 machine.
Please also note that it's essential that each thread has it's own connection if you do not want them to get in each others way.
If your application uses multiple threads of execution, they cannot
share a connection concurrently. You must either explicitly control
access to the connection (using mutexes) or use a connection for each
thread. If each thread uses its own connection, you will need to use
the AT clause to specify which connection the thread will use.
from: http://www.postgresql.org/docs/9.5/static/ecpg-connect.html
All kinds of wierd things happen if two threads share the same connection. I believe this is what is happening in your case. If you take a lock with one connection, all other threads that use the same connection will have access to the locked objects.
Permit me to suggest an alternative approach, that is really simple. The use of redis as a queue. You can either simply make use of redis-py and the lpush/rpop methods or use python-rq.
There is a chance a locking transaction is not yet issued at the time of the select, or the lock is lost by the time the results of the select are ready and the update statement begins. Have you tried explicitly beginning a transaction?
BEGIN;
WITH req AS (
SELECT id
FROM requests AS _re
WHERE _re.state = 'pending' AND _re.started_at IS NULL
LIMIT 1 FOR UPDATE SKIP LOCKED
)
UPDATE requests SET started_at = NOW(), finished_at = NULL
FROM req
WHERE requests.id = req.id;
COMMIT;

Safe to use unsafeIOToSTM to read from database?

In this pseudocode block:
atomically $ do
if valueInLocalStorage key
then readValueFromLocalStorage key
else do
value <- unsafeIOToSTM $ fetchValueFromDatabase key
writeValueToLocalStorage key value
Is it safe to use unsafeIOToSTM? The docs say:
The STM implementation will often run transactions multiple times, so you need to be prepared for this if your IO has any side effects.
Basically, if a transaction fails it is because some other thread wroteValueToLocalStorage and when the transaction is retried it will return the stored value instead of fetching from the database again.
The STM implementation will abort transactions that are known to be invalid and need to be restarted. This may happen in the middle of unsafeIOToSTM, so make sure you don't acquire any resources that need releasing (exception handlers are ignored when aborting the transaction). That includes doing any IO using Handles, for example. Getting this wrong will probably lead to random deadlocks.
This worries me the most. Logically, if fetchValueFromDatabase doesn't open a new connection (i.e. an existing connection is used) everything should be fine. Are there other pitfalls I am missing?
The transaction may have seen an inconsistent view of memory when the IO runs. Invariants that you expect to be true throughout your program may not be true inside a transaction, due to the way transactions are implemented. Normally this wouldn't be visible to the programmer, but using unsafeIOToSTM can expose it.
key is a single value, no invariants to break.
I would suggest that doing I/O from an STM transaction is just a bad idea.
Presumably what you want is to avoid two threads doing the DB lookup at the same time. What I would do is this:
See if the item is already in the cache. If it is, we're done.
If it isn't, mark it with an "I'm fetching this" flag, commit the STM transaction, go get it from the DB, and do a second STM transaction to insert it into the cache (and remove the flag).
If the item is already flagged, retry the transaction. This blocks the calling thread until the first thread inserts the value from the DB.

Concurency issue with SELECT FOR UPDATE in Postgres 9.3

I'm struggling with this issue since two days. We have a solution where multiple worker threads will try to select job requests from a single database/table, by setting a flag on the selected requests and thus effectively blocking the other workers to select the same requests.
I created a java test application to test my queries, but while in normal situations the test executes without issue, in high contention situation (ex. 1 table entry with 50 threads; no delays or processing) I still have threads which obtain the same request/entry, interestingly it happens when the test just starts. I cannot understand why. I've read all relevant Postgres locking and isolation related documentation... While is possible that the issue is with the test application itself, I suspect that I'm missing something about how the SELECT FOR UPDATE works in READ COMMITTED isolation context.
So the question would be can SELECT FOR UPDATE (with READ COMMITED isolation) guarantee that a general concurrency issue like I described can be safely solved?
Acquire query:
UPDATE mytable SET status = 'LOCK'
WHERE ctid IN (SELECT ctid FROM mytable
WHERE status = 'FREE'
ORDER BY id
LIMIT %d
FOR UPDATE)
RETURNING id, status;
Release query:
UPDATE mytable SET status = 'FREE'
WHERE id = %d AND status = 'LOCK'
RETURNING id, status;
So would you consider these two queries should be safe, or there is some weird case possible that would allow two threads to acquire the same row? I'd like to mention that I tried also SERIALIZABLE isolation and didn't helped.
It turns out (how could it be different?) that I made a mistake in my test. I didn't respected resource acquire/release order. The test was registering the release (decrementing a counter) after the Release Query, which lead another thread to Acquire the resource and register it in the meantime. An error from the category, which you know how to solve, but cannot see even if you look several times, because you wrote the code... Peer review helped in the end.
I suppose at this time I have a test to prove that:
The above two queries are safe
You don't need SERIALIZABLE isolation to solve problems with DB acquire/release as long as you use row locking as in SELECT... FOR UPDATE
You must ORDER (BY) results when using row locking (even if you use LIMIT 1!), otherwise you end up with deadlocks
Safe to acquire multiple resources with one query (LIMIT 2 and above)
Using ctid in the query is safe; it's actually a little faster, but this is insignificant in real world applications
I'm not sure if this will be helpful to others, but I was getting desperate. So all is good with Postgres 9.3 :)
Another aspect I'd like so share is regarding the speed of the Acquire query with LIMIT 2. See the test result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[36] 186/190/624=1000
Thread[19] 184/201/615=1000
Thread[12] 230/211/559=1000
Thread[46] 175/200/625=1000
Thread[ 9] 205/211/584=1000
...
Thread[ 4] 189/232/579=1000
Thread[ 3] 185/198/617=1000
Thread[49] 218/204/578=1000
Thread[ 1] 204/203/593=1000
...
Thread[37] 177/163/660=1000
Thread[31] 168/199/633=1000
Thread[18] 174/187/639=1000
Thread[42] 178/229/593=1000
Thread[29] 201/229/570=1000
...
Thread[10] 203/198/599=1000
Thread[25] 215/210/575=1000
Thread[27] 248/191/561=1000
...
Thread[17] 311/192/497=1000
Thread[ 8] 365/198/437=1000
Thread[15] 389/176/435=1000
All threads finished
Execution time: 31408
Test done; exiting
Compare the above with this query :
UPDATE mytable SET status = 'LOCK'
WHERE id IN (SELECT t1.id FROM (SELECT id FROM mytable
WHERE status = 'FREE' ORDER BY id LIMIT 2) AS t1
FOR UPDATE)
RETURNING id, status;
and the result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[29] 32/121/847=1000
Thread[22] 61/151/788=1000
Thread[46] 36/114/850=1000
Thread[41] 57/132/811=1000
Thread[24] 49/146/805=1000
Thread[13] 47/135/818=1000
...
Thread[20] 48/118/834=1000
Thread[47] 65/152/783=1000
Thread[18] 51/146/803=1000
Thread[ 8] 69/158/773=1000
Thread[14] 56/158/786=1000
Thread[ 0] 66/161/773=1000
Thread[38] 60/148/792=1000
Thread[27] 69/158/773=1000
...
Thread[45] 78/177/745=1000
Thread[30] 96/162/742=1000
...
Thread[32] 162/167/671=1000
Thread[17] 329/156/515=1000
Thread[33] 337/178/485=1000
Thread[37] 381/172/447=1000
All threads finished
Execution time: 15490
Test done; exiting
Conclusion
The test prints for each thread how many times the Acquire query returned 2, 1 or 0 resources totalling the number of test loops (1000).
From the above results we can conclude that we can speed up the query (halfing the time!) at the cost of increasing the thread contention. This means that we will receive more times 0 resources back from the Acquire query. Technically this is not a problem because we need to treat this situation anyway.
Of course situation changes if you add a wait time (sleeping) when no resources are returned, but choosing a correct value for the wait time depends on the application performance requirements...

Resources