Postgresql FOR UPDATE SKIP LOCKED still selects duplicated rows - multithreading

I am using PostgreSQL as a job queue. Following is my query to retrieve a job and update its state:
UPDATE requests AS re
SET
started_at = NOW(),
finished_at = NULL
FROM (
SELECT
_re.*
FROM requests AS _re
WHERE
_re.state = 'pending'
AND
_re.started_at IS NULL
LIMIT 1
FOR UPDATE SKIP LOCKED
) AS sub
WHERE re.id = sub.id
RETURNING
sub.*
Now, I have several machines, on each machine I have 1 process with several threads, and on each thread I have a worker. All workers in the same process shared a connection pool, typically having 10 - 20 connections.
The problem is, the above query will return some rows more than once!
I cannot find any reasons. Could anyone help?
To be more detailed, I am using Python3 and psycopg2.
Update:
I have tried #a_horse_with_no_name's answer, but seems not work.
I noticed that, one request is retrieved by two queries with the started_at updated to:
2016-04-21 14:23:06.970897+08
and
2016-04-21 14:23:06.831345+08
which are only differed by 0.14s.
I am wondering if at the time those two connections executes the inner SELECT subquery, both locks are not established yet?
Update:
To be more precise, I have 200 workers (i.e. 200 threads) in 1 process on 1 machine.

Please also note that it's essential that each thread has it's own connection if you do not want them to get in each others way.
If your application uses multiple threads of execution, they cannot
share a connection concurrently. You must either explicitly control
access to the connection (using mutexes) or use a connection for each
thread. If each thread uses its own connection, you will need to use
the AT clause to specify which connection the thread will use.
from: http://www.postgresql.org/docs/9.5/static/ecpg-connect.html
All kinds of wierd things happen if two threads share the same connection. I believe this is what is happening in your case. If you take a lock with one connection, all other threads that use the same connection will have access to the locked objects.
Permit me to suggest an alternative approach, that is really simple. The use of redis as a queue. You can either simply make use of redis-py and the lpush/rpop methods or use python-rq.

There is a chance a locking transaction is not yet issued at the time of the select, or the lock is lost by the time the results of the select are ready and the update statement begins. Have you tried explicitly beginning a transaction?
BEGIN;
WITH req AS (
SELECT id
FROM requests AS _re
WHERE _re.state = 'pending' AND _re.started_at IS NULL
LIMIT 1 FOR UPDATE SKIP LOCKED
)
UPDATE requests SET started_at = NOW(), finished_at = NULL
FROM req
WHERE requests.id = req.id;
COMMIT;

Related

mariadb.DatabaseError after long inactivity on mariadb connection in Python 3

So I am developing this online telnet-like game and it's not very popular (who knows, one day), so the database connection of my game engine is not used for hours at night. It is one script that waits for events, so it keeps running.
The first time a query is done after several hours of inactivity, I receive the mariadb.DatabaseError when trying to execute the cursor. If I redo the query, it works again. So while the function throws the exception that the connection is lost, it does repair it.
My question: how should I handle this?
These are things I see as possible solutions, but in my opinion, they are not very good:
wrapping every query inside a try-except structure, makes the code bulky with mostly unnecessary and repetitive code
writing my own 'decorator' function to execute a query, which will then reinitialize the database when I get mariadb.DatabaseError, which seems better, but makes me write wrapper functions around (almost) perfectly working library functions
doing a mostly pointless 'ping' query every N minutes, which is stressing on the db which is useless 99.9% of the time.
Here is some code to illustrate:
import mariadb
class Db:
...
def __init__(self):
self.conn = mariadb.connect(user=self.__db_user, password=self.__db_pass, host=self.__db_host, port=self.__db_port, database=self.__db_name)
def one_of_many_functions(self, ...):
cur = self.conn.cursor()
cur.execute('SELECT ...') # Here is where the mariadb.DatabaseError happens after long inactivity, and otherwise runs fine
...
I actually really don't understand why python's mariadb implementation doesn't handle this. When the connection is lost, cur.execute will throw a mariadb.DatabaseError, but no action is to be taken, because if I requery with that same database connection, it works again. So the connection does repair itself. Why does the component make me requery while it 'repairs' the connection itself and could query again?
But as it is what it is, my question is: what is the nicest way to handle this?
If you set a long time out value, there is even no guarantee, that the connection will drop due to other reasons (client timeout, 24 hr disconnect, ...)
An option would be to set auto_reconnect, as in the following example:
import mariadb
conn1= mariadb.connect()
conn2= mariadb.connect()
# Force MariaDB/Connector Python to reconnect
conn2.auto_reconnect= True
cursor1= conn1.cursor()
print("Connid of connection 2: %s" % conn2.connection_id);
# Since we don't want to wait, we kill the conn2 intentionally
cursor1.execute("KILL %s" % conn2.connection_id)
cursor2= conn2.cursor()
cursor2.execute("select connection_id()")
row= cursor2.fetchall()
print("Connid of connection 2: %s" % conn2.connection_id);
print(row)
Output:
Connid of connection 2: 174
Connid of connection 2: 175
[(175,)]
So after connection 2 was killed, next cursor.execute will establish a new connection before executing the statement. This solution will not work if you use an existing open cursor, since the internal statement handle becomes invalid.
Are you using a socket or TCP/IP for connection?
TCP/IP connections are designed to be cleaned up after a period of no traffic. You might say it's idiotic, but there's really no better way to know if a program crashes.
For the same reason, databases have their own timeout mechanism. For MySQL it's called wait_timeout.
Normally, a connection object (or its wrapper) would take care of running some no-op query if there is nothing else going on with the connection, something like select 1. This is a standard practice. Check the documentation for your connection object - it might already be there, you just need to configure it. Use something like 30-60 seconds.
If not, you will have to implement it yourself. It doesn't matter how, the point is that you cannot expect connections to stay open forever. Either make connections short-lived (open it only when you need it and close it afterwards), or implement a timer that will insert some no-op query periodically. In the latter case note that you will need to implement synchronization mechanism to make sure that your application query never runs at the same time as no-op query.
Have you considered using a connection pool.
# Create Connection Pool
pool = mariadb.ConnectionPool(
#...,
pool_size=1
)
Then in your connection method.
try:
pconn = pool.get_connection()
except mariadb.PoolError as e:
# Report Error
print(f"Error opening connection from pool: {e}")
The documentation doesn't say what happens when connections are closed or broken. I expect that it takes care of that, and always tries to provide a valid connection ( as long as your not asking for more connections than are in the pool.)
I got the code from their docs

How to sync Delphi event while running DB operations in a background thread?

Using Delphi 7 & UIB, I'm running database operations in a background thread to eliminate problems like:
Timeout
Priority
Immediate Force-reconnect after network-loss
Non-blocked UI
Keeping an opened DB connection alive
User canceling
I've read ALL related topics here, and realized: using while isMyThreadStillRuning and not UserCanceled do sleep(100); end; isn't the recommended way to do this, but rather using TEvent.WaitFor(3000)....
The solutions here are either about sending signals FROM or TO... the thread, or doing it with messages, but never both ways.
Reading the help file, I've also found TSimpleEvent, which seems to be easier to use.
So what is the recommended way to communicate between Main-UI + DB-Thread in both ways?
Should I simply create 2+2 TSimpleEvent?
to start a new transaction (thread should stop sleeping)
force-STOP execution
to signal back if it's moved to a new stage (transaction started / executed / commited=done)
to signal back if there is any error happened
or should there be only 1 TEvent?
Update 2:
First tests show:
2x TSimpleEvent is enough (1 for Thread + 1 for Gui)
Both created as public properties of the background thread
Force-terminating the thread does not work. (Too many errors impossible to handle..)
Better to set a variable like (Stop_yourself) and let it cancel and free itself, (while creating a new instance from the same class and try again.)
(still work in progress...)
You should move the query to a TThread. Unfortunately, anonymous threads are not available in D7 so you need to write your own TThread derived class. Inside, you need its own DB connection to prevent shared resources. From the caller method, you can wait for the thread to end. The results should be stored somewhere in the caller class. Ensure that the access to parameters of the query and for storing the result of the query is handled thread-safe by using a TMutex or TMonitor.

JDBC LockRegistry accross JVMS

Is my application service obtaining a lock using JDBC LockRepository supposed to run inside an #Transaction ?
We have a sample application service that updates a JDBCRepository and since this application can run on multiple JVMS (headless). We needed a global lock to serialize those updates.
I looked at your test and was hoping my use case would work too. ... JdbcLockRegistryDifferentClientTests
My config has a DefaultLockRepository and JdbcLockRegistry;
I launched( java -jar boot.jar) my application on two terminals to simulate. When I obtain a lock and issue a tryLock() without #Transaction on my application service both of them get the lock (albeit) one after the other almost immediately. I expected one of them to NOT get it for at least 10 seconds (Default expiry).
Service (Instance -1) {
Obtain("KEY-1")
tryLock()
DoWork()
unlock();
close();
}
Service (Instance -2) {
Obtain("KEY-1")
tryLock() <-- Wait until the lock expires or the unlock happens
DoWork()
unlock();
close();
}
I also noticed here DefaultLockRepository that the transaction scope (if not inherited) is only around the JDBC operation.
When I change my service to
#Transaction
Service (Instance -1) {
Obtain("KEY-1")
tryLock()
DoWork()
unlock();
close();
}
It works as expected.
I am quite sure I missed something ? But I expect my lock operation to honor global-locks (the fact that a lock exists in a JDBC store with an expiration) until an unlock or expiration.
Is my understanding incorrect ?
This works as designed. I didnt configure the DefaultLockRepository correctly and the default ttl was shorter than my service (artificial wait) lock duration. My apologies. :) Josh Long helped me figure this out :)
You have to use different client ids. The same is means the same client. That for special use-case. Use different client ids as they are different instances
The behavior here is subtle (or obvious once you see how this is working) and the general lack of documentation unhelpful, so here's my experience.
I created a lock table by looking at the SQL in DefaultLockRepository, which appeared to imply a composite primary key of REGION, LOCK_KEY and CLIENT_ID - THIS WAS WRONG.
I subsequently found the SQL script in the spring-integration-jdbc JAR, where I could see that the composite primary key MUST BE on just REGION and LOCK_KEY as #ArtemBilan says.
The reason is that the lock doesn't care about the client, obviously, so the primary key must be just the REGION and LOCK_KEY columns. These columns are used when acquiring a lock and it is the key violation that occurs should another client attempt to obtain the lock that is used to restrict other client IDs.
This also implies that, again as #ArtemBilan says, each client instance must have a unique ID, which is the default behavior when no ID specified at construction time.

Concurency issue with SELECT FOR UPDATE in Postgres 9.3

I'm struggling with this issue since two days. We have a solution where multiple worker threads will try to select job requests from a single database/table, by setting a flag on the selected requests and thus effectively blocking the other workers to select the same requests.
I created a java test application to test my queries, but while in normal situations the test executes without issue, in high contention situation (ex. 1 table entry with 50 threads; no delays or processing) I still have threads which obtain the same request/entry, interestingly it happens when the test just starts. I cannot understand why. I've read all relevant Postgres locking and isolation related documentation... While is possible that the issue is with the test application itself, I suspect that I'm missing something about how the SELECT FOR UPDATE works in READ COMMITTED isolation context.
So the question would be can SELECT FOR UPDATE (with READ COMMITED isolation) guarantee that a general concurrency issue like I described can be safely solved?
Acquire query:
UPDATE mytable SET status = 'LOCK'
WHERE ctid IN (SELECT ctid FROM mytable
WHERE status = 'FREE'
ORDER BY id
LIMIT %d
FOR UPDATE)
RETURNING id, status;
Release query:
UPDATE mytable SET status = 'FREE'
WHERE id = %d AND status = 'LOCK'
RETURNING id, status;
So would you consider these two queries should be safe, or there is some weird case possible that would allow two threads to acquire the same row? I'd like to mention that I tried also SERIALIZABLE isolation and didn't helped.
It turns out (how could it be different?) that I made a mistake in my test. I didn't respected resource acquire/release order. The test was registering the release (decrementing a counter) after the Release Query, which lead another thread to Acquire the resource and register it in the meantime. An error from the category, which you know how to solve, but cannot see even if you look several times, because you wrote the code... Peer review helped in the end.
I suppose at this time I have a test to prove that:
The above two queries are safe
You don't need SERIALIZABLE isolation to solve problems with DB acquire/release as long as you use row locking as in SELECT... FOR UPDATE
You must ORDER (BY) results when using row locking (even if you use LIMIT 1!), otherwise you end up with deadlocks
Safe to acquire multiple resources with one query (LIMIT 2 and above)
Using ctid in the query is safe; it's actually a little faster, but this is insignificant in real world applications
I'm not sure if this will be helpful to others, but I was getting desperate. So all is good with Postgres 9.3 :)
Another aspect I'd like so share is regarding the speed of the Acquire query with LIMIT 2. See the test result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[36] 186/190/624=1000
Thread[19] 184/201/615=1000
Thread[12] 230/211/559=1000
Thread[46] 175/200/625=1000
Thread[ 9] 205/211/584=1000
...
Thread[ 4] 189/232/579=1000
Thread[ 3] 185/198/617=1000
Thread[49] 218/204/578=1000
Thread[ 1] 204/203/593=1000
...
Thread[37] 177/163/660=1000
Thread[31] 168/199/633=1000
Thread[18] 174/187/639=1000
Thread[42] 178/229/593=1000
Thread[29] 201/229/570=1000
...
Thread[10] 203/198/599=1000
Thread[25] 215/210/575=1000
Thread[27] 248/191/561=1000
...
Thread[17] 311/192/497=1000
Thread[ 8] 365/198/437=1000
Thread[15] 389/176/435=1000
All threads finished
Execution time: 31408
Test done; exiting
Compare the above with this query :
UPDATE mytable SET status = 'LOCK'
WHERE id IN (SELECT t1.id FROM (SELECT id FROM mytable
WHERE status = 'FREE' ORDER BY id LIMIT 2) AS t1
FOR UPDATE)
RETURNING id, status;
and the result:
Starting test...
DB setup done
All threads created & connections made
All threads started
Thread[29] 32/121/847=1000
Thread[22] 61/151/788=1000
Thread[46] 36/114/850=1000
Thread[41] 57/132/811=1000
Thread[24] 49/146/805=1000
Thread[13] 47/135/818=1000
...
Thread[20] 48/118/834=1000
Thread[47] 65/152/783=1000
Thread[18] 51/146/803=1000
Thread[ 8] 69/158/773=1000
Thread[14] 56/158/786=1000
Thread[ 0] 66/161/773=1000
Thread[38] 60/148/792=1000
Thread[27] 69/158/773=1000
...
Thread[45] 78/177/745=1000
Thread[30] 96/162/742=1000
...
Thread[32] 162/167/671=1000
Thread[17] 329/156/515=1000
Thread[33] 337/178/485=1000
Thread[37] 381/172/447=1000
All threads finished
Execution time: 15490
Test done; exiting
Conclusion
The test prints for each thread how many times the Acquire query returned 2, 1 or 0 resources totalling the number of test loops (1000).
From the above results we can conclude that we can speed up the query (halfing the time!) at the cost of increasing the thread contention. This means that we will receive more times 0 resources back from the Acquire query. Technically this is not a problem because we need to treat this situation anyway.
Of course situation changes if you add a wait time (sleeping) when no resources are returned, but choosing a correct value for the wait time depends on the application performance requirements...

Delphi threads and variables

So this is my question , threads are so confusing for me , let's say I have 5 threads , and 50 or 100 or more sites. So as far as I've learned about threads , I can make constructor create (link:string) and start new threads with different links , but than I wold need to make as much threads as the number of links I need to parse.So how can I make variable link shared between threads , so when thread one downloads link listbox1.items[0] it tells others that number 0 is downloaded and next thread should ask what link should I download and get answer listbox1.items[1] and so on until they download all links when they should terminate.
Can anyone provide me with simple example of how can this be done. Threads are killing me :(
You could have a thread-safe list of URLs to process, and a static-sized pool of worker threads each taking an unprocessed URL from the list at a time, processing it (downloading and parsing) and adding any found new URLs to the list, in a loop, as long as there are any unprocessed items in the list. Keep finished URLs in the list, only mark them as done, to avoid recursion.
Sounds like you simply need to set up a critical section.
This needs to be set up around the code segment which reads the next URL. To do this you would typically place a semaphore at the start of the code so that only one thread can enter it at any time. The semaphore is reset at the end of the code. As each new thread sees the URL list has expired, then it terminates.
Typically semaphores are boolean, but they can be integers for example if you want to allow a specific number of threads to enter the region at any time.
In your case you can simply set up a global boolean variable (visible to all threads), say "fSemaphore".
At the start of the region, the thread checks the flag. If it is false it sets it to true and enters the region (to get the next URL).
If it is true, then it loops - e.g. repeat sleep(0) until (not fSemaphore).
When it exits the region it set fSemaphore := False;
Obviously you need to make sure you guard against a possible infinite loop scenario...
Define a 'TURI' class for the request URI, result, error message and anything else needed for the web query except for the component to be used for the URI access. Descending from TObject shoud be fine. Create, initialize 100 of these and push them on a producer-consumer queue, (TObjectQueue, TCriticalSection and a semaphore should do fine). Hang a few TThreads off the queue that loop around and process the TURI instances until the queue is empty, whereupon they block.
You do not say what action you need taking with the processed TURI's - they will need freeing somewhere. If you wish to notify the main thread, PostMessage the completed URI's and free them in the message-handler.
Terminate the threads? Sure, if you really have to, then queue up some object that signals them to commit suicide, (a NIL maybe - the threads can check for 'assigned' just after popping the queue). When doing something like this, I oftem just leave the threads lying around even if I don't need to process any more URI during the app run - it's not worth the typing of terminating them.
Sadly, the Delphi examples and, I'm afraid, many textbooks, dont' get much further than suspend/resume control of threads, (don't do it), and 'TThread.Synchronize', 'TThread.WaitFor' and 'TThread.OnTerminate'. If you get a textbook like that, take it outside and burn it - you will learn next-to-nothing good.

Resources