Should I close connection after every insert batch? - asyncpg

I need to insert a couple of dozen rows remotely once every 10 seconds. I am not sure whether I should use async with connection around each insert, closing the connection and re-opening every 10s, or just await conn = ... once and have a handle on an open connection forever.
Please explain at what point the decision would change, based on other numbers of rows and insert frequency.

Related

Slow performance of basic Parse server queries

We have parse server using postgresql as database. Problem is it runs simple queries extremely slow compared to sql or running them from js via pg.
For example getting all users (~5k rows in table) takes couple of seconds. Getting sessions and users takes from 3 seconds up to in extreme cases 8-10. Example:
let Session = Parse.Object.extend("_Session");
let sessionQuery = new Parse.Query(Session);
sessionQuery.include("user");
sessionQuery.limit(100000);
let sessions = await sessionQuery.find({ useMasterKey: true });
This segment runs slow 2-3 seconds, sometimes up to 8, while running sql is 100ms or so. There are only ~5k users and 9k sessions. We tried setting direct access variable, we checked that its indeed properly set inside parseServer.js source. Currently we are moving all select type queries to SQL but it would be preferable to find solution.

How to use synchronous messages on rabbit queue?

I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.

Running a repetitive task in Node.js for each row in a postgres table on a different interval for each row

What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!

PYODBC Connection Not Closing

I'm using threading to execute multiple SQL queries simultaneously. I first append the connections to a list called connections as such:
import pyodbc
connections = []
num_connnections = 100
for i in range(num_connections):
connections.append(pyodbc.connect('connection_string'))
This works and so does threading multiple queries. However, if I run this process many times, I get the error
('HY000', '[HY000] [Oracle][ODBC][Ora]ORA-12519: TNS:no appropriate service handler found
(12519) (SQLDriverConnect); [HY000] [Oracle][ODBC][Ora]ORA-12519: TNS:no appropriate service handler found (12519)')
I'm fairly sure this is because the number of ODBC connections becomes too high. When I try closing them with
pyodbc.pooling = False
for i in connections:
i.close()
del i
del connections
the list connections is deleted. However, it doesn't appear it closed any connections because the next time I run pyodbc.connect('connection_string') I immediately get the same error. Any ideas on what this could be?

Cassandra COPY consistently fails

I was trying to import a CSV with about 20 million rows.
I did a pilot run with a few 100 rows worth of CSV just to check if the columns were in order and that there were no parsing errors. All went well.
Every time I tried importing the 20 million row CSV, it failed after varying amounts of time. On my local machine it failed after 90 minutes with the following error. On the server box it fails within 10 minutes:
Processed 4050000 rows; Write: 624.27 rows/ss
code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info=
{'received_responses': 0, 'required_responses': 1, 'write_type': 0, 'consistency': 1}
Aborting import at record #4050617. Previously-inserted values still present.
4050671 rows imported in 1 hour, 26 minutes, and 43.649 seconds.
Cassandra: Coordinator node timed out waiting for replica nodes' responses (It is a one node cluster and replication factor is 1 so why is it wating for other nodes is another question)
Then based on recommendation in another thread I changed the write time out though I was not convinced it was the root cause.
write_request_timeout_in_ms: 20000
(Also tried changing it to 300000)
But it still eventually fails.
So now, I have chopped the original CSV into many 500,000 line CSVs.
This has a better success rate (compared to 0!). But even this fails 2 of 5 times for various reasons.
Sometimes I get the following error:
Processed 460000 rows; Write: 6060.32 rows/ss
Connection heartbeat failure
Aborting import at record #443491. Previously inserted records are still present, and some records after that may be present as well.
Other times it just stops updating the progress on console and the only way out is to abort using Ctrl+C
I've spent most of the day like this. My RDBMS is running happily with 5 billion rows. I wanted to test Cassandra with 10 times as much data but I'm having trouble even importing a million rows at a time.
One observation about how the COPY command proceeds is this: Once the command is entered, it starts writing at the rate of about 10,000 rows per second. It sustanins this speed till it has inserted about 80,000 rows. Then there is a pause of about 30 seconds after which it consumes another 70,000 to 90,000 rows, pauses for another 30 seconds and so on till it either finishes all rows in the CSV or fails midway with an error or simply hangs.
I need to get to the root of this. I really hope to find that I am doing something silly and it's not something I have to accept and work around.
I am using Cassandra 2.2.3
There is a lot of people having trouble with the COPY command, it seems that it works for small datasets but it starts to fail when you have a lot of data.
In the documentation they recommend to use the SSTable loader if you have a few million rows to import, i used it with my company and I had a lot of consistency problems.
I have tried everything and for me the safest way to import large amount of data into cassandra is by writing a little script that reads your CSV and then execute async queries. Python does it very well.
Will is correct. COPY is meant for small data sets and usually struggles when you start hitting the millions of rows. In addition to SSTable loader, there's this utility: https://github.com/brianmhess/cassandra-loader which I find to have very good performance with some added convenience.

Resources