I am trying to run several hundred thousand sql update queries using node/mssql. I am trying to:
insert each record individually (if one fails I don't want the batch to fail)
batch the queries so I don't overload the SQL server (I can open a new connection for every query but the server explodes if I do that)
With my existing code (which works 99% of the time) I occasionally get: operation timed out for an unknown reason and I'm hoping someone can suggest a fix, or improvements.
this is what I have:
try {
const sql = require("mssql");
let pool=await new sql.connect(CONFIG_OBJ)
let batchSize=1000
let queries=[
`update xxx set [AwsCoID]='10118' where [PrimaryKey]='10118-78843' IF ##ROWCOUNT=0 insert into xxx([AwsCoID]) values('10118')`,
`update or insert 2`,
`update or insert 3`,....]
for (let i = 0; i < queries.length; i += batchSize) {
let prom = queries
.slice(i, i + batchSize)
.map((qq) => pool.request().query(qq));
for (let p of await (Promise as any).allSettled(prom)) {
//make sure connection is still active after batch finishes
pool=await new sql.connect(cc)
//console.error(`promerr:`, p);
let status: "fulfilled" | "rejected" = p.status;
let value = p.value as SqlResult;
if (status != "fulfilled" || !value.isSuccess) {
console.log(`batchRunSqlCommands() promERR:`, value);
errs.push(value);
}
}
}
} catch (e) {
console.log(`batchSqlCommand err:`, e);
} finally {
pool.close();
}
For anyone else who writes something like I did, the issue is that SQL server does a table lock of the affected rows when doing an upsert. The fix is to add a clustered index that ensures each record being updated is in its own cluster, so the cluster gets locked but only one row is modified within the cluster at a time.
TLDR: set a "line unique" column (eg PrimaryKey) as the clustered index on the table.
This is not good for DB performance, but will quickly and simply solve the issue. You could also intelligently cluster groups of data, but then you would need to ensure your batch update only touched each cluster once and finished before trying to access it again.
Related
I have a question regarding massive calls to PostgreSQL.
This is the scenario:
I have a simple Nodejs app that makes queries to PostgreSQL in a short period of time.
Everything is fine, but sometimes these calls get rejected due to Postgresql maximum pool connections setting, which is equal to 100.
I have in mind to make queue consumption app style, which means adding every query to a queue and then consuming an element every second. By consequence a query to PostgreSQL every second.
But my problem is, Idk where to start. This is the part where I am getting problems with, at some point, I have a lot of calls and I get lots of "ERROR IN QUERY EXECUTION" for the reason explained before.
const pool3 = new Pool(credentialsPostGres);
let res = [];
let sql_call = "select colum1 from table2 where x = y"; //the real query is a bit more complex, but you get the idea.
poll_query.query(sql_call,(err,results) => {
if (err) {
pool3.end();
console.log(err + " ERROR IN QUERY EXECUTION");
} else {
res.push({ data: Object.values(JSON.parse(JSON.stringify(results.rows))) });
pool3.end();
return callback(res,data);
}
})
How I should manage this part into a queue? I am a bit lost.
Help!
I am using Nodejs with SQL Server database. I am using this node-mssql package for writing queries from Nodejs.
I have a route which has a if condition and query structure as below:
let checkPartExists = await pool.query(`Select * from Parts WHERE partID = ${partID}`);
if (checkPartExists.recordset.length == 0){
await pool.query(`INSERT INTO Parts(PartID, Quantity) VALUES(${partID}, ${quantity})`)
}
else{
await pool.query(`UPDATE Parts SET Quantity = ${quantity} WHERE PartID = ${partID}`)
}
Now, if the single threaded Nodejs didn't have any event loop, I could safely assume that this would always work. But I know that that is not the case. I just had an instance where the same partID has been inserted twice.
My understanding is that:
User 1 makes a post request to that route
It executes the select query, finds that this partID does not exist in the parts table and reaches the insert portion
However, before it finishes with the insert User 2 (or maybe the same user) makes a post request and the select query is executed which also thinks that a part with that partID does not exist.
This will insert the same partID twice. Is this called a race condition? How do I prevent such situation?
I know I can make PartID a unique key in the database and throw an error when this happens, but I feel like there has to be a way of handling this through code as well.
Please let me know how you guys/girls are handling such situations.
This is a job for a Transaction. Something like this.
const transaction = new sql.Transaction()
await transaction.begin()
const request = new sql.Reqest(transaction)
let checkPartExists = await request.query(`
Select * from Parts WHERE partID = ${partID}`);
if (checkPartExists.recordset.length == 0){
await request.query(`INSERT INTO Parts(PartID, Quantity) VALUES(${partID}, ${quantity})`)
} else {
await request.query(`UPDATE Parts SET Quantity = ${quantity} WHERE PartID = ${partID}`)
}
await transaction.commit()
This serializes access to the table. Transactions are the standard way of avoiding race conditions.
If only one cluster of node.js application is running, then async-mutex can be the solution. If you are running multiple clusters then distributed deadlocking could be a solution.
Mutex is a design pattern so the resource can not be shared among instances.
I'm doing a few tutorials on CosmosDB. I've got the database set up with the Core (SQL) API, and using Node.js to interface with it. for development, I'm using the emulator.
This is the bit of code that I'm running:
const CosmosClient = require('#azure/cosmos').CosmosClient
process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";
const options = {
endpoint: 'https://localhost:8081',
key: REDACTED,
userAgentSuffix: 'CosmosDBJavascriptQuickstart'
};
const client = new CosmosClient(options);
(async () => {
let cost = 0;
let i = 0
while (i < 2000) {
i += 1
console.log(i+" Creating record, running cost:"+cost)
let response = await client.database('TestDB').container('TestContainer').items.upsert({}).catch(console.log);
cost += response.requestCharge;
}
})()
This, without fail, stops at around iteration 1565, and doesn't continue. I've tried it with different payloads, without much difference (it may do a few more or a few less iterations, but seems to almsot always be around that number)
On the flipside, a similar .NET Core example works great to insert 10,000 documents:
double cost = 0.0;
int i = 0;
while (i < 10000)
{
i++;
ItemResponse<dynamic> resp = await this.container.CreateItemAsync<dynamic>(new { id = Guid.NewGuid() });
cost += resp.RequestCharge;
Console.WriteLine("Created item {0} Operation consumed {1} RUs. Running cost: {2}", i, resp.RequestCharge, cost);
}
So I'm not sure what's going on.
So, after a bit of fiddling, this doesn't seem to have anything to do with CosmosDB or it's library.
I was running this in the debugger, and Node would just crap out after x iterations. I noticed if I didn't use a console.log it would actually work. Also, if I ran the script with node file.js it also worked. So there seems to be some sort of issue with debugging the script while also printing to the console. Not exactly sure whats up with that, but going to go ahead and mark this as solved
I have an application which checks for new entries in DB2 every 15 seconds on the iSeries using IBM's idb-connector. I have async functions which return the result of the query to socket.io which emits an event with the data included to the front end. I've narrowed down the memory leak to the async functions. I've read multiple articles on common memory leak causes and how to diagnose them.
MDN: memory management
Rising Stack: garbage collection explained
Marmelab: Finding And Fixing Node.js Memory Leaks: A Practical Guide
But I'm still not seeing where the problem is. Also, I'm unable to get permission to install node-gyp on the system which means most memory management tools are off limits as memwatch, heapdump and the like need node-gyp to install. Here's an example of what the functions basic structure is.
const { dbconn, dbstmt } = require('idb-connector');// require idb-connector
async function queryDB() {
const sSql = `SELECT * FROM LIBNAME.TABLE LIMIT 500`;
// create new promise
let promise = new Promise ( function(resolve, reject) {
// create new connection
const connection = new dbconn();
connection.conn("*LOCAL");
const statement = new dbstmt(connection);
statement.exec(sSql, (rows, err) => {
if (err) {
throw err;
}
let ticks = rows;
statement.close();
connection.disconn();
connection.close();
resolve(ticks.length);// resolve promise with varying data
})
});
let result = await promise;// await promise
return result;
};
async function getNewData() {
const data = await queryDB();// get new data
io.emit('newData', data)// push to front end
setTimeout(getNewData, 2000);// check again in 2 seconds
};
Any ideas on where the leak is? Am i using async/await incorrectly? Or else am i creating/destroying DB connections improperly? Any help on figuring out why this code is leaky would be much appreciated!!
Edit: Forgot to mention that i have limited control on the backend processes as they are handled by another team. I'm only retrieving the data they populate the DB with and adding it to a web page.
Edit 2: I think I've narrowed it down to the DB connections not being cleaned up properly. But, as far as i can tell I've followed the instructions suggested on their github repo.
I don't know the answer to your specific question, but instead of issuing a query every 15 seconds, I might go about this in a different way. Reason being that I don't generally like fishing expeditions when the environment can tell me an event occurred.
So in that vein, you might want to try a database trigger that loads the key to the row into a data queue on add, or even change or delete if necessary. Then you can just put in an async call to wait for a record on the data queue. This is more real time, and the event handler is only called when a record shows up. The handler can get the specific record from the database since you know it's key. Data queues are much faster than database IO, and place little overhead on the trigger.
I see a couple of potential advantages with this method:
You aren't issuing dozens of queries that may or may not return data.
The event would fire the instant a record is added to the table, rather than 15 seconds later.
You don't have to code for the possibility of one or more new records, it will always be 1, the one mentioned in the data queue.
yes you have to close connection.
Don't make const data. you don't need promise by default statement.exec is async and handles it via return result;
keep setTimeout(getNewData, 2000);// check again in 2 seconds
line outside getNewData otherwise it becomes recursive infinite loop.
Sample code
const {dbconn, dbstmt} = require('idb-connector');
const sql = 'SELECT * FROM QIWS.QCUSTCDT';
const connection = new dbconn(); // Create a connection object.
connection.conn('*LOCAL'); // Connect to a database.
const statement = new dbstmt(dbconn); // Create a statement object of the connection.
statement.exec(sql, (result, error) => {
if (error) {
throw error;
}
console.log(`Result Set: ${JSON.stringify(result)}`);
statement.close(); // Clean up the statement object.
connection.disconn(); // Disconnect from the database.
connection.close(); // Clean up the connection object.
return result;
});
*async function getNewData() {
const data = await queryDB();// get new data
io.emit('newData', data)// push to front end
setTimeout(getNewData, 2000);// check again in 2 seconds
};*
change to
**async function getNewData() {
const data = await queryDB();// get new data
io.emit('newData', data)// push to front end
};
setTimeout(getNewData, 2000);// check again in 2 seconds**
First thing to notice is possible open database connection in case of an error.
if (err) {
throw err;
}
Also in case of success connection.disconn(); and connection.close(); return boolean values that tell is operation successful (according to documentation)
Always possible scenario is to pile up connection objects in 3rd party library.
I would check those.
This was confirmed to be a memory leak in the idb-connector library that i was using. Link to github issue Here. Basically there was a C++ array that never had it's memory deallocated. A new version was added and the commit can viewed Here.
I have a node.js program calling a Postgres (Amazon RDS micro instance) function, get_jobs within a transaction, 18 times a second using the node-postgres package by brianc.
The node code is just an enhanced version of brianc's basic client pooling example, roughly like...
var pg = require('pg');
var conString = "postgres://username:password#server/database";
function getJobs(cb) {
pg.connect(conString, function(err, client, done) {
if (err) return console.error('error fetching client from pool', err);
client.query("BEGIN;");
client.query('select * from get_jobs()', [], function(err, result) {
client.query("COMMIT;");
done(); //call `done()` to release the client back to the pool
if (err) console.error('error running query', err);
cb(err, result);
});
});
}
function poll() {
getJobs(function(jobs) {
// process the jobs
});
setTimeout(poll, 55);
}
poll(); // start polling
So Postgres is getting:
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: BEGIN;
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: execute <unnamed>: select * from get_jobs();
2016-04-20 12:04:33 UTC:172.31.9.180(38446):XXX#XXX:[5778]:LOG: statement: COMMIT;
... repeated every 55ms.
get_jobs is written with temp tables, something like this
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
-- 1. get the jobs that are due
CREATE TEMP TABLE jobs ON COMMIT DROP AS
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
CREATE TEMP TABLE jobs_extra ON COMMIT DROP AS
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
ALTER TABLE jobs_extra ADD PRIMARY KEY (id);
-- 3. return the final result with a join to a third big table
RETURN query (
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
I've used the temp table pattern because I know that jobs will always be a small extract of rows from really_big_table_1, in hopes that this will scale better than a single query with multiple joins and multiple where conditions. (I used this to great effect with SQL Server and I don't trust any query optimiser now, but please tell me if this is the wrong approach for Postgres!)
The query runs in 8ms on small tables (as measured from node), ample time to complete one job "poll" before the next one starts.
Problem: After about 3 hours of polling at this rate, the Postgres server runs out of memory and crashes.
What I tried already...
If I re-write the function without temp tables, Postgres doesn't run out of memory, but I use the temp table pattern a lot, so this isn't a solution.
If I stop the node program (which kills the 10 connections it uses to run the queries) the memory frees up. Merely making node wait a minute between polling sessions doesn't have the same effect, so there are obviously resources that the Postgres backend associated with the pooled connection is keeping.
If I run a VACUUM while polling is going on, it has no effect on memory consumption and the server continues on its way to death.
Reducing the polling frequency only changes the amount of time before the server dies.
Adding DISCARD ALL; after each COMMIT; has no effect.
Explicitly calling DROP TABLE jobs; DROP TABLE jobs_extra; after RETURN query () instead of ON COMMIT DROPs on the CREATE TABLEs. Server still crashes.
Per CFrei's suggestion, added pg.defaults.poolSize = 0 to the node code in an attempt to disable pooling. The server still crashed, but took much longer and swap went much higher (second spike) than all the previous tests which looked like the first spike below. I found out later that pg.defaults.poolSize = 0 may not disable pooling as expected.
On the basis of this: "Temporary tables cannot be accessed by autovacuum. Therefore, appropriate vacuum and analyze operations should be performed via session SQL commands.", I tried to run a VACUUM from the node server (as some attempt to make VACUUM an "in session" command). I couldn't actually get this test working. I have many objects in my database and VACUUM, operating on all objects, was taking too long to execute each job iteration. Restricting VACUUM just to the temp tables was impossible - (a) you can't run VACUUM in a transaction and (b) outside the transaction the temp tables don't exist. :P EDIT: Later on the Postgres IRC forum, a helpful chap explained that VACUUM isn't relevant for temp tables themselves, but can be useful to clean up the rows created and deleted from pg_attributes that TEMP TABLES cause. In any case, VACUUMing "in session" wasn't the answer.
DROP TABLE ... IF EXISTS before the CREATE TABLE, instead of ON COMMIT DROP. Server still dies.
CREATE TEMP TABLE (...) and insert into ... (select...) instead of CREATE TEMP TABLE ... AS, instead of ON COMMIT DROP. Server dies.
So is ON COMMIT DROP not releasing all the associated resources? What else could be holding memory? How do I release it?
I used this to great effect with SQL Server and I don't trust any query optimiser now
Then don't use them. You can still execute queries directly, as shown below.
but please tell me if this is the wrong approach for Postgres!
It is not a completely wrong approach, it's just a very awkward one, as you are trying to create something that's been implemented by others for a much easier use. As a result, you are making many mistakes that can lead to many problems, including memory leaks.
Compare to the simplicity of the exact same example that uses pg-promise:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function getJobs() {
return db.tx(function (t) {
return t.func('get_jobs');
});
}
function poll() {
getJobs()
.then(function (jobs) {
// process the jobs
})
.catch(function (error) {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
Gets even simpler when using ES6 syntax:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
The only thing that I didn't quite understand in your example - the use of a transaction to execute a single SELECT. This is not what transactions are generally for, as you are not changing any data. I assume you were trying to shrink a real piece of code you had that changes some data also.
In case you don't need a transaction, your code can be further reduced to:
var pgp = require('pg-promise')();
var conString = "postgres://username:password#server/database";
var db = pgp(conString);
function poll() {
db.func('get_jobs')
.then(jobs=> {
// process the jobs
})
.catch(error=> {
// error
});
setTimeout(poll, 55);
}
poll(); // start polling
UPDATE
It would be a dangerous approach, however, not to control the end of the previous request, which also may create memory/connection issues.
A safe approach should be:
function poll() {
db.tx(t=>t.func('get_jobs'))
.then(jobs=> {
// process the jobs
setTimeout(poll, 55);
})
.catch(error=> {
// error
setTimeout(poll, 55);
});
}
Use CTEs to create partial result sets instead of temp tables.
CREATE OR REPLACE FUNCTION get_jobs (
) RETURNS TABLE (
...
) AS
$BODY$
DECLARE
_nowstamp bigint;
BEGIN
-- take the current unix server time in ms
_nowstamp := (select extract(epoch from now()) * 1000)::bigint;
RETURN query (
-- 1. get the jobs that are due
WITH jobs AS (
select ...
from really_big_table_1
where job_time < _nowstamp;
-- 2. get other stuff attached to those jobs
), jobs_extra AS (
select ...
from really_big_table_2 r
inner join jobs j on r.id = j.some_id
)
-- 3. return the final result with a join to a third big table
select je.id, ...
from jobs_extra je
left join really_big_table_3 r on je.id = r.id
group by je.id
);
END
$BODY$ LANGUAGE plpgsql VOLATILE;
The planner will evaluate each block in sequence the way I wanted to achieve with temp tables.
I know this doesn't directly solve the memory leak issue (I'm pretty sure there's something wrong with Postgres' implementation of them, at least the way they manifest on the RDS configuration).
However, the query works, it is query planned the way I was intending and the memory usage is stable now after 3 days of running the job and my server doesn't crash.
I didn't change the node code at all.