Anybody who has experience building concurrent AWS Lambda Function with Postgres?
I have to build a lambda cron that will ingest thousands of invoices into a Postgres database. I have to call the ingestion lambda function concurrently for each invoices. The problem is, because the it is concurrent, each instance of the ingestion function will create a connection to the database. Which means, if I have a 1000 invoice to ingest, each invoice will invoke a lambda function, that will create 1000 database connection. This will exhaust the max connection that Postgres can handle. Some instance of the lambda function invoked will return an error saying that there are no more connection available.
Any tips you can give how to handle this problem?
Here are some snippets of my code:
ingestInvoiceList.js
var AWS = require('aws-sdk');
var sftp = require('ssh2-sftp-client');
var lambda = AWS.Lambda();
exports.handler = async (evenrt) => {
...
let folder_contents;
try {
// fetch list of Zip format invoices
folder_contents = await sftp.list(client_folder);
} catch (err) {
console.log(`[${client}]: ${err.toString()}`);
throw new Error(`[${client}]: ${err.toString()}`);
}
let invoiceCount = 0;
let funcName = 'ingestInvoice';
for (let item of folder_contents) {
if (item.type === '-') {
let payload = JSON.stringify({
invoice: item.name
});
let params = {
FunctionName: funcName,
Payload: payload,
InvocationType: 'Event'
};
//invo9ke ingest invoice concurrently
let result = await new Promise((resolve) => {
lambda.invoke(params, (err, data) => {
if (err) resolve(err);
else resolve(data);
});
});
console.log('result: ', result);
invoiceCount++;
}
}
...
}
ingestInvoice.js
var AWS = require('aws-sdk');
var sftp = require('ssh2-sftp-client');
var DBClient = require('db.js')l
var lambda = AWS.Lambda();
exports.handler = async (evenrt) => {
...
let invoice = event.invoice;
let client = 'client name';
let db = new DBClient();
try {
console.log(`[${client}]: Extracting documents from ${invoice}`);
try {
// get zip file from sftp server
await sftp.fastGet(invoice, '/tmp/tmp.zip', {});
} catch (err) {
throw err;
}
let zip;
try {
// extract the zip file...
zip = await new Promise((resolve, reject) => {
fs.readFile("/tmp/tmp.zip", async function (err, data) {
if (err) return reject(err);
let unzippedData;
try {
unzippedData = await JSZip.loadAsync(data);
} catch (err) {
return reject(err);
}
return resolve(unzippedData);
});
});
} catch (err) {
throw err;
}
let unibillRegEx = /unibill.+\.txt/g;
let files = [];
zip.forEach(async (path, entry) => {
if (unibillRegEx.exec(entry.name)) {
files['unibillObj'] = entry;
} else {
files['pdfObj'] = entry;
}
});
// await db.getClient().connect();
await db.setSchema(client);
console.log('Schema has been set.');
let unibillStr = await files.unibillObj.async('string');
console.log('ingesting ', files.unibillObj.name);
//Do ingestion queries here...
...
await uploadInvoiceDocsToS3(client, files);
} catch (err) {
console.error(err.stack);
throw err;
} finally {
try {
// console.log('Disconnecting from database...');
// await db.endClient();
console.log('Disconnecting from SFTP...');
await sftp.end();
} catch (err) {
console.log('ERROR: ' + err.toString());
throw err;
}
}
...
}
db.js
var { Pool } = require('pg');
module.exports = class DBClient {
constructor() {
this.pool = new Pool();
}
async setSchema(schema) {
await this.execQuery(`SET search_path TO ${schema}`);
}
async execQuery(sql) {
return await this.pool.query(sql);
}
}
Any answer would be appreciated, thank you!
I see two ways to handle this. Ultimately it depends on how fast you want to process this data.
Change the concurrency setting for you Lambda to a "Reserve Concurrency:
.
This will allow you to limit the number of concurrent Lambda's running (see this link for more details).
Change your code to queue the work to be done in an SQS queue. From there you would have to create another Lambda to be triggered by the queue and process it as needed. This Lambda could decide how much to pull off the queue at a time and it too would likely need to be limited on concurrency. But you could tune it to, for example, run for the maximum 15 minutes which may be enough to empty the queue and would not kill the DB. Or if you had, say, a max concurrency of 100 then you would process quickly without killing the DB.
First, you have to initialize your connection outside the handler, so each time your warm lambda will be executed it won't open a new one:
const db = new DBClient();
exports.handler = async (event) => {
...
await db.query(...)
...
}
If is node-pg there is a package that keep tracks of all the idle connections, kill them if necessary and retry in case of error or sorry, too many clients already:
https://github.com/MatteoGioioso/serverless-pg
Any other custom implemented retry mechanism with backoff will work as well.
There is also a one for MySQL as well: https://github.com/jeremydaly/serverless-mysql
These days a good solution to consider for this problem, on AWS, is RDS Proxy, which acts as a transparent proxy between your lambda(s) and database:
Amazon RDS Proxy allows applications to pool and share connections established with the database, improving database efficiency, application scalability, and security.
Related
This one has had me spinning my tires for about 2 days now, I'm ready to reach out for help :)
I have a google firebase functions app, running as a middle-ware to an angular SPA. Hoping to avoid some of the pay-by-use cost of Azure SQL, I wanted to implement a caching option for the most common queries.
I thought I knew redis, I've worked with it before. There's a simple enough example on the repo: https://www.npmjs.com/package//redis
Everything works fine, if it is top level.
But the way my application is built, i need the ability to set a cache value from within the .then of a Promise, and when I try to do that, all operation just stops, with no identifiable error logging, or even response from redis. Even in Azure insights, i'm not getting much feed-back, only that the 'set' operation isn't being counted in metrics.
So, just to clarify, this works:
// "cache", the object, set globally
export const testCache = functions.https.onRequest(
async (req: any, res) => {
await cache.connect();
var redisKey = 'testing_global';
var result = await cache.get(redisKey);
await cache.set(redisKey, 'testing new class')
console.log("\nDone");
cache.disconnect();
res.send('done');
})
But, this does not:
import { createClient } from 'redis';
const cache = createClient({
url: "rediss://" + process.env.REDIS_HOST_NAME + ":6380",
password: process.env.REDIS_KEY,
});
export const getValues = functions.https.onRequest(
(req: any, response) => {
cors(req, response, async () => {
response.set('Access-Control-Allow-Origin', origin);
var searchText = req.body['search'];
var offset = req.body['offset'];
var fetch = req.body['fetch'];
var x = req.body['x'];
var y = req.body['y'];
var districts = req.body['districtFilter'];
var sort = req.body['sort'];
if (offset) {
if (!/^\d+$/.test(offset))
throw new Error('bad number');
} else {
offset = 0;
}
if (fetch) {
if (!/^\d+$/.test(fetch))
throw new Error('bad number');
} else {
fetch = 200;
}
if (x) {
if (!/^-?\d+$/.test(x))
throw new Error('bad number');
} else {
x = null;
}
if (y) {
if (!/^-?\d+$/.test(y))
throw new Error('bad number');
} else {
y = null;
}
if (districts && districts.length > 0) {
districts = sanitizeStringArray(districts);
} else {
districts = null;
}
if (sort) {
switch (sort) {
case "scoreAsc":
case "scoreDesc":
case "priceAsc":
case "priceDesc":
break;
default:
sort = '';
}
}
var redisKey = `get_values_${searchText}_${offset}_${fetch}_${x}_${y}_${districts}_${sort}`;
var cacheResult: any = null;
await cache.connect();
// because I want to end up in the 'else', for testing
cacheResult = await cache.getFromCache('someOtherKey');
if (null !== cacheResult) {
response.send({
"status": "success",
"totalCount": cacheResult.totalCount,
"data": cacheResult.result
});
cache.disconnect();
} else {
var connection = new Connection(sqlConfig);
var totalCount: number = 0;
connection.on('connect', function (err: any) {
// If no error, then good to proceed.
console.log("Connected");
var sql = `EXEC SomeSPC;`;
const sqlRequest = new Request(sql, function (err: any) {
if (err) {
console.log(err);
}
});
const countRequest = new Request(
`EXEC SomeOtherSPC;`
, function (err: any) {
if (err) {
console.log(err);
}
}
)
sqlRequest.connection = connection;
countRequest.connection = connection;
var result: any[] = [];
sqlRequest.on('row', function (columns: any[]) {
var rowResult: any = {};
columns.forEach(function (column: any) {
rowResult[column['metadata']['colName']] = column['value'];
});
result.push(rowResult);
});
sqlRequest.on("requestCompleted", function (rowCount: any, more: any) {
console.log(rowCount + ' rows returned');
connection.execSql(countRequest);
countRequest.on('row', function (columns: any[]) {
totalCount = columns[0]['value'];
});
countRequest.on('requestCompleted', async function (rowCount: any, more: any) {
connection.close();
cacheResult = {
totalCount: totalCount
, result: result
};
// ******************************************************************
// Does Not Work, Just Fails, Without Much to Go On
await cache.set(redisKey, 'in Promise')
response.send({
"status": "success",
"totalCount": totalCount,
"data": result
});
})
});
connection.execSql(sqlRequest);
});
connection.on('infoMessage', infoError);
connection.on('errorMessage', infoError);
connection.on('end', end);
connection.on('debug', debug);
connection.connect();
console.log("Reading rows from the Table...");
}
})
}
)
There's **update- no longer ** a fair amount of psuedo-code here, so please **update- No Need to ** disregard any inconsistent lines. I went ahead and put in the full function, including all the fluff, since trimming the fat seems to make things difficult for others to understand what is being asked.
The sql stuff all works, if i take out the cache.set() everything is fine, but that line, in the result of the Promise, just fails, and I can't figure out why.
I've tried using cache locally and globally, extracting the cache operations to a function, and then to a separate class, and in all cases, i'm getting the same result.
Is there a known reason this wouldn't work?
As far as I can understand, given you didn't provide a reproducible example, your code is binding to the requestCompleted event for the second request only after it runs execSql(): I would suggest moving the binding block before that, otherwise the event may be skipped.
I am new to mongoDb, as I am trying to query from different collection and in order to do that, when I am fetching data from category collection I mean when I am running select * from collection it is throwing error, MongoError: pool destroyed.
As per my understanding it is because of some find({}) is creating a pool and that is being destroyed.
The code which I am using inside model is below,
const MongoClient = require('mongodb').MongoClient;
const dbConfig = require('../configurations/database.config.js');
export const getAllCategoriesApi = (req, res, next) => {
return new Promise((resolve, reject ) => {
let finalCategory = []
const client = new MongoClient(dbConfig.url, { useNewUrlParser: true });
client.connect(err => {
const collection = client.db(dbConfig.db).collection("categories");
debugger
if (err) throw err;
let query = { CAT_PARENT: { $eq: '0' } };
collection.find(query).toArray(function(err, data) {
if(err) return next(err);
finalCategory.push(data);
resolve(finalCategory);
// db.close();
});
client.close();
});
});
}
When my finding here is when I am using
let query = { CAT_PARENT: { $eq: '0' } };
collection.find(query).toArray(function(err, data) {})
When I am using find(query) it is returning data but with {} or $gte/gt it is throwing Pool error.
The code which I have written in controller is below,
import { getAllCategoriesListApi } from '../models/fetchAllCategory';
const redis = require("redis");
const client = redis.createClient(process.env.REDIS_PORT);
export const getAllCategoriesListData = (req, res, next, query) => {
// Try fetching the result from Redis first in case we have it cached
return client.get(`allstorescategory:${query}`, (err, result) => {
// If that key exist in Redis store
if (false) {
res.send(result)
} else {
// Key does not exist in Redis store
getAllCategoriesListApi(req, res, next).then( function ( data ) {
const responseJSON = data;
// Save the Wikipedia API response in Redis store
client.setex(`allstorescategory:${query}`, 3600, JSON.stringify({ source: 'Redis Cache', responseJSON }));
res.send(responseJSON)
}).catch(function (err) {
console.log(err)
})
}
});
}
Can any one tell me what mistake I am doing here. How I can fix pool issue.
Thanking you in advance.
I assume that toArray is asynchronous (i.e. it invokes the callback passed in as results become available, i.e. read from the network).
If this is true the client.close(); call is going to get executed prior to results having been read, hence likely yielding your error.
The close call needs to be done after you have finished iterating the results.
Separately from this, you should probably not be creating the client instance in the request handler like this. Client instances are expensive to create (they must talk to all of the servers in the deployment before they can actually perform queries) and generally should be created per running process rather than per request.
8 out of ten times everything connects well. That said, I sometimes get a MongoClient must be connected before calling MongoClient.prototype.db error. How should I change my code so it works reliably (100%)?
I tried a code snippet from one of the creators of the Now Zeit platform.
My handler
const { send } = require('micro');
const { handleErrors } = require('../../../lib/errors');
const cors = require('../../../lib/cors')();
const qs = require('micro-query');
const mongo = require('../../../lib/mongo');
const { ObjectId } = require('mongodb');
const handler = async (req, res) => {
let { limit = 5 } = qs(req);
limit = parseInt(limit);
limit = limit > 10 ? 10 : limit;
const db = await mongo();
const games = await db
.collection('games_v3')
.aggregate([
{
$match: {
removed: { $ne: true }
}
},
{ $sample: { size: limit } }
])
.toArray();
send(res, 200, games);
};
module.exports = handleErrors(cors(handler));
My mongo script that reuses the connection in case the lambda is still warm:
// Based on: https://spectrum.chat/zeit/now/now-2-0-connect-to-database-on-every-function-invocation~e25b9e64-6271-4e15-822a-ddde047fa43d?m=MTU0NDkxODA3NDExMg==
const MongoClient = require('mongodb').MongoClient;
if (!process.env.MONGODB_URI) {
throw new Error('Missing env MONGODB_URI');
}
let client = null;
module.exports = function getDb(fn) {
if (client && !client.isConnected) {
client = null;
console.log('[mongo] client discard');
}
if (client === null) {
client = new MongoClient(process.env.MONGODB_URI, {
useNewUrlParser: true
});
console.log('[mongo] client init');
} else if (client.isConnected) {
console.log('[mongo] client connected, quick return');
return client.db(process.env.MONGO_DB_NAME);
}
return new Promise((resolve, reject) => {
client.connect(err => {
if (err) {
client = null;
console.error('[mongo] client err', err);
return reject(err);
}
console.log('[mongo] connected');
resolve(client.db(process.env.MONGO_DB_NAME));
});
});
};
I need my handler to be 100% reliable.
if (client && !client.isConnected) {
client = null;
console.log('[mongo] client discard');
}
This code can cause problems! Even though you're setting client to null, that client still exists, will continue connecting to mongo, will not be garbage collected, and its callback connection code will still run, but in its callback client will refer to the next client that's created that is not necessarily connected.
A common pattern for this kind of code is to only ever return a single promise from the getDB call:
let clientP = null;
function getDb(fn) {
if (clientP) return clientP;
clientP = new Promise((resolve, reject) => {
client = new MongoClient(process.env.MONGODB_URI, {
useNewUrlParser: true
});
client.connect(err => {
if (err) {
console.error('[mongo] client err', err);
return reject(err);
}
console.log('[mongo] connected');
resolve(client.db(process.env.MONGO_DB_NAME));
});
});
return clientP;
};
I had the same issue. In my case it was caused by calling getDb() before a previous getDb() call had returned. In this case, I believe that 'client.isConnected' returns true, even though it is still connecting.
This was caused by forgetting to put an 'await' before the getDb() call in one location. I tracked down which by outputting a callstack from getDb using:
console.log(new Error().stack);
I don't see the same issue in the sample code in the question, though it could be triggered by another bit of code that isn't shown.
I have written this article talking about serverless, lambda e db connections. There are some good concepts which could help you to find the root cause of your problem. There are also example and use cases of how to mitigate connection pool issues.
Just by looking your code I can tell it is missing this:
context.callbackWaitsForEmptyEventLoop = false;
Serverless: Dynamodb x Mongodb x Aurora serverless
I've been using node-oracledb for a few months and I've managed to achieve what I have needed to so far.
I'm currently working on a search app that could potentially return about 2m rows of data from a single call. To ensure I don't get a disconnect from the browser and the server, I thought I would try queryStream so that there is a constant flow of data back to the client.
I implemented the queryStream example as-is, and this worked fine for a few hundred thousand rows. However, when the returned rows is greater than one million, Node runs out of memory. By logging and watching both client and server log events, I can see that client is way behind the server in terms of rows sent and received. So, it looks like Node is falling over because it's buffering so much data.
It's worth noting that at this point, my selectstream implementation is within a req/res function called via Express.
To return the data, I do something like....
stream.on('data', function (data) {
rowcount++;
let obj = new myObjectConstructor(data);
res.write(JSON.stringify(obj.getJson());
});
I've been reading about how streams and pipe can help with flow, so what I'd like to be able to do is to be able to pipe the results from the query to a) help with flow and b) to be able to pipe the results to other functions before sending back to the client.
E.g.
function getData(req, res){
var stream = myQueryStream(connection, query);
stream
.pipe(toSomeOtherFunction)
.pipe(yetAnotherFunction)
.pipe(res);
}
I'm spent a few hours trying to find a solution or example that allows me to pipe results, but I'm stuck and need some help.
Apologies if I'm missing something obvious, but I'm still getting to grips with Node and especially streams.
Thanks in advance.
There's a bit of an impedance mismatch here. The queryStream API emits rows of JavaScript objects, but what you want to stream to the client is a JSON array. You basically have to add an open bracket to the beginning, a comma after each row, and a close bracket to the end.
I'll show you how to do this in a controller that uses the driver directly as you have done, instead of using separate database modules as I advocate in this series.
const oracledb = require('oracledb');
async function get(req, res, next) {
try {
const conn = await oracledb.getConnection();
const stream = await conn.queryStream('select * from employees', [], {outFormat: oracledb.OBJECT});
res.writeHead(200, {'Content-Type': 'application/json'});
res.write('[');
stream.on('data', (row) => {
res.write(JSON.stringify(row));
res.write(',');
});
stream.on('end', () => {
res.end(']');
});
stream.on('close', async () => {
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
stream.on('error', async (err) => {
next(err);
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
} catch (err) {
next(err);
}
}
module.exports.get = get;
Once you get the concepts, you can simplify things a bit with a reusable Transform class which allows you to use pipe in the controller logic:
const oracledb = require('oracledb');
const { Transform } = require('stream');
class ToJSONArray extends Transform {
constructor() {
super({objectMode: true});
this.push('[');
}
_transform (row, encoding, callback) {
if (this._prevRow) {
this.push(JSON.stringify(this._prevRow));
this.push(',');
}
this._prevRow = row;
callback(null);
}
_flush (done) {
if (this._prevRow) {
this.push(JSON.stringify(this._prevRow));
}
this.push(']');
delete this._prevRow;
done();
}
}
async function get(req, res, next) {
try {
const toJSONArray = new ToJSONArray();
const conn = await oracledb.getConnection();
const stream = await conn.queryStream('select * from employees', [], {outFormat: oracledb.OBJECT});
res.writeHead(200, {'Content-Type': 'application/json'});
stream.pipe(toJSONArray).pipe(res);
stream.on('close', async () => {
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
stream.on('error', async (err) => {
next(err);
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
} catch (err) {
next(err);
}
}
module.exports.get = get;
Rather than writing your own logic to create a JSON stream, you can use JSONStream to convert an object stream to (stringified) JSON, before piping it to its destination (res, process.stdout etc) This saves the need to muck around with .on('data',...) events.
In the example below, I've used pipeline from node's stream module rather than the .pipe method: the effect is similar (with better error handling I think). To get objects from oracledb.queryStream, you can specify option {outFormat: oracledb.OUT_FORMAT_OBJECT} (docs). Then you can make arbitrary modifications to the stream of objects produced. This can be done using a transform stream, made perhaps using through2-map, or if you need to drop or split rows, through2. Below the stream is sent to process.stdout after being stringified as JSON, but you could equally send to it express's res.
require('dotenv').config() // config from .env file
const JSONStream = require('JSONStream')
const oracledb = require('oracledb')
const { pipeline } = require('stream')
const map = require('through2-map') // see https://www.npmjs.com/package/through2-map
oracledb.getConnection({
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
connectString: process.env.CONNECT_STRING
}).then(connection => {
pipeline(
connection.queryStream(`
select dual.*,'test' as col1 from dual
union select dual.*, :someboundvalue as col1 from dual
`
,{"someboundvalue":"test5"} // binds
,{
prefetchRows: 150, // for tuning
fetchArraySize: 150, // for tuning
outFormat: oracledb.OUT_FORMAT_OBJECT
}
)
,map.obj((row,index) => {
row.arbitraryModification = index
return row
})
,JSONStream.stringify() // false gives ndjson
,process.stdout // or send to express's res
,(err) => { if(err) console.error(err) }
)
})
// [
// {"DUMMY":"X","COL1":"test","arbitraryModification":0}
// ,
// {"DUMMY":"X","COL1":"test5","arbitraryModification":1}
// ]
I would like to know if it's possible to run a series of SQL statements and have them all committed in a single transaction.
The scenario I am looking at is where an array has a series of values that I wish to insert into a table, not individually but as a unit.
I was looking at the following item which provides a framework for transactions in node using pg. The individual transactions appear to be nested within one another so I am unsure of how this would work with an array containing a variable number of elements.
https://github.com/brianc/node-postgres/wiki/Transactions
var pg = require('pg');
var rollback = function(client, done) {
client.query('ROLLBACK', function(err) {
//if there was a problem rolling back the query
//something is seriously messed up. Return the error
//to the done function to close & remove this client from
//the pool. If you leave a client in the pool with an unaborted
//transaction weird, hard to diagnose problems might happen.
return done(err);
});
};
pg.connect(function(err, client, done) {
if(err) throw err;
client.query('BEGIN', function(err) {
if(err) return rollback(client, done);
//as long as we do not call the `done` callback we can do
//whatever we want...the client is ours until we call `done`
//on the flip side, if you do call `done` before either COMMIT or ROLLBACK
//what you are doing is returning a client back to the pool while it
//is in the middle of a transaction.
//Returning a client while its in the middle of a transaction
//will lead to weird & hard to diagnose errors.
process.nextTick(function() {
var text = 'INSERT INTO account(money) VALUES($1) WHERE id = $2';
client.query(text, [100, 1], function(err) {
if(err) return rollback(client, done);
client.query(text, [-100, 2], function(err) {
if(err) return rollback(client, done);
client.query('COMMIT', done);
});
});
});
});
});
My array logic is:
banking.forEach(function(batch){
client.query(text, [batch.amount, batch.id], function(err, result);
}
pg-promise offers a very flexible support for transactions. See Transactions.
It also supports partial nested transactions, aka savepoints.
The library implements transactions automatically, which is what should be used these days, because too many things can go wrong, if you try organizing a transaction manually as you do in your example.
See a related question: Optional INSERT statement in a transaction
Here's a simple TypeScript solution to avoid pg-promise
import { PoolClient } from "pg"
import { pool } from "../database"
const tx = async (callback: (client: PoolClient) => void) => {
const client = await pool.connect();
try {
await client.query('BEGIN')
try {
await callback(client)
await client.query('COMMIT')
} catch (e) {
await client.query('ROLLBACK')
}
} finally {
client.release()
}
}
export { tx }
Usage:
...
let result;
await tx(async client => {
const { rows } = await client.query<{ cnt: string }>('SELECT COUNT(*) AS cnt FROM users WHERE username = $1', [username]);
result = parseInt(rows[0].cnt) > 0;
});
return result;