Why does a simple SQL query causes significant slow down in my Lambda function? - node.js

I built a basic node.js API hosted on AWS Lambda and served over AWS API Gateway. This is the code:
'use strict';
// Require and initialize outside of your main handler
const mysql = require('serverless-mysql')({
config: {
host : process.env.ENDPOINT,
database : process.env.DATABASE,
user : process.env.USERNAME,
password : process.env.PASSWORD
}
});
// Import the Dialogflow module from the Actions on Google client library.
const {dialogflow} = require('actions-on-google');
// Instantiate the Dialogflow client.
const app = dialogflow({debug: true});
// Handle the Dialogflow intent named 'trip name'.
// The intent collects a parameter named 'tripName'.
app.intent('trip name', async (conv, {tripName}) => {
// Run your query
let results = await mysql.query('SELECT * FROM tablename where field = ? limit 1', tripName)
// Respond with the user's lucky number and end the conversation.
conv.close('Your lucky number is ' + results[0].id);
// Run clean up function
await mysql.end()
});
// Set the DialogflowApp object to handle the HTTPS POST request.
exports.fulfillment = app;
It receives a parameter (a trip name), looks it up in MySQL and returns the result.
My issue is that the API takes more than 5 seconds to respond which is slow.
I'm not sure why it's slow? the MySQL is a powerful Amazon Aurora and node.js is supposed to be fast.
I tested the function from the same AWS region as the MySQL (Mumbai) and it still times out so I don't think it has to do with distance between different regions.
The reason of slowness is carrying out any SQL query (even a dead simple SELECT). It does bring back the correct result but slowly.
When I remove the SQL part it becomes blazing fast. I increased the memory for Lambda to the maximum and redeployed Aurora to a far more powerful one.

Lambda functions will run faster if you configure more memory. The less the memory configured, the worse the performance.
This means if you have configured your function to use 128MB, it's going to be run in a very low profile hardware.
On the other hand, if you configure it to use 3GB, it will run in a very decent machine.
At 1792MB, your function will run in a hardware with a dedicated core, which will speed up your code significantly considering you are making use of IO calls (network requests, for example). You can see this information here
There's no magic formula though. You have to run a few tests and see what memory configuration best suits your application. I would start with 3GB and eventually decrease it by chunks of 128MB until you find the right configuration.

Related

Firebase Functions: How to maintain 'app-global' API client?

How can I achieve an 'app-wide' global variable that is shared across Cloud Function instances and function invocations? I want to create a truly 'global' object that is initialized only once per the lifetime of all my functions.
Context:
My app's entire backend is Firestore + Firebase Cloud Functions. That is, I use a mix of background (Firestore) triggers and HTTP functions to implement backend logic. Additionally, I rely on a 3rd-party location service to continually listen to location updates from sensors. I want just a single instance of the client on which to subscribe to these updates.
The problem is that Firebase/Google Cloud Functions are stateless, meaning that function instances don't share memory/objects/state. If I call functionA, functionB, functionC, there's going to be at least 3 instances of locationService clients created, each listening separately to the 3rd party service so we end up with duplicate invocations of the location API callback.
Sample code:
// index.js
const functions = require("firebase-functions");
exports.locationService = require('./location_service');
this.locationService.initClient();
// define callable/HTTP functions & Firestore triggers
...
and
// location_service.js
var tracker = require("third-party-tracker-js");
const self = (module.exports = {
initClient: function () {
tracker.initialize('apiKey')
.then((client)=>{
client.setCallback(async function(payload) {
console.log("received location update: ", payload)
// process the payload ...
// with multiple function instances running at once, we receive as many callbacks for each location update
})
client.subscribeProject()
.then((subscription)=>{
subscription.subscribe()
.then((subscribeMsg)=>{
console.log("subscribed to project with message: ", subscribeMsg); // success
});
// subscription.unsubscribe(); // ??? at what point should we unsubscribe?
})
.catch((err)=>{
throw(err)
})
})
.catch((err)=>{
throw(err)
})
},
});
I realize what I'm trying to do is roughly equivalent to implementing a daemon in a single-process environment, and it appears that serverless environments like Firebase/Google Cloud Functions aren't designed to support this need because each instance runs as its own process. But I'd love to hear any contrary ideas and possible workarounds.
Another idea...
Inspired by this related SO post and the official GCF docs on stateless functions, I thought about using Firestore to persist a tracker value that allows us to conditionally initialize the API client. Roughly like this:
// read value from db; only initialize the client if there's no valid subscription
let locSubscriberActive = await getSubscribeStatusFromDb();
if (!locSubscriberActive) {
this.locationService.initClient();
}
// in `location_service.js`, do setSubscribeStatusToDb(); // set flag to true when we call subscribe(). reset when we get terminated
The problem faced: at what point do I unset/reset that value? Intuitively, I would do so the moment the function instance that initialized the client gets recycled/killed. However, it appears that it is not possible to know when a Firebase Cloud Function instance is terminated? I searched everywhere but couldn't find docs on how to detect such an event...
What you're trying to do is not at all supported in Cloud Functions. It's important to realize that there may be any number of server instances allocated for each deployed function. That's how Cloud Functions scales up and down to match the load on the function in a cost-effective way. These instances might be terminated at any time for any reason. You have no indication when an instance terminates.
Also, instances are not capable of performing any computation when they are idle. CPU resources are clamped down after a function terminates, and are spun up again when the next function is invoked on that instance. You can't have any "daemon" code running when a function is not actively being invoked. I don't know what your locationService does, but it is certainly doing nothing at all after a function terminates, regardless of how it terminated.
For any sort of long-running or daemon-like code, Cloud Functions is not a suitable product. You should instead consider also using another product that lets you run code 24/7 without disruptions. App Engine and Compute Engine are viable alternatives, and you will have to think carefully about if and how you want their server instances to scale with load.

What is the best way to stream data in real time into Big Query (using Node)?

I want to stream HTTP requests into BigQuery, in real time (or near real time).
Ideally, I would like to use a tool that provides an endpoint to stream HTTP requests to and allows me to write simple Node such that:
1. I can add the appropriate insertId so BigQuery can dedupe requests if necessary and
2. I can batch the data so I don't send a single row at a time (which would result in unnecessary GCP costs)
I have tried using AWS Lambdas or Google Cloud Functions but the necessary setup for this problem on those platforms far exceeds the needs of the use case here. I assume many developers have this same problem and there must be a better solution.
Since you are looking for a way to stream HTTP requests to BigQuery and also send them in batch to minimize Google Cloud Platform costs, you might want to take a look at the public documentation where this issue is explained.
You can also find a Node.js template on how to perform the stream insert into BigQuery:
// Imports the Google Cloud client library
const {BigQuery} = require('#google-cloud/bigquery');
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const projectId = "your-project-id";
// const datasetId = "my_dataset";
// const tableId = "my_table";
// const rows = [{name: "Tom", age: 30}, {name: "Jane", age: 32}];
// Creates a client
const bigquery = new BigQuery({
projectId: projectId,
});
// Inserts data into a table
await bigquery
.dataset(datasetId)
.table(tableId)
.insert(rows);
console.log(`Inserted ${rows.length} rows`);
As for the batch part, the recommended ratio is to use 500 rows per request even though it can be up to 10,000. More information about that Quotas & Limits for streaming inserts can be found in the public documentation.
You can make use of Cloud functions. With the help of cloud functions, you can create your own API in Node JS and then it can be used for Streaming data in BQ.
Target Architecture for STREAM will be like this:
Pubsub Subscriber (PUSH TYPE) -> Google Cloud Function -> Google Big Query
You can make use of this API in batch mode as well with the help of Cloud Composer (i.e. Apache Airflow) or Cloud Scheduler to schedule your API as per your requirements.
Target Architecture for BATCH will be like this:
Cloud Scheduler/Cloud Composer -> Google Cloud Function -> Google Big Query

Node app that fetches, processes, and formats data for consumption by a frontend app on another server

I currently have a frontend-only app that fetches 5-6 different JSON feeds, grabs some necessary data from each of them, and then renders a page based on said data. I'd like to move the data fetching / processing part of the app to a server-side node application which outputs one simple JSON file which the frontend app can fetch and easily render.
There are two noteworthy complications for this project:
1) The new backend app will have to live on a different server than its frontend counterpart
2) Some of the feeds change fairly often, so I'll need the backend processing to constantly check for changes (every 5-10 seconds). Currently with the frontend-only app, the browser fetches the latest versions of the feeds on load. I'd like to replicate this behavior as closely as possible
My thought process for solving this took me in two directions:
The first is to setup an express application that uses setTimeout to constantly check for new data to process. This data is then sent as a response to a simple GET request:
const express = require('express');
let app = express();
let processedData = {};
const getData = () => {...} // returns a promise that fetches and processes data
/* use an immediately invoked function with setTimeout to fetch the data
* when the program starts and then once every 5 seconds after that */
(function refreshData() {
getData.then((data) => {
processedData = data;
});
setTimeout(refreshData, 5000);
})();
app.get('/', (req, res) => {
res.send(processedData);
});
app.listen(port, () => {
console.log(`Started on port ${port}`);
});
I would then run a simple get request from the client (after properly adjusting CORS headers) to get the JSON object.
My questions about this approach are pretty generic: Is this even a good solution to this problem? Will this drive up hosting costs based on processing / client GET requests? Is setTimeout a good way to have a task run repeatedly on the server?
The other solution I'm considering would deal with setting up an AWS Lambda that writes the resulting JSON to an s3 bucket. It looks like the minimum interval for scheduling an AWS Lambda function is 1 minute, however. I imagine I could set up 3 or 4 identical Lambda functions and offset them by 10-15 seconds, however that seems so hacky that it makes me physically uncomfortable.
Any suggestions / pointers / solutions would be greatly appreciated. I am not yet a super experienced backend developer, so please ELI5 wherever you deem fit.
A few pointers.
Use crontasks for periodic processing of data. This is far preferable especially if you are formatting a lot of data.
Don't setup multiple Lambda functions for the same task. It's going to be messy to maintain all those functions.
After processing / fetching the feed, you can store the JSON file in your own server or S3. Note that if it's S3, then you are paying and waiting for a network operation. You can read the file from your express app and just send the response back to your clients.
Depending on the file size and your load in the server you might want to add a caching server so that you can cache the response until new JSON data is available.

Google cloud datastore slow (>800ms) with simple query from compute engine

When I try to query the Google Cloud Datastore from a (micro) compute engine, it usually takes >800ms to get a reply. The best I got was 450ms, the worst was >3 seconds.
I was under the impression that latency should be much, much lower (like 20-80ms), so I'm guessing I'm doing something wrong.
This is the (node.js) code I'm using to query (from a simple datastore with just a single entity):
const Datastore = require('#google-cloud/datastore');
const projectId = '<my-project-id>';
const datastoreClient = Datastore({
projectId: projectId
});
var query = datastoreClient.createQuery('Test').limit(1);
console.time('query');
query.run(function (err, test) {
if (err) {
console.log(err);
return;
}
console.timeEnd('query');
});
Not sure if it's relevant, but my app-engine project is in the US-Central region, as is the compute engine I'm running the query from.
UPDATE
After some more testing I found out that the default authentication (token?) that you get when using the Node.js library provided by Google expires after about 4 minutes.
So in other words: if you use the same process, but you wait 4 minutes or more between requests, query times are back to >800ms.
I also tried authenticating using a keyfile, and that seemed to do better: subsequent request are still faster, but the initial request only takes half the time (>400ms).
This latency that you see for your initial requests to the Datastore are most likely due to caching being warmed up. The Datastore uses a distributed architecture to manage scaling, which allows your queries to scale with the size of your result set. By performing more of the same query, the better prepared the Datastore is to serve your query, and the more consistent the speeds of your results.
If you want similar result speeds on low Datastore access rates, it is recommended to configure your own caching layer. Google App Engine provides Memcache which is optimized for use with the Datastore. Since you are making requests from Compute Engine, you can use other third-party solutions such as Redis or Memcached.

MongoDB connections from AWS Lambda

I'm looking to create a RESTful API using AWS Lambda/API Gateway connected to a MongoDB database. I've read that connections to MongoDB are relatively expensive so it's best practice to retain a connection for reuse once its been established rather than making new connections for every new query.
This is pretty straight forward for normal applications as you can establish a connection during start up and reuse it during the applications lifetime. But, since Lambda is designed to be stateless retaining this connection seems to be less straight forward.
Therefore, I'm wondering what would be the best way to approach this database connection issue? Am I forced to make new connections every time a Lambda function is invoked or is there a way to pool/cache these connections for more efficient queries?
Thanks.
AWS Lambda functions should be defined as stateless functions, so they can't hold state like a connection pool.
This issue was also raised in this AWS forum post. On Oct 5, 2015 AWS engineer Sean posted that you should not open and close connection on each request, by creating a pool on code initialization, outside of handler block. But two days later the same engineer posted that you should not do this.
The problem is that you don't have control over Lambda's runtime environment. We do know that these environments (or containers) are reused, as describes the blog post by Tim Wagner. But the lack of control can drive you to drain all your resources, like reaching a connection limit in your database. But it's up to you.
Instead of connecting to MongoDB from your lambda function you can use RESTHeart to access the database through HTTP. The connection pool to MongoDB is maintained by RESTHeart instead. Remember that in regards to performance you'll be opening a new HTTP connection to RESTHeart on each request, and not using a HTTP connection pool, like you could do in a tradicional application.
You should assume lambdas to be stateless but the reality is that most of the time the vm is simply frozen and does maintain some state. It would be inefficient for Amazon to spin up a new process for every request so they often re-use the same process and you can take advantage of this to avoid thrashing connections.
To avoid connecting for every request (in cases where the lambda process is re-used):
Write the handler assuming the process is re-used such that you connect to the database and have the lamba re-use the connection pool (the db promise returned from MongoClient.connect).
In order for the lambda not to hang waiting for you to close the db connection, db.close(), after servicing a request tell it not wait for an empty event loop.
Example:
var db = MongoClient.connect(MongoURI);
module.exports.targetingSpec = (event, context, callback) => {
context.callbackWaitsForEmptyEventLoop = false;
db.then((db) => {
// use db
});
};
From the documentation about context.callbackWaitsForEmptyEventLoop:
callbackWaitsForEmptyEventLoop
The default value is true. This property is useful only to modify the default behavior of the callback. By default, the callback will wait until the Node.js runtime event loop is empty before freezing the process and returning the results to the caller. You can set this property to false to request AWS Lambda to freeze the process soon after the callback is called, even if there are events in the event loop. AWS Lambda will freeze the process, any state data and the events in the Node.js event loop (any remaining events in the event loop processed when the Lambda function is called next and if AWS Lambda chooses to use the frozen process). For more information about callback, see Using the Callback Parameter.
Restheart is a REST-based server that runs alongside MongoDB. It maps most CRUD operations in Mongo to GET, POST, etc., requests with extensible support when you need to write a custom handler (e.g., specialized geoNear, geoSearch query)
I ran some tests executing Java Lambda functions connecting to MongoDB Atlas.
As already stated by other posters Amazon does reuse the Instances, however these may get recycled and the exact behaviour cannot be determined. So one could end up with stale connections. I'm collecting data every 5 minutes and pushing it to the Lambda function every 5 minutes.
The Lambda basically does:
Build up or reuse connection
Query one record
Write or update one record
close the connection or leave it open
The actual amount of data is quite low. Depending on time of the day it varies from 1 - 5 kB. I only used 128 MB.
The Lambdas ran in N.Virgina as this is the location where the free tier is tied to.
When opening and closing the connection each time most calls take between 4500 - 9000 ms. When reusing the connection most calls are between 300 - 900 ms. Checking the Atlas console the connection count stays stable. For this case reusing the connection is worth it. Building up a connection and even disconnecting from a replica-set is rather expensive using the Java driver.
For a large scale deployment one should run more comprehensive tests.
Yes, there is a way to cache/retain connection to MongoDB and its name is pool connection. and you can use it with lambda functions as well like this:
for more information you can follow these links:
Using Mongoose With AWS Lambda
Optimizing AWS Lambda(a bit out date)
const mongoose = require('mongoose');
let conn = null;
const uri = 'YOUR CONNECTION STRING HERE';
exports.handler = async function(event, context) {
// Make sure to add this so you can re-use `conn` between function calls.
context.callbackWaitsForEmptyEventLoop = false;
const models = [{name: 'User', schema: new mongoose.Schema({ name: String })}]
conn = await createConnection(conn, models)
//e.g.
const doc = await conn.model('User').findOne({})
console.log('doc: ', doc);
};
const createConnection = async (conn,models) => {
// Because `conn` is in the global scope, Lambda may retain it between
// function calls thanks to `callbackWaitsForEmptyEventLoop`.
// This means your Lambda function doesn't have to go through the
// potentially expensive process of connecting to MongoDB every time.
if (conn == null || (conn && [0, 3].some(conn.readyState))) {
conn = await mongoose.createConnection(uri, {
// Buffering means mongoose will queue up operations if it gets
// disconnected from MongoDB and send them when it reconnects.
// With serverless, better to fail fast if not connected.
bufferCommands: false, // Disable mongoose buffering
bufferMaxEntries: 0, // and MongoDB driver buffering
useNewUrlParser: true,
useUnifiedTopology: true,
useCreateIndex: true
})
for (const model of models) {
const { name, schema } = model
conn.model(name, schema)
}
}
return conn
}
Unfortunately you may have to create your own RESTful API to answer MongoDB requests until AWS comes up with one. So far they only have what you need for their own Dynamo DB.
The short answer is yes, you need to create a new connection AND close it before the lambda finishes.
The long answer is actually during my tests you can pass down your DB connections in your handler like so(mysql example as that's what I've got to hand), you can't rely on this having a connection so check my example below, it may be that once your Lambda's haven't been executed for ages it does lose the state from the handler(cold start), I need to do more tests to find out, but I have noticed if a Lambda is getting a lot of traffic using the below example it doesn't create a new connection.
// MySQL.database.js
import * as mysql from 'mysql'
export default mysql.createConnection({
host: 'mysql db instance address',
user: 'MYSQL_USER',
password: 'PASSWORD',
database: 'SOMEDB',
})
Then in your handler import it and pass it down to the lambda that's being executed.
// handler.js
import MySQL from './MySQL.database.js'
const funcHandler = (func) => {
return (event, context, callback) => {
func(event, context, callback, MySQL)
}
}
const handler = {
someHandler: funcHandler(someHandler),
}
export default handler
Now in your Lambda you do...
export default (event, context, callback, MySQL) => {
context.callbackWaitsForEmptyEventLoop = false
// Check if their is a MySQL connection if not, then open one.
// Do ya thing, query away etc etc
callback(null, responder.success())
}
The responder example can he found here. sorry it's ES5 because that's where the question was asked.
Hope this helps!
Official Best Practice for Connecting from AWS Lambda
You should define the client to the MongoDB server outside the AWS
Lambda handler function. Don't define a new MongoClient object each
time you invoke your function. Doing so causes the driver to create a
new database connection with each function call. This can be expensive
and can result in your application exceeding database connection
limits.
As an alternative, do the following:
Create the MongoClient object once.
Store the object so your function can reuse the MongoClient across function invocations.
Step 1
Isolate the call to the MongoClient.connect() function into its own module so that the connections can be reused across functions. Let's create a file mongo-client.js for that:
mongo-client.js:
const { MongoClient } = require('mongodb');
// Export a module-scoped MongoClient promise. By doing this in a separate
// module, the client can be shared across functions.
const client = new MongoClient(process.env.MONGODB_URI);
module.exports = client.connect();
Step 2
Import the new module and use it in function handlers to connect to database.
some-file.js:
const clientPromise = require('./mongodb-client');
// Handler
module.exports.handler = async function(event, context) {
// Get the MongoClient by calling await on the connection promise. Because
// this is a promise, it will only resolve once.
const client = await clientPromise;
// Use the connection to return the name of the connected database for example.
return client.db().databaseName;
}
Resources
For more info, check this Docs.
We tested an AWS Lambda that connected every minute to our self managed MongoDB.
The connections were unstable and the Lambda failed
We resolved the issue by wrapping the MongoDB with Nginx reverse proxy stream module:
How to setup MongoDB behind Nginx Reverse Proxy
stream {
server {
listen <your incoming Mongo TCP port>;
proxy_connect_timeout 1s;
proxy_timeout 3s;
proxy_pass stream_mongo_backend;
}
upstream stream_mongo_backend {
server <localhost:your local Mongo TCP port>;
}
}
In addition to saving the connection for reuse, increase the memory allocation for the lambda function. AWS allocates CPU proportionally to the memory allocation and when changing from 128MB to 1.5Gb the connection time dropped from 4s to 0.5s when connecting to mongodb atlas.
Read more here: https://aws.amazon.com/lambda/faqs/
I was facing the same issue few times ago but I have resolved with by putting my mongo on same account of EC2.
I have created a mongo DB on the same AWS EC2 account where my lambda function reside.
Now I can access my mongo from the lambda function with the private IP.

Resources