Fetching bot answers from a database - node.js

I'm using Azure Cosmos DB with MongoDB for storing the answers that my Microsoft Bot Framework-based chatbot will give to different dialogs.
My issue is that I don't know if it's best to do a query for each response or do one large query to fetch everything in the DB once the code runs and store it in arrays.
The Azure Cosmos DB pricing uses the unit Request Units per second (RU/s).
In terms of cost and speed, I'm thinking of doing one query whenever the bot service is run (in my case, that would be when app.js is run on my Azure Web App).
This query fetches all the data in my database and stores results in different arrays in my code. Inside my bot.dialog()s I will use these arrays to fetch the answer that I wont the bot to return to the end user.

i would load all the data from the db into the bot when the app starts up and if you manipulate the data you can write it back into the db when the bot shuts down. this would mean that you have one single big query at the beginning of your bots life and another one at the end. but this also depends on the amount of memory that your app has allocated and how big the db is

From Cosmos DB perspective fewer requests that yield larger datasets will typically be faster/cheaper in terms of RUs than more requests fetching smaller datasets. Roundtrips are expensive. But it depends on the complexity of the queries too - aggregation pipelines are more expensive than find() with filters. Everything else should be a client-side consideration

Related

Is writing multiple INSERTS versus UPDATE faster for temporary POSTGRES databases?

I am re-designing a project I built a year ago when I was just starting to learn how to code. I used MEAN stack, back then and want to convert it to a PERN stack now. My AWS knowledge has also grown a bit and I'd like to expand on these new skills.
The application receives real-time data from an api which I clean up to write to a database as well as broadcast that data to connected clients.
To better conceptualize this question I will refer to the following items:
api-m1 : this receives the incoming data and passes it to my schema I then send it to my socket-server.
socket-server: handles the WSS connection to the application's front-end clients. It also will write this data to a postgres database which it gets from Scraper and api-m1. I would like to turn this into clusters eventually as I am using nodejs and will incorporate Redis. Then I will run it behind an ALB using sticky-sessions etc.. for multiple EC2 instances.
RDS: postgres table which socket-server writes incoming scraper and api-m1 data to. RDS is used to fetch the most recent data stored along with user profile config data. NOTE: RDS main data table will have max 120-150 UID records with 6-7 columns
To help better visualize this see img below.
From a database perspective, what would be the quickest way to write my data to RDS.
Assuming we have during peak times 20-40 records/s from the api-m1 + another 20-40 records/s from the scraper? After each day I tear down the database using a lambda function and start again (as the data is only temporary and does not need to be saved for any prolonged period of time).
1.Should I INSERT each record using a SERIAL id, then from the frontend fetch the most recent rows based off of the uid?
2.a Should I UPDATE each UID so i'd have a fixed N rows of data which I just search and update? (I can see this bottlenecking with my Postgres client.
2.b Still use UPDATE but do BATCHED updates (what issues will I run into if I make multiple clusters i.e will I run into concurrency problems where table record XYZ will have an older value overwrite a more recent value because i'm using BATCH UPDATE with Node Clusters?
My concern is UPDATES are slower than INSERTS and I don't want to make it as fast as possible. This section of the application isn't CPU heavy, and the rt-data isn't that intensive.
To make my comments an answer:
You don't seem to need SQL semantics for anything here, so I'd just toss RDS and use e.g. Redis (or DynamoDB, I guess) for that data store.

Azure Cache Redis Slow Performance rather than DB query

We have implemented Azure Cache Redis for our project.
But the problem is Azure Cache query performance is slower than db query.
For a query while Redis response average is 115ms the db query average is 60ms.
For another query while Redis response average is 200ms db query average is 210ms.
I expected redis queries to return around 50ms for all requests.
Is this normal or are we missing any point.
Maybe speed is not the case all the time?
The performance of AzureRedis cache queries depends on various criteria:
First, you need to check out the source from where you are querying
the Redis cache. If the source and Redis cache resource are in
different regions, there may be a significant network latency.
The pricing tier of the Redis cache also plays a crucial role in the
performance.
Use the redis-benchmark.exe utility to check the throughput and
characteristics of your cache.
You can also consider the scaling link to improve the
performance of the Redis cache.
Possible reasons why query run time varies:
Different data volumes
Different system load
Different query plans
This also can depend on the type of query. In this case SQL database vs Redis.
A website tested this and achieved the following conclusion:
Especially if your request number is large, it's always better to use db and Redis together. Otherwise, both are nearly equal and change in their timings per query.
Related:
Why does the same query takes different amount of time to run?
Redis vs mySQL

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

MongoDB Multiple database vs single database

I have a NodeJS project that using mongodb as main database.
Regular, I just use one database for containing all information (users, organization, messages,...)
But now, I need to store one more thing - log data - which grow very very fast.
So I consider store log in other database to keep current database safe and fast.
Does anyone has experience in this, Is that better than single database?
Not a real question the mods will certainly say. You have a few options depending on your log data and how / how often you want to access it.
Capped collections if you don't need to store the logs for a long time
Something like Redis to delay writing to the log and keep the app responding fast
Use a replica set to distribute the database load.

Where / When to close Mongo db connection in a for each iteration context

I have a node.js app running on heroku backed by a Mongo db which breaks down like this:
Node app connects to db and stores db and collection into "top level" variables (not sure if Global is the right word)
App iterates through each document in the db using the foreach() function in node mongo driver.
Each iteration sends the document id to another function that uses the id to access fields on that document and take actions based on that data. In this case its making requests against api's from amazon and walmart getting updated pricing info. This function is also being throttled so as not to make too many requests too quickly.
My question is this, how can I know its safe to close the db connection. My best idea is to get a count of the documents, multiply that by the number of external api hits per document and then increment a variable by one each time a api transaction finishes and then test that number against the total number expected and if it hits that close the connection. This sounds so hackish there has to be a better way. Any ideas?

Resources