Proper way to handle QLDB Session - node.js

I want to know how to handle qldb sessions in a node.js application.
Should I create one session for the entire scope of the app or should I make a new session before each batch of transactions?
Right now I'm creating a session before each transaction and I'm getting some OCC conflicts when running unit tests (for each test a new session is created).

You should use as many sessions as needed to achieve the level of throughput required. Each session can run a single transaction, and each transaction has a certain latency. So, for example, if your transactions take 10ms, then you can do 100 transactions per second (1s = 1000ms and 1000/10 = 100). If you need to achieve 1000 TPS, you would then need 10 sessions.
The driver comes with a "pool" of sessions. So, each transaction should request a session from the pool. The pool will grow/shrink as required.
Each session can live no longer than ~15 minutes (there is some jitter). Thus, you should handle the case where using a session throws an exception (invalid session) and retry your operation (get a session, run the transaction).
In terms of OCC, I think that is quite likely unrelated to your usage of sessions. OCC means you read data in your transaction that was changed by the time you tried to commit. Usually this means you haven't setup the right indexes, so your reads are scanning all items in a table.

Related

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

Are Rails Thread.current variables isolated to a single user request?

I am using Thread.current to store a current user id so that I can see who did various updates to our database. However, after some usage in production, it is returning other user ids than those who could be updating this data. Locally and on lesser-used QA instances, the user ids saved are appropriate.
We are using Rails 5.1, ruby 2.5.1 with Puma. RAILS_MAX_THREADS=1, but we do have a RAILS_POOL_SIZE=5. Any ideas what might cause this issue or how to fix it? Specifically, does a single Thread.current variable last longer than a single user request?
Why would Thread.current be limited to a request?
The same thread(s) are used for multiple requests.
Threads aren't killed at the end of the request, they just picks up the next request from the queue (or wait for the a request to arrive in the queue).
It would be different if you used the Timeout middleware, since timeouts actually use a thread to count the passage of time (and stop processing)... but creating new threads per request introduces performance costs.
Sidenote
Depending on your database usage (blocking IO), RAILS_MAX_THREADS might need to be significantly higher. The more database calls / data you have the more time threads will spend blocking on database IO (essentially sleeping).
By limiting the thread pool to a single thread, you are limiting request concurrency in a significant way. The CPU could be handling other requests while waiting for the database to return the data.

Is there any way to differentiate sessions of each client in cassandra QueryHandler?

My aim is to log unique queries per session by writing custom QueryHandler implementation as logging all queries causes performance hit in our case.
Consider the case : If a user connected to cassandra cluster with java client and performs "select * from users where id = ?" 100 times.
And another user connected from cqlsh and performed same query 50 times. so i want to log only two queries in this case. For that i need a unique session id per login.
Cassandra provides below interface where all requests lands up but none of its apis provide any sessionId to differentiate between two different session described in above case.
org.apache.cassandra.cql3.QueryHandler
Note: I am able to get remoteaddress/port but i want some id which is created when user logged in and get destroyed when he disconnects.
In queryState.getClientState().getRemoteAddress() the address + port will be unique per tcp connection in the sessions pool. There can be multiple concurrent requests over each connection though, and a session can have multiple connections per host. There is also no guarantee the same tcp connection will be used from one request to another on client side.
However a single session cannot be connected as 2 different users (part of the initialization of connection) so the scenario you described isn't possible from the same Session object perspective. I think just using the address as the key for uniqueness will be all you can do given how the protocol/driver works. It will at least dedup things a little.
Are you actually processing your logging inline or are you pushing it off async? If using logback it should be using async appender but if your posting events synchronously to another server, might be better just to throw all the events on a queue and let it do the deduping in another thread so you don't hurt latency.

Cache model for often requesting items

I have a bunch of user-generated messages with timestamps, text messages, profile images respectively and other stuff. All clients (phones) who are using my Web API are able to request last messages then scroll them down and request oldest items. Obviously, top messages are hottest data in whole list. Obviously, I want to make a cache, which has caching policy and clear undestanding about new requested messages - are requsted messages hot, or not?
I created a stateless service with MemoryCache and now use it for my purposes. Is there are any underwater stones which I should take into account during my work with it? Except point, of course, that I have five nodes, and user is able to make a request to service which has no cache inside. In that case this service goes to data-layer-service then gets and loads some data from it.
UPD #1
Forgot mention that this list of messages updates time out of time with new entries.
UPD #2
I wrapped MemoryCache in IReliableDictionary implementation and palm off it under a stateful Service with my own StateManager implementation. Every time a request didn't find an item in the collection I go to the Azure Storage and retrieve actual data. After I had finished I realized that my experiment is not useful because there is no way for scaling such approach. I mean if my app has fixed partitioned Reliable Services working as cache, I do not have possibility to grow them up with upscaling my Service Fabric. In case of load increase after some time this fact hits me in my face :)
I still do not know how to make a cache for my super hot most readable messages more efficient way. And I still doubt in Reliable Actors approach. It creates a huge amount of replicated data.
I think this is an ideal use of an actor.
The actor will be garbage collected after a period of time, so data won't stay in memory.
One actor per user.

Is this MEAN stack design-pattern suitable at the 1,000-10,000 user scale?

Let's say that when a user logs into a webapp, he sees a list of information.
Let's say that list of information is served by one of two dynos (via heroku), but that the list of information originates from a single mongo database (i.e., the nodejs dynos are just passing the mongo information to a user when he logs into the webapp).
Question: Suppose I want to make it possible for a user to both modify and add to that list of information.
At a scale of 1,000-10,000 users, is the following strategy suitable:
User modifies/adds to data; HTTP POST sent to one of the two nodejs dynos with the updated data.
Dyno (whichever one it may be) takes modification/addition of data and makes a direct query into the mongo database to update the data.
Dyno sends confirmation back to the client that the update was successful.
Is this OK? Would I have to likely add more dynos (heroku)? I'm basically worried that if a bunch of users are trying to access a single database at once, it will be slow, or I'm somehow risking corrupting the entire database at the 1,000-10,000 person scale. Is this fear reasonable?
Short answer: Yes, it's a reasonable fear. Longer answer, depends.
MongoDB will queue the responses, and handle them in the order it receives. Depending on how much of it is being served from memory, it may or maybe not be fast enough.
NodeJS has the same design pattern, where it will queue responses it doesn't process, and execute them when the resources become available.
The only way to tell if performance is being hindered is by monitoring it, and seeing if resources consistently hit a threshold you're uncomfortable with passing. On the upside, during your discovery phase your clients will probably only notice a few milliseconds of delay.
The proper way to implement that is to spin up a new instance as the resources get consumed to handle the traffic.
Your database likely won't corrupt, but if your data is important (and why would you collect it if it isn't?), you should be creating a replica set. I would probably go with a replica set of data before I go with a second instance of node.

Resources