AWS Redis Reader endpoint and ioredis - node.js

We want our Redis to be more scalable and we want to be able to add more read instances.
I am trying to use this new Reader endpoint: https://aws.amazon.com/about-aws/whats-new/2019/06/amazon-elasticache-launches-reader-endpoint-for-redis
However I dont see any easy or automated way for ioredis to use that approach where I can set up which endpoint will be for writes and which one for reads. Even here I can see the recommended approach at the end is to "manually split": https://github.com/luin/ioredis/issues/387
Do you know any existing solution or good approach where I can set up which endpoints will be used for writes and which one will be used for reads?
The most straightforward for me right now is some kind of "proxy" layer, where I will create two instances of Redis and I will send all writes to the primary endpoint and all reads to Reader endpoint. However I would prefer some better (or well tested) approach.
PS: I tried to "hack it" with Cluster functionality of ioredis, but even the simple connection without any functionality and one - primary endpint - fails with ClusterAllFailedError: Failed to refresh slots cache.
(To have Reader endpoint enabled - the Cluster mode must be off)

Just note about how it ended
We had two instances (or reused the same instance if URL was same)
redis = new Redis(RKT_REDIS_URL.href, redisOptions)
if (RKT_REDIS_READER_URL.href === RKT_REDIS_URL.href) {
redisro = redis
} else {
redisro = new Redis(RKT_REDIS_READER_URL.href, redisOptions)
}
And then used first for writes and other for reads.
redis.hmset(key, update)
redisro.hmget(key, field)
However after some time we have adopted clustered redis and it is much better and can recommend it. Also the ioredis npm module is capable of using it seemlessly (you dont have to configure anything, you just put there configure endpoint which i.e. AWS provides and thats it).
This was our configuration
redisOptions.scaleReads = 'master'
redis = new Redis.Cluster([RKT_REDIS_URL.href], redisOptions)
The options for scaleReads are
scaleReads is "master" by default, which means ioredis will never send
any queries to slaves. There are other three available options:
"all": Send write queries to masters and read queries to masters or
slaves randomly. "slave": Send write queries to masters and read
queries to slaves.
https://github.com/luin/ioredis

Related

Is there a way to listen for operations / change events at the container level rather than the individual DDS level in the Fluid Framework?

Scenario:
I have a service running that is keeping a global search or query index up to date for all containers in my “system”. This service is notified any time a container is opened by a client and opens its own reference to that container to listen for changes to content in that container so it can update the “global container index” storage. The container is potentially large and partitioned into may individual DDS entities, and I would like to avoid loading every DDS in the container in order to listen for changes in each of those DDS’s.
Ideally I would be able to listen for any “operations / changes” at the container level and dynamically load the impacted DDS to be able to transcribe the information that was updated into this external index storage.
I originally left this as a comment to SamBroner's response but it got too long.
The ContainerRuntime raises an "op" event for every op, so you can listen to that to implement something similar to #1. This is missing in the docs currently so it's not obvious.
I think interpreting ops without loading the DDS code itself might be possible for DDSes with simpler merge logic, like SharedMap, but very challenging for SharedSequence, for example.
I guess it depends on the granularity of information you're trying to glean from the ops with general purpose code. Knowing just that a DDS was edited may be feasible, but knowing its resulting state... more difficult.
There are actually two questions here: 1. How do I listen to container-level operations? 2. How do I load just one DDS?
How do I listen to operations?
This is certainly possible, but is not included as a capability in the reference implementation of the service. There are multiple ways of architect-ing a solution to this.
Use the EventEmitter on the Container object itself. The sequencedDocumentMessage will have a type property. When type === "op", the message contents will include metadata about a change to data. You can use the below to get a feel for this.
const container = await getTinyliciousContainer(documentId, DiceRollerContainerRuntimeFactory, createNew);
if (sequenceDocumentMessage.type === "op") {
console.log(sequenceDocumentMessage.contents)
}
If you're looking for all of the message types and message interfaces, the enum and the generic interface for ISequenceDocumentMessage are both here.
Listen directly to the Total Order Broadcast with a bespoke lambda
If you're running the reference implementation of Fluid, you can just add a new lambda that is directly listening to Kafka (default Total Order Broadcast) and doing this job. The lambdas that are already running are located here: server/routerlicious/packages/lambdas. Deli is actually doing a fairly similar job already by listening into and labeling operations from Kafka by their Container.
Use the Foreman "lambda" in R11S specifically for spawning jobs
I'd prefer an architecture where the lambda is actually just a "job runner". This would give you something along the lines of "Fluid Lambdas" where these lambdas can just react to operations coming off the Kafka Stream. Functionality like this is included, but poorly documented or tested in the Foreman lambda
Critically, listening to just the Ops is not a very good way to know the current state of a Distributed Data Structure. The Distributed Data Structures manage the merging of new operations into the state. Therefore easiest way to get the current state of DDS is to load the DDS.
How do I load just one DDS?
This is actually fairly straightforward, if not well documented. You'll need to provide a requestHandler that can fetch just the DDS from the container. Ultimately, the Container does have the ability to virtualize everything that isn't requested. You'll need to load the container, but just request the specific DDS.
In pseudocode...
const container = loader.resolve("fluid://URIToContainer");
const dds = await container.request("/ddspaths/uuid");
dds.getIndexData();

how does mongodb replica set work with nodejs-mongoose?

Techstack used nodejs,mongoose,mongodb
i'm working on product that handles many DBrequests. During beginning of every month the db requests are high due to high read/write requests (bulk data processing). The number of records in each collection's targeted for serving these read/write requests are quite high. Read is high but write is not that high.
So the cpu utilization on the instance in which mongodb is running reaches the dangerzone(above 90%) during these times. The only thing that gets me through these times is HOPE (yes, hoping that instance will not crash).
Rather than scaling vertically, i'm looking for solutions to scale horizontally (not a revolutionary thought). i looked at replicaset and sharding. This question is only related to replicaSet.
i went through documents and i feel like the understanding i have on replicaset is not really the way it might work.
i have configured my replicaset with below configuration. i simply want to add one more instance because as per the understanding i have right now, if i add one more instance then my database can handle more read requests by distributing the load which could minimize the cpuUtilization by atleast 30% on primaryNode. is this understanding correct or wrong? Please share your thoughts
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"localhost:12017"},
{_id:1,host:"localhost:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"localhost:12019"}
]
}
When i broughtup the replicaset with above config and ran my nodejs-mongoose code, i ran into this issue . Resolution they are proposing is to change the above config into
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"validdomain.com:12017"},
{_id:1,host:"validdomain.com:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"validdomain.com:12019"}
]
}
Question 1 (related to the coding written in nodejsproject with mongoose library(for handling db) which connects to the replicaSet)
const URI = mongodb://167.99.21.9:12017,167.99.21.9:12019/${DB};
i have to specify both uri's of my mongodb instances in mongoose connection URI String.
When i look at my nodejs-mongoose code that will connect to the replicaSet, i have many doubts on how it might handle the multipleNode.
How does mongoose know which ip is the primaryNode?
Lets assume 167.99.21.9:12019 is primaryNode and rs.slaveOk(false) on secondaryReplica, so secondaryNode cannot serve readRequests.
In this situation, does mongoose trigger to the first uri(167.99.21.9:12017) and this instance would redirect to the primaryNode or will the request comeback to mongoose and then mongoose will trigger another request to the 167.99.21.9:12019 ?
Question 2
This docLink mention's that data redundancy enables to handle high read requests. Lets assume, read is enabled for secondaryNode, and
Lets assume the case when mongoose triggers a request to primaryNode and primaryNode was getting bombarded at that time with read/write requests but secondaryNode is free(doing nothing) , then will mongodb automatically redirect the request to secondaryNode or will this request fail and redirect back to mongoose, so that the burden will be on mongoose to trigger another request to the next available Node?
can mongoose automatically know which Node in the replicaSet is free?
Question 3
Assuming both 167.99.21.9:12017 & 167.99.21.9:12019 instances are available for read requests with ReadPreference.SecondaryPreferred or ReadPreference.nearest, will the load get distributed when secondaryNode gets bombarded with readRequests and primaryNode is like 20% utilization? is this the case? or is my understanding wrong? Can the replicaSet act as a loadbalancer? if not, how to make it balance the load?
Question 4
var configuration = {
_id : "testReplicaDB",
members:[
{_id:0,host:"validdomain.com:12017"},
{_id:1,host:"validdomain.com:12018",arbiterOnly:true,buildIndexes:false},
{_id:2,host:"validdomain.com:12019"}
]
}
You can see the DNS name in the configuration, does this mean that when primaryNode redirects a request to secondaryNode, DNS resolution will happen and then using that IP which corresponds to secondaryNode, the request will be redirected to secondaryNode? is my understanding correct or wrong? (if my understanding is correct, this is going to fireup another set of questions)
:|
i could've missed many details while reading the docs. This is my last hope of getting answers. So please share if you know the answers to any of these.
if this is the case, then how does mongoose know which ip is the primaryReplicaset?
There is no "primary replica set", there can be however a primary in a replica set.
Each MongoDB driver queries all of the hosts specified in the connection string to discover the members of the replica set (in case one or more of the hosts is unavailable for whatever reason). When any member of the replica set responds, it does so with the full list of current members of the replica set. The driver then knows what the replica set members are, and which of them is currently primary (if any).
secondaryReplica cannot serve readRequests
This is not at all true. Any data-bearing node can fulfill read requests, IF the application provided a suitable read preference.
In this situation, does mongoose trigger to the first uri(167.99.21.9:12017) and this instance would redirect to the primaryReplicaset or will the request comeback to mongoose and then mongoose will trigger another request to the 167.99.21.9:12019 ?
mongoose does not directly talk to the database. It uses the driver (node driver for MongoDB) to do so. The driver has connections to all replica set members, and sends the requests to the appropriate node.
For example, if you specified a primary read preference, the driver would send that query to the primary if one exists. If you specified a secondary read preference, the driver would send that query to a secondary if one exists.
i'm assuming that when both 167.99.21.9:12017 & 167.99.21.9:12019 instances are available for read requests with ReadPreference.SecondaryPreferred or ReadPreference.nearest
Correct, any node can fulfill those.
the load could get distributed across
Yes and no. In general replicas may have stale data. If you require current data, you must read from the primary. If you do not require current data, you may read from secondaries.
how to make it balance the load?
You can make your application balance the load by using secondary or nearest reads, assuming it is OK for your application to receive stale data.
if mongoose triggers a request to primaryReplica and primaryReplica is bombarded with read/write requests and secondaryReplica is free(doing nothing) , then will mongodb automatically redirect the request to secondaryReplica?
No, a primary read will not be changed to a secondary read.
Especially in the scenario you are describing, the secondary is likely to be stale, thus a secondary read is likely to produce wrong results.
can mongoose automatically know which replica is free?
mongoose does not track deployment state, the driver is responsible for this. There is limited support in drivers for choosing a "less loaded" node, although this is measured based on network latency and not CPU/memory/disk load and only applies to the nearest read preference.

Connecting from AWS Lambda to MongoDB

I'm working on a NodeJS project and using pretty common AWS setup it seems. My ApiGateway receives call, triggers lambda A, then this lambda A triggers other lambdas, say B or C depending on params passed from ApiGateway.
Lambda A needs to access MongoDB and to avoid hassle with running MongoDB myself I decided to use mLab. ATM Lambda A is accessing MongoDB using NodeJS driver.
Now, not to start connection with every Lambda A execution I use connection pool, again, inside of Lambda A code, outside of handler I keep connection pool that allows me to reuse connections when Lambda A is invoked multiple times.
This seems to work fine.
However, I'm not sure how to deal with connections when Lambda A is invoking Lambda B and Lambda B needs to access mLab's MongoDB database.
Is it possible to pass connection pool somehow or Lambda B would have to keep its own connection pool?
I was thinking of using mLab's Data API that exposes most of the operations of MongoDB driver and so I could use HTTP calls e.g. GET and POST to run commands against database. It seems similar to RESTHeart it seems.
I'm leaning towards option 2 but on mLab's Data API it clearly states to avoid using REST api unless cannot connect using MongoDB driver directly:
The first method—the one we strongly recommend whenever possible for
added performance and functionality—is to connect using one of the
available MongoDB drivers. You do not need to use our API if you use
the driver. The second method, documented in this article, is to
connect via mLab’s RESTful Data API. Use this method only if you
cannot connect using a MongoDB driver.
Given all this how would it be best to approach it? 1 or 2 or is there any other option I should consider?
Unfortunately you won't be able to 'share' a mongo connection across lambdas because ultimately there's a 'physical' socket to the connection which is specific to that instance.
I think both of your solutions are good depending on usage.
If you tend to have steady average concurrency on both lambda A and B across an hour period (which is a bit of a rule of thumb as to how long AWS keeps a lambda instance alive), then having them both own their own static connections is a good solution. This is because the chances are that a request will reach an already started and connected lambda. I would also guess that node drivers for 'vanilla' mongo are more mature than those for the RESTFul Data API.
However if you get spikey or uneven load, then you might use the RESTFul Data API. This is because you'll be centralising the responsibility for managing the number of open connections to your instances to a single point, which under these conditions means you're less likely to be opening unneeded connections, or using all of your current capacity and having to wait for a new connection to be established.
Ultimately it's a game of probabilistic load balancing- either you 'pool' all your connections in a central place (the Data API) and become less affected by the usage of a single function at the expense of greater latency on individual operations, or you pool at a function level but are more exposed to cold-starts opening connections under uneven concurrency.

Need more insight into Hazelcast Client and the ideal scenario to use it

There is already a question on the difference between Hazelcast Instance and Hazelcast client.
And it is mentioned that
HazelcastInstance = HazelcastClient + AnotherFeatures
So is it right to say client just reads and writes to the cluster formed without getting involved in the cluster? i.e. client does not store the data?
This is important to know since we can configure JVM memory as per the usage. The instances forming the cluster will be allocated more than the ones that are just connecting as a client.
It is a little bit more complicated than that. The Hazelcast Lite Member is a full-blown cluster member, without getting partitions assigned. That said, it doesn't store any data but otherwise behaves like a normal member.
Clients on the other side are simple proxies that have to forward everything to one cluster member to get any operation done. You can imagine a Hazelcast client to be something like a JDBC client, that has just enough code to connect to the cluster and redirect requests / retrieve responses.

DynamoDB Application Architecture

We are using DynamoDB with node.js and Express to create REST APIs. We have started to go with Dynamo on the backend, for simplicity of operations.
We have started to use the DynamoDB Document SDK from AWS Labs to simplify usage, and make it easy to work with JSON documents. To instantiate a client to use, we need to do the following:
AWS = require('aws-sdk');
Doc = require("dynamodb-doc");
var Dynamodb = new AWS.DynamoDB();
var DocClient = new Doc.DynamoDB(Dynamodb);
My question is, where do those last two steps need to take place, in order to ensure data integrity? I’m concerned about an object that is waiting for something happen in Dynamo, being taken over by another process, and getting the data swapped, resulting in incorrect data being sent back to a client, or incorrect data being written to the database.
We have three parts to our REST API. We have the main server.js file, that starts express and the HTTP server, and assigns resources to it, sets up logging, etc. We do the first two steps of creating the connection to Dynamo, creating the AWS and Doc requires, at that point. Those vars are global in the app. We then, depending on the route being followed through the API, call a controller that parses up the input from the rest call. It then calls a model file, that does the interacting with Dynamo, and provides the response back to the controller, which formats the return package along with any errors, and sends it to the client. The model is simply a group of methods that essentially cover the same area of the app. We would have a user model, for instance, that covers things like login and account creation in an app.
I have done the last two steps above for creating the dynamo object in two places. One, I have simply placed them in one spot, at the top of each model file. I do not reinstantiate them in the methods below, I simply use them. I have also instantiated them within the methods, when we are preparing to the make the call to Dynamo, making them entirely local to the method, and pass them to a secondary function if needed. This second method has always struck me as the safest way to do it. However, under load testing, I have run into situations where we seem to have overwhelmed the outgoing network connections, and I start getting errors telling me that the DynamoDB end point is unavailable in the region I’m running in. I believe this is from the additional calls required to make the connections.
So, the question is, is creating those objects local to the model file, safe, or do they need to be created locally in the method that uses them? Any thoughts would be much appreciated.
You should be safe creating just one instance of those clients and sharing them in your code, but that isn't related to your underlying concern.
Concurrent access to various records in DynamoDB is still something you have to deal with. It is possible to have different requests attempt writes to the object at the same time. This is possible if you have concurrent requests on a single server, but is especially true when you have multiple servers.
Writes to DynamoDB are atomic only at the individual item. This means if your logic requires multiple updates to separate items potentially in separate tables there is no way to guarantee all or none of those changes are made. It is possible only some of them could be made.
DynamoDB natively supports conditional writes so it is possible to ensure specific conditions are met, such as specific attributes still have certain values, otherwise the write will fail.
With respect to making too many requests to DynamoDB... unless you are overwhelming your machine there shouldn't be any way to overwhelm the DynamoDB API. If you are performing more read/writes that you have provisioned you will receive errors indicating provisioned throughput has been exceeded, but the API itself is still functioning as intended under these conditions.

Resources