Redis Streams: How to manage perpetual subscription and BLOCK behaviour?

Redis Streams: How to manage perpetual subscription and BLOCK behaviour? - node.js

I am using Redis in an Express application. My app is both a publisher and consumer of streams, using a single redis connection (redis.createClient).
I have a question on the best way to manage a perpetual subscription (with xreadgroup).
Currently I am doing this:
const readStream = () => xreadgroup('GROUP' appId, consumerId, 'BLOCK', 1, 'COUNT', 1, 'STREAMS' key, '>')
.then(handleData)
.then(() => setImmeadiate(readStream));
Where xreadgroup is just a promisified version of node-redis' xreadgroup.
My question - What is the appropriate usage of BLOCK? If I block indefinitely or for a long period, then my client cannot publish any messages (with xadd) until it is unblocked, or the block times out. Since I must use some sort of loop/recursion in order to keep reading events, BLOCK appears to be fairly unnecessary; can I just leave it off and is this the expected usage?
Likewise, is using setImmeadiate appropriate or would process.nextTick or an async loop be preferred?
There is very little documentation in node-redis and the few examples simply read the messages once after blocking and do not produce/consume on the same client.

Not an expert on the subject, but I'd like to share some thoughts that might help.
I'm not sure if node-redis can "stack" multiple commands, meaning - will it be able to fire new commands while waiting for the XREADGROUP to complete?
From your description, seems like that's what's happening. In that case, I suggest you create a dedicated connection to call XREADGROUP - this way you can publish and listen without blocking one another.
You don't need to use the BLOCK; but if your goal is to listen to all events and wait for those not yet published, using it might be wise and will give you better performance while making less calls to redis.
setImmediate is probably good, especially using BLOCK. If you don't use it, then it might be good to add a little bit of timeout between calls - without the BLOCK calls will answer return almost immediately. You can check this for more details.
Friendly reminder: don't forget to ACK your messages or use NOACK (might be ok depending on your use case):
consumer groups require explicit acknowledgment of the messages successfully processed by the consumer, via the XACK command.
The NOACK subcommand can be used to avoid adding the message to the PEL in cases where reliability is not a requirement and the occasional message loss is acceptable.
Source: https://redis.io/commands/xreadgroup

Related

RabbitMQ one-consumer/one-publisher pattern

I'm a bit confused/stuck. I've been reading around for RabbitMQ best practices, and a lot of articles come up stating that you should have two connections -- one separated for your subscriber and one for your publisher.
So what I've ended up doing is essentially started my server as so:
amqp.connect(RABBIT_URL, (err, conn) => {
conn.createChannel((ampqErr, subscribingChannel) => {
// .....
})
amqp.connect(RABBIT_URL, (err, conn) => {
conn.createChannel((ampqErr, publishingChannel) => {
// .......
});
...but I'm 99% certain this isn't the correct way to do this. So this is where my first question is. How do I maintain this rule within one service?
Also, it doesn't really work out since after a task that is done (i.e. finished scraping a page), at the end of that event, I need to fire an event that'll trigger a parsing in another service. I was pretty much doing a
ch.publish(...)
right before I'd do the ack. This isn't pure as the channel I had solely for consuming is now publishing events to trigger a parse from the other service.
This type of 'event/action' order carries on through 2 other services
(1. web/app --> 2. scrape --> 3. parse --> 4. analytics )
My plan is just to trigger an event after the completion at each service. Is this the correct way to do this?
I guess there are two questions.
Thank you.. thank you.. thank you so much to whoever can help me here. Lost an entire weekend just dabbling around. :(

I've been reading around for RabbitMQ best practices, and a lot of articles come up stating that you should have two connections -- one separated for your subscriber and one for your publisher.
Multiple connections and channels as a best practice do not necessarily translate well to Node, as the event loop is single threaded, you don't gain any real benefit by opening multiple channels, unless you are using child processes or some other form of threading which allows you to utilize both channels at once.
Also, it doesn't really work out since after a task that is done (i.e. finished scraping a page), at the end of that event, I need to fire an event that'll trigger a parsing in another service. I was pretty much doing a ch.publish(...)
right before I'd do the ack. This isn't pure as the channel I had solely for consuming is now publishing events to trigger a parse from the other service.
This is not really an issue if you're using a single channel, which as I mentioned before, is not really a problem in Node.
My plan is just to trigger an event after the completion at each service. Is this the correct way to do this?
I see no problem with this method, and have used it quite extensively in my own Node microservices. Rabbit is a very robust and flexible system, and everyones architecture will differ depending on their own application/service needs.

If Redis is single Threaded, how can it be so fast?

I'm currently trying to understand some basic implementation things of Redis. I know that redis is single-threaded and I have already stumbled upon the following Question: Redis is single-threaded, then how does it do concurrent I/O?
But I still think I didn't understood it right. Afaik Redis uses the reactor pattern using one single thread. So If I understood this right, there is a watcher (which handles FDs/Incoming/outgoing connections) who delegates the work to be done to it's registered event handlers. They do the actual work and set eg. their responses as event to the watcher, who transfers the response back to the clients. But what happens if a request (R1) of a client takes lets say about 1 minute. Another Client creates another (fast) request (R2). Then - since redis is single threaded - R2 cannot be delegated to the right handler until R1 is finished, right? In a multithreade environment you could just start each handler in a single thread, so the "main" Thread is just accepting and responding to io connections and all other work is carried out in own threads.
If it really just queues the io handling and handler logic, it could never be as fast it is. What am I missing here?

You're not missing anything, besides perhaps the fact that most operations in Redis complete in less than a ~millisecond~ couple of microseconds. Long running operations indeed block the server during their execution.

Let’s say if there were 10,000 users doing live data pulling with 10 seconds each on hmget, and on the other side, server were broadcasting using hmset, redis can only issue the set at the last available queue.
Redis is only good for queuing and handle limited processing like inserting lazy last login info, but not for live info broadcasting, in this case, memcached will be the right choice. Redis is single threaded, like FIFO.

Implementing general purpose long polling

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.

Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.

Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

How to design a scalable rpc call listener?

I have to listen for rpc calls , stack them somewhere , process them, and answer. The thing is that they are not run as soon as they come. The response is an ACK for each rpc call recieved.
The problem is that i want to design it in a way that i can have many listening servers writing in the same stack of calls, piling them up as they come.
My objective is to listen to as many calls as possible. How should i achieve this?
My main technology is Perl and node.js but would use any open source software for this task.

It sounds like any kind of job queue will do what you need it to; I'm personally a big fan of using Redis for this kind of thing. Since Redis lists maintain insertion order, you can simply LPUSH your RPC call info on to the end of the list from any number of web servers listening to the RPC calls, and somewhere else (in another process/on another machine, I assume) RPOP (or BRPOP) them off and process them.
Since Node.js uses fully asynchronous IO, assuming you're not doing a lot of processing in your RPC listeners (that is, you're only listening for requests, sending an ACK, and pushing onto Redis), my guess is that Node would be exceedingly efficient at this.
An aside on using Redis for a queue: if you want to ensure that, in the event of a catastrophic failure, jobs are not lost, you'll need to implement a little more logic; from the RPOPLPUSH documentation:
Pattern: Reliable queue
Redis is often used as a messaging server to implement processing of background jobs or other kinds of messaging
tasks. A simple form of queue is often obtained pushing values into a
list in the producer side, and waiting for this values in the consumer
side using RPOP (using polling), or BRPOP if the client is better
served by a blocking operation.
However in this context the obtained
queue is not reliable as messages can be lost, for example in the case
there is a network problem or if the consumer crashes just after the
message is received but it is still to process.
RPOPLPUSH (or
BRPOPLPUSH for the blocking variant) offers a way to avoid this
problem: the consumer fetches the message and at the same time pushes
it into a processing list. It will use the LREM command in order to
remove the message from the processing list once the message has been
processed.
An additional client may monitor the processing list for
items that remain there for too much time, and will push those timed
out items into the queue again if needed.

Redis and Node.js and Socket.io Questions

I have been just learning redis and node.js There are two questions I have for which I couldn't find any satisfying answer.
My first question is about reusing redis clients within the node.js. I have found this question and answer: How to reuse redis connection in socket.io? , but it didn't satisfy me enough.
Now, if I create the redis client within the connection event, it will be spawned for each connection. So, if I have 20k concurrent users, there will be 20k redis clients.
If I put it outside of the connection event, it will be spawned only once.
The answer is saying that he creates three clients for each function, outside of the connection event.
However, from what I know MySQL that when writing an application which spawns child processes and runs in parallel, you need to create your MySQL client within the function in which you are creating child instances. If you create it outside of it, MySQL will give an error of "MySQL server has gone away" as child processes will try to use the same connection. It should be created for each child processes separately.
So, even if you create three different redis clients for each function, if you have 30k concurrent users who send 2k messages concurrently, you should run into the same problem, right? So, every "user" should have their own redis client within the connection event. Am I right? If not, how node.js or redis handles concurrent requests, differently than MySQL? If it has its own mechanism and creates something like child processes within the redis client, why we need to create three different redis clients then? One should be enough.
I hope the question was clear.
-- UPDATE --
I have found an answer for the following question. http://howtonode.org/control-flow
No need to answer but my first question is still valid.
-- UPDATE --
My second question is this. I am also not that good at JS and Node.js. So, from what I know, if you need to wait for an event, you need to encapsulate the second function within the first function. (I don't know the terminology yet). Let me give an example;
socket.on('startGame', function() {
getUser();
socket.get('game', function (gameErr, gameId) {
socket.get('channel', function (channelErr, channel) {
console.log(user);
client.get('games:' + channel + '::' + gameId + ':owner', function (err, owner) { //games:channel.32:game.14
if(owner === user.uid) {
//do something
}
});
}
});
});
So, if I am learning it correctly, I need to run every function within the function if I need to wait I/O answer. Otherwise, node.js's non-blocking mechanism will allow the first function to run, in this case it will get the result in parallel, but the second function might not have the result if it takes time to get. So, if you are getting a result from redis for example, and you will use the result within the second function, you have to encapsulate it within the redis get function. Otherwise second function will run without getting the result.
So, in this case, if I need to run 7 different functions and the 8. function will need the result of all of them, do I need to write them like this, recursively? Or am I missing something.
I hope this was clear too.
Thanks a lot,

So, every "user" should have their own redis client within the connection event.
Am I right?
Actually, you are not :)
The thing is that node.js is very unlike, for example, PHP. node.js does not spawn child processes on new connections, which is one of the main reasons it can easily handle large amounts of concurrent connections, including long-lived connections (Comet, Websockets, etc.). node.js processes events sequentially using an event queue within one single process. If you want to use several processes to take advantage of multi-core servers or multiple servers, you will have to do it manually (how to do so is beyond the scope of this question, though).
Therefore, it is a perfectly valid strategy to use one single Redis (or MySQL) connection to serve a large quantity of clients. This avoids the overhead of instantiating and terminating a database connection for each client request.

So, every "user" should have their own redis client within the
connection event. Am I right?
You shouldn't make a new Redis client for each connected user, that's not the proper way to do it. Instead just create 2-3 clients max and use them.
For more information checkout this question:
How to reuse redis connection in socket.io?

As for the first question:
The "right answer" might make you think you are good with one Connection.
In reality, whenever you are doing something that is waiting on an IO, a timer, etc, you are actually making node run the waiting method on the queue. Hence, if you use only 1 single connection, you will actually limit the performance of the thread you working on ( a single CPU) to the speed of redis - which is probably a few hundreds of callbacks per second (non-redis waiting callbacks will still go on) - while this is not poor performance, there's no reason to create this kind of limitation. It is recommended to create a few (5-10) connections to avoid this issue in it's entire. This number goes up for slower databases, e.g. MySQL, but is dependant on the type of queries and the code specifics.
Do note, that you should run a few workers on your server, per the number of CPUs you have, for best performance.
In regards to the 2nd Question:
It is a much better practice, to name the functions, one after the other, and use the names in the code rather than defining it as you go. In some situations, it will reduce memory consumption.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Redis Streams: How to manage perpetual subscription and BLOCK behaviour? - node.js

Related

RabbitMQ one-consumer/one-publisher pattern

If Redis is single Threaded, how can it be so fast?

Implementing general purpose long polling

How to design a scalable rpc call listener?

Redis and Node.js and Socket.io Questions

Categories

Resources