Read/write request Waiting queue and waiting time in Apache Cassandra - cassandra

I'm new to Apache Cassandra and I'm working on research about it. Especially the waiting queue length and waiting time for read/write request.
Apache Cassandra is based on the SDEA (Stage-driven-event-archetecture). That means for each request, for example read request, it'll be put in a queue to be processed. Based on this, there's should be some metrics for the waiting queue and waiting time for each request.
According to the post in this link: https://www.pythian.com/blog/guide-to-cassandra-thread-pools/, the queue is based on the messaging service part.
Meanwhile, I did some search and found that my question is still a ticket for the Cassandra developers: https://issues.apache.org/jira/browse/CASSANDRA-8398. The ticket has been there for a long time.
I also noticed that I can get this information with the usage of tpstats. However, to get this information I need to run the command in the terminal and print it out. So it's not accurate at all for my case.
I'm wondering anyone could have some hints, on where should I start to get the information/metrics of the waiting time and queue length for each request.
Thanks!
Steven

Related

How to perform long event processing in Node JS with a message queue?

I am building an email processing pipeline in Node JS with Google Pub/Sub as a message queue. The message queue has a limitation where it needs an acknowledgment for a sent message within 10 minutes. However, the jobs it's sending to the Node JS server might take an hour to complete. So the same job might run multiple times till one of them finishes. I'm worried that this will block the Node JS event loop and slow down the server too.
Find an architecture diagram attached. My questions are:
Should I be using a message queue to start this long-running job given that the message queue expects a response in 10 mins or is there some other architecture I should consider?
If multiple such jobs start, should I be worried about the Node JS event loop being blocked. Each job is basically iterating through a MongoDB cursor creating hundreds of thousands of emails.
Well, it sounds like you either should not be using that queue (with the timeout you can't change) or you should break up your jobs into something that easily finishes long before the timeouts. It sounds like a case of you just need to match the tool with the requirements of the job. If that queue doesn't match your requirements, you probably need a different mechanism. I don't fully understand what you need from Google's pub/sub, but creating a queue of your own or finding a generic queue on NPM is generally fairly easy if you just want to serialize access to a bunch of jobs.
I rather doubt you have nodejs event loop blockage issues as long as all your I/O is using asynchronous methods. Nothing you're doing sounds CPU-heavy and that's what blocks the event loop (long running CPU-heavy operations). Your whole project is probably limited by both MongoDB and whatever you're using to send the emails so you should probably make sure you're not overwhelming either one of those to the point where they become sluggish and lose throughput.
To answer the original question:
Should I be using a message queue to start this long-running job given that the message queue expects a response in 10 mins or is there
some other architecture I should consider?
Yes, a message queue works well for dealing with these kinds of events. The important thing is to make sure the final action is idempotent, so that even if you process duplicate events by accident, the final result is applied once. This guide from Google Cloud is a helpful resource on making your subscriber idempotent.
To get around the 10 min limit of Pub/Sub, I ended up creating an in-memory table that tracked active jobs. If a job was actively being processed and Pub/Sub sent the message again, it would do nothing. If the server restarts and loses the job, the in-memory table also disappears, so the job can be processed once again if it was incomplete.
If multiple such jobs start, should I be worried about the Node JS event loop being blocked. Each job is basically iterating through a
MongoDB cursor creating hundreds of thousands of emails.
I have ignored this for now as per the comment left by jfriend00. You can also rate-limit the number of jobs being processed.

Limiting number of requests in cassandra without causing starting timeout ticking

The DataStax Cassandra driver of version 4 has got a feature of the throttling.
The documentation states:
Similarly, the request timeout encompasses throttling: the timeout starts ticking before the
throttler has started processing the request; a request may time out while it is still in the
throttler's queue, before the driver has even tried to send it to a node.
Great. However, let's say I have a dynamic list of some ids and I want to execute select requests to cassandra in parallel (using executeAsync()) for all ids in the list. Having list too large I will eventually face timeouts if requests are residing in the throttler's queue too long.
How can I overcome this issue? Is there any built-in rate limiting technique so I can do not care about how many requests in parallel I can execute, but just throw all of them to cassandra and then wait until they all are completed??
UPD: I am not interested in custom code solutions, as ofc we are capable to implement our own rate limit solution. I am asking precisely about driver's built-in mechanisms to achieve this.

Cancel a running query

I have an application where users are running a geospatial query against a mongo database. The query can return many thousands of results (~50k). These results are then streamed to the client over a websocket. However, users can abort a request mid result set and execute a new query. Users will frequently start, abort, and re-start requests on the order of several times per minute. Sometimes they even cancel/restart every couple of seconds.
The question is, when a user aborts a request, how do I cancel the query on the server so it doesn't continue to tie up resources streaming back thousands of unneeded results? I'm currently calling destroy() on the cursor, but it's not clear that this is actually stopping the query from executing on the server.
What's the best practice in this case?
Have you tried this?
db.currentOp()
db.killOp(IDRETURNEDHE)
This is a good example.
The answer is it depends upon a lot of your implementation details.
If your server is in the middle of streaming results (e.g. still hasn't sent or queued everything) when the server receives some sort of other message that the previous results should be cancelled, then it is possible for you to communicate with that other stream and tell it to stop sending. How exactly you would do that depends entirely upon your code and you would have to show us your code for us to know.
Chances are the db query is long since complete and what is going on is the server is in the process of streaming results to the client. So, if that's the case, then it isn't the db you're looking for, it's the code that streams the response to the client. Since node.js JS is single threaded, the only time another request would actually get run on the server would be while the streaming code was in some async write operation, waiting for that to finish. You would probably have to set some flag that was uniquely associated with a particular user and then your stream code would have to check for that flag before each chunk of data was sent. If it saw the cancel flag, it could abandon sending the rest of the results.
You could make things more cancellable by explicitly chunking your results (say 500 at a time) and checking for a cancel flag between the sending of each chunk.
If, on the other hand, all the data has already been buffered up by the TCP layer on the server, then the only way to stop that from being sent is to tear down the webSocket and force the client to reconnect.

Background jobs that run on every request on Heroku and node.js

I have an app that needs to run a very long process (takes 30-60 seconds for each request). After the processing, the result is then returned to the request as a response. This works fine locally, but it crashes my Heroku instance.
What I'd like to happen instead is:
User comes on site, request sent to backend
Backend returns immediately, and kicks off another process/task/job that does the processing
When the processing ends, the response is returned to the correct user.
I am not sure what all I need for this. Based on an hour-long research, it seems like I can use Redis as a queue and a worker can poll it every x minutes. But what I can't understand is how to figure out which request to send the response to after processing ends.
Is there a sample Express/node.js for this? Any pointers are helpful.
Like you found in your research, setting up a worker queue using Redis is a good approach for long running processes. A nice library for this is kue (https://github.com/learnboost/kue).
When it comes to responding to a request with the results of the job, having an outanding requesting hanging waiting for a response is not a good way to go about it (and may not work, heroku kills requests that have been idle for a certain period of time).
What you could do is when the request is made start the background job and respond to the request right away with job ID. The client can then poll the server for the status of the job, when the job is complete it can then fetch the needed result.
Kue (from #mattetre's answer) is not maintained anymore. Kue's GitHub page suggests Bull as a good alternative. It is a fast and reliable Redis based queue for Node.js.

Set visibilitytimeout to 7days means automatically delete for azure queue?

It seems that new azure SDK extends the visibilitytimeout to <= 7 days. I know by default, when I add a message to an azure queue, the live time is 7days. When I get message out, and set the visibilitytimeout to 7days. Does that mean I don't need to delete this message if I don't care about message reliable? the message will disappear later 7 days.
I want to take this way because DeleteMessage is very slow. If I don't delete message, doesn't it have any impact on performance of GetMessage?
Based on the documentation for Get Messages, I believe it is certainly possible to set the VisibilityTimeout period to 7 days so that messages are fetched only once. However I see some issues with this approach instead of just deleting the message once the process is done:
What happens when you get the message and start processing it and somehow the process fails? If you set the visibility timeout to be 7 days, then the message would never appear in the queue again and thus the process it was supposed to do never gets done.
Even though the message is hidden, it is still there in the queue thus you keep on incurring storage charges for that message. Even though the cost is trivial but why keep the message when you don't really need it.
A lot of systems rely on Approximate Messages Count property of a queue to check on the health of processes which are performed by messages in a queue. Please note that even though you make the message hidden, it is still there in the queue and thus will be included in total messages count in the queue. So if you're building a system which relies on this for health check, you will always find your system to be unhealthy because you're never deleting the messages.
I'm curious to know why you find deleting messages to be very slow. In my experience this is quite fast. How are you monitoring message deletion?
Rather than hacking around the problem I think you should drill into understanding why the deletes are slow. Have you enabled logs and looked at the e2elatency and serverlatency numbers across all your queue operations. Ideally you shouldn't be seeing a large difference between the two for all your queue operations. If you do see a large difference then it implies something is happening on the client that you should investigate further.
For more information on logging take a look at the following articles:
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/
http://msdn.microsoft.com/en-us/library/azure/hh343262.aspx
Information on client side logging can also be found in this post: e client side logging – which you can learn more about in this blog post. http://blogs.msdn.com/b/windowsazurestorage/archive/2013/09/07/announcing-storage-client-library-2-1-rtm.aspx
Please let me know what you find.
Jason

Resources