Handling failures in Thrift in general

Handling failures in Thrift in general - rpc

I read through the official documentation and the official whitepaper, but I couldn't find a satisfying answer to how Thrift handles failures in the following scenario:
Say you have a client sending a method call to a server to insert an entry in some data structure residing in that server (it doesn't really matter what it is). Suppose the server has processed the call and inserted the entry but the client couldn't receive a response due to a network failure. In such a case, how should the client handle this? A simple retry of sending the call would possibly result in a duplicate entry being inserted. Does the Thrift library persist the response somewhere so that it can resend to the client when it is back online? Or is it the application's responsibility to do so?
Would appreciate it if someone could point out the details of how it works, besides directing to its source code.

The question is an interesting one, but it is by no means limited to Thrift. A better name would be
Handling failures in asynchronous or remote calls in general
because that's in essence, what it is. Altough in the specific case of an RPC-style API like, for example, a Thrift service, the client blocks and it seems to be an synchronous call, it really isn't that way.
The whole problem can be rephrased to the more general question about
Designing robust distributed systems
So what is the main problem, that we have to deal with? We have to assume that every call we do may fail. In particular, it can fail in three ways:
request died
request sent, server processing successful, response died
request sent, server processing failed, response died
In some cases, this is not a big deal, regardless of the exact case we have. If the client just wants to retrieve some values, he can simply re-query and will get some results eventually if he tries often enough.
In other cases, especially when the client modifies data on the server, it may become more problematic. The general recommendation in such cases is to make the service calls idempotent, meaning: regardless, how often I do the same call, the end result is always the same. This could be achieved by various means and more or less depends on the use case.
For example, one method is it to send some logical "ticket" values along with each request to filter out doubled or outdated requests on the server. The server keeps track and/or checks these tickets, before the processing starts eventually. But again, if that method suits your needs depends on your use case.
The Command and Query Responsibility Segregation (CQRS) pattern is another approach to deal with the complexity. It basically breaks the API into setters and getters. I'd recommend to look into that topic, but it is not useful for every scenario. I'd also recommend to look at the Data Consistency Primer article. Last not least the CAP theorem is always a good read.
Good Service/API design is not simple, and the fact, that we have to deal with a distributed parallel system does not make it easier, quite the opposite.

Let me try to give a straight answer.
... is it the application's responsibility to do so?
Yes.
There're 4 types of Exceptions involved in Thrift RPC, including TTransportException, TProtocolException, TApplicationException, and User-defined exceptions.
Based on the book Programmer's Guide to Apache Thrift, the former 2 are local exceptions, while the latter 2 are not.
As the names imply, TTransportException includes exceptions like NOT_OPEN, TIMED_OUT, and TProtocolException includes INVALID_DATA, BAD_VERSION, etc. These exceptions are not propagated from the server the the client and act much like normal language exceptions.
TApplicationExceptions involve problems such as calling a method that isn’t implemented or failing to provide the necessary arguments to a method.
User-defined Exceptions are defined in IDL files and raised by the user code.
For all of these exceptions, no retry operations are done by Thrift RPC framework itself. Instead, they should be handled properly by the application code.

Related

What's a good design for making sure the Node.js Event Loop isn't blocked when adding potentially hundreds of records?

I just read this article from Node.js: Don't Block the Event Loop
The Ask
I'm hoping that someone can read over the use case I describe below and tell me whether or not I'm understanding how the event loop is blocked, and whether or not I'm doing it. Also, any tips on how I can find this information out for myself would be useful.
My use case
I think I have a use case in my application that could potentially cause problems. I have a functionality which enables a group to add members to their roster. Each member that doesn't represent an existing system user (the common case) gets an account created, including a dummy password.
The password is hashed with argon2 (using the default hash type), which means that even before I get to the need to wait on a DB promise to resolve (with a Prisma transaction) that I have to wait for each member's password to be generated.
I'm using Prisma for the ORM and Sendgrid for the email service and no other external packages.
A take-away that I get from the article is that this is blocking the event loop. Since there could potentially be hundreds of records generated (such as importing contacts from a CSV or cloud contact service), this seems significant.
To sum up what the route in question does, including some details omitted before:
Remove duplicates (requires one DB request & then some synchronous checking)
Check remaining for existing user
For non-existing users:
Synchronously create many records & push each to a separate array. One of these records requires async password generation for each non-existing user
Once the arrays are populated, send a DB transaction with all records
Once the transaction is cleared, create invitation records for each member
Once the invitation records are created, send emails in a MailData[] through SendGrid.
Clearly, there are quite a few tasks that must be done sequentially. If it matters, the asynchronous functions are also nested: createUsers calls createInvites calls sendEmails. In fact, from the controller, there is: updateRoster calls createUsers calls createInvites calls sendEmails.

There are architectural patterns that are aimed at avoiding issues brought by potentially long-running operations. Note here that while your example is specific, any long running process would possibly be harmful here.
The first obvious pattern is the cluster. If your app is handled by multiple concurrent independent event-loops of a cluster, blocking one, ten or even thousand of loops could be insignificant if your app is scaled to handle this.
Imagine an example scenario where you have 10 concurrent loops, one is blocked for a longer time but 9 remaining are still serving short requests. Chances are, users would not even notice the temporary bottleneck caused by the one long running request.
Another more general pattern is a separated long-running process service or the Command-Query Responsibility Segregation (I'm bringing the CQRS into attention here as the pattern description could introduce more interesting ideas you could be not familiar with).
In this approach, some long-running operations are not handled directly by backend servers. Instead, backend servers use a Message Queue to send requests to yet another service layer of your app, the layer that is solely dedicated to running specific long-running requests. The Message Queue is configured so that it has specific throughput so that if there are multiple long-running requests in short time, they are queued, so that possibly some of them are delayed but your resources are always under control. The backend that sends requests to the Message Queue doesn't wait synchronously, instead you need another form of return communication.
This auxiliary process service can be maintained and scaled independently. The important part here is that the service is never accessed directly from the frontend, it's always behind a message queue with controlled throughput.
Note that while the second approach is often implemented in real-life systems and it solves most issues, it can still be incapable of handling some edge cases, e.g. when long-running requests come faster than they are handled and the queue grows infintely.
Such cases require careful maintenance and you either scale your app to handle the traffic or you introduce other rules that prevent users from running long processes too often.

NodeJS Performance - Multiple routes vs Single routes

I have created a single endpoint in Node.js.
Following is the end-point:
app.post('/processMyRequests',function(req,res){
switch(req.body.functionality) {
case "functionalityName1":
jsFileName1.functionA(req,res);
break;
case "functionalityName2":
jsFileName2.functionB(req,res);
break;
default:
res.send("Sorry for that");
break;
}
});
In each of these functions, calls to APIs are done, then the data is processed, and finally response is sent back.
My questions:
Since Node.js as a default handles requests asynchronously, can we have a single route for all the responses?
Will concurrency be an issue i.e. when parallel hits are happening into the single route will Node.js stall or slow down?
If the answer to question (2) is YES, how will it change when I have separate routes i.e if the same amount of requests come into a specific route then it is going to be the same issue right?
Would be happy if someone could share real-time use cases. Thanks

You technically can have a single route for all the responses, but it's considered "better-practice" to create endpoints which are compact, clear in what the intended function/purpose is, and not too complex; in your example, there could be many possible branches of code that the route could take. This requires unique logic for each branch, which adds to the complexity of your endpoints, and takes away from the clarity of the code. Imagine that when an error occurs, you now have to debug potentially multiple different files and different branches of your endpoint, when you could have created a separate endpoint for each unique "branch".
As your application grows in both size, and complexity, you are going to want an easy way to manage your API. Putting lots of stuff into one endpoint is going to be a nightmare for you to maintain, and debug.
It may be useful for you to look at some tutorials/docs about how to design and implement an API, here is a good article from Scotch.io
Example for question one:
// GET multiple records
app.get('/functionality1',function(req,res){
//Unique logic for functionality
});
// GET a record by an 'id' field
app.get('/functionality1/:id',function(req,res){
//Unique logic for functionality
});
// POST a new record
app.post('/functionality1',function(req,res){
//Unique logic for functionality
});
// PUT (update) a record
app.put('/functionality1',function(req,res){
//Unique logic for functionality
});
// DELETE a record
app.delete('/functionality1',function(req,res){
//Unique logic for functionality
});
app.get('/functionality2',function(req,res){
//Unique logic for functionality
});
...
This gives you a much clearer idea of what is happening for each endpoint, versus having to digest a lot of technically unrelated logic in a single API endpoint. Summing it up, it's better to have endpoints which are clear and concise in their purpose, and scope.
It really depends on how the logic is implemented; obviously Node.js is single-threaded. This means it can only process 1 "stream" of code at a time (no true concurrency or parallelism). However, Node gets around this through its event-loop. The problem that you could see depends on if you wrote asynchronous (non-blocking) code, or synchronous (blocking) code. In Node it's almost always better and recommended to write non-blocking code. This helps to prevent blocking the event loop, meaning your node app can do other things while, for example waiting for a file to finish being read, an API call to finish, or a promise to resolve. Writing blocking code will result in your application bottle-necking/"hanging", which is perceived by your end-users as higher-latency
Having multiple routes, or a single route isn't going to resolve this problem. It's more about how you are utilizing (or not utilizing) the event loop. It's extremely important to use asynchronous code as much as possible.
One thing that you can do if you absolutely must use synchronous code (this is actually a good approach to leverage regardless of code synchronicity)is to implement a microservice architecture, where a service can process your blocking (or resource-intensive) code off of your API Node service. This frees up your API service to handle requests as rapidly as possible, and leave the heavy lifting to other services.
Another possibility is to leverage clustering. This gives you the ability to run node as if it were multi-threaded, by spawning "worker" processes, which are identical to your master process, with the difference in that they are able to process work individually. This type of approach is extremely useful if you expect that you will have a very busy API service.
Some extremely helpful resources:
Node.js Express Best Practices
A GREAT video explaining the event-loop
Parallelism vs. Concurrency in Node.js
Node.js Clustering
API Design

When is blocking code acceptable in node.js?

I know that blocking code is discouraged in node.js because it is single-threaded. My question is asking whether or not blocking code is acceptable in certain circumstances.
For example, if I was running an Express webserver that requires a MongoDB connection, would it be acceptable to block the event loop until the database connection was established? This is assuming that all pages served by Express require a database query (which would fail if MongoDB was not initialized).
Another example would be an application that requires the contents of a configuration file before being initializing. Is there any benefit in using fs.readFile over fs.readFileSync in this case?
Is there a way to work around this? Is wrapping all the code in a callback or promise the best way to go? How would that be different from using blocking code in the above examples?

It is really up to you to decide what is acceptable. And you would do that by determining what the consequences of blocking would be ... on a case-by-case basis. That analysis would take into account:
how often it occurs,
how long the event loop is likely to be blocked, and
the impact that blocking in that context will have on usability1.
Obviously, there are ways to avoid blocking, but these tend to add complexity to your application. Really, you need to decide ... on a case-by-case basis ... whether that added complexity is warranted.
Bottom line: >>you<< need to decide what is acceptable based on your understanding of your application and your users.
1 - For example, in a game it would be more acceptable to block the UI while switching "levels" than during active play. Or for a general web service, "once off" blocking while a config file is loaded or a DB connection is established during webserver startup is more acceptable that if this happened on every request.

From my experience most tasks should be handled in a callback or by returning a promise. You DO NOT want to block code in a Node application. That's what makes it so nice! Mostly with MongoDB it will crash before it has a chance to connect if there is no connection. It won't' really have an effect on an API call because your server will be dead!
Source: I'm a developer at a bootcamp that teaches MEAN stack.

Your two examples are completely different. The distinction actually answers the question in and of itself.
Grabbing data from a database is dependent on being connected to that database. Any code that is dependent upon that data is then dependent upon that connection. These things have to happen serially for the app to function and be meaningful.
On the other hand, readFileSync will block ALL code, not just code that is reliant on it. You could start reading a csv file while simultaneously establishing a database connection. Once both are done, you could add that csv data to the database.

Implementing general purpose long polling

I've been trying to implement a simple long polling service for use in my own projects and maybe release it as a SAAS if I succeed. These are the two approaches I've tried so far, both using Node.js (polling PostgreSQL in the back).
1. Periodically check all the clients in the same interval
Every new connection is pushed onto a queue of connections, which is being walked through in an interval.
var queue = [];
function acceptConnection(req, res) {
res.setTimeout(5000);
queue.push({ req: req, res: res });
}
function checkAll() {
queue.forEach(function(client) {
// respond if there is something new for the client
});
}
// this could be replaced with a timeout after all the clients are served
setInterval(checkAll, 500);
2. Check each client at a separate interval
Every client gets his own ticker which checks for new data
function acceptConnection(req, res) {
// something which periodically checks data for the client
// and responds if there is anything new
new Ticker(req, res);
}
While this keeps the minimum latency for each client lower, it also introduces overhead by setting a lot of timeouts.
Conclusion
Both of these approaches solve the problem quite easily, but I don't feel that this will scale up easily to something like 10 million open connections, especially since I'm polling the database on every check for every client.
I thought about doing this without the database and just immediately broadcast new messages to all open connections, but that will fail if a client's connection dies for a few seconds while the broadcast is happening, because it is not persistent. Which means I basically need to be able to look up messages in history when the client polls for the first time.
I guess one step up here would be to have a data source where I can subscribe to new data coming in (CouchDB change notifications?), but maybe I'm missing something in the big picture here?
What is the usual approach for doing highly scalable long polling? I'm not specifically bound to Node.js, I'd actually prefer any other suggestion with a reasoning why.

Not sure if this answers your question, but I like the approach of PushPin (+ explanation of concepts).
I love the idea (using reverse proxy and communicating with return codes + delayed REST return requests), but I do have reservations about the implementation. I might be underestimating the problem, but is seems to me that the technologies used are a bit on an overkill. Not sure if I will use it or not yet, would prefer a more lightweight solution, but I find the concept phenomenal.
Would love to hear what you used eventually.

Since you mentioned scalability, I have to get a little bit theoretical, as the only practical measure is load testing. Therefore, all I can offer is advice.
Generally speaking, once-per anything is bad for scalability. Especially once-per-connection or once-per-request since that makes part of your app proportional to the amount of traffic. Node.js removed the thread-per-connection dependency with its single-threaded asynchronous I/O model. Of course, you can't completely eliminate having something per-connection, like a request and response object and a socket.
I suggest avoiding anything that opens a database connection for every HTTP connection. This is what connections pools are for.
As for choosing between your two options above, I would personally go for the second choice because it keeps each connection isolated. The first option uses a loop over connections, which means actual execution time per connection. It's probably not a big deal given that I/O is asynchronous, but given a choice between an iteration-per-connection and the mere existence of an object-per-connection, I would prefer to just have an object. Then I have less to worry about when suddenly there are 10,000 connections.
The C10K problem seems like a good reference for this, though this is really personal judgement to be honest.
http://www.kegel.com/c10k.html
http://en.wikipedia.org/wiki/C10k_problem

at-most-once and exactly-once

I am studying Distributed Systems and when it comes to the RPC part, I have heard about these two semantics (at-most-once and exactly-once). I understand that the at-most-once is used on databases for instances, when we don't want duplicate execution.
First question:
How is this achieved? How does the server know that it shouldnt execute the request again? It might be a duplicate but it might be a legitimate request as well.
The second question is:
What is the difference between the two semantics in the title? I can read :). I know that at-most-once might not be executed at all but, what does exactly-once do that guarantees the execution?

Here is a pretty good explanation of the different types of messaging semantics for your second question:
At-most-once semantics: The easiest type of semantics to achieve, from an engineering complexity perspective, since it can be done in a fire-and-forget way. There's rarely any need for the components of the system to be stateful. While it's the easiest to achieve, at-most-once is also the least desirable type of messaging semantics. It provides no absolute message delivery guarantees since each message is delivered once (best case scenario) or not at all.
At-least-once semantics: This is an improvement on at-most-once semantics. There might be multiple attempts at delivering a message, so at least one attempt is successful. In other words, there's a chance messages may be duplicated, but they can't be lost. While not ideal as a system-wide characteristic, at-least-once semantics are good enough for use cases where duplication of data is of little concern or scenarios where deduplication is possible on the consumer side.
Exactly-once semantics: The ultimate message delivery guarantee and the optimal choice in terms of data integrity. As its name suggests, exactly-once semantics means that each message is delivered precisely once. The message can neither be lost nor delivered twice (or more times). Exactly-once is by far the most dependable message delivery guarantee. It’s also the hardest to achieve.
That's all part of this blog post about Exactly-once message processing (Disclosure: I work for Ably)
Hope this helps 😄

In cases of at most once semantics, request is sent again in case of failure, but request is filtered on the server for duplicates.
In exactly once semantics, request is sent again, request is filtered for duplicate and there is a guarantee for the server to restart after failure and start processing requests from where it crashed.
But exactly once is not realizable because what happens when client sends request, and before it reaches the server, server crashes. There is no way of tracking the request.
http://de.wikipedia.org/wiki/Remote_Procedure_Call#Fehlersemantik

To correct Hesper's answer-
Earlier, exactly once RPC was not realisable but a research paper in 2015 [1] proved that it is possible to do so. Basically RIFL paradigm guarantees safety of exactly one execution of an RPC that is executed is stored durably
[1]: Lee, Collin, et al. "Implementing linearizability at large scale and low latency." Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 2015

Bump, I'm studying this too and found this, hope it helps (helped me),
At-least-once versus at-most-once?
let's take an example: acquiring a lock
if client and server stay up, client receives lock
if client fails, it may have the lock or not (server needs a plan!)
if server fails, client may have lock or not
at-least-once: client keeps trying
at-most-once: client will receive an exception
what does a client do in the case of an exception?
need to implement some application-specific protocol
ask server, do i have the lock?
server needs to have a plan for remembering state across reboots
e.g., store locks on disk.
at-least-once (if we never give up)
clients keep trying. server may run procedure several times
server must use application state to handle duplicates
if requests are not idempotent
but difficult to make all request idempotent
e.g., server good store on disk who has lock and req id
check table for each requst
even if server fails and reboots, we get correct semantics
What is right?
depends where RPC is used.
simple applications:
at-most-once is cool (more like procedure calls)
more sophisticated applications:
need an application-level plan in both cases
not clear at-once gives you a leg up
=> Handling machine failures makes RPC different than procedure calls
quoted from distributed systems and paradigms 2nd edition

For the first question I believe that each request should have a unique id attached to it. Therefore even if the client sends two requests that have the exact same command the server is able to filter and distinguish via the unique id of the request.
For the second question I think this article helps define the semantics for an rpc call. http://www.cs.unc.edu/~dewan/242/f97/notes/ipc/node27.html

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string