Migrating to microservices, how can one refactor the front end so that it can be notified of failures? - frontend

In a monolithic architecture, the front end usually makes one rest call to backend and backend does everything in one shot and returns a status code (2xx, 4xx, 5xx). The front end uses this status code to alter the display of information to the user.
In a microservices architecture, the front end will still make one call to backend, but this time, the backend which is broken into microservices may return a 2xx from its front end facing microservice (let's call this the SUCCESS-SERVICE) but some other service may fail (FAILED-SERVICE), which would need to result into a rollback.
Assuming that the microservices at backend are listening for events and the SUCCESS-SERVICE eventually rolls back its transaction (deletes the record).
How should one design the front end to capture the failure AFTER it got success from the first service already?
One pattern I can think of is right after getting a 2xx from the service, poll for the status of the newly created resource (GET /resource/:id) and look for a defined set of status messages that can indicate if the entire workflow succeeded or failed. Given the backend service would have rolled back the transaction, the GET call will eventually return a 4xx because the id would no longer be valid.
Is there another or better way to design the front end?

I believe this approach would work, because there are not so many options to surface eventual consistency on the front-end level.
One idea that complements this approach would be to create a dedicated microservice that stores the state of the operations, in order to have a single point of polling in your backend (otherwise, depending on the workflow, you may need to poll in different microservices). This brings more consistency into the architecture at the cost of having a single point of failure.
It may also be useful to check out the way transactions can be done in microservices (interesting article here)

Related

What's a good design for making sure the Node.js Event Loop isn't blocked when adding potentially hundreds of records?

I just read this article from Node.js: Don't Block the Event Loop
The Ask
I'm hoping that someone can read over the use case I describe below and tell me whether or not I'm understanding how the event loop is blocked, and whether or not I'm doing it. Also, any tips on how I can find this information out for myself would be useful.
My use case
I think I have a use case in my application that could potentially cause problems. I have a functionality which enables a group to add members to their roster. Each member that doesn't represent an existing system user (the common case) gets an account created, including a dummy password.
The password is hashed with argon2 (using the default hash type), which means that even before I get to the need to wait on a DB promise to resolve (with a Prisma transaction) that I have to wait for each member's password to be generated.
I'm using Prisma for the ORM and Sendgrid for the email service and no other external packages.
A take-away that I get from the article is that this is blocking the event loop. Since there could potentially be hundreds of records generated (such as importing contacts from a CSV or cloud contact service), this seems significant.
To sum up what the route in question does, including some details omitted before:
Remove duplicates (requires one DB request & then some synchronous checking)
Check remaining for existing user
For non-existing users:
Synchronously create many records & push each to a separate array. One of these records requires async password generation for each non-existing user
Once the arrays are populated, send a DB transaction with all records
Once the transaction is cleared, create invitation records for each member
Once the invitation records are created, send emails in a MailData[] through SendGrid.
Clearly, there are quite a few tasks that must be done sequentially. If it matters, the asynchronous functions are also nested: createUsers calls createInvites calls sendEmails. In fact, from the controller, there is: updateRoster calls createUsers calls createInvites calls sendEmails.
There are architectural patterns that are aimed at avoiding issues brought by potentially long-running operations. Note here that while your example is specific, any long running process would possibly be harmful here.
The first obvious pattern is the cluster. If your app is handled by multiple concurrent independent event-loops of a cluster, blocking one, ten or even thousand of loops could be insignificant if your app is scaled to handle this.
Imagine an example scenario where you have 10 concurrent loops, one is blocked for a longer time but 9 remaining are still serving short requests. Chances are, users would not even notice the temporary bottleneck caused by the one long running request.
Another more general pattern is a separated long-running process service or the Command-Query Responsibility Segregation (I'm bringing the CQRS into attention here as the pattern description could introduce more interesting ideas you could be not familiar with).
In this approach, some long-running operations are not handled directly by backend servers. Instead, backend servers use a Message Queue to send requests to yet another service layer of your app, the layer that is solely dedicated to running specific long-running requests. The Message Queue is configured so that it has specific throughput so that if there are multiple long-running requests in short time, they are queued, so that possibly some of them are delayed but your resources are always under control. The backend that sends requests to the Message Queue doesn't wait synchronously, instead you need another form of return communication.
This auxiliary process service can be maintained and scaled independently. The important part here is that the service is never accessed directly from the frontend, it's always behind a message queue with controlled throughput.
Note that while the second approach is often implemented in real-life systems and it solves most issues, it can still be incapable of handling some edge cases, e.g. when long-running requests come faster than they are handled and the queue grows infintely.
Such cases require careful maintenance and you either scale your app to handle the traffic or you introduce other rules that prevent users from running long processes too often.

Async Flows Design in Lagom or Microservices

How to design asyn flows in Lagom ?
Problem faced: In our product we have a Lead Aggregate which has a User Id (represents the owner of the lead), Now User has a limitation which says one user can have max of 10 Lead associated with this. We designed this by creating a separate Service ResourceManagement and when a User asks for Picking a Lead, we send a Command to LeadAggregate which generates a Event LeadPickRequested. On ProcessManager Listen to the event and asks for the Resource From ResourceManagement, on Success send Command to LeadAggregate - MarkAsPicked and on this send Push notification to the User that Lead is Picked but from building the UI perspective its very difficult and same cannot be done for exposing our API to third party.
One Sol. we have done is when request is received on Service save a RequestID Vs Request Future . in Command Add the request Id and when the LeadAggregate finally change into Picked State or Picked Failure a PM listen to the event , checks if a RequestFuture is there for the request Id , then complete the future with correct response. This way it works as Sync API for the end User.
Any Better Sol. for this
If you want to provide a synchronous API, I only see two options:
Design your domain model so that Lead creation logic, the "10 leads max" rule and the list of leads for a user are co-located in the same Aggregate root (hint: an AR can spawn another AR).
Accept to involve more than one non-new Aggregate in the same transaction.
The tradeoff depends on transactional analysis about the aggregates in question - will reading from them in the same transaction lead to a lot of locking and race conditions?

Reduce Replication Calls PouchDB/Cloudant

I have a fully functional process for syncing PouchDB and Bluemix/Cloudant for a current side/hobby project. Its a project planning app so users can be constantly making changes to their travel plans.
I have continuous/live replication turned on. As you can imagine it hits Cloudant with a ton of API calls.
Any thoughts on how to reduce the API calls without taking functionality away from the app?
Thanks!
If your application's data is only being generated on the client side and then pushed to the server, then be sure to use PouchDB's db.replicate.to(remoteDB) call to start replication. If you use sync instead, then your client will monitor the server side's changes feed, eating up API calls as it does it.
Using continuous replication, each document change (add/update/delete) is written to the server side as it happens. If using fewer API calls is your priority, then you could opt for "one shot" replication (i.e. not continuous). This would bundle many changes into a single bulk write operation on the client side, using fewer API calls to transmit the information. The challenge would be when to trigger replication in your app: on pressing a 'sync' button, on application startup, on shutdown, every hour?

What happens if my Node.js server crashes while waiting for web services callback?

Im just starting to look into Node.js to create a web application that asynchrounously calls multiple web services to complete a single client request. I think in SOA speak this is known as a composite service / transaction.
My Node.js application will be responsible for completing any compensating actions should any web service calls fail within the composite service. For example, if service A and B return 'success', but service C returns 'fail', Node.js may need to apply a compensating action (undo effectively) on service A and B.
My question is, what if my Node.js server crashes? I could be in the middle of a composite transaction. Multiple calls to web services have been made, and I am waiting for the callbacks. If my node server crashes, responses meant for the callbacks will go unheard. It could then be possible that one of the web services was not successful, and that some compensating actions on other services would be needed.
Im not sure how I would be able to address this once my node server is back online. This could potentially put the system in an inconsistent state if service A and B succeeded, but C didn't.
Distributed transactions are bad for SOA - they introduce dependency,rigidity , security and performance problems. You can implement a Saga instead which means that each of your services will need to be aware of the on-going operation and take compensating actions if they find out there was a problem. You'd want to save state for each of the services so that they'd know on recovery to get to a consistent internal state.
If you find you must have distributed transactions than you should probably rethink the boundaries between your services.
(updates from the comments)
Even if you use a Saga, you may find that you want some coordinator to control the compensation - but if your services are autonomous they won't need that central coordinator -they'd perform the compensating action themselves - for example if they use the reservation pattern infoq.com/news/2009/09/reservations . They can perform compensation on expiration of the reservation. Otherwise, you can persist the state somewhere (redis/db/zookeeper etc.) and then check that on recovery of the coordinator

Handling failures in Thrift in general

I read through the official documentation and the official whitepaper, but I couldn't find a satisfying answer to how Thrift handles failures in the following scenario:
Say you have a client sending a method call to a server to insert an entry in some data structure residing in that server (it doesn't really matter what it is). Suppose the server has processed the call and inserted the entry but the client couldn't receive a response due to a network failure. In such a case, how should the client handle this? A simple retry of sending the call would possibly result in a duplicate entry being inserted. Does the Thrift library persist the response somewhere so that it can resend to the client when it is back online? Or is it the application's responsibility to do so?
Would appreciate it if someone could point out the details of how it works, besides directing to its source code.
The question is an interesting one, but it is by no means limited to Thrift. A better name would be
Handling failures in asynchronous or remote calls in general
because that's in essence, what it is. Altough in the specific case of an RPC-style API like, for example, a Thrift service, the client blocks and it seems to be an synchronous call, it really isn't that way.
The whole problem can be rephrased to the more general question about
Designing robust distributed systems
So what is the main problem, that we have to deal with? We have to assume that every call we do may fail. In particular, it can fail in three ways:
request died
request sent, server processing successful, response died
request sent, server processing failed, response died
In some cases, this is not a big deal, regardless of the exact case we have. If the client just wants to retrieve some values, he can simply re-query and will get some results eventually if he tries often enough.
In other cases, especially when the client modifies data on the server, it may become more problematic. The general recommendation in such cases is to make the service calls idempotent, meaning: regardless, how often I do the same call, the end result is always the same. This could be achieved by various means and more or less depends on the use case.
For example, one method is it to send some logical "ticket" values along with each request to filter out doubled or outdated requests on the server. The server keeps track and/or checks these tickets, before the processing starts eventually. But again, if that method suits your needs depends on your use case.
The Command and Query Responsibility Segregation (CQRS) pattern is another approach to deal with the complexity. It basically breaks the API into setters and getters. I'd recommend to look into that topic, but it is not useful for every scenario. I'd also recommend to look at the Data Consistency Primer article. Last not least the CAP theorem is always a good read.
Good Service/API design is not simple, and the fact, that we have to deal with a distributed parallel system does not make it easier, quite the opposite.
Let me try to give a straight answer.
... is it the application's responsibility to do so?
Yes.
There're 4 types of Exceptions involved in Thrift RPC, including TTransportException, TProtocolException, TApplicationException, and User-defined exceptions.
Based on the book Programmer's Guide to Apache Thrift, the former 2 are local exceptions, while the latter 2 are not.
As the names imply, TTransportException includes exceptions like NOT_OPEN, TIMED_OUT, and TProtocolException includes INVALID_DATA, BAD_VERSION, etc. These exceptions are not propagated from the server the the client and act much like normal language exceptions.
TApplicationExceptions involve problems such as calling a method that isn’t implemented or failing to provide the necessary arguments to a method.
User-defined Exceptions are defined in IDL files and raised by the user code.
For all of these exceptions, no retry operations are done by Thrift RPC framework itself. Instead, they should be handled properly by the application code.

Resources