Handle duplicates in batch POST requests to a REST API

Handle duplicates in batch POST requests to a REST API - node.js

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

Related

How to implement atomicity in node js transactions

I am working on an application in which client(android/reactjs) clicks a button and five operations takes place, let say,
add a new field
update the old field
upload a photo
upload some text
delete some old fields.
Now sometimes due to network issue or any another issue only some operations takes place and db gets corrupted. So my question is how can I make all this transactions one i.e. atomic i.e. either all will complete or the done operations will be rollback. And where should I do this in client(reactjs/android) or in backend(nodejs) with API ? I thought of making an API on backend(since chances of backend goes down is rare) and keep the track of the operations done(statelessly like using arrays). If in any case transaction get stopped, roll back all the done operations. But I found this expensive and it not covers the risk of server error. Can you suggest how can I implement/design this ?

CQS: Who is responsible for data caching and when?

When and who should be responsible for caching data into local data store from API GET requests in DDD architecture with CQS based use cases?
First thing that comes to mind:
Initiate a Query to get some data from local data store and if it is empty, fetch required data from API -> cache it into local data store -> return it
This solution does not seem to follow CQS correctly because Queries should not alter data store (or can they?).
Second thing that comes to mind:
Execute a Command to fetch fresh data from API -> update data store -> raise a data updated event -> event handler listens for data updated events and executes new Query to get fresh data
Second solution seems to follow CQS pattern better, tho I am not sure if any of these solutions are by any means correct way of handling data caching in CQS based architecture.

The first option isn't, to my mind, any "bigger" of a violation of CQS/CQRS as the second. The query isn't altering authoritative state (e.g. a DDD aggregate), it's just copying it into a cache. It does require cache invalidation.
The second is questionable because a query results in a command (it's sometimes reasonable to treat a query as a read-only command (provided the query limits itself to a single aggregate) in order to have a stronger consistency guarantee).
A better approach to my mind is to have a hybrid of the two:
Queries are served from the cache if possible
When commands result in updates to aggregates, events are published (ideally listing the aggregates which changed)
Event handler listens for update events and invalidates cache based on which aggregates have changed.
A further evolution of this would be event sourcing, where the updates are the events and the queries only get served by a read model fed by the events.

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});

You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

How to stream data in a Node JS + Mongo DB REST API?

I am developing a Rest API in Node JS + Mongo DB, handled with Mongoose's middleware, in which one of the methods allows the recovery of contents asociated to a certain user.
So far I've been retrieving all of the user's content, but the amount of data is starting to grow, and now I need to stream the data somehow.
The behaviour I want to implement would be for the server to answer the request with a stream of 10-20 items, and then, if the client needs more data, it would need to send another request, which would be answered with the following 10-20 items.
All I can come up with would be to answer with those first 10-20 items, and then, in case the client needs more data, to provide a new (optional) parameter for my method, which would allow the client to send the last item's id, so the server can send back the following 10-20 items.
I know that this approach will work, but I feel like it's too raw; there's gotta be a cleaner way to implement this behaviour, since it's the kind of behaviour a lot of web applications must implement.
So, my question would be: Do you know of any better way to solve this problem?
Thanks in advance.

Provide the ability to read an offset and a limit from the request, then do something like:
db.collection.find().skip(20).limit(10)
Also, set defaults on APIs you build so that someone can't request a million records at once. Maybe max results is always 200, and if the request doesn't have the above params set, return up to the first 200 results.

Generating pages that require complex calculations and data manipulation

What's the best approach for generating a page that is the results of complex calculation/data manipulation/api calls (e.g. 5 mins per page)? Obviously I can't do the calculation within my rails web request.
A scheduled task can produce some data, but where should I store it? Should I store it in a postgres table? Should I store it in a document oriented database? Should I store it in memory? Should I generate an html?
I have the feeling of being second-level ignorant about the subject. Is there a well known set of tools to deal with this kind of architectural problem?
Thanks.

I would suggest following approach:
1. Once you receive initial request:
You can start processing in a separate thread when you receive the first request with input for calculation and send some token/unique identifier for the request.
2. Store the result:
Then start the calculation and store the result in memory using some tool like memcached.
3. Poll for the result:
Then the request for fetching the result should keep polling for the result with generated token/unique request identifier. As Adriano said you can use AJAX for that (I am assuming you are getting the requests from Web Browser).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string