Generating pages that require complex calculations and data manipulation - web

What's the best approach for generating a page that is the results of complex calculation/data manipulation/api calls (e.g. 5 mins per page)? Obviously I can't do the calculation within my rails web request.
A scheduled task can produce some data, but where should I store it? Should I store it in a postgres table? Should I store it in a document oriented database? Should I store it in memory? Should I generate an html?
I have the feeling of being second-level ignorant about the subject. Is there a well known set of tools to deal with this kind of architectural problem?
Thanks.

I would suggest following approach:
1. Once you receive initial request:
You can start processing in a separate thread when you receive the first request with input for calculation and send some token/unique identifier for the request.
2. Store the result:
Then start the calculation and store the result in memory using some tool like memcached.
3. Poll for the result:
Then the request for fetching the result should keep polling for the result with generated token/unique request identifier. As Adriano said you can use AJAX for that (I am assuming you are getting the requests from Web Browser).

Related

What is the best practice for storing rarely modified database values in NodeJS?

I've got a node app that works with Salesforce for a few different things. One of the features is letting users fill in a form and pushing it to Salesforce.
The form has a dropdown list, so I query salesforce to get the list of available dropdown items and make them available to my form via res.locals. Currently I'm getting these values via some middleware, storing them in the users session, and then checking if the session value is set, use them, if not, query salesforce and pull them in.
This works, but it means every users session data in Mongo holds a whole bunch of picklist vals (they are the same for all users). I very rarely make changes to the values on the Salesforce side of things, so I'm wondering if there is a "proper" way of storing these vals in my app?
I could pull them into a Mongo collection, and trigger a manual refresh of them whenever they change. I could expire them in Mongo (but realistically if they do need to change, it's because someone needs to access the new values immediately), so not sure that makes the most sense...
Is storing them in everyone's session the best way to tackle this, or is there something else I should be doing?
To answer your question quickly, you could add them to a singleton object (instead of session data, which is per user). But not sure how you will manage their lifetime (i.e. pull them again when they change). A singleton can be implemented using a simple script file that can be required which returns a simple object...
But if I was to do something like this, I would go about doing it differently:
I would create an API endpoint that returns your list data (possibly giving it a query parameters to return different lists).
If you can afford the data being outdated for a short period of time then, you can write your API so that it returns the response cached (http cache, for a short period of time)
If your data has to be realtime fresh, then your API should return an eTag in the response of the API. The eTag header basically acts like a checksum for your data, a good checksum would be "last updated date" of all the records in a collection. Upon receiving a request you check if you have the header "if-none-match" which would contain the checksum, at this point, you do a "lite" call to your database to just pull the checksum, if it matches then you return 304 http code (not modified), otherwise you actually pull the full data you need and return it (alongside the new checksum in the response eTag). Basically you are letting your browser do the caching...
Note that you can also combine caching in points 1 and 2 and use them together.
More resources on this here:
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers
https://developers.facebook.com/docs/marketing-api/etags

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?
I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool
Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.
just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});
You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Handle duplicates in batch POST requests to a REST API

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

How to stream data in a Node JS + Mongo DB REST API?

I am developing a Rest API in Node JS + Mongo DB, handled with Mongoose's middleware, in which one of the methods allows the recovery of contents asociated to a certain user.
So far I've been retrieving all of the user's content, but the amount of data is starting to grow, and now I need to stream the data somehow.
The behaviour I want to implement would be for the server to answer the request with a stream of 10-20 items, and then, if the client needs more data, it would need to send another request, which would be answered with the following 10-20 items.
All I can come up with would be to answer with those first 10-20 items, and then, in case the client needs more data, to provide a new (optional) parameter for my method, which would allow the client to send the last item's id, so the server can send back the following 10-20 items.
I know that this approach will work, but I feel like it's too raw; there's gotta be a cleaner way to implement this behaviour, since it's the kind of behaviour a lot of web applications must implement.
So, my question would be: Do you know of any better way to solve this problem?
Thanks in advance.
Provide the ability to read an offset and a limit from the request, then do something like:
db.collection.find().skip(20).limit(10)
Also, set defaults on APIs you build so that someone can't request a million records at once. Maybe max results is always 200, and if the request doesn't have the above params set, return up to the first 200 results.

Resources