How to stream data in a Node JS + Mongo DB REST API?

How to stream data in a Node JS + Mongo DB REST API? - node.js

I am developing a Rest API in Node JS + Mongo DB, handled with Mongoose's middleware, in which one of the methods allows the recovery of contents asociated to a certain user.
So far I've been retrieving all of the user's content, but the amount of data is starting to grow, and now I need to stream the data somehow.
The behaviour I want to implement would be for the server to answer the request with a stream of 10-20 items, and then, if the client needs more data, it would need to send another request, which would be answered with the following 10-20 items.
All I can come up with would be to answer with those first 10-20 items, and then, in case the client needs more data, to provide a new (optional) parameter for my method, which would allow the client to send the last item's id, so the server can send back the following 10-20 items.
I know that this approach will work, but I feel like it's too raw; there's gotta be a cleaner way to implement this behaviour, since it's the kind of behaviour a lot of web applications must implement.
So, my question would be: Do you know of any better way to solve this problem?
Thanks in advance.

Provide the ability to read an offset and a limit from the request, then do something like:
db.collection.find().skip(20).limit(10)
Also, set defaults on APIs you build so that someone can't request a million records at once. Maybe max results is always 200, and if the request doesn't have the above params set, return up to the first 200 results.

Related

What is the best practice for storing rarely modified database values in NodeJS?

I've got a node app that works with Salesforce for a few different things. One of the features is letting users fill in a form and pushing it to Salesforce.
The form has a dropdown list, so I query salesforce to get the list of available dropdown items and make them available to my form via res.locals. Currently I'm getting these values via some middleware, storing them in the users session, and then checking if the session value is set, use them, if not, query salesforce and pull them in.
This works, but it means every users session data in Mongo holds a whole bunch of picklist vals (they are the same for all users). I very rarely make changes to the values on the Salesforce side of things, so I'm wondering if there is a "proper" way of storing these vals in my app?
I could pull them into a Mongo collection, and trigger a manual refresh of them whenever they change. I could expire them in Mongo (but realistically if they do need to change, it's because someone needs to access the new values immediately), so not sure that makes the most sense...
Is storing them in everyone's session the best way to tackle this, or is there something else I should be doing?

To answer your question quickly, you could add them to a singleton object (instead of session data, which is per user). But not sure how you will manage their lifetime (i.e. pull them again when they change). A singleton can be implemented using a simple script file that can be required which returns a simple object...
But if I was to do something like this, I would go about doing it differently:
I would create an API endpoint that returns your list data (possibly giving it a query parameters to return different lists).
If you can afford the data being outdated for a short period of time then, you can write your API so that it returns the response cached (http cache, for a short period of time)
If your data has to be realtime fresh, then your API should return an eTag in the response of the API. The eTag header basically acts like a checksum for your data, a good checksum would be "last updated date" of all the records in a collection. Upon receiving a request you check if you have the header "if-none-match" which would contain the checksum, at this point, you do a "lite" call to your database to just pull the checksum, if it matches then you return 304 http code (not modified), otherwise you actually pull the full data you need and return it (alongside the new checksum in the response eTag). Basically you are letting your browser do the caching...
Note that you can also combine caching in points 1 and 2 and use them together.
More resources on this here:
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers
https://developers.facebook.com/docs/marketing-api/etags

NodeJS Issue with huge data export

I have a scenario. In DB, I have a table with a huge amount of records (2 million) and I need to export them to xlsx or csv.
So the basic approach that I used, is running a query against DB and put the data into an appropriate file to download.
Problems:
There is a DB timeout that I have set to 150 sec which sometimes isn't
enough and I am not sure if expanding timeout would be a good idea!
There is also some certain timeout with express request, So it basically timed out my HTTP req and hits for second time (for unknown reason)
So as a solution, I am thinking of using stream DB connection and with that if in any way I can provide an output stream with the file, It should work.
So basically I need help with the 2nd part, In stream, I would receive records one by one and at the same time, I am thinking of allowing user download the file progressively. (this would avoid request timeout)
I don't think it's unique problem but didn't find any appropriate pieces to put together. Thanks in advance!

If you see it in your log, do you run the query more than once?
Does your UI timeout before the server even reach the res.end()?

Cloud Functions Http Request return cached Firebase database

I'm new in Node.js and Cloud Functions for Firebase, I'll try to be specific for my question.
I have a firebase-database with objects including a "score" field. I want the data to be retrieved based on that, and that can be done easily in client side.
The issue is that, if the database gets to grow big, I'm worried that either it will take too long to return and/or will consume a lot of resources. That's why I was thinking of a http service using Cloud Functions to store a cache with the top N objects that will be updating itself when the score of any objects change with a listener.
Then, client side just has to call something like https://myexampleprojectroute/givemethetoplevels to receive a Json with the top N levels.
Is it reasonable? If so, how can I approach that? Which structures do I need to use this cache, and how to return them in json format via http?
At the moment I'll keep doing it client side but I'd really like to have that both for performance and learning purpose.
Thanks in advance.
EDIT:
In the end I did not implement the optimization. The reason why is, first, that the firebase database does not contain a "child count" so I didn't find a way with my newbie javascript knowledge to implement that. Second, and most important, is that I'm pretty sure it won't scale up to millions, having at most 10K entries, and firebase has rules for sorted reading optimization. For more information please check out this link.
Also, I'll post a simple code snippet to retrieve data from your database via http request using cloud-functions in case someone is looking for it. Hope this helps!
// Simple Test function to retrieve a json object from the DB
// Warning: No security methods are being used such authentication, request methods, etc
exports.request_all_levels = functions.https.onRequest((req, res) => {
const ref = admin.database().ref('CustomLevels');
ref.once('value').then(function(snapshot) {
res.status(200).send(JSON.stringify(snapshot.val()));
});
});

You're duplicating data upon writes, to gain better read performance. That's a completely reasonable approach. In fact, it is so common in NoSQL databases to keep such derived data structures that it even has a name: denormalization.
A few things to keep in mind:
While Cloud Functions run in a more predictable environment than the average client, the resources are still limited. So reading a huge list of items to determine the latest 10 items, is still a suboptimal approach. For simple operations, you'll want to keep the derived data structure up to date for every write operation.
So if you have a "latest 10" and a new item comes in, you remove the oldest item and add the new one. With this approach you have at most 11 items to consider, compared to having your Cloud Function query the list of items for the latest 10 upon every write, which is a O(something-with-n) operation.
Same for an averaging operation: you'll find a moving average to be most performant, because it doesn't require any of the previous data.

Handle duplicates in batch POST requests to a REST API

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

Possible to do a general Backbone.js fetch to fill multiple collections? [Backbone.js/Express]

I have a page which uses multiple collections(6 or so), each collection needing to fetch data from the server when the page is routed to. Obviously, there a lot of connections that need to happen which will probably make this page slow. I was wondering if a general fetch is possible when I route to this page. This way I can retrieve all the data at once, send it as one big json chunk, and allocate the data to each collection at the same time. This would take only one connection. I looked around and did not see such a technique for Backbone.
Is this is proper thought? I am using Express/Node on the server side.
Thanks

I can't comment of whether or not this is good, but I have done something similar. I simply made an AJAX request using jQuery to my endpoint which served the data for all the collections in one large JSON object, similar to:
{
"Collection1": [...],
"Collection2": [...],
....
"CollectionN": [...]
}
When I get the response from the server (in the success callback) I take the data for each collection and just use the collection.add() function. This is basically the same as what Backbone does in fetch (make the request, and pass the returned values into add). When an array is passed to the collection.add() function each object in the array is used to create a a model.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string