NodeJS Issue with huge data export - node.js

I have a scenario. In DB, I have a table with a huge amount of records (2 million) and I need to export them to xlsx or csv.
So the basic approach that I used, is running a query against DB and put the data into an appropriate file to download.
Problems:
There is a DB timeout that I have set to 150 sec which sometimes isn't
enough and I am not sure if expanding timeout would be a good idea!
There is also some certain timeout with express request, So it basically timed out my HTTP req and hits for second time (for unknown reason)
So as a solution, I am thinking of using stream DB connection and with that if in any way I can provide an output stream with the file, It should work.
So basically I need help with the 2nd part, In stream, I would receive records one by one and at the same time, I am thinking of allowing user download the file progressively. (this would avoid request timeout)
I don't think it's unique problem but didn't find any appropriate pieces to put together. Thanks in advance!

If you see it in your log, do you run the query more than once?
Does your UI timeout before the server even reach the res.end()?

Related

Max number of records in DynamoDB

I'm writing a script that saves a CSV into a dynamoDB table. I'm using Node.js and the aws-sdk module. Everything seems to be correct, but I'm sending over 50k records to Dynamo, while only 1181 are saved and shown on the web console.
I've tried with different amount of records and this is the biggest count I get ,no matter if I try saving 100k, 10k or 50k.
According to AWS's documentation, there shouldn't be any limit to the amount of records, any idea as to what other factors could influence this hard limit?
BTW, my code is catching errors from the insert actions, and I'm not picking up any when inserting past the 1181 mark, so the module is not really helping.
Any extra idea would be appreciated.
If your using the DynamoDb batchWriteitem or another batch insert you need to check the "UnprocessedItems" Element in the response. Sometimes batch writes exceed the provisioned write capacity of your table and it will not process all of your inserts, sounds like what is happening here.
You should check the response of your insert and if there are unprocessed items setup a retry and exponential backoff timing strategy in your code. This will allow for the additional items to be inserted until all of your CSV is processed.
Here is the refrerence link for Dynamo BatchWriteItem if you want to take a closer look at the response elements. Good luck!

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?
I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool
Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.
just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Handle duplicates in batch POST requests to a REST API

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

How to stream data in a Node JS + Mongo DB REST API?

I am developing a Rest API in Node JS + Mongo DB, handled with Mongoose's middleware, in which one of the methods allows the recovery of contents asociated to a certain user.
So far I've been retrieving all of the user's content, but the amount of data is starting to grow, and now I need to stream the data somehow.
The behaviour I want to implement would be for the server to answer the request with a stream of 10-20 items, and then, if the client needs more data, it would need to send another request, which would be answered with the following 10-20 items.
All I can come up with would be to answer with those first 10-20 items, and then, in case the client needs more data, to provide a new (optional) parameter for my method, which would allow the client to send the last item's id, so the server can send back the following 10-20 items.
I know that this approach will work, but I feel like it's too raw; there's gotta be a cleaner way to implement this behaviour, since it's the kind of behaviour a lot of web applications must implement.
So, my question would be: Do you know of any better way to solve this problem?
Thanks in advance.
Provide the ability to read an offset and a limit from the request, then do something like:
db.collection.find().skip(20).limit(10)
Also, set defaults on APIs you build so that someone can't request a million records at once. Maybe max results is always 200, and if the request doesn't have the above params set, return up to the first 200 results.

Custom in-memory cache

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?
Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

Resources