Bulk Data Transfer through REST API - node.js

I have been informed that "REST API is not made / good for Bulk Data Transfer. Its a proven fact". I tried to search over google about this, but unable to find any fruitful answer. Can anyone let me know whether this statement is actually True or not? If its TRUE, then why?
Note: I am not exposing Bulk Data (50 million rows from database) over Web. I am saving it to Server as JSON format (Approx. 3GB file size) and transferring it to other system. I am using Node JS for this purpose. Network is not an issue to transfer file.

Nothing wrong with exposing a end point which returns huge data
It might be concern on how you are sending that data, as memory could be a issue
Why don't you consider streaming the data, that way memory needed is only one packet of the data which has to be streamed at a time
NodeJS has many way to pipe the data into response object, you can also consider JSONStream module from npmjs.org

Related

How to overcome the sizelimits of mongodb and in general storing, sending and retrieving large documents?

Currently i work on an application that can send and retrieve arbitary large files. In the beginning we decided to use json for this because it is quite easy to handle and store. This works until images, videos or larger stuff in general comes in.
The current way we do this.
So we got a few problems at least with the current approach:
1 MB File size limit of express. Solution
10 MB File size limit of axios. Solution
16 MB File size limit of MongoDB. No solution currently
So currently we are trying to overcome the limits of MongoDB, but in general this seems like we are on the wrong path. As we go higher there will be more and more limits that are harder to overcome and maybe MongoDB's limit is not solveable. So would there be a way to do this in a more efficent way then what we currently do?
There is one thing left to say. In general we need to load the whole object on serverside back together to verify that the structure is the one we would expect and to hash the whole object. So we did not think of splitting it right now, but maybe that is the only option left. But even then how would you send videos or similar big chunks ?
If you need to store files bigger than 16 MB in MongoDb, you can use GridFS.
GridFS works by splitting your file into smaller chunks of data and store them separately. When that file is needed it gets reassembled and becomes available.

Node Streams to Mysql

I have to parse large csvs approx 1gb, map the header to the database columns, and format every row. I.E the csv has "Gender" Male but my database only accepts enum('M', 'F', 'U').
Since the files are so large I have to use node streams, to transform the file and then use load data infile to upload it all at once.
I would like granular control over the inserts, which load data infile doesn't provide. If a single line has incorrect data the whole upload fails. I am currently using mysqljs, which doesn't provide an api to check if the pool has reached queueLimit and therefore I can't pause the stream reliably.
I am wondering if I can use apache kafka or spark to stream the instructions and it will be added to the database sequentially. I have skimmed through the docs and read some tutorials but none of them show how to connect them to the database. It is mostly consumer/producer examples.
I know there are multiple ways of solving this problem but I am very much interested in a way to seamlessly integrate streams with databases. If streams can work with I.O why not databases? I am pretty sure big companies don't use load data infile or add chunks of data to array repeatedly and insert to database.

NodeJS Issue with huge data export

I have a scenario. In DB, I have a table with a huge amount of records (2 million) and I need to export them to xlsx or csv.
So the basic approach that I used, is running a query against DB and put the data into an appropriate file to download.
Problems:
There is a DB timeout that I have set to 150 sec which sometimes isn't
enough and I am not sure if expanding timeout would be a good idea!
There is also some certain timeout with express request, So it basically timed out my HTTP req and hits for second time (for unknown reason)
So as a solution, I am thinking of using stream DB connection and with that if in any way I can provide an output stream with the file, It should work.
So basically I need help with the 2nd part, In stream, I would receive records one by one and at the same time, I am thinking of allowing user download the file progressively. (this would avoid request timeout)
I don't think it's unique problem but didn't find any appropriate pieces to put together. Thanks in advance!
If you see it in your log, do you run the query more than once?
Does your UI timeout before the server even reach the res.end()?

Using GTFS data, how should i extend it with realtime gtfs?

I am building an application using GTFS data. I am a bit confused when it comes to GTFS-realtime.
I have stored all the GTFS information in a database(Mongo), I am able to retrieve stop times of a specific bus stop.
So now I want to integrate GTFS-realtime information to it. What will be the best way to deal with the information retrived? I am using gtfs-realtime-binding (nodsjs library) by Google.
I have the following idea:
Store the realtime-GTFS information in a separate database and query it after getting the stoptime from GTFS. And I can update the database periodically to make sure the real time info is up to date.
Also, I know the retrieve data is in .proto binary format. Should I store them as ascii or is there a better way to deal with it?
I couldnt find much information about how to deal with the realtime data so I hope someone can give me a direction on what to do next.
Thanks!
In your case GTFS-Realtime can be used as "ephemeral" data, and I would go with an object in memory, with the stop_id/route_id as keys.
For every request:
Check if the realtime object contains the id, then present realtime. Else load from the database.

Custom in-memory cache

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?
Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

Resources