nodejs stream csv data into mongoDB - node.js

I'm new to NodeJS and MongoDB and trying to insert a CSV file into MongoDB.
my first version was to create an array variable and push the data into the array like this
.on('data',(data)=>{array.push(JSON.parse(data))}
then after pushing all the objects into the array I insert it into MongoDB using
TempModel.insertMany(array)
this solution worked great for me in small files and even large ones if I allocate enough memory for nodeJS so the array can store more objects.
but in very large files I get an error
FATAL ERROR: Ineffective mark-compacts heap limit Allocation failed - JavaScript heap out of memory
I am guessing this error occurred because there are too many objects in the array (correct me if I am wrong)
So my new solution was to stream the CSV file and insert every line in it as an object into MongoDB, instead of pushing it into the array.
but when I start the project it stops at the first line and doesn't insert it into the MongoDB.
that's the code I have now.
any ideas on how can I make it work?
It is good to insert millions of objects one by one into the MongoDB
instead of insertMany?
I have created a schema and model in mongoose, then created a read stream and converted the CSV file into objects, and then insert it into MongoDB
const tempSchema = new mongoose.Schema({},{stric:false});
const TempModel = mongoose.model('tempCollection',tempSchema);
fs.createReadStream(req.file.path)
.pipe(csv())
.on('data',(data) => {
TempModel.insertOne(JSON.parse(data));
})
.on('end',()=>{
console.log('finished');
)};

The snippet can be restructured to use stream pipes to control the data flow down to the MongoDB write operations. This will avoid the memory issue and provide a means to batch operations together.
A somewhat complete pseudocode example:
import util from "util";
import streams from "streams";
const tempSchema = new mongoose.Schema({},{stric:false});
const TempModel = mongoose.model('tempCollection',tempSchema);
// Promisify waiting for the file to be parsed and stored in mongodb
await util.promisify(streams.pipeline)(
fs.createReadStream(req.file.path),
csv(),
// Create a writeable to piggyback streams built-in batch processing logic
new streams.Writable({
// bulkWrite() supports at most 1000 ops/call. Consequently, we do not need
// to load/parse additional rows into memory when this queue is full
highWaterMark: 1000,
writev: async (chunks, next) => {
try {
// Bulk write documents to MongoDB
await TempModel.bulkWrite(chunks.map(({chunk: data}) => ({
insertOne: JSON.parse(data)
})), {
ordered: false
});
// Signal completion
next();
}
catch(error) {
// Propagate errors
next(error)
}
}
})
);
console.log('finished');

Related

Stream large JSON from REST API using NodeJS/ExpressJS

I have to return a large JSON, resulting from a query to MongoDB, from a REST API server build-up using ExpressJS. This JSON has to be converted into .csv so the client can directly save the resulting CSV file. I know that the best solution is to use NodeJS streams and pipe. Could anyone suggest to me a working example? Thanks.
Typically when wanting to parse JSON in Node its fairly simple. In the past I would do something like the following.
const fs = require('fs');
const rawdata = fs.readFileSync('file.json');
const data = JSON.parse(rawdata);
Or even simpler with a require statement like this
const data = require('./file.json');
Both of these work great with small or even moderate size files, but what if you need to parse a really large JSON file, one with millions of lines, reading the entire file into memory is no longer a great option.
Because of this I needed a way to “Stream” the JSON and process as it went. There is a nice module named ‘stream-json’ that does exactly what I wanted.
With stream-json, we can use the NodeJS file stream to process our large data file in chucks.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const fs = require('fs');
const jsonStream = StreamArray.withParser();
//internal Node readable stream option, pipe to stream-json to convert it for us
fs.createReadStream('file.json').pipe(jsonStream.input);
//You'll get json objects here
//Key is the array-index here
jsonStream.on('data', ({key, value}) => {
console.log(key, value);
});
jsonStream.on('end', ({key, value}) => {
console.log('All Done');
});
Now our data can process without running out of memory, however in the use case I was working on, inside the stream I had an asynchronous process. Because of this, I still was consuming huge amounts of memory as this just up a very large amount of unresolved promises to keep in memory until they completed.
To solve this I had to also use a custom Writeable stream like this.
const StreamArray = require( 'stream-json/streamers/StreamArray');
const {Writable} = require('stream');
const fs = require('fs');
const fileStream = fs.createReadStream('file.json');
const jsonStream = StreamArray.withParser();
const processingStream = new Writable({
write({key, value}, encoding, callback) {
//some async operations
setTimeout(() => {
console.log(key,value);
//Runs one at a time, need to use a callback for that part to work
callback();
}, 1000);
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', () => console.log('All done' ));
The Writeable stream also allows each asynchronous process to complete and the promises to resolve before continuing on to the next, thus avoiding the memory backup.
This stack overflow is where I got the examples for this post.
Parse large JSON file in Nodejs and handle each object independently
Also note another thing I learned in this process is if you want to start Node with more than the default amount of RAM you can use the following command.
node --max-old-space-size=4096 file.js
By default the memory limit in Node.js is 512 mb, to solve this issue you need to increase the memory limit using command –max-old-space-size. This can be used to avoid the memory limits within node. The command above would give Node 4GB of RAM to use.

Load 1 million records from a file and save to PSQL database

I have a file of 1 million records in which I have to pass one by one record to elastic search and save the resulted data into the database.
But the issue is, it is taking very long time to do that as the records are streaming one by one to elasticsearch then it saves the data into PSQL database.
I want some suggestions that how can I improve on this or should use some other tools.
Right now I am using Nodejs with some packages:
I upload the file in nodejs application and convert it to json file using
const csv=require('csvtojson')
I use
const StreamArray = require('stream-json/streamers/StreamArray');
const {Writable} = require('stream');
For reading json and parsing it through these packages using stream as the file is too big.
I use this code
const fileStream = fs.createReadStream(this.fileName);
const jsonStream = StreamArray.withParser();
const incomingThis = this;
const processingStream = new Writable({
write({key, value}, encoding, callback) {
incomingThis.recordParser(value, (val, data) => { // pass the data to elasticsearch to get search data
incomingThis.processQueue(data); // save the data to the PSQL database
callback();
});
},
//Don't skip this, as we need to operate with objects, not buffers
objectMode: true
});
//Pipe the streams as follows
fileStream.pipe(jsonStream.input);
jsonStream.pipe(processingStream);
//So we're waiting for the 'finish' event when everything is done.
processingStream.on('finish', async () => {
console.log('stream end');
const statistics = new Statistics(jobId);
await statistics.update(); // update the job table for completion of data
});
Please suggest How can I improve on this to parse 1 million records file in couple of hours rather than days or minimum less time.
I am open to use any other tools too like redis, spark if these will help me out.
Thanks.
Instead of one by one pressing from stream . use batch approach ( create multiple batches ) to get data in elastic and save in batch .

Stream MongoDB results to Express response

I am trying to do this without and 3rd party dependencies, as I don't feel they should be needed. Please note, due to architect ruling we have to use MongoDB native, and not Mongoose (don't ask!).
Basically I have a getAll function, that will return all documents (based on passed in query) from a single collection.
The number of documents, could easily hit multiple thousand, and thus, I want to stream them out as I receive them.
I have the following code:
db.collection('documents')
.find(query)
.stream({
transform: (result) => {
return JSON.stringify(new Document(result));
}
})
.pipe(res);
Which kind of works, except it destroys the array that the documents should sit in, and it responds {...}{...}
There has to be a way of doing this right?
What you can do is to write explicitly the start of the array res.write("[") before requesting the database, put a ,, on every json stringified object and on the stream end write the end of the array res.write("]") this can work. But it is not advisable!
JSON.stringify is a very slow operation, you should try to use it as less as possible.
A better approach will be to go with a streamable JSON.stringify implementation like json-stream-stringify
const JsonStreamStringify = require('json-stream-stringify');
app.get('/api/users', (req, res, next) => {
const stream = db.collection('documents').find().stream();
new JsonStreamStringify(stream).pipe(res);
);
Be aware of using pipe in production, pipe does not destroy the source or destination stream when errors. It is advisable to go for pump or pipeline in production to avoid memory leaks.

Difficulty processing CSV file, browser timeout

I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.

In Meteor, how do I get a node read stream from a collection's find curser?

In Meteor, on the server side, I want to use the .find() function on a Collection and then get a Node ReadStream interface from the curser that is returned. I've tried using .stream() on the curser as described in the mongoDB docs Seen Here. However I get the error "Object [object Object] has no method 'stream'" So it looks like Meteor collections don't have this option. Is there a way to get a stream from a Meteor Collection's curser?
I am trying to export some data to CSV and I want to pipe the data directly from the collections stream into a CSV parser and then into the response going back to the user. I am able to get the response stream from the Router package we are using, and it's all working except for getting a stream from the collection. Fetching the array from the find to push it into the stream manually would defeat the purpose of a stream since it would put everything in memory. I guess my other option is to use a foreach on the collection and push the rows into the stream one by one, but this seems dirty when I could pipe the stream directly through the parser with a transform on it.
Here's some sample code of what I am trying to do:
response.writeHead(200,{'content-type':'text/csv'});
// Set up a future
var fut = new Future();
var users = Users.find({}).stream();
CSV().from(users)
.to(response)
.on('end', function(count){
log.verbose('finished csv export');
response.end();
fut.ret();
});
return fut.wait();
Have you tried creating a custom function and piping to it?
Though this would only work if Users.find() supported .pipe()(again, only if Users.find inherited from node.js streamble object).
Kind of like
var stream = require('stream')
var util = require('util')
streamreader = function (){
stream.Writable.call(this)
this.end = function() {
console.log(this.data) //this.data contains raw data in a string so do what you need to to make it usable, i.e, do a split on ',' or something or whatever it is you need to make it usable
db.close()
})
}
util.inherits(streamreader,stream.Writeable)
stream.prototype._write = function (chunk, encoding, callback) {
this.data = this.data + chunk.toString('utf8')
callback()
}
Users.find({}).pipe(new streamReader())

Resources