Stream MongoDB results to Express response - node.js

I am trying to do this without and 3rd party dependencies, as I don't feel they should be needed. Please note, due to architect ruling we have to use MongoDB native, and not Mongoose (don't ask!).
Basically I have a getAll function, that will return all documents (based on passed in query) from a single collection.
The number of documents, could easily hit multiple thousand, and thus, I want to stream them out as I receive them.
I have the following code:
db.collection('documents')
.find(query)
.stream({
transform: (result) => {
return JSON.stringify(new Document(result));
}
})
.pipe(res);
Which kind of works, except it destroys the array that the documents should sit in, and it responds {...}{...}
There has to be a way of doing this right?

What you can do is to write explicitly the start of the array res.write("[") before requesting the database, put a ,, on every json stringified object and on the stream end write the end of the array res.write("]") this can work. But it is not advisable!
JSON.stringify is a very slow operation, you should try to use it as less as possible.
A better approach will be to go with a streamable JSON.stringify implementation like json-stream-stringify
const JsonStreamStringify = require('json-stream-stringify');
app.get('/api/users', (req, res, next) => {
const stream = db.collection('documents').find().stream();
new JsonStreamStringify(stream).pipe(res);
);
Be aware of using pipe in production, pipe does not destroy the source or destination stream when errors. It is advisable to go for pump or pipeline in production to avoid memory leaks.

Related

Difficulty processing CSV file, browser timeout

I was asked to import a csv file from a server daily and parse the respective header to the appropriate fields in mongoose.
My first idea was to make it to run automatically with a scheduler using the cron module.
const CronJob = require('cron').CronJob;
const fs = require("fs");
const csv = require("fast-csv")
new CronJob('30 2 * * *', async function() {
await parseCSV();
this.stop();
}, function() {
this.start()
}, true);
Next, the parseCSV() function code is as follow:
(I have simplify some of the data)
function parseCSV() {
let buffer = [];
let stream = fs.createReadStream("data.csv");
csv.fromStream(stream, {headers:
[
"lot", "order", "cwotdt"
]
, trim:true})
.on("data", async (data) =>{
let data = { "order": data.order, "lot": data.lot, "date": data.cwotdt};
// Only add product that fulfill the following condition
if (data.cwotdt !== "000000"){
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
}
})
.on("end", function(){
db.Product.find({}, function(err, productAvailable){
// Check whether database exists or not
if(productAvailable.length !== 0){
// console.log("Database Exists");
// Add subsequent onward
db.Product.insertMany(buffer)
buffer = [];
} else{
// Add first time
db.Product.insertMany(buffer)
buffer = [];
}
})
});
}
It is not a problem if it's just a few line of rows in the csv file but just only reaching 2k rows, I encountered a problem. The culprit is due to the if condition checking when listening to the event handler on, it needs to check every single row to see whether the database contains the data already or not.
The reason I'm doing this is that the csv file will have new data added into it and I need to add all the data for the first time if database is empty or look into every single row and only add those new data into mongoose.
The 1st approach I did from here (as in the code),was using async/await to make sure that all the datas have been read before proceeding to the event handler end. This helps but I see from time to time (with mongoose.set("debug", true);), some data are being queried twice, which I have no idea why.
The 2nd approach was not to use the async/await feature, this has some downside since the data was not fully queried, it proceeded straight to the event handler end and then insertMany some of the datas which were able to get pushed into the buffer.
If i stick with the current approach, it is not an issue, but the query will take 1 to 2 minutes, not to mention even more if the database keeps growing. So, during those few minutes of querying, the event queue got blocked and therefore when sending request to the server, the server time out.
I used stream.pause() and stream.resume() before this code but I can't get it to work as it just jump straight to the end event handler first. This cause the buffer to be empty every single time since end event handler runs before the on event handler
I cant' remember the links that I used but the fundamentals that I got from is through this.
Import CSV Using Mongoose Schema
I saw these threads:
Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
Can't populate big chunk of data to mongodb using Node.js
to be similar to what I need but it's a bit too complicated for me to understand what is going on. Seems like using socket or a child process maybe? Furthermore, I still need to check conditions before adding into the buffer
Anyone care to guide me on this?
Edit: await is removed from console.log as it is not asynchronous
Forking a child process approach:
When web service got a request of csv data file save it somewhere in app
Fork a child process -> child process example
Pass the file url to the child_process to run the insert checks
When child process finish processing the csv file, delete the file
Like what Joe said, indexing the DB would speed up the processing time by a lot when there are lots(millions) of tuples.
If you create an index on order and lot. The query should be very fast.
db.Product.createIndex( { order: 1, lot: 1 }
Note: This is a compound index and may not be the ideal solution. Index strategies
Also, your await on console.log is weird. That may be causing your timing issues. console.log is not async. Additionally the function is not marked async
// removing await from console.log
let product = {"order": data.order, "lot": data.lot}
// Check whether product exist in database or not
await db.Product.find(product, function(err, foundProduct){
if(foundProduct && foundProduct.length !== 0){
console.log("Product exists")
} else{
buffer.push(product);
console.log("Product not exists")
}
})
I would try with removing the await on console.log (that may be a red herring if console.log is for stackoverflow and hiding the actual async method.) However, be sure to mark the function with async if that is the case.
Lastly, if the problem still exists. I may look into a 2 tiered approach.
Insert all lines from the CSV file into a mongo collection.
Process that mongo collection after the CSV has been parsed. Removing the CSV from the equation.

How do I make a large but unknown number of REST http calls in nodejs?

I have an orientdb database. I want to use nodejs with RESTfull calls to create a large number of records. I need to get the #rid of each for some later processing.
My psuedo code is:
for each record
write.to.db(record)
when the async of write.to.db() finishes
process based on #rid
carryon()
I have landed in serious callback hell from this. The version that was closest used a tail recursion in the .then function to write the next record to the db. However, I couldn't carry on with the rest of the processing.
A final constraint is that I am behind a corporate proxy and cannot use any other packages without going through the network administrator, so using the native nodejs packages is essential.
Any suggestions?
With a completion callback, the general design pattern for this type of problem makes use of a local function for doing each write:
var records = ....; // array of records to write
var index = 0;
function writeNext(r) {
write.to.db(r, function(err) {
if (err) {
// error handling
} else {
++index;
if (index < records.length) {
writeOne(records[index]);
}
}
});
}
writeNext(records[0]);
The key here is that you can't use synchronous iterators like .forEach() because they won't iterate one at a time and wait for completion. Instead, you do your own iteration.
If your write function returns a promise, you can use the .reduce() pattern that is common for iterating an array.
var records = ...; // some array of records to write
records.reduce(function(p, r) {
return p.then(function() {
return write.to.db(r);
});
}, Promsise.resolve()).then(function() {
// all done here
}, function(err) {
// error here
});
This solution chains promises together, waiting for each one to resolve before executing the next save.
It's kinda hard to tell which function would be best for your scenario w/o more detail, but I almost always use asyncjs for this kind of thing.
From what you say, one way to do it would be with async.map:
var recordsToCreate = [...];
function functionThatCallsTheApi(record, cb){
// do the api call, then call cb(null, rid)
}
async.map(recordsToCreate, functionThatCallsTheApi, function(err, results){
// here, err will be if anything failed in any function
// results will be an array of the rids
});
You can also check out other ones to enable throttling, which is probablya good idea.

Complex sequencing of promises - nested

After a lot of googling I have not been able to confirm the correct approach to this problem. The following code runs as expected but I have a grave feeling that I am not approaching this in the correct way, and I am setting myself up for problems.
The following code is initiated by the main app.js file and is passed a location to start loading XML files from and processing into a mongoDB
exports.processProfiles = function(path) {
var deferrer = q.defer();
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function(deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function(filenames) {
// now we have all the file names lets load and save
filenames.forEach(function(filename) {
// Here is where i think the problem is!
// kick off another promise chain for the dynamically sized array of files to process
q(loadFileContent(path, filename)) // first we load the data in the file
.then(function(inboundFile) {
// then parse XML structure to my new shiny JSON structure
// and ask Mongo to store it for me
return dataService.createProfile(processProfileXML(filename, inboundFile));
})
.done(function(result) {
console.log(result);
})
});
})
.catch(function(err) {
deferrer.reject('Unable to Process Profile records : ' + err);
})
.done(function() {
deferrer.resolve('Profile Processing Completed');
});
return deferrer.promise;
}
Whilst this code works these are my main concerns but cannot solve them on my own after a few hours of Google and reading.
1) Is this blocking? The read out to the console is difficult to understand if this is running asynchronously as i want it to - i think it is but advice on if I am doing something fundamentally wrong would be great
2) Is having a nested promise a bad idea, should I be linking it to the outter promise - I have tried but could not get anything to compile or run.
I haven't used Q in a really long time, but I think that you'd need to do is let it know you're about to hand back an array of promises that need to all be satisfied before moving on.
Additionally as you're waiting for multiple promises on one section of code, rather than nesting further, throw the 'set' of promises back up once they're all satisfied.
q(dataService.deleteProfiles()) // simple mongodb call to empty the Profiles collection
.then(function (deleteResult) {
return loadFilenames(path); // method to load all filenames in the given path using fs
})
.then(function (filenames) {
return q.all(
filenames.map(function (filename) {
return q(loadFileContent(path, filename)) { /* Do stuff with your filenames */ });
})
);
.then(function (resultsOfLoadFileContentsPromises) {
console.log('I did stuff with all the things');
)
.catch(function(err) {});
What you have is not 'blocking'. But really what you're doing with promises is moving things into a new 'block'ing section. The more blocks you have, the more async-ish your code will appear. If nothing else is running apart from this promise, it will still appear procedural.
But inner promises must still resolve before the parent promises resolve thereafter.
Inner promises like what you have aren't an inherently bad, personally I will break them out into seperate files to makes easier to reason about, but I wouldn't define that as 'bad' unless there's no need for that inner promise to exist, however where possible (and in your example here) I've adjusted so I throw back up the next set of promises for a new section to deal with the data after it's gotten it.
(I'm not great with Q though, this code will probably require a little further tweaking).

Is it considered bad practice to manipulate a queried database document before sending to the client in Mongoose?

So I spent too long trying to figure out how to manipulate a returned database document (using mongoose) using transform and virtuals, but for my purposes, those aren't options. The behaviour I desire is very similar to that of a transform (in which I delete a property), but I only want to delete the property from the returned document IFF it satisfies a requirement calculated using the req.session.user/req.user object (I'm using PassportJS, but any equivalent session user suffices). Obviously, there is no access to the request object in a virtual or transform, and so I can't do the calculation.
Then it dawned on me that I could just query normally and manipulate the returned object in the callback before I send it to the client. And I could put it in a middleware function that looks nice, but something tells me this is a hacky thing to do. I'm presenting an api to the client that does not reflect the data stored/retrieved directly from the database. It may also clutter up my route configuration if I have middleware like this all over making it harder to maintain code. Below is an example of what the manipulation looks like:
app.route('/api/items/:id').get(manipulateItem, sendItem);
app.param('id', findUniqueItem);
function findUniqueItem(req, res, next, id) {
Item.findUniqueById(id, function(err, item) {
if (!err) { req.itemFound = item; }
next();
}
}
function manipulateItem(req, res, next) {
if (req.itemFound.people.indexOf(req.user) === -1) {
req.itemFound.userIsInPeopleArray = false;
} else {
req.itemFound.userIsInPeopleArray = true;
}
delete req.itemFound.people;
}
function sendItem(req, res, next) {
res.json(req.itemFound);
}
I feel like this is a workaround to a problem with a simpler solution, but I'm not sure what that solution is.
There's nothing hacky about the act of modifying it.
It's all a matter of when you modify it.
For toy servers, and learning projects, the answer is whenever you want.
In production environments, you want to do your transform on your way out of your system, and into the next system (the next system might be the end user; it might be another server; it might be another big block of functionality in your own server, that shouldn't have access to more information that it needs to do its job).
getItemsFromSomewhere()
.then(transformToTypeICanUse)
.then(filterBasedOnMyExpectations)
.then(doOperations)
.then(transformToTypeIPromisedYou)
.then(outputToNextSystem);
That example might not be super-helpful in terms of an actual how, but that's sort of the point.
As you can see, you could link that system of events up to another system of events (that does its own transform to its own data-structure, does its own filtering/mapping, transforms that data into whatever its API promises, and passes it along to the next system, and eventually out to the end user).
I think part of the sense of "hacking" comes from bolting the result of the async process onto req, where req gets injected from step to step, through the middleware.
That said:
function eq (a) {
return function (b) { return a === b; };
}
function makeOutputObject (inputObject, personWasFound) {
// return whatever you want
}
var personFound = req.itemFound.people.some(eq(req.user));
var outputObject = makeOutputObject(req.itemFound, personFound);
Now you aren't using the actual delete keyword, or modifying the call-to-call state of that itemFound object.
You're separating your view-based logic from your app-based logic, but without the formal barriers (can always be added later, if they're needed).

In Meteor, how do I get a node read stream from a collection's find curser?

In Meteor, on the server side, I want to use the .find() function on a Collection and then get a Node ReadStream interface from the curser that is returned. I've tried using .stream() on the curser as described in the mongoDB docs Seen Here. However I get the error "Object [object Object] has no method 'stream'" So it looks like Meteor collections don't have this option. Is there a way to get a stream from a Meteor Collection's curser?
I am trying to export some data to CSV and I want to pipe the data directly from the collections stream into a CSV parser and then into the response going back to the user. I am able to get the response stream from the Router package we are using, and it's all working except for getting a stream from the collection. Fetching the array from the find to push it into the stream manually would defeat the purpose of a stream since it would put everything in memory. I guess my other option is to use a foreach on the collection and push the rows into the stream one by one, but this seems dirty when I could pipe the stream directly through the parser with a transform on it.
Here's some sample code of what I am trying to do:
response.writeHead(200,{'content-type':'text/csv'});
// Set up a future
var fut = new Future();
var users = Users.find({}).stream();
CSV().from(users)
.to(response)
.on('end', function(count){
log.verbose('finished csv export');
response.end();
fut.ret();
});
return fut.wait();
Have you tried creating a custom function and piping to it?
Though this would only work if Users.find() supported .pipe()(again, only if Users.find inherited from node.js streamble object).
Kind of like
var stream = require('stream')
var util = require('util')
streamreader = function (){
stream.Writable.call(this)
this.end = function() {
console.log(this.data) //this.data contains raw data in a string so do what you need to to make it usable, i.e, do a split on ',' or something or whatever it is you need to make it usable
db.close()
})
}
util.inherits(streamreader,stream.Writeable)
stream.prototype._write = function (chunk, encoding, callback) {
this.data = this.data + chunk.toString('utf8')
callback()
}
Users.find({}).pipe(new streamReader())

Resources