Node.js: Processing a stream without running out of memory - node.js

I'm trying to read a giant logfile (250,000 lines), parsing each line into a JSON object, and insert each JSON object to CouchDB for analytics.
I'm trying to do this by creating a buffered stream that will process each chunk seperately, but I always run out of memory after about 300 lines. It seems like using buffered streams and util.pump should avoid this, but apparently not.
(Perhaps there are better tools for this than node.js and CouchDB, but I'm interested in learning how to do this kind of file processing in node.js and think it should be possible.)
CoffeeScript below, JavaScript here: https://gist.github.com/5a89d3590f0a9ca62a23
fs = require 'fs'
util = require('util')
BufferStream = require('bufferstream')
files = [
"logfile1",
]
files.forEach (file)->
stream = new BufferStream({encoding:'utf8', size:'flexible'})
stream.split("\n")
stream.on("split", (chunk, token)->
line = chunk.toString()
# parse line into JSON and insert in database
)
util.pump(fs.createReadStream(file, {encoding: 'utf8'}), stream)

Maybe this helps:
Memory leak when using streams in Node.js?
Try to use pipe() to solve it.

Related

Create a Read Stream remotely without writing it node.js

I'm trying to create a Read Stream from a remote file without writing it to disc.
var file = fs.createWriteStream('Video.mp4');
var request = http.get('http://url.tld/video.mp4', function(response){
response.pipe(file);
});
Can I create a Read Stream directly from an HTTP response without writing it to disc ? Maybe creating a buffer in chunks and covert it to readable stream ?
Seems like you can use the request module.
Have a look at 7zark7's answer here: https://stackoverflow.com/a/14552721/7189461

how to read an incomplete file and wait for new data in nodejs

I have a UDP client that grabs some data from another source and writes it to a file on the server. Since this is large amount of data, I dont want the end user to wait until they its full written to the server so that they can download it. So I made a NodeJS server that grabs the latest data from the file and sends it to the user.
Here is the code:
var stream = fs.readFileSync(filename)
.on("data", function(data) {
response.write(data)
});
The problem here is, if the download starts when the file was only for example 10mb.. the fs.readFileSync will only read my file up to 10mb. Even if 2 mins later the file increased to 100mb. fs.readFileSync will never know about the new updated data. How can I do this in Node? I would like somehow refresh the fs state or maybe perpaps wait for new data using fs file system. Or is there some kind of fs fileContent watcher?
EDIT:
I think the code below describes better what I would like to achieve, however in this code it keeps reading forever and I dont have any variable from fs.read that can help me stop it:
fs.open(filename, 'r', function(err, fd) {
var bufferSize=1000,
chunkSize=512,
buffer=new Buffer(bufferSize),
bytesRead = 0;
while(true){ //check if file has new content inside
fs.read(fd, buffer, 0, chunkSize, bytesRead);
bytesRead+= buffer.length;
}
});
Node has built-in methods in the fs module. It is tagged as unstable, so it can change in the future.
Its called: fs.watchFile(filename[, options], listener)
You can read more about it here: https://nodejs.org/api/fs.html#fs_fs_watchfile_filename_options_listener
But i highly suggest you to use one of the good modules mantained actively like
watchr:
From his readme:
Better file system watching for Node.js. Provides a normalised API the
file watching APIs of different node versions, nested/recursive file
and directory watching, and accurate detailed events for
file/directory changes, deletions and creations.
The module page is here: https://github.com/bevry/watchr
(Used the module in a couple of proyects and working great, im not related to it in other way)
you need store in some data base last size of file.
read filesize first.
load your file.
then make a script to check if file was change.
you can consult the size with jquery.post to obtain your result and decide if need to reload in javascript

read (pull) vs pipe(control flow) vs data(push)

Node.js has different options to consume the data.
Streams 0,1,2,3 and so on...
My question is with respect to real life application of
These different option. I fairly understand the
Difference between readable /read, data event and
Pipe but not very confident about selecting specific
Method.
For example if I want to use flow control, read with
Some manual work as well as pipe can be used.
data event ignores flow control, should I stop using
Plain data event?
For most things, you should be able to use
src.pipe(dest);
If you look at the source code for the Stream.prototype.pipe implementation, you can see that it's just a very handy wrapper that sets everything up for you
For all the work I do with streams, I generally just choose the proper stream type (Readable, Writable, Duplex, Transform, or PassThrough) and then define the proper methods (_read, _write, and/or _transform) on the stream. Lastly, I use .pipe to connect everything together.
It's very common to see stream setups that appear to be "circular"
client.pipe(encoder).pipe(server).pipe(decoder).pipe(client)
As an example, here's stream I'm using in my burro module. You can write objects to this stream, and you can read JSON strings from it.
// burro/encoder.js
var stream = require("stream"),
util = require("util");
var Encoder = module.exports = function Encoder() {
stream.Transform.call(this, {objectMode: true});
};
util.inherits(Encoder, stream.Transform);
Encoder.prototype._transform = function _transform(obj, encoding, callback) {
this.push(JSON.stringify(obj));
callback(null);
};
As a general recommendation, you will almost always write your Streams like this. That is, you write your own "class" that inherits from one of the built-in streams. It is not really practical for you to use a built-in stream directly.
To demonstrate how you might use this, start by creating a new instance of the stream
var encoder = new Encoder();
See what the encoder outputs by piping it to stdout
encoder.pipe(process.stdout);
Write some sample objects to it
encoder.write({foo: "bar", a: "b"});
// '{"foo":"bar","a":"b"}'
encoder.write({hello: "world"});
// '{"hello":"world"}'

Getting a ReadableStream from something that writes to WritableStreams

I've never used streams in Node.js, so I apologize in advance if this is trivial.
I'm using the ya-csv library to create a CSV. I use a line like this:
csvwriter = csv.createCsvStreamWriter(process.stdout)
As I understand it, this takes a writable Stream and writes to it when I add a record.
I need to use this CSV as an email attachment.
From nodemailer's docs, here is how to do that:
attachments: [
{ // stream as an attachment
fileName: "text4.txt",
streamSource: fs.createReadStream("file.txt")
}
]
As I understand it, this takes a readable Stream and reads from it.
Therein lies the problem. I need a readable Stream, I need a writable Stream, but at no point do I have a Stream.
It would be nice if ya-csv had a:
csvwriter = csv.createReadableCsvStream()
But it doesn't. Is there some built-in stream that makes available for writing whatever it reads? I've looks for a library with no success (though there are a few things that could work but seem like overkill).
you can use PassThrough stream for that:
var PassThrough = require('stream').PassThrough
var stream = new PassThrough
var csvwriter = csv.createCsvStreamWriter(stream)
now you can read from stream whatever is written

Node.js request stream ends/stalls when piped to writable file stream

I'm trying to pipe() data from Twitter's Streaming API to a file using modern Node.js Streams. I'm using a library I wrote called TweetPipe, which leverages EventStream and Request.
Setup:
var TweetPipe = require('tweet-pipe')
, fs = require('fs');
var tp = new TweetPipe(myOAuthCreds);
var file = fs.createWriteStream('./tweets.json');
Piping to STDOUT works and stream stays open:
tp.stream('statuses/filter', { track: ['bieber'] })
.pipe(tp.stringify())
.pipe(process.stdout);
Piping to the file writes one tweet and then the stream ends silently:
tp.stream('statuses/filter', { track: ['bieber'] })
.pipe(tp.stringify())
.pipe(file);
Could anyone tell me why this happens?
it's hard to say from what you have here, it sounds like the stream is getting cleaned up before you expect. This can be triggered a number of ways, see here https://github.com/joyent/node/blob/master/lib/stream.js#L89-112
A stream could emit 'end', and then something just stops.
Although I doubt this is the problem, one thing that concerns me is this
https://github.com/peeinears/tweet-pipe/blob/master/index.js#L173-174
destroy should be called after emitting error.
I would normally debug a problem like this by adding logging statements until I can see what is not happening right.
Can you post a script that can be run to reproduce?
(for extra points, include a package.json that specifies the dependencies :)
According to this, you should create an error handler on the stream created by tp.

Resources