I'm writing a small node.js application that receives a multipart POST from an HTML form and pipes the incoming data to Amazon S3. The formidable module provides the multipart parsing, exposing each part as a node Stream. The knox module handles the PUT to s3.
var form = new formidable.IncomingForm()
, s3 = knox.createClient(conf);
form.onPart = function(part) {
var put = s3.putStream(part, filename, headers, handleResponse);
put.on('progress', handleProgress);
};
form.parse(req);
I'm reporting the upload progress to the browser client via socket.io, but am having difficulty getting these numbers to reflect the real progress of the node to s3 upload.
When the browser to node upload happens near instantaneously, as it does when the node process is running on the local network, the progress indicator reaches 100% immediately. If the file is large, i.e. 300MB, the progress indicator rises slowly, but still faster than our upstream bandwidth would allow. After hitting 100% progress, the client then hangs, presumably waiting for the s3 upload to finish.
I know putStream uses Node's stream.pipe method internally, but I don't understand the detail of how this really works. My assumption is that node gobbles up the incoming data as fast as it can, throwing it into memory. If the write stream can take the data fast enough, little data is kept in memory at once, since it can be written and discarded. If the write stream is slow though, as it is here, we presumably have to keep all that incoming data in memory until it can be written. Since we're listening for data events on the read stream in order to emit progress, we end up reporting the upload as going faster than it really is.
Is my understanding of this problem anywhere close to the mark? How might I go about fixing it? Do I need to get down and dirty with write, drain and pause?
Your problem is that stream.pause isn't implemented on the part, which is a very simple readstream of the output from the multipart form parser.
Knox instructs the s3 request to emit "progress" events whenever the part emits "data". However since the part stream ignores pause, the progress events are emitted as fast as the form data is uploaded and parsed.
The formidable form, however, does know how to both pause and resume (it proxies the calls to the request it's parsing).
Something like this should fix your problem:
form.onPart = function(part) {
// once pause is implemented, the part will be able to throttle the speed
// of the incoming request
part.pause = function() {
form.pause();
};
// resume is the counterpart to pause, and will fire after the `put` emits
// "drain", letting us know that it's ok to start emitting "data" again
part.resume = function() {
form.resume();
};
var put = s3.putStream(part, filename, headers, handleResponse);
put.on('progress', handleProgress);
};
Related
This is a conceptual query regarding system level optimisation. My understanding by reading the NodeJS Documentation is that pipes are handy to perform flow control on streams.
Background: I have microphone stream coming in and I wanted to avoid an extra copy operation to conserve overall system MIPS. I understand that for audio streams this is not a great deal of MIPS being spent even if there was a memcopy under the hood, but I also have an extension planned to stream in camera frames at 30fps and UHD resolution. Making multiple copies of UHD resolution pixel data at 30fps is super inefficient, so needed some advice around this.
Example Code:
var spawn = require('child_process').spawn
var PassThrough = require('stream').PassThrough;
var ps = null;
//var audioStream = new PassThrough;
//var infoStream = new PassThrough;
var start = function() {
if(ps == null) {
ps = spawn('rec', ['-b', 16, '--endian', 'little', '-c', 1, '-r', 16000, '-e', 'signed-integer', '-t', 'raw', '-']);
//ps.stdout.pipe(audioStream);
//ps.stderr.pipe(infoStream);
exports.audioStream = ps.stdout;
exports.infoStream = ps.stderr;
}
};
var stop = function() {
if(ps) {
ps.kill();
ps = null;
}
};
//exports.audioStream = audioStream;
//exports.infoStream = infoStream;
exports.startCapture = start;
exports.stopCapture = stop;
Here are the questions:
To be able to perform flow control, does the source.pipe(dest) perform a memcpy from the source memory to the destination memory under the hood OR would it pass the reference in memory to the destination?
The commented code contains a PassThrough class instantiation - I am currently assuming the PassThrough causes memcopies as well, and so I am saving one memcpy operation in the entire system because I added in the above comments?
If I had to create a pipe between a Process and a Spawned Child process (using child_process.spawn() as shown in How to transfer/stream big data from/to child processes in node.js without using the blocking stdio?), I presume that definitely results in memcpy? Is there anyway to make that a reference rather than copy?
Does this behaviour differ from OS to OS? I presume it should be OS agnostic, but asking this anyways.
Thanks in advance for your help. It will help my architecture a great deal.
some url's for reference: https://github.com/nodejs/node/
https://github.com/nodejs/node/blob/master/src/stream_wrap.cc
https://github.com/nodejs/node/blob/master/src/stream_base.cc
https://github.com/libuv/libuv/blob/v1.x/src/unix/stream.c
https://github.com/libuv/libuv/blob/v1.x/src/win/stream.c
i tried writing a complicated / huge explaination based on theese and some other files however i came to the conclusion it would be best to give you a summary of how my experience / reading tells me node internally works:
pipe simply connects streams making it appear as if .on("data", …) is called by .write(…) without anything bloated in between.
now we need to separate the js world from the c++ / c world.
when dealing with data in js we use buffers. https://github.com/nodejs/node/blob/master/src/node_buffer.cc
they simply represent allocated memory with some candy on top to operate with it.
if you connect stdout of a process to some .on("data", …) listener it will copy the incoming chunk into a Buffer object for further usage inside the js world.
inside the js world you have methods like .pause() etc. (as you can see in nodes steam api documentation) to prevent the process to eat memory in case incoming data flows faster than its processed.
connecting stdout of a process and for example an outgoing tcp port through pipe will result in a connection similar to how nginx operates. it will connect theese streams as if they would directly talk to each other by copying incoming data directly to the outgoing stream.
as soon as you pause a stream, node will use internal buffering in case its unable to pause the incoming stream.
so for your scenario you should just do testing.
try to receive data through an incoming stream in node, pause the stream and see what happens.
i'm not sure if node will use internal buffering or if the process you try to run will just halt untill it can continue to send data.
i expect the process to halt untill you continue the stream.
for transfering huge images i recommend transfering them in chunks or to pipe them directly to an outgoing port.
the chunk way would allow you to send the data to multiple clients at once and would keep the memory footprint pretty low.
PS you should take a look at this gist that i just found: https://gist.github.com/joyrexus/10026630
it explains in depth how you can interact with streams
The documentation for node suggests that for the new best way to read streams is as follows:
var readable = getReadableStreamSomehow();
readable.on('readable', function() {
var chunk;
while (null !== (chunk = readable.read())) {
console.log('got %d bytes of data', chunk.length);
}
});
To me this seems to cause a blocking while loop. This would mean that if node is responding to an http request by reading and sending a file, the process would have to block while the chunk is read before it could be sent.
Isn't this blocking IO which node.js tries to avoid?
The important thing to note here is that it's not blocking in the sense that it's waiting for more input to arrive on the stream. It's simply retrieving the current contents of the stream's internal buffer. This kind of loop will finish pretty quickly since there is no waiting on I/O at all.
A stream can be both synchronous and asynchronous. If readable stream synchronously pushes data in the internal buffer then you'll get a synchronous stream. And yes, in that case if it pushes lots of data synchronously node's event loop won't be able to run until all the data is pushed.
Interestingly, if you even remove the while loop in readble callback, the stream module internally calls a while loop once and keeps running until all the pushed data is read.
But for asynchronous IO operations(e.g. http or fs module), they push data asynchronously in the buffer. So the while loop only runs when data is pushed in buffer and stops as soon as you've read the entire buffer.
I'm using node and socket io to stream twitter feed to the browser, but the stream is too fast. In order to slow it down, I'm attempting to use setInterval, but it either only delays the start of the stream (without setting evenly spaced intervals between the tweets) or says that I can't use callbacks when broadcasting. Server side code below:
function start(){
stream.on('tweet', function(tweet){
if(tweet.coordinates && tweet.coordinates != null){
io.sockets.emit('stream', tweet);
}
});
}
io.sockets.on("connection", function(socket){
console.log('connected');
setInterval(start, 4000);
});
I think you're misunderstanding how .on() works for streams. It's an event handler. Once it is installed, it's there and the stream can call you at any time. Your interval is actually just making things worse because it's installing multiple .on() handlers.
It's unclear what you mean by "data coming too fast". Too fast for what? If it's just faster than you want to display it, then you can just store the tweets in an array and then use timers to decide when to display things from the array.
If data from a stream is coming too quickly to even store and this is a flowing nodejs stream, then you can pause the stream with the .pause() method and then, when you're able to go again, you can call .resume(). See http://nodejs.org/api/stream.html#stream_readable_pause for more info.
The following will not work properly:
var http = require('http');
var fs = require('fs');
var theIndex = fs.createReadStream('index.html');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
theIndex.pipe(res);
}).listen(9000);
It will work great on the first request but for all subsequent requests no index.html will be sent to the client. The createReadStream call seems to need be inside the createServer callback. I think I can conceptualize why, but can you articulate why in words? It seems to be that once the stream has completed the file handle is closed and the stream must be created again? It can't simply be "restarted"? Is this correct?
Thanks
Streams contain internal state that keeps track of the state of the stream--in the case of a file stream, you have a file descriptor object, a read buffer, and the current position the file has been read to. Thus, it doesn't make sense to "rewind" a Node.js stream because Node.js is an asynchronous environment--this is an important point to keep in mind, as it means that two HTTP requests can be in the middle of processing at the same time.
If one HTTP request causes the stream to begin streaming from disk, and midway through the streaming process another HTTP request came in, there would be no way to use the same stream in the second HTTP request (the internal record-keeping would incorrectly send the second HTTP response the wrong data). Similarly, rewinding the stream when the second HTTP request is processed would cause the wrong data to be sent to the original HTTP request.
If Node.js were not an asynchronous environment, and it was guaranteed that the stream was completely used up before you rewound it, it might make sense to be able to rewind a stream (though there are other considerations, such as the timing of the open, end, and close events).
You do have access to the low-level fs.read mechanisms, so you could theoretically create an API that only opened a single file descriptor but spawned multiple streams; each stream would contain its own buffer and read position, but share a file descriptor. Perhaps something like:
var http = require('http');
var fs = require('fs');
var theIndexSpawner = createStreamSpawner('index.html');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
theIndexSpawner.spawnStream().pipe(res);
}).listen(9000);
Of course, you'll have to figure out when it's time to close the file descriptor, making sure you don't hold onto it for too long, etc. Unless you find that opening the file multiple times is an actual bottleneck in your application, it's probably not worth the mental overhead.
I'm trying to make a server based on the net module. what I don't understand is on which event I'm supposed to put the response code:
on(data,function()) could still be in the middle of receiving more data from the stream (so it might be to early to reply)
and on(end,function()) is after the connection is closed .
thank you for your help
The socket event ('data'), calls the callback function every time an incoming data buffer is ready for reading,, and the event emits the socket buffer of data,,
so use this,,
socket.on('data',function(data){
// Here is the function to detect the real data in stream
});
this can help for node v0.6.5, http://nodejs.org/docs/v0.6.5/api/net.html#event_data_
and this for clear understanding for the Readable streames,
http://nodejs.org/docs/v0.6.5/api/streams.html#readable_Stream