I have a 400Mb file split into chunks that are ~1Mb each.
Each chunk is a MongoDB document:
{
name: 'stuff.zip',
index: 15,
buffer: Binary('......'),
totalChunks: 400
}
I am fetching each chunk from my database and then streaming it to the client.
Every time I get chunk from the DB I push it to the readableStream which is being piped to the client.
Here is the code:
import { Readable } from 'stream'
const name = 'stuff.zip'
const contentType = 'application/zip'
app.get('/api/download-stuff', (req, res) => {
res.set('Content-Type', contentType)
res.set('Content-Disposition', `attachment; filename=${name}`)
res.attachment(name)
// get `totalChunks` from random chunk
let { totalChunks } = await ChunkModel.findOne({ name }).select('totalChunks')
let index = 0
const readableStream = new Readable({
async read() {
if (index < totalChunks) {
let { buffer } = await ChunkModel.findOne({ name, index }).select('buffer')
let canContinue = readableStream.push(buffer)
console.log(`pushed chunk ${index}/${totalChunks}`)
index++
// sometimes it logs false
// which means I should be waiting before pushing more
// but I don't know how
console.log('canContinue = ', canContinue)
} else {
readableStream.push(null)
readableStream.destroy()
console.log(`all ${totalChunks} chunks streamed to the client`)
}
}
})
readableStream.pipe(res)
})
The code works.
But I'm wondering whether I risk having memory overflows on my local server memory, especially when the requests for the same file are too many or the chunks are too many.
Question: My code is not waiting for readableStream to finish reading the chunk that was just pushed to it, before pushing the next one. I thought it was, and that is why I'm using read(){..} in this probably wrong way. So how should I wait for each chunk to be pushed, read, streamed to the client and cleared from my server's local memory, before I push the next one in ?
I have created this sandbox in case it helps anyone
In general, when the readable interface is implemented correctly (i.e., the backpressure signal is respected), the readable interface will prevent the code from overflowing the memory regardless of source size.
When implemented according to the API spec, the readable itself does not keep references for data that has finished passing through the stream. The memory requirement of a readable buffer is adjusted by specifying a highWatermark.
In this case, the snippet does not conform to the readable interface. It violates the following two concepts:
No data shall be pushed to the readable's buffer unless read() has been called. Currently, this implementation proceeds to push data from DB immediately. Consequently, the readable buffer will start to fill before the sink has begun to consume data.
The readable's push() method returns a boolean flag. When the flag is false, the implementation must wait for .read() to be called before pushing additional data. If the flag is ignored, the buffer will overflow wrt. the highWatermark.
Note that ignoring these core criteria of Readables circumvents the backpressure logic.
An alternative implementation, if this is a Mongoose query:
app.get('/api/download-stuff', async (req, res) => {
// ... truncated handler
// A helper variable to relay data from the stream to the response body
const passThrough = new stream.PassThrough({objectMode: false});
// Pipe data using pipeline() to simplify handling stream errors
stream.pipeline(
// Create a cursor that fetch all relevant documents using a single query
ChunkModel.find().limit(chunksLength).select("buffer").sort({index: 1}).lean().cursor(),
// Cherry pick the `buffer` property
new stream.Transform({
objectMode: true,
transform: ({ buffer }, encoding, next) => {
next(null, buffer);
}
}),
// Write the retrieved documents to the helper variable
passThrough,
error => {
if(error){
// Log and handle error. At this point the HTTP headers are probably already sent,
// and it is therefore too late to return HTTP500
}
}
);
res.body = passThrough;
});
Related
I want to pipe data from my readable stream to a writable stream but validate in between.
In my case:
Readable Stream: http response as a stream (Axios.post response as a stream to be more specific)
Writable Stream: AWS S3
Axios.post response comes in XML format. So, it means the readable stream will read chunks that represent XML. I transform each chunk to string and check if <specificTag> (opening) and </specificTag> closing is available. Both these checks will be done in different or arbitrary chunks.
If both opening/closing tags are OK then I have to transfer the chunk to Writable stream.
I am coding like:
let openTagFound: boolean: false;
let closingTagFound: boolean: false;
readableStream.pipe(this.validateStreamData()).pipe(writableStream);
I have also defined _tranform method for validateStreamData() like:
private validateStreamData(): Transform {
let data = '', transformStream = new Transform();
let openTagFound: boolean = false;
let closingTagFound: boolean = false;
try {
transformStream._transform = function (chunk, _encoding, done) {
// Keep chunk in memory
data += chunk.toString();
if(!openTagFound) {
// Check whether openTag e.g <specificTag> is found, if yes
openTagFound = true;
}
if(!closingTagFound) {
// parse the chunk using parser
// Check whether closingTag e.g </specificTag> is found, if yes
closingTagFound = true;
}
// we are not writing anything out at this
// time, only at end during _flush
// so we don't need to call push
done();
};
transformStream._flush = function (done) {
if(openTagFound && closingTagFound) {
this.push(data);
}
done();
};
return transformStream;
} catch (ex) {
this.logger.error(ex);
transformStream.end();
throw Error(ex);
}
}
Now, you can see that I am using a variable data at:
// Keep chunk in memory
data += chunk.toString();
I want to get rid of this. I do not want to utilize memory explicitly. The final goal is to get data from Axios.post and transfer it to AWS S3, only if my validation succeeds. If not, then it should not write to S3.
Any help is much appreciated.
Thanks in Advance!!!
So, What I finally did is, let the pipe end and kept some flags to check whether it is valid or invalid and then on('end') callback, if flag says invalid explicitly destroyed destination object.
I've just started playing with streaming data in Expressjs.
Not entirely sure, but I think the request will start to execute the handler again. For example, here is my handler:
import getDataAsync from "./somewhere";
function handler(req, res) {
console.log('requesting', req.path);
getDataAsync()
.then(data => {
let stream = renderContent(data);
stream.pipe(res);
})
.catch(err => {
res.end();
})
}
What I found was, it continue to print out console.log('requesting', req.path) (which I think will re-execute getDataAsync).
My question is:
Is it true it will re-execute getDataAsync?
If it does, what's your approach?
Thank heaps!
Node JS is non-blocking, so if you were to make a request to an endpoint with this handler again then it will execute. The handler will call getDataAsync() and then the handler gets removed from call stack. The process is repeated for each request.
If you want the handler to wait out the stream before it calls it again you could do:
import getDataAsync from "./somewhere";
let streamComplete = true;
function handler(req, res) {
if(!streamComplete) {
res.end();
}
console.log('requesting', req.path);
getDataAsync()
.then(data => {
streamComplete = false;
let stream = renderContent(data);
stream.pipe(res);
stream.on('end', () => streamComplete = true);
})
.catch(err => {
res.end();
})
}
I did need to sort this problem out in one of my projects. Node or in fact any other environment/language will have the same issue, that once you start streaming the data to one client, it's rather hard to stream it to another. This is due to the fact that once you do this:
inputStream.pipe(outputStream);
...the input data will be pushed out to output and will be removed from memory. So if you just pipe the inputStream again, you'll have some initial part of the data missing.
The solution I came up with was to write a Transform stream that kept the data in memory and you can reuse it afterwards. Such a stream will have all the original chunks and at the same time when it catches up with the first request, it will just keep pushing the chunks directly. I packaged the solution as a npm module and published so now you can use it.
This is how you use it:
const {ReReadable} = require("rereadable-stream");
// We'll use this for caching - you can use a Map if you have more streams
let cachedStream;
// This function will get the stream and
const getCachedStream = () =>
(cachedStream || (cachedStream =
getDataAsync()
.then(
data => renderContent(data).pipe(new ReReadable())
))
)
.then(readable => readable.rewind())
Such a function will call you getDataAsync once and then will push the data to a the rewindable stream, but every time the function is executed the stream will be rewound to the begining.
You can read a bit more about the rereadable-stream module here.
A word of warning though - remember, that you will keep all that data in memory now, so be careful to clean it up if there's more chunks there and control your memory usage.
I create a writeFileStream and pipe it with readableStream.
When on data, I check length of data to if length too short, don't create a file with writeFileStream.
Can I abort create a file with writeFileStream, or unlink the file after file created?
Thanks for your help.
const fs = require('fs')
const { ReadableMock } = require('stream-mock')
const { assert } = require('chai')
describe.only('fs', () => {
const expectedPath = './file.txt'
const input = 'abc'
const reader = new ReadableMock(input)
const writer = fs.createWriteStream(expectedPath)
before((done) => {
let index = 0
reader.pipe(writer)
reader.on('data', () => {
index++
if (index === 1) {
reader.unpipe(writer)
done()
}
})
})
after(() => {
fs.unlinkSync('./file.txt')
})
it('should not create file', () => {
assert.isFalse(fs.existsSync(expectedPath)) // expected true to be false.
})
})
In order to achieve what you're trying to achieve I'd create a PassThrough stream and use highWaterMark to tell me when the stream has been filled - you won't need much code and the streams will give you so little overhead you won't notice (not with writing to disk or reading from HTTP). ;)
Here's what I'd do:
const reader = new ReadableMock(input)
const checker = new PassThrough({
highWaterMark: 4096 // or how many bytes you need to gather first
});
reader
.once('pause', () => checker.pipe(fs.createWriteStream(expectedPath)))
.pipe(checker);
What happens here is:
reader is piped to checker which is not connected to anything, but has it's highWaterMark level of bytes that it allows (you may add encoding there to use chars instead of bytes)
checker is paused, but on pipe reader will unpause and try to write as much as it can
checker will accept some data before returning false on write that will emit pause event on reader
the listener only now creates the writer and it's underlying file and pipes the checker
the checker gets unpaused and so gets reader
If the number of bytes is lower than highWaterMark, pause will not be emitted on reader and so the file won't get created.
Mind you - you may need to close connections and clean up if this is not a mock, otherwise you may leave those hanging and waiting to be read and soon you'll exhaust incoming connection limits or available memory.
I am piping the response from a Node request into a transform stream using through2Concurrent. This response comes in as a buffer and is parsed to an object using JSONStream. That then gets piped into my transform stream. The transform stream function then makes an HTTP requests, formats the response and stores it into a MongoDB. We are using concurrent streams because it would take an unacceptably long time to handle everything otherwise.
response Stream -> JSONStream.parse() -> Transform Stream
Problem Description
The initial response stream contains roughly 18,000 objects once parsed. However, the stream terminates and an finish event is received before all 18,000 objects are handled. No error is thrown, but only about 2,000 - 5,000 objects are actually handled before the stream ends. The exact number handled varies.
Here is the relevant code:
const analyticsTransformer = through2Concurrent.obj({
maxConcurrency: 15
}, async (doc, enc, cb) => {
// Make an http request. This is a relatively long request.
const res = await apim.getAnalytics(doc);
// Save response to mongo.
await UsageData.save(res);
cb();
});
// Kick off the streaming.
broker.getInstances()
.pipe(JSONStream.parse('*')
.pipe(analyticsTransformer)
.on('finish', () => {
// We reach this way too quickly before we have handled all 18,000 objects
})
.on('error', err => {
// No errors are caught.
})
What I have Tried
Waiting for an 'end' event: Same result. Unhandled objects and early termination.
Using through2 (not through2Concurrent): Receive ETIMEOUT after several thousand objects have come through.
Setting the highWaterMark to 18,000:
This is the only thing that has worked. I can handle all of the objects if I change this highWatermark value, but this is really just a bandaid on the problem. I want to know why this works and what I can do to fix my streaming problems in a robust way.
Setting the highWaterMark looks like this:
const analyticsTransformer = through2Concurrent.obj({
highWaterMark: 18,000,
maxConcurrency: 15
}, async (doc, enc, cb) => {
// ...
});
Why does changing the highWaterMark value work?
What is the real cause of my early terminated stream?
How can I fix it?
Thanks in advance to anyone that can help! :)
I am having a really hard time wrapping my head around how to stream data back to my client when using Nodejs/Expressjs.
I am grabbing a lot of data from my database and I am doing it in chunks, I would like to stream that back to the client as I get the data such that I do not have to store the entire dataset in memory as a json object before sending it back.
I would like the data to stream back as a file, ie I want the browser to ask my users what to do with the file on download. I was previously creating a file system write stream and stream the contents of my data to the file system, then when done I would send the file back to the client. I would like to eliminate the middle man (creating tmp file on file system) and just stream data to client.
app.get(
'/api/export',
function (req, res, next) {
var notDone = true;
while (notDone) {
var partialData = // grab partial data from database (maybe first 1000 records);
// stream this partial data as a string to res???
if (checkIfDone) notDone = false;
}
}
);
I can call res.write("some string data"), then call res.end() when I am done. However I am not 100% sure that this is actually streaming the response to the client as I write. Seems like expressjs is storing all the data until I call end and then sending the response. Is that true?
What is the proper way to stream strings chunks of data to a response using expressjs?
The response object is already a writable stream. Express handles sending chunked data automatically, so you won't need to do anything extra but:
response.send(data)
You may also want to check out the built-in pipe method, http://nodejs.org/api/stream.html#stream_event_pipe.
You can do this by setting the appropriate headers and then just writing to the response object. Example:
res.writeHead(200, {
'Content-Type': 'text/plain',
'Content-Disposition': contentDisposition('foo.data')
});
var c = 0;
var interval = setInterval(function() {
res.write(JSON.stringify({ foo: Math.random() * 100, count: ++c }) + '\n');
if (c === 10) {
clearInterval(interval);
res.end();
}
}, 1000);
// extracted from Express, used by `res.download()`
function contentDisposition(filename) {
var ret = 'attachment';
if (filename) {
filename = basename(filename);
// if filename contains non-ascii characters, add a utf-8 version ala RFC 5987
ret = /[^\040-\176]/.test(filename)
? 'attachment; filename="' + encodeURI(filename) + '"; filename*=UTF-8\'\'' + encodeURI(filename)
: 'attachment; filename="' + filename + '"';
}
return ret;
}
Also, Express/node does not buffer data written to a socket unless the socket is paused (either explicitly or implicitly due to backpressure). Data buffered by node when in this paused state may or may not be combined with other data chunks that already buffered. You can check the return value of res.write() to determine if you should continue writing to the socket. If it returns false, then listen for the 'drain' event and then continue writing again.