Stream data from Cassandra to file considering backpressure

Stream data from Cassandra to file considering backpressure - node.js

I have Node App that collects vote submissions and stores them in Cassandra. The votes are stored as base64 encoded encrypted strings. The API has an endpoint called /export that should get all of these votes strings (possibly > 1 million), convert them to binary and append them one after the other in a votes.egd file. That file should then be zipped and sent to the client. My idea is to stream the rows from Cassandra, converting each vote string to binary and writing to a WriteStream.
I want to wrap this functionality in a Promise for easy use. I have the following:
streamVotesToFile(query, validVotesFileBasename) {
return new Promise((resolve, reject) => {
const writeStream = fs.createWriteStream(`${validVotesFileBasename}.egd`);
writeStream.on('error', (err) => {
logger.error(`Writestream ${validVotesFileBasename}.egd error`);
reject(err);
});
writeStream.on('drain', () => {
logger.info(`Writestream ${validVotesFileBasename}.egd error`);
})
db.client.stream(query)
.on('readable', function() {
let row = this.read();
while (row) {
const envelope = new Buffer(row.vote, 'base64');
if(!writeStream.write(envelope + '\n')) {
logger.error(`Couldn't write vote`);
}
row = this.read()
}
})
.on('end', () => { // No more rows from Cassandra
writeStream.end();
writeStream.on('finish', () => {
logger.info(`Stream done writing`);
resolve();
});
})
.on('error', (err) => { // err is a response error from Cassandra
reject(err);
});
});
}
When I run this it is appending all the votes to a file and downloading fine. But there are a bunch of problems/questions I have:
If I make a req to the /export endpoint and this function runs, while it's running all other requests to the app are extremely slow or just don't finish before the export request is done. I'm guessing because the event loop being hogged by all of these events from the Cassandra stream (thousands per second) ?
All the votes seem to write to the file fine yet I get false for almost every writeStream.write() call and see the corresponding logged message (see code) ?
I understand that I need to consider backpressure and the 'drain' event for the WritableStream so ideally I would use pipe() and pipe the votes to a file because that has built in backpressure support (right?) but since I need to process each row (convert to binary and possible add other data from other row fields in the future), how would I do that with pipe?

This the perfect use case for a TransformStream:
const myTransform = new Transform({
readableObjectMode: true,
transform(row, encoding, callback) {
// Transform the row into something else
const item = new Buffer(row['vote'], 'base64');
callback(null, item);
}
});
client.stream(query, params, { prepare: true })
.pipe(myTransform)
.pipe(fileStream);
See more information on how to implement a TransformStream in the Node.js API Docs.

Related

How to handle larger node js request?

How to send a response for large node js request? I have an excel sheet contains 10k items with an updated price. I am uploading the excel from the UI and sending all JSON data in the API for updating the price in the database. Now the request was timed out because of the processing time. How can I handle it in a proper way? How can I send the response after processing all the items?

Use fs.createReadStream to process large files.
i.e. if you're working on csv files you can process each line like this:
const fs = require("fs")
const csv = require("csv-parser")
function processFile() {
fs.createReadStream("./my-file.csv")
.pipe(
csv()
)
.on("data", (data) => {
const value = data.toString()
// processing
})
.on("end", () => {
console.log('Completed.')
// send response
})
}
// watch for file changes. i.e. new. upload
fs.watch('somedir', (eventType, filename) => {
if (filename === 'my-file.csv' && eventType === 'change') {
processFile();
}
});

How to 'pipe' oracle-db data from 'on data' event

I've been using node-oracledb for a few months and I've managed to achieve what I have needed to so far.
I'm currently working on a search app that could potentially return about 2m rows of data from a single call. To ensure I don't get a disconnect from the browser and the server, I thought I would try queryStream so that there is a constant flow of data back to the client.
I implemented the queryStream example as-is, and this worked fine for a few hundred thousand rows. However, when the returned rows is greater than one million, Node runs out of memory. By logging and watching both client and server log events, I can see that client is way behind the server in terms of rows sent and received. So, it looks like Node is falling over because it's buffering so much data.
It's worth noting that at this point, my selectstream implementation is within a req/res function called via Express.
To return the data, I do something like....
stream.on('data', function (data) {
rowcount++;
let obj = new myObjectConstructor(data);
res.write(JSON.stringify(obj.getJson());
});
I've been reading about how streams and pipe can help with flow, so what I'd like to be able to do is to be able to pipe the results from the query to a) help with flow and b) to be able to pipe the results to other functions before sending back to the client.
E.g.
function getData(req, res){
var stream = myQueryStream(connection, query);
stream
.pipe(toSomeOtherFunction)
.pipe(yetAnotherFunction)
.pipe(res);
}
I'm spent a few hours trying to find a solution or example that allows me to pipe results, but I'm stuck and need some help.
Apologies if I'm missing something obvious, but I'm still getting to grips with Node and especially streams.
Thanks in advance.

There's a bit of an impedance mismatch here. The queryStream API emits rows of JavaScript objects, but what you want to stream to the client is a JSON array. You basically have to add an open bracket to the beginning, a comma after each row, and a close bracket to the end.
I'll show you how to do this in a controller that uses the driver directly as you have done, instead of using separate database modules as I advocate in this series.
const oracledb = require('oracledb');
async function get(req, res, next) {
try {
const conn = await oracledb.getConnection();
const stream = await conn.queryStream('select * from employees', [], {outFormat: oracledb.OBJECT});
res.writeHead(200, {'Content-Type': 'application/json'});
res.write('[');
stream.on('data', (row) => {
res.write(JSON.stringify(row));
res.write(',');
});
stream.on('end', () => {
res.end(']');
});
stream.on('close', async () => {
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
stream.on('error', async (err) => {
next(err);
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
} catch (err) {
next(err);
}
}
module.exports.get = get;
Once you get the concepts, you can simplify things a bit with a reusable Transform class which allows you to use pipe in the controller logic:
const oracledb = require('oracledb');
const { Transform } = require('stream');
class ToJSONArray extends Transform {
constructor() {
super({objectMode: true});
this.push('[');
}
_transform (row, encoding, callback) {
if (this._prevRow) {
this.push(JSON.stringify(this._prevRow));
this.push(',');
}
this._prevRow = row;
callback(null);
}
_flush (done) {
if (this._prevRow) {
this.push(JSON.stringify(this._prevRow));
}
this.push(']');
delete this._prevRow;
done();
}
}
async function get(req, res, next) {
try {
const toJSONArray = new ToJSONArray();
const conn = await oracledb.getConnection();
const stream = await conn.queryStream('select * from employees', [], {outFormat: oracledb.OBJECT});
res.writeHead(200, {'Content-Type': 'application/json'});
stream.pipe(toJSONArray).pipe(res);
stream.on('close', async () => {
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
stream.on('error', async (err) => {
next(err);
try {
await conn.close();
} catch (err) {
console.log(err);
}
});
} catch (err) {
next(err);
}
}
module.exports.get = get;

Rather than writing your own logic to create a JSON stream, you can use JSONStream to convert an object stream to (stringified) JSON, before piping it to its destination (res, process.stdout etc) This saves the need to muck around with .on('data',...) events.
In the example below, I've used pipeline from node's stream module rather than the .pipe method: the effect is similar (with better error handling I think). To get objects from oracledb.queryStream, you can specify option {outFormat: oracledb.OUT_FORMAT_OBJECT} (docs). Then you can make arbitrary modifications to the stream of objects produced. This can be done using a transform stream, made perhaps using through2-map, or if you need to drop or split rows, through2. Below the stream is sent to process.stdout after being stringified as JSON, but you could equally send to it express's res.
require('dotenv').config() // config from .env file
const JSONStream = require('JSONStream')
const oracledb = require('oracledb')
const { pipeline } = require('stream')
const map = require('through2-map') // see https://www.npmjs.com/package/through2-map
oracledb.getConnection({
user: process.env.DB_USER,
password: process.env.DB_PASSWORD,
connectString: process.env.CONNECT_STRING
}).then(connection => {
pipeline(
connection.queryStream(`
select dual.*,'test' as col1 from dual
union select dual.*, :someboundvalue as col1 from dual
`
,{"someboundvalue":"test5"} // binds
,{
prefetchRows: 150, // for tuning
fetchArraySize: 150, // for tuning
outFormat: oracledb.OUT_FORMAT_OBJECT
}
)
,map.obj((row,index) => {
row.arbitraryModification = index
return row
})
,JSONStream.stringify() // false gives ndjson
,process.stdout // or send to express's res
,(err) => { if(err) console.error(err) }
)
})
// [
// {"DUMMY":"X","COL1":"test","arbitraryModification":0}
// ,
// {"DUMMY":"X","COL1":"test5","arbitraryModification":1}
// ]

Saving data stream into Cassandra using node.js

I have a data stream (via Node EventEmitter) emitting data in JSON format and would like to save the stream into Cassandra as it gets emitted. Is there an elegant way to implement this functionality?
The driver that i'm using is nodejs-dse-driver and the Cassandra version is 3.11.1. Please suggest if there are any recommended plugins that i can leverage to accomplish the above task.

This is a good use case for a Transform Stream.
If you have a true Readable stream then you can pipe any Transform stream into the Readable stream. I don't think an event emitter is a readable stream though, so you may need to change your original data fetching implementation.
See the NodeJS documentation for implementation details.
https://nodejs.org/api/stream.html#stream_new_stream_transform_options
Something like this depending on your version of NodeJS.
const myTransformStream = new Transform({
objectMode: true,
transform(row, encoding, callback) {
// insert into Cassandra code here
cassandra.execute(query, row, {prepare: true}, () => {
// after the execute is done, callback to process more
callback(null, row);
});
}
});
originalStream.pipe(myTransformStream);

You can read the data in chunks from your source and send it in parallel, for example (using the async library):
const limit = 10;
stream.on('readable', () => {
let r;
const rows = [];
async.whilst(function condition() {
while ((r = csv.read()) != null && rows.length < limit) {
rows.push(r);
}
return rows.length > 0;
}, function eachGroup(next) {
// we have a group of 10 rows or less to save
// we can do it in a batch
// or we can do it in parallel with async.each()
async.each(rows, (r, eachCallback) {
// Adapt the row to parameters
// For example: sample
const params = r.split(',);
client.execute(query, params, { prepare: true}, eachCallback);
}, next);
}, function groupFinished(err) {
if (err) {
// something happened when saving
// TODO: do something with err
return;
}
// This chunk of rows emitted by stream where saved
});
}).on('end', () => {
// no more data from source
});

Nodejs: How to send a readable stream to the browser

If I query the box REST API and get back a readable stream, what is the best way to handle it? How do you send it to the browser?? (DISCLAIMER: I'm new to streams and buffers, so some of this code is pretty theoretical)
Can you pass the readStream in the response and let the browser handle it? Or do you have to stream the chunks into a buffer and then send the buffer??
export function getFileStream(req, res) {
const fileId = req.params.fileId;
console.log('fileId', fileId);
req.sdk.files.getReadStream(fileId, null, (err, stream) => {
if (err) {
console.log('error', err);
return res.status(500).send(err);
}
res.type('application/octet-stream');
console.log('stream', stream);
return res.status(200).send(stream);
});
}
Will ^^ work, or do you need to do something like:
export function downloadFile(req, res) {
const fileId = req.params.fileId;
console.log('fileId', fileId);
req.sdk.files.getReadStream(fileId, null, (err, stream) => {
if (err) {
console.log('error', err);
return res.status(500).send(err);
}
const buffers = [];
const document = new Buffer();
console.log('stream', stream);
stream.on('data', (chunk) => {
buffers.push(buffer);
})
.on('end', function(){
const finalBuffer = Buffer.concat(buffers);
return res.status(200).send(finalBuffer);
});
});
}

The first example would work if you changed you theoretical line to:
- return res.status(200).send(stream);
+ res.writeHead(200, {header: here})
+ stream.pipe(res);
That's the nicest thing about node stream. The other case would (in essence) work too, but it would accumulate lots of unnecessary memory.
If you'd like to check a working example, here's one I wrote based on scramjet, express and browserify:
https://github.com/MichalCz/scramjet/blob/master/samples/browser/browser.js
Where your streams go from the server to the browser. With minor mods it'll fit your problem.

How to force Node.js Transform stream to finish?

Consider the following scenario. I have two Node Transform streams:
Transform stream 1
function T1(options) {
if (! (this instanceof T1)) {
return new T1(options);
}
Transform.call(this, options);
}
util.inherits(T1, Transform);
T1.prototype._transform = function(chunk, encoding, done) {
console.log("### Transforming in t1");
this.push(chunk);
done();
};
T1.prototype._flush = function(done) {
console.log("### Done in t1");
done();
};
Transform stream 2
function T2(options) {
if (! (this instanceof T2)) {
return new T2(options);
}
Transform.call(this, options);
}
util.inherits(T2, Transform);
T2.prototype._transform = function(chunk, encoding, done) {
console.log("### Transforming in t2");
this.push(chunk);
done();
};
T2.prototype._flush = function(done) {
console.log("### Done in t2");
done();
};
And, I'm wanting to apply these transform streams before returning a response. I have a simple HTTP server, and on each request, I fetch a resource and would like these transformations to be applied to this fetched resource and then send the result of the second transformation to the original response:
var options = require('url').parse('http://localhost:1234/data.json');
options.method = 'GET';
http.createServer(function(req, res) {
var req = http.request(options, function(httpRes) {
var t1 = new T1({});
var t2 = new T2({});
httpRes
.pipe(t1)
.pipe(t2)
.on('finish', function() {
// Do other stuff in here before sending request back
t2.pipe(res, { end : true });
});
});
req.end();
}).listen(3001);
Ultimately, the finish event never gets called, and the request hangs and times out because the response is never resolved. I've noticed that if I just pipe t2 into res, it seems to work fine:
.pipe(t1)
.pipe(t2)
.pipe(res, { end : true });
But, this scenario doesn't seem feasible because I need to do some extra work before returning the response.

This happens because you need to let node know that the stream is being consumed somewhere, otherwise the last stream will just fill up the buffer and considering your data is longer than the highwaterMark option (usually 16) and then halt waiting for the data to be consumed.
There are three ways of consuming a stream in full:
piping to a readable stream (what you did in the second part of your question)
reading consecutive chunks by calling the read method of a stream
listening on "data" events (essentially stream.on("data", someFunc)).
The last option is the quickest, but will result in consuming the stream without looking at memory usage.
I'd also note that using the "finish" event might be a little misleading, since it is called when the last data is read, but not necessarily emitted. On a Transform stream, since it's a readable as well it's much better to use the "end" event.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Stream data from Cassandra to file considering backpressure - node.js

Related

How to handle larger node js request?

How to 'pipe' oracle-db data from 'on data' event

Saving data stream into Cassandra using node.js

Nodejs: How to send a readable stream to the browser

How to force Node.js Transform stream to finish?

Categories

Resources