Nodejs forking stream, redundant data

Nodejs forking stream, redundant data - node.js

I'm new to nodejs stream, and i don't understand why when i try to fork a stream after a trasform i get the same data multiple times. This is the example:
import { createReadStream, createWriteStream } from "fs";
import { Transform } from "stream";
const inputStream = createReadStream("assets/input.txt");
const outputStream1 = createWriteStream("assets/output1.txt");
const outputStream2 = createWriteStream("assets/output2.txt");
const t1 = new Transform({
transform(chunk, encoding, callback) {
this.push(chunk.toString().toUpperCase());
callback();
}
});
inputStream.pipe(t1).pipe(outputStream1);
inputStream.pipe(t1).pipe(outputStream2);
I would expect to get the data just one time. But this are the resulting files:
Input:
hello world
output1:
HELLO WORLDHELLO WORLD
output2:
HELLO WORLDHELLO WORLD
Thank you in advance for the help.

Related

stream array of objects nested in objects in nodejs

I fetch data from a API using nodejs.
I get a response with such a structure (The response is saved by a stream into JSON file)
{"data":{"total":40,"data":[{"date":"20220914","country":"PL","data1":1,"data2":2,"data3":3,"data4":"4"},{"date":"20220914","country":"DE","data1":21,"data2":22,"data3":23,"data4":"24"},{"date":"20220914","country":"DE","data1":21,"data2":22,"data3":23,"data4":"24"},{"date":"20220914","country":"PL","data1":1,"data2":2,"data3":3,"data4":"4"}], "total_page":1,"page":1,"page_size":100},"success":true,"code":"0","request_id":"123"}
Now I would like read the file in a stream, do some transforms on the each object, however I am not able to retrieve it object by object.
The problem is the array with data which I'm interested in is nested in .data.data object keys and I don't know how to get each element of the array one by one and modify it.
import { pipeline, Transform } from 'stream';
import { promisify } from 'util';
import fs from 'fs';
public async processData() {
await this.api.getReport();
const reader = fs.createReadStream('./response.json');
const writer = fs.createWriteStream('properFormat.txt');
const asyncPipeline = promisify(pipeline);
const newFormatedData = (object: Record<string, string>) => {
//Here I would like to take into consideration only values for example with the key: date, country and data1
console.log(object.toString());
};
const formatData = new Transform({
objectMode: true,
transform(chunk, encoding, done) {
this.push(newFormatedData(chunk));
done();
},
});
asyncPipeline(reader, formatData, writer);
}
Thank you for any hints on this!

IPFS streams not buffers

I'm trying to create an API get URL that can be called with <img src='...'/> that will actually load the image from IPFS.
I'm getting the file from IPFS and I can send it as a buffer via fastify but can't send it as a stream.
here's the working buffer using ipfs.cat
import { concat as uint8ArrayConcat } from "uint8arrays/concat";
import all from "it-all";
fastify.get(
"/v1/files/:username/:cid",
async function (request: any, reply: any) {
const { cid }: { cid: string } = request.params;
const ipfs = create();
const data = uint8ArrayConcat(await all(ipfs.cat(cid)));
reply.type("image/png").send(data);
}
);
Docs for ipfs cat
Docs for fastify reply buffers
I also tried sending it as a stream to try and not load the file into the server's memory...
import { concat as uint8ArrayConcat } from "uint8arrays/concat";
import all from "it-all";
import { Readable } from "stream";
...
fastify.get(
"/v1/files/:username/:cid",
async function (request: any, reply: any) {
const { cid }: { cid: string } = request.params;
const ipfs = create();
const bufferToStream = async (buffer: any) => {
const readable = new Readable({
read() {
this.push(buffer);
this.push(null);
},
});
return readable;
};
const data = uint8ArrayConcat(await all(ipfs.cat(cid)));
const str = await bufferToStream(data);
reply.send(str);
}
);
With a new error
Error [ERR_STREAM_WRITE_AFTER_END]: write after end
Here I'm trying to push into the stream
import { concat as uint8ArrayConcat } from "uint8arrays/concat";
import all from "it-all";
import { Readable } from "stream";
fastify.get(
"/v1/files/:username/:cid",
async function (request: any, reply: any) {
const { cid }: { cid: string } = request.params;
const ipfs = create();
const myStream = new Readable();
myStream._read = () => {};
const pushChunks = async () => {
for await (const chunk of ipfs.cat(cid)) {
myStream.push(chunk);
}
};
pushChunks();
reply.send(myStream);
}
);
the error now is
INFO (9617): stream closed prematurely
and trying to dump it all in the stream
import { concat as uint8ArrayConcat } from "uint8arrays/concat";
import all from "it-all";
import { Readable } from "stream";
fastify.get(
"/v1/files/:username/:cid",
async function (request: any, reply: any) {
const { cid }: { cid: string } = request.params;
const ipfs = create();
var myStream = new Readable();
myStream._read = () => {};
myStream.push(uint8ArrayConcat(await all(ipfs.cat(cid))));
myStream.push(null);
reply.send(myStream);
}
);
with error
WARN (14295): response terminated with an error with headers already sent
Is there any benefit to converting it to a stream? Hasn't IPFs already loaded it into memory??

Is there any benefit to converting it to a stream? Hasn't IPFs already loaded it into memory??
The ipfs module returns many chunks as a byte array.
So a file is the sum of these chunks.
Now, if you push all these chunks into an Array, and then uint8ArrayConcat is called, all the chunks are actually in your memory server.
So, if the file is 10 GB, your server has an array of bytes equal to 10 GB in memory.
Since this is unwanted for sure, you should push every chunk from the ipfs file to the response. By doing this, the chunk buffer array is transitive in the server's memory, but it is not persisted. So, in this case, you will not have the 10 GB file in memory, but only a tiny slice of it.
Since ipfs.cap returns an Async iterator, you could handle manually or use something like async-iterator-to-stream to write:
const ipfsStream = asyncIteratorToStream(ipfs.cat(cid))
return reply.send(ipfsStream)
As follow up, I share this awesome resource about node.js stream and buffers

Using csv-parse with highlandjs

I would like to do a bit of parsing on csv files to convert them to JSON and extract data out of them. I'm using highland as a stream processing library. I am creating an array of csv parsing streams using
import { readdir as readdirCb, createReadStream } from 'fs';
import { promisify } from 'util';
import _ from 'highland';
import parse from 'csv-parse';
const readdir = promisify(readdirCb);
const LOGS_DIR = './logs';
const options = '-maxdepth 1';
async function main() {
const files = await readdir(LOGS_DIR)
const stream = _(files)
.map(filename => createReadStream(`${LOGS_DIR}/${filename}`))
.map(parse)
}
main();
I have tried to use stream like:
const stream = _(files)
.map(filename => createReadStream(`${LOGS_DIR}/${filename}`))
.map(parse)
.each(stream => {
stream.on('parseable', () => {
let record
while (record = stream.read()) { console.log(record) }
})
})
This does not produce any records. I am not sure as how to proceed and receive the JSON for each row for each CSV file.
EDIT:
Writing a function like this works for an individual file:
import parse from 'csv-parse';
import transform from 'stream-transform';
import { createReadStream } from 'fs';
export default function retrieveApplicationIds(filename) {
console.log('Parsing file', filename);
return createReadStream(filename).pipe(parser).pipe(getApplicationId).pipe(recordUniqueId);
}
Edit 2:
I have tried using the concat streams approach:
const LOGS_DIR = './logs';
function concatStreams(streamArray, streamCounter = streamArray.length) {
streamArray.reduce((mergedStream, stream) => {
// pipe each stream of the array into the merged stream
// prevent the automated 'end' event from firing
mergedStream = stream.pipe(mergedStream, { end: false });
// rewrite the 'end' event handler
// Every time one of the stream ends, the counter is decremented.
// Once the counter reaches 0, the mergedstream can emit its 'end' event.
stream.once('end', () => --streamCounter === 0 && mergedStream.emit('end'));
return mergedStream;
}, new PassThrough());
}
async function main() {
const files = await readdir(LOGS_DIR)
const streams = files.map(parseFile);
const combinedStream = concatStreams(streams);
combinedStream.pipe(process.stdout);
}
main();
When I use this, I get the error:
(node:1050) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 unpipe listeners added to [Transformer]. Use emitter.setMaxListeners() to increase limit

Node.js copy a stream into a file without consuming

Given a function parses incoming streams:
async onData(stream, callback) {
const parsed = await simpleParser(stream)
// Code handling parsed stream here
// ...
return callback()
}
I'm looking for a simple and safe way to 'clone' that stream, so I can save it to a file for debugging purposes, without affecting the code. Is this possible?
Same question in fake code: I'm trying to do something like this. Obviously, this is a made up example and doesn't work.
const fs = require('fs')
const wstream = fs.createWriteStream('debug.log')
async onData(stream, callback) {
const debugStream = stream.clone(stream) // Fake code
wstream.write(debugStream)
const parsed = await simpleParser(stream)
// Code handling parsed stream here
// ...
wstream.end()
return callback()
}

No you can't clone a readable stream without consuming. However, you can pipe it twice, one for creating file and the other for 'clone'.
Code is below:
let Readable = require('stream').Readable;
var stream = require('stream')
var s = new Readable()
s.push('beep')
s.push(null)
var stream1 = s.pipe(new stream.PassThrough())
var stream2 = s.pipe(new stream.PassThrough())
// here use stream1 for creating file, and use stream2 just like s' clone stream
// I just print them out for a quick show
stream1.pipe(process.stdout)
stream2.pipe(process.stdout)

I've tried to implement the solution provided by #jiajianrong but was struggling to get it work with a createReadStream, because the Readable throws an error when I try to push the createReadStream directly. Like:
s.push(createReadStream())
To solve this issue I have used a helper function to transform the stream into a buffer.
function streamToBuffer (stream: any) {
const chunks: Buffer[] = []
return new Promise((resolve, reject) => {
stream.on('data', (chunk: any) => chunks.push(Buffer.from(chunk)))
stream.on('error', (err: any) => reject(err))
stream.on('end', () => resolve(Buffer.concat(chunks)))
})
}
Below the solution I have found using one pipe to generate a hash of the stream and the other pipe to upload the stream to a cloud storage.
import stream from 'stream'
const Readable = require('stream').Readable
const s = new Readable()
s.push(await streamToBuffer(createReadStream()))
s.push(null)
const fileStreamForHash = s.pipe(new stream.PassThrough())
const fileStreamForUpload = s.pipe(new stream.PassThrough())
// Generating file hash
const fileHash = await getHashFromStream(fileStreamForHash)
// Uploading stream to cloud storage
await BlobStorage.upload(fileName, fileStreamForUpload)
My answer is mostly based on the answer of jiajianrong.

Node js, piping pdfkit to a memory stream

I am using pdfkit on my node server, typically creating pdf files, and then uploading them to s3.
The problem is that pdfkit examples pipe the pdf doc into a node write stream, which writes the file to the disk, I followed the example and worked correctly, however my requirement now is to pipe the pdf doc to a memory stream rather than save it on the disk (I am uploading to s3 anyway).
I've followed some node memory streams procedures but none of them seem to work with pdf pipe with me, I could just write strings to memory streams.
So my question is: How to pipe the pdf kit output to a memory stream (or something alike) and then read it as an object to upload to s3?
var fsStream = fs.createWriteStream(outputPath + fileName);
doc.pipe(fsStream);

An updated answer for 2020. There is no need to introduce a new memory stream because "PDFDocument instances are readable Node streams".
You can use the get-stream package to make it easy to wait for the document to finish before passing the result back to your caller.
https://www.npmjs.com/package/get-stream
const PDFDocument = require('pdfkit')
const getStream = require('get-stream')
const pdf = () => {
const doc = new PDFDocument()
doc.text('Hello, World!')
doc.end()
return await getStream.buffer(doc)
}
// Caller could do this:
const pdfBuffer = await pdf()
const pdfBase64string = pdfBuffer.toString('base64')
You don't have to return a buffer if your needs are different. The get-stream readme offers other examples.

There's no need to use an intermediate memory stream1 – just pipe the pdfkit output stream directly into a HTTP upload stream.
In my experience, the AWS SDK is garbage when it comes to working with streams, so I usually use request.
var upload = request({
method: 'PUT',
url: 'https://bucket.s3.amazonaws.com/doc.pdf',
aws: { bucket: 'bucket', key: ..., secret: ... }
});
doc.pipe(upload);
1 - in fact, it is usually undesirable to use a memory stream because that means buffering the entire thing in RAM, which is exactly what streams are supposed to avoid!

You could try something like this, and upload it to S3 inside the end event.
var doc = new pdfkit();
var MemoryStream = require('memorystream');
var memStream = new MemoryStream(null, {
readable : false
});
doc.pipe(memStream);
doc.on('end', function () {
var buffer = Buffer.concat(memStream.queue);
awsservice.putS3Object(buffer, fileName, fileType, folder).then(function () { }, reject);
})

A tweak of #bolav's answer worked for me trying to work with pdfmake and not pdfkit. First you need to have memorystream added to your project using npm or yarn.
const MemoryStream = require('memorystream');
const PdfPrinter = require('pdfmake');
const pdfPrinter = new PdfPrinter();
const docDef = {};
const pdfDoc = pdfPrinter.createPdfKitDocument(docDef);
const memStream = new MemoryStream(null, {readable: false});
const pdfDocStream = pdfDoc.pipe(memStream);
pdfDoc.end();
pdfDocStream.on('finish', () => {
console.log(Buffer.concat(memStream.queue);
});

My code to return a base64 for pdfkit:
import * as PDFDocument from 'pdfkit'
import getStream from 'get-stream'
const pdf = {
createPdf: async (text: string) => {
const doc = new PDFDocument()
doc.fontSize(10).text(text, 50, 50)
doc.end()
const data = await getStream.buffer(doc)
let b64 = Buffer.from(data).toString('base64')
return b64
}
}
export default pdf

Thanks to Troy's answer, mine worked with get-stream as well. The difference was I did not convert it to base64string, but rather uploaded it to AWS S3 as a buffer.
Here is my code:
import PDFDocument from 'pdfkit'
import getStream from 'get-stream';
import s3Client from 'your s3 config file';
const pdfGenerator = () => {
const doc = new PDFDocument();
doc.text('Hello, World!');
doc.end();
return doc;
}
const uploadFile = async () => {
const pdf = pdfGenerator();
const pdfBuffer = await getStream.buffer(pdf)
await s3Client.send(
new PutObjectCommand({
Bucket: 'bucket-name',
Key: 'filename.pdf',
Body: pdfBuffer,
ContentType: 'application/pdf',
})
);
}
uploadFile()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Nodejs forking stream, redundant data - node.js

Related

stream array of objects nested in objects in nodejs

IPFS streams not buffers

Using csv-parse with highlandjs

Node.js copy a stream into a file without consuming

Node js, piping pdfkit to a memory stream

Categories

Resources