createWriteStream in node seems to execute 'end' event before 'data'

createWriteStream in node seems to execute 'end' event before 'data' - node.js

I am unable to understand how the event loop is processing my snippet. What I am trying to achieve is
to read from a csv
download a resource found in the csv
upload it to s3
write it into a new csv file
const readAndUpload = () => {
fs.createReadStream('filename.csv')
.pipe(csv())
.on('data', ((row: any) => {
const file = fs.createWriteStream("file.jpg");
var url = new URL(row.imageURL)
// choose whether to make an http or https request
let client = (url.protocol=="https:") ? https : http
const request = client.get(row.imageURL, function(response:any) {
// file save
response.pipe(file);
console.log('file saved')
let filePath = "file.jpg";
let params = {
Bucket: 'bucket-name',
Body : fs.createReadStream(filePath),
Key : "filename.jpg"
};
// upload to s3
s3.upload(params, function (err: any, data: any) {
//handle error
if (err) {
console.log("Error", err);
}
//success
if (data) {
console.log("Uploaded in:", data.Location);
row.imageURL = data.Location
writeData.push(row)
// console.log(writeData)
}
});
});
}))
.on('end', () => {
console.log("done reading")
const csvWriter = createCsvWriter({
path: 'out.csv',
header: [
{id: 'id', title: 'some title'}
]
});
csvWriter
.writeRecords(writeData)
.then(() => console.log("The CSV file was written successfully"))
})
}
Going by the log statements that I have added, done reading and The CSV file was written successfully is printed by before file saved. My understanding was that the end event is called after the data event, so I am unsure of where I am going wrong.
Thank you for reading!

I'm not sure if this is part of the problem or not, but you've got an extra set of parens in this part of the code. Change this:
.on('data', ((row: any) => {
.....
})).on('end', () => {
to this:
.on('data', (row: any) => {
.....
}).on('end', () => {
And, if the event handlers are set up properly, your .on('data', ...) event handler gets called before the .on('end', ....) for the same stream. If you put this:
console.log('at start of data event handler');
as the first line in that event handler, you will see it get called first.
But, your data event handler uses multiple asynchronous calls and nothing you have in your code makes the end event wait for all your processing to be done in the data event handler. So, since that processing takes awhile, it's natural that the end event would occur before you're done running all that asynchronous code on the data event.
In addition, if you ever can have more than one data event (which one normally would), you're going to have multiple data events in flight at the same time and since you're using a fixed filename, they will probably be overwriting each other.
The usual way to solve something like this is to to stream.pause() to pause the readstream at the start of the data event processing and then when all your asynchronous stuff is done, you can then stream.resume() to let it start going again.
You will need to get the right stream in order to pause and resume. You could do something like this:
let stream = fs.createReadStream('filename.csv')
.pipe(csv());
stream.on('data', ((row: any) => {
stream.pause();
....
});
Then, way inside your s3.upload() callback, you can call stream.resume(). You will also need much, much better error handling that you have or things will just get stuck if you get an error.
It also looks like you have other concurrency issues too where you call:
response.pipe(file);
And you then attempt to use the file without actually waiting for that .pipe() operation to be done (which is also asynchronous). Overall, this whole logic really needs a major cleanup. I don't understand what exactly you're trying to do in all the different steps to know how to write a totally clean and simpler version.

Related

Nodejs global variable scope issue

I'm quite new to Nodejs. In the following code I am getting json data from an API.
let data_json = ''; // global variable
app.get('/', (req, res) => {
request('http://my-api.com/data-export.json', (error, response, body) => {
data_json = JSON.parse(body);
console.log( data_json ); // data prints successfully
});
console.log(data_json, 'Data Test - outside request code'); // no data is printed
})
data_json is my global variable and I assign the data returned by the request function. Within that function the json data prints just fine. But I try printing the same data outside the request function and nothing prints out.
What mistake am I making?

Instead of waiting for request to resolve (get data from your API), Node.js will execute the code outside, and it will print nothing because there is still nothing at the moment of execution, and only after node gets data from your api (which will take a few milliseconds) will it execute the code inside the request. This is because nodejs is asynchronous and non-blocking language, meaning it will not block or halt the code until your api returns data, it will just keep going and finish later when it gets the response.
It's a good practice to do all of the data manipulation you want inside the callback function, unfortunately you can't rely on on the structure you have.
Here's an example of your code, just commented out the order of operations:
let data_json = ''; // global variable
app.get('/', (req, res) => {
//NodeJS STARTS executing this code
request('http://my-api.com/data-export.json', (error, response, body) => {
//NodeJS executes this code last, after the data is loaded from the server
data_json = JSON.parse(body);
console.log( data_json );
//You should do all of your data_json manipluation here
//Eg saving stuff to the database, processing data, just usual logic ya know
});
//NodeJS executes this code 2nd, before your server responds with data
//Because it doesn't want to block the entire code until it gets a response
console.log(data_json, 'Data Test - outside request code');
})
So let's say you want to make another request with the data from the first request - you will have to do something like this:
request('https://your-api.com/export-data.json', (err, res, body) => {
request('https://your-api.com/2nd-endpoint.json', (err, res, body) => {
//Process data and repeat
})
})
As you can see, that pattern can become very messy very quickly - this is called a callback hell, so to avoid having a lot of nested requests, there is a syntactic sugar to make this code look far more fancy and maintainable, it's called Async/Await pattern. Here's how it works:
let data_json = ''
app.get('/', async (req,res) => {
try{
let response = await request('https://your-api.com/endpoint')
data_json = response.body
} catch(error) {
//Handle error how you see fit
}
console.log(data_json) //It will work
})
This code does the same thing as the one you have, but the difference is that you can make as many await request(...) as you want one after another, and no nesting.
The only difference is that you have to declare that your function is asynchronous async (req, res) => {...} and that all of the let var = await request(...) need to be nested inside try-catch block. This is so you can catch your errors. You can have all of your requests inside catch block if you think that's necessary.
Hopefully this helped a bit :)

The console.log occurs before your request, check out ways to get asynchronous data: callback, promises or async-await. Nodejs APIs are async(most of them) so outer console.log will be executed before request API call completes.
let data_json = ''; // global variable
app.get('/', (req, res) => {
let pr = new Promise(function(resolve, reject) {
request('http://my-api.com/data-export.json', (error, response, body) => {
if (error) {
reject(error)
} else {
data_json = JSON.parse(body);
console.log(data_json); // data prints successfully
resolve(data_json)
}
});
})
pr.then(function(data) {
// data also will have data_json
// handle response here
console.log(data_json); // data prints successfully
}).catch(function(err) {
// handle error here
})
})
If you don't want to create a promise wrapper, you can use request-promise-native (uses native Promises) created by the Request module team.
Learn callbacks, promises and of course async-await.

No data after piping response stream

Using Electron's net module, the aim is to fetch a resource and, once the response is received, to pipe it to a writeable stream like so:
const stream = await fetchResource('someUrl');
stream.pipe(fs.createWriteStream('./someFilepath'));
As simplified implementation of fetchResource is as follows:
import { net } from 'electron';
async function fetchResource(url) {
return new Promise((resolve, reject) => {
const data = [];
const request = net.request(url);
request.on('response', response => {
response.on('data', chunk => {
data.push(chunk);
});
response.on('end', () => {
// Maybe do some other stuff with data...
});
// Return the response to then pipe...
resolve(response);
});
request.end();
});
}
The response ends up being an instance of IncomingMessage, which implements a Readable Stream interface according to the node docs, so it should be able to be piped to a write stream.
The primary issue is there ends up being no data in the stream that get's piped through 😕

Answering my own question, but the issue is reading from multiple sources: the resolved promise and the 'data' event. The event listener source was flushing out all the data before the resolved promise could get to it.
A solution is to fork the stream into a new one that won't compete with the original if more than once source tries to pipe from it.
import stream from 'stream';
// ...make a request and get a response stream, then fork the stream...
const streamToResolve = response.pipe(new stream.PassThrough());
// Listen to events on response and pipe from it
// ...
// Resolve streamToResolve and separately pipe from it
// ...

NodeJS using FS and AsyncJS blocking the UI

I'm using react, electron, nodejs, asyncjs redux and thunk.
I wrote the following code which is supposed to download a list of files and write it to disk. In my code when the user presses a button i call this actionCreator:
export function downloadList(pack) {
return (dispatch, getState) => {
const { downloadManager } = getState();
async.each(downloadManager.downloadQueue[pack].libs, async (url, callback) => {
const filename = url.split('/').pop().split('#')[0].split('?')[0];
await downloadFile(url, `dl/${filename}`);
callback();
}, (err) => {
if (err) {
console.log('A file failed to process');
} else {
dispatch({
type: DOWNLOAD_COMPLETED,
packName: pack
});
}
});
};
}
async function downloadFile(url, path) {
const file = fs.createWriteStream(path);
const request = https.get(url, (response) => {
response.pipe(file);
file.on('finish', () => {
file.close();
});
}).on('error', (err) => { // Handle errors
fs.unlink(path); // Delete the file async. (But we don't check the result)
});
}
It does what it's supposed to do but while it does that, it blocks the entire UI. I really can't understand why it's happening since if i use an
setTimeout
with 3000ms delay inside the async.each it doesn't block the UI.
Another strange behaviour is that if i use the eachLimit function of asyncJS it just downloads me the limit of files, so if i want to download 100 files but i set eachLimit to 10 parallel, it just downloads the first 10 files and then stops. Can you enlight me about this?
I wanted to use axios to download files since it doesn't need to know if the urls are http or https but i can't find any resource on using axios with stream responsetype

I can answer the first part. Pretty much every existent implementation of JavaScript runs on one thread. This means that the runtime is concurrent, but not parallel, i.e. the runtime does one and exactly one thing at a time. This means that if there is a function call that takes a while, it will block everything else. Therefore, something in the downloadList function is blocking the event loop. However, if you use setTimeout, then the downloadList function will be pushed onto the message queue, which will unblock the event and allow the UI to be rendered. For more information on the event loop check out this video

Node request write to file corrupt

I have a get request in node that successfully receives data from API.
When I pipe that response directly to a file like this, it works, the file created is a valid, readable pdf (as i expect to receive from the API).
var http = require('request');
var fs = require('fs');
http.get(
{
url:'',
headers:{}
})
.pipe(fs.createWriteStream('./report.pdf'));
Simple, however the file gets corrupted if I use the event emitters of the request like this
http.get(
{
url:'',
headers:{}
})
.on('error', function (err) {
console.log(err);
})
.on('data', function(data) {
file += data;
})
.on('end', function() {
var stream = fs.createWriteStream('./report.pdf');
stream.write(file, function() {
stream.end();
});
});
I have tried all manner of writing this file and it always ends in a totally blank pdf - the only time the pdf is valid is via the pipe method.
When i console log the events, the sequence seems to be correct - ie, all chunks received and then the end fires at the end.
It's making things impossible to do anything after the pipe. What is pipe doing differently to the writestream ?

I assume that you initialize file as a string:
var file = '';
Then, in your data handler, you add the new chunk of data to it:
file += data;
However, this performs an implicit conversion to (UTF-8-encoded) strings. If the data is actually binary, like with a PDF, this will invalidate the output data.
Instead, you want to collect the data chunks, which are Buffer instances, and use Buffer.concat() to concatenate all those buffers into one large (binary) buffer:
var file = [];
...
.on('data', function(data) {
file.push(data);
})
.on('end', function() {
file = Buffer.concat(file);
...
});

If you wanted to do something after the file is done being written by pipe, you can add an event listener for finish on the object returned by pipe.
.pipe(fs.createWriteStream('./report.pdf'))
.on('finish', function done() { /* the file has been written */ });
Source: https://nodejs.org/api/stream.html#stream_event_finish

How to await fields before processing the file stream in multipart form

I'm using SendGrid for receiving files via email. SendGrid parses the incoming emails and sends the files in a multipart form to an endpoint I have set up.
I don't want the files on my local disk so I stream them straight to Amazon S3. This works perfect.
But before I can stream to S3 I need to get hold of the destination mail address so I can work out the correct s3 folder. This is sent in a field named "to" in the form post. Unfortunately this field sometimes arrives after the files are arriving, hence I need a way to await the to-field before I'm ready to take the stream.
I thought I could wrap the onField in a promise and await the to-field from within the onFile. But this concept seems to lock it self up when the field arrives after the file.
I'm new to booth streams and promises. I would really appreciate if someone could tell me how to do this.
This is the non working pseudoish code:
function sendGridUpload(req, res, next) {
var busboy = new Busboy({ headers: req.headers });
var awaitEmailAddress = new Promise(function(resolve, reject) {
busboy.on('field', function(fieldname, val, fieldnameTruncated, valTruncated) {
if(fieldname === 'to') {
resolve(val);
} else {
return;
}
});
});
busboy.on('file', function(fieldname, file, filename, encoding, mimetype) {
function findInbox(emailAddress) {
console.log('Got email address: ' + emailAddress);
..find the inbox and generate an s3Key
return s3Key;
}
function saveFileStream(s3Key) {
..pipe the file directly to S3
}
awaitEmailAddress.then(findInbox)
.then(saveFileStream)
.catch(function(err) {
log.error(err)
});
});
req.pipe(busboy);
}

I finally got this working. The solution is not very pretty, and I have actually switched to another concept (described at the end of the post).
To buffer the incoming data until the "to"-field arrives I used stream-buffers by #samcday. When I get hold of the to-field I release the readable stream to the pipes lined up for the data.
Here is the code (some parts omitted, but essential parts are there).
var streamBuffers = require('stream-buffers');
function postInboundMail(req, res, next) {
var busboy = new Busboy({ headers: req.headers});
//Sometimes the fields arrives after the files are streamed.
//We need the "to"-field before we are ready for the files
//Therefore the onField is wrapped in a promise which gets
//resolved when the to field arrives
var awaitEmailAddress = new Promise(function(resolve, reject) {
busboy.on('field', function(fieldname, val, fieldnameTruncated, valTruncated) {
var emailAddress;
if(fieldname === 'to') {
try {
emailAddress = emailRegexp.exec(val)[1]
resolve(emailAddress)
} catch(err) {
return reject(err);
}
} else {
return;
}
});
});
busboy.on('file', function(fieldname, file, filename, encoding, mimetype) {
var inbox;
//I'm using readableStreamBuffer to accumulate the data before
//I get the email field so I can send the stream through to S3
var readBuf = new streamBuffers.ReadableStreamBuffer();
//I have to pause readBuf immediately. Otherwise stream-buffers starts
//sending as soon as I put data in in with put().
readBuf.pause();
function getInbox(emailAddress) {
return model.inbox.findOne({email: emailAddress})
.then(function(result) {
if(!result) return Promise.reject(new Error(`Inbox not found for ${emailAddress}`))
inbox = result;
return Promise.resolve();
});
}
function saveFileStream() {
console.log('=========== starting stream to S3 ========= ' + filename)
//Have to resume readBuf since we paused it before
readBuf.resume();
//file.save will approximately do the following:
// readBuf.pipe(gzip).pipe(encrypt).pipe(S3)
return model.file.save({
inbox: inbox,
fileStream: readBuf
});
}
awaitEmailAddress.then(getInbox)
.then(saveFileStream)
.catch(function(err) {
log.error(err)
});
file.on('data', function(data) {
//Fill readBuf with data as it arrives
readBuf.put(data);
});
file.on('end', function() {
//This was the only way I found to get the S3 streaming finished.
//Destroysoon will let the pipes finish the reading bot no more writes are allowed
readBuf.destroySoon()
});
});
busboy.on('finish', function() {
res.writeHead(202, { Connection: 'close', Location: '/' });
res.end();
});
req.pipe(busboy);
}
I would really much like feedback on this solution, even though I'm not using it. I have a feeling that this can be done much more simple and elegant.
New solution:
Instead of waiting for the to-field I send the stream directly to S3. I figured, the more stuff I put in between the incoming stream and the S3 saving, the higher the risk of loosing the incoming file due to a bug in my code. (SendGrid will eventually resend the file if I'm not responding with 200, but it will take some time.)
This is how I do it:
Save a placeholder for the file in the database
Pipe the stream to S3
Update the placeholder with more information as it arrives
This solution also gives me the opportunity to easily get hold of unsuccessful uploads since the placeholders for unsuccessful uploads will be incomplete.
//Michael

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

createWriteStream in node seems to execute 'end' event before 'data' - node.js

Related

Nodejs global variable scope issue

No data after piping response stream

NodeJS using FS and AsyncJS blocking the UI

Node request write to file corrupt

How to await fields before processing the file stream in multipart form

Categories

Resources