Parse Remote CSV File using Nodejs / Papa Parse? - node.js

I am currently working on parsing a remote csv product feed from a Node app and would like to use Papa Parse to do that (as I have had success with it in the browser in the past).
Papa Parse Github: https://github.com/mholt/PapaParse
My initial attempts and web searching haven't turned up exactly how this would be done. The Papa readme says that Papa Parse is now compatible with Node and as such Baby Parse (which used to serve some of the Node parsing functionality) has been depreciated.
Here's a link to the Node section of the docs for anyone stumbling on this issue in the future: https://github.com/mholt/PapaParse#papa-parse-for-node
From that doc paragraph it looks like Papa Parse in Node can parse a readable stream instead of a File. My question is;
Is there any way to utilize Readable Streams functionality to use Papa to download / parse a remote CSV in Node some what similar to how Papa in the browser uses XMLHttpRequest to accomplish that same goal?
For Future Visibility
For those searching on the topic (and to avoid repeating a similar question) attempting to utilize the remote file parsing functionality described here: http://papaparse.com/docs#remote-files will result in the following error in your console:
"Unhandled rejection ReferenceError: XMLHttpRequest is not defined"
I have opened an issue on the official repository and will update this Question as I learn more about the problems that need to be solved.

After lots of tinkering I finally got a working example of this using asynchronous streams and with no additional libraries (except fs/request). It works for remote and local files.
I needed to create a data stream, as well as a PapaParse stream (using papa.NODE_STREAM_INPUT as the first argument to papa.parse()), then pipe the data into the PapaParse stream. Event listeners need to be implemented for the data and finish events on the PapaParse stream. You can then use the parsed data inside your handler for the finish event.
See the example below:
const papa = require("papaparse");
const request = require("request");
const options = {/* options */};
const dataStream = request.get("https://example.com/myfile.csv");
const parseStream = papa.parse(papa.NODE_STREAM_INPUT, options);
dataStream.pipe(parseStream);
let data = [];
parseStream.on("data", chunk => {
data.push(chunk);
});
parseStream.on("finish", () => {
console.log(data);
console.log(data.length);
});
The data event for the parseStream happens to run once for each row in the CSV (though I'm not sure this behaviour is guaranteed). Hope this helps someone!
To use a local file instead of a remote file, you can do the same thing except the dataStream would be created using fs:
const dataStream = fs.createReadStream("./myfile.csv");
(You may want to use path.join and __dirname to specify a path relative to where the file is located rather than relative to where it was run)

OK, so I think I have an answer to this. But I guess only time will tell. Note that my file is .txt with tab delimiters.
var fs = require('fs');
var Papa = require('papaparse');
var file = './rawData/myfile.txt';
// When the file is a local file when need to convert to a file Obj.
// This step may not be necissary when uploading via UI
var content = fs.readFileSync(file, "utf8");
var rows;
Papa.parse(content, {
header: false,
delimiter: "\t",
complete: function(results) {
//console.log("Finished:", results.data);
rows = results.data;
}
});

Actually you could use a lightweight stream transformation library called scramjet - parsing CSV straight from http stream is one of my main examples. It also uses PapaParse to parse CSVs.
All you wrote above, with any transforms in between, can be done in just couple lines:
const {StringStream} = require("scramjet");
const request = require("request");
request.get("https://srv.example.com/main.csv") // fetch csv
.pipe(new StringStream()) // pass to stream
.CSVParse() // parse into objects
.consume(object => console.log("Row:", object)) // do whatever you like with the objects
.then(() => console.log("all done"))
In your own example you're saving the file to disk, which is not necessary even with PapaParse.

I am adding this answer (and will update it as I progress) in case anyone else is still looking into this.
It seems like previous users have ended up downloading the file first and then processing it. This SHOULD NOT be necessary since Papa Parse should be able to process a read stream and it should be possible to pipe 'http' GET to that stream.
Here is one instance of someone discussing what I am trying to do and falling back to downloading the file and then parsing it: https://forums.meteor.com/t/processing-large-csvs-in-meteor-js-with-papaparse/32705/4
Note: in the above Baby Parse is discussed, now that Papa Parse works with Node Baby Parse has been depreciated.
Download File Workaround
While downloading and then Parsing with Papa Parse is not an answer to my question, it is the only workaround I have as of now and someone else may want to use this methodology.
My code to download and then parse currently looks something like this:
// Papa Parse for parsing CSV Files
var Papa = require('papaparse');
// HTTP and FS to enable Papa parse to download remote CSVs via node streams.
var http = require('http');
var fs = require('fs');
var destinationFile = "yourdestination.csv";
var download = function(url, dest, cb) {
var file = fs.createWriteStream(dest);
var request = http.get(url, function(response) {
response.pipe(file);
file.on('finish', function() {
file.close(cb); // close() is async, call cb after close completes.
});
}).on('error', function(err) { // Handle errors
fs.unlink(dest); // Delete the file async. (But we don't check the result)
if (cb) cb(err.message);
});
};
download(feedURL, destinationFile, parseMe);
var parseMe = Papa.parse(destinationFile, {
header: true,
dynamicTyping: true,
step: function(row) {
console.log("Row:", row.data);
},
complete: function() {
console.log("All done!");
}
});

Http(s) actually has a readable stream as parameter in the callback, so here is a simple solution
try {
var streamHttp = await new Promise((resolve, reject) =>
https.get("https://example.com/yourcsv.csv", (res) => {
resolve(res);
})
);
} catch (e) {
console.log(e);
}
Papa.parse(streamHttp, config);

const Papa = require("papaparse");
const { StringStream } = require("scramjet");
const request = require("request");
const req = request
.get("https://example.com/yourcsv.csv")
.pipe(new StringStream());
Papa.parse(req, {
header: true,
complete: (result) => {
console.log(result);
},
});

David Liao's solution worked for me, I did tweak it a little bit since I am using local file. He did not include the example how to solve the file access in node if you did get Error: ENOENT: no such file or directory message in your console.
To test your actual working directory and to understand where you must point your path to console log the following, this gave me better understanding of the file location: console.log(process.cwd()).
const fs = require('fs');
const papa = require('papaparse');
const request = require('request');
const path = require('path');
const options = {
/* options */
};
const fileName = path.resolve(__dirname, 'ADD YOUR ABSOLUTE FILE LOCATION HERE');
const dataStream = fs.createReadStream(fileName);
const parseStream = papa.parse(papa.NODE_STREAM_INPUT, options);
dataStream.pipe(parseStream);
let data = [];
parseStream.on('data', chunk => {
data.push(chunk);
});
parseStream.on('finish', () => {
console.log(data);
console.log(data.length);
});

Related

Node JS/azure functions passing video information back from api call

So essentially, what my api call does is it 1) takes in video data using parse multipart, 2) converts that video data to a real mp4 file using ffmpeg, and then 3) is supposed to send back the video data to the client in the response body.
Steps 1 and 2 work perfectly - it's that third step that I am stuck on.
The api call creates the Out.mp4 file, but when I try and read its info using createReadStream, the chunks array doesn't populate, and a null context.res body is returned.
Please let me know what I am doing wrong and how I can pass back the video info properly so as to be able to convert the video info back to a playable mp4 file on the client's side.
Also, lmk if you have any questions or things I can clarify.
Here is the api call index.js file
const fs = require("fs");
module.exports=async function(context, req){
try{
//Get the input file setup
context.log("Javascript HTTP trigger function processed a request.");
var bodyBuffer=Buffer.from(req.body);
var boundary=multipart.getBoundary(req.headers['content-type']);
var parts=multipart.Parse(bodyBuffer, boundary);
var temp = "C:/home/site/wwwroot/In.mp4";
fs.writeFileSync(temp, Buffer(parts[0].data));
//Actually execute the ffmpeg script
var execLineBuilder= "C:/home/site/wwwroot/ffmpeg-5.1.2-essentials_build/bin/ffmpeg.exe -i C:/home/site/wwwroot/In.mp4 C:/home/site/wwwroot/Out.mp4"
var execSync = require('child_process').execSync;
//Executing the script
execSync(execLineBuilder)
//EVERYTHING WORKS UP UNTIL HERE (chunks array seems to be empty, even though outputting chunk to a file populates
//That file with data)
//Storing the chunks of the output mp4 into chunks array
execSync.on('exit', ()=>{
chunks = [];
const myPromise = new Promise((resolve, reject) => {
var readStream = fs.createReadStream("C:/home/site/wwwroot/Out.mp4");
readStream.on('data', (chunk)=> {
chunks.push(chunk);
resolve("foo");
});
})
})
myPromise.then(()=>{
context.res={
status:200,
body:chunks
}
})
}catch (e){
context.res={
status:500,
body:e
}
}
}```
you can use an npm package called azure-function-express this package will basically convert your azure function to an express
This way you can directly read the mp3 file you saved and send it directly.
const createHandler = require("azure-function-express").createHandler;
const express = require("express");
const fs = require('fs');
const app = express();
app.get("/api/HttpTrigger1", (req, res) => {
res.writeHead(200, {'Content-Type': 'video/mp4'});
let open = fs.createReadStream('./test.mp3');
res.send(open);
});
This way you will be able to share the video also running the ffmpeg might also be simple

How to read and parse a CSV file from an Axios GET request in Node/Express?

I'm creating a Node/Express backend that uses axios to make a GET request to this URL: https://marknadssok.fi.se/Publiceringsklient/sv-SE/Search/Search?SearchFunctionType=Insyn&Utgivare=&PersonILedandeSt%C3%A4llningNamn=&Transaktionsdatum.From=&Transaktionsdatum.To=&Publiceringsdatum.From=2021-03-30&Publiceringsdatum.To=2021-03-30&button=export&Page=1
When using a regular browser the response is a file download with the data in a CSV file.
Is there a way to read and parse this CSV file in the Node/Express backend that I'm building? I do not wish to persist the data to filesystem or anything. Simply use libraries such as "csv-parse" to turn this CSV file into an object for each row in the file.
Thanks in advance!
EDIT:
When I try the example where I read the file directly the jsonArray when I console.log it looks like this:
The actual CSV file seems to look just as it should. See here:
You can use the following example. It downloads a CSV file, reads it, and removes it. However, the delimeter used in the CSV file you gave the link, form of \x00, the file looks like that. The best thing to do is to use a different CSV file or to parse that generated CSV file row by row. I think you can open a different question for this problem.
const fs = require('fs');
const csv = require('csvtojson');
const https = require('https');
let url = "https://marknadssok.fi.se/Publiceringsklient/sv-SE/Search/Search?SearchFunctionType=Insyn&Utgivare=&PersonILedandeSt%C3%A4llningNamn=&Transaktionsdatum.From=&Transaktionsdatum.To=&Publiceringsdatum.From=2021-03-30&Publiceringsdatum.To=2021-03-30&button=export&Page=1";
async function parseCSV(url) {
const dest = __dirname + '/foo.csv'; // You can use uuid like packages to name the csv file
return new Promise((resolve, reject) => {
var file = fs.createWriteStream(dest);
https.get(url, function (response) {
response.pipe(file);
file.on('finish', function () {
file.close();
csv({
noheader:true,
trim:true
}).fromFile(dest).then(jsonArray => {
fs.unlinkSync(dest);
resolve(jsonArray);
})
});
}).on('error', function (err) { // Handle errors
fs.unlinkSync(dest);
reject(new Error("Download failed."));
});
});
}
parseCSV(url).then( result => {
result.forEach(row => {
row = row.field1;
console.log(row);
});
});

Stream node requests to the cloud with file metadata

Im using koa in order to build a web app, and I want to allow users to upload files to it. The files need to be streamed to the cloud, but I would like to avoid saving the file locally.
The problem is that I need some file metadata before I pipe the upload stream to the writeable stream. I want to have the mime-type and optionally attach other data like the original file name etc.
I tried sending the binary data with the request's "content-type" header set to the file's type, but I would like the request to have the content type application/octet-stream so I can know in the back-end how to handle the request.
I read somewhere that the better option would be to use multipart/form-data but I'm not sure how to structure the request, and how to parse the metadata in order to notify the cloud before I pipe to its write stream.
Here is the code im currently using. Basically, it just pipes the request as is, and I use the request header to know the type of the file:
module.exports = async ctx => {
// Generate a random id that will be part of the filename.
const id = pushid();
// Get the content type from the header.
const contentType = ctx.header['content-type'];
// Get the extension for the file from the content type
const ext = contentType.split('/').pop();
// This is the configuration for the upload stream to the cloud.
const uploadConfig = {
// I must specify a content type, or know the file extension.
contentType
// there is some other stuff here but its not relevant.
};
// Create a upload stream for the cloud storage.
const uploadStream = bucket
.file(`assets/${id}/original.${ext}`)
.createWriteStream(uploadConfig);
// Here is what took me hours to get to work... dev life is hard
ctx.req.pipe(uploadStream);
// return a promise so Koa doesn't shut down the request before its finished uploading.
return new Promise((resolve, reject) =>
uploadStream.on('finish', resolve).on('error', reject)
);
};
Please assume I don't know much about the uploading protocols and managing streams.
Ok so after a lot of searching I found out that there is a parser that works with streams called busboy. It is pretty easy to use, but before jumping into the code I highly suggest everyone dealing with multipart/form-data requests to read this article.
Here is how I solved it:
const Busboy = require('busboy');
module.exports = async ctx => {
// Init busboy with the headers of the "raw" request.
const busboy = new Busboy({ headers: ctx.req.headers });
busboy.on('file', (fieldname, stream, filename, encoding, contentType) => {
const id = pushid();
const ext = path.extname(filename);
const uploadStream = bucket
.file(`assets/${id}/original${ext}`)
.createWriteStream({
contentType,
resumable: false,
metadata: {
cacheControl: 'public, max-age=3600'
}
});
stream.pipe(uploadStream);
});
// Pipe the request to busboy.
ctx.req.pipe(busboy);
// return a promise that resolves to whatever you want
ctx.body = await new Promise(resolve => {
busboy.on('finish', () => {
resolve('done');
});
});
};

Efficient way to read file in NodeJS

I am receiving an image file sent from an Ajax request:
var data = canvas.toDataURL('image/jpg', 1.0);
$.post({
url: "/upload-image",
data: {
file: data
}
}).done(function(response) {
....
})
}
And on the server side, I want to transmit the image file to an API
function getOptions(buffer) {
return {
url: '.../face_detection',
headers: headers,
method: 'POST',
formData: {
filename: buffer
}
}
}
router.post('/upload-image', function(req, res, next) {
console.log('LOG 0' + Date.now());
var data_url = req.body.file;
var matches = data_url.match(/^data:.+\/(.+);base64,(.*)$/);
var ext = matches[1];
var base64_data = matches[2];
var buffer = new Buffer(base64_data, 'base64');
console.log('LOG 1' + Date.now());
request(getOptions(buffer), function(error, response, body) {
res.json(body);
console.log(Date.now());
});
});
The problem that I have is that the lines between LOG 0 and LOG 1 are very slow, a few seconds. But the image is only 650kb. Is there a way to accelerate this?
Using another method to read the header, avoid the buffer, change the uploading process. I don't know but I'd like to be faster.
Thank you very much.
I would suggest using a library to handle some of this logic. If you would prefer to keep a lean dependency list, you can take a look at the source of some of these modules and base your own solution off of them.
For converting a data URI to a buffer: data-uri-to-buffer
For figuring out a file type: file-type
I would especially recommend the file-type solution. A safer (can't say safest) way to ensure what kind of file a Buffer is is to inspect aspects of the file. file-type seems to at least take a look at the Magic Number of the file to check type. Not foolproof, but if you are accepting files from users, you have to accept the risks involved.
Also have a look at Security Stack Exchange questions for good practices. Although the following say PHP, all server software runs the risk of being vulnerable to user input:
Hacker used picture upload to get PHP code into my site
Can simply decompressing a JPEG image trigger an exploit?
Risks of a PHP image upload form
"use strict";
const dataUriToBuffer = require('data-uri-to-buffer'),
fileType = require("file-type"),
express = require("express"),
router = express.Router(),
util = require("util"),
fs = require("fs"),
path = require("path");
const writeFileAsync = util.promisify(fs.writeFile);
// Keep track of file types you support
const supportedTypes = [
"png",
"jpg",
"gif"
];
// Handle POSTs to upload-image
router.post("/upload-image", function (req, res, next) {
// Did they send us a file?
if (!req.body.file) {
// Unprocessable entity error
return res.sendStatus(422);
}
// Get the file to a buffer
const buff = dataUriToBuffer(req.body.file);
// Get the file type
const bufferMime = fileType(buff); // {ext: 'png', mime: 'image/png'}
// Is it a supported file type?
if (!supportedTypes.contains(bufferMime.ext)) {
// Unsupported media type
return res.sendStatus(415);
}
// Save or do whatever with the file
writeFileAsync(path.join("imageDir", `userimage.${bufferMime.ext}`), buff)
// Tell the user that it's all done
.then(() => res.sendStatus(200))
// Log the error and tell the user the save failed
.catch((err) => {
console.error(err);
res.sendStatus(500);
});
});

POSTing RAW body with restify client

I'm trying to POST a raw body with restify. I have the receive side correct, when using POSTman I can send a raw zip file, and the file is correctly created on the server's file system. However, I'm struggling to write my test in mocha. Here is the code I have, any help would be greatly appreciated.
I've tried this approach.
const should = require('should');
const restify = require('restify');
const fs = require('fs');
const port = 8080;
const url = 'http://localhost:' + port;
const client = restify.createJsonClient({
url: url,
version: '~1.0'
});
const testPath = 'test/assets/test.zip';
fs.existsSync(testPath).should.equal(true);
const readStream = fs.createReadStream(testPath);
client.post('/v1/deploy', readStream, function(err, req, res, data) {
if (err) {
throw new Error(err);
}
should(res).not.null();
should(res.statusCode).not.null();
should(res.statusCode).not.undefined();
res.statusCode.should.equal(200);
should(data).not.null();
should(data.endpoint).not.undefined();
data.endpoint.should.equal('http://endpointyouhit:8080');
done();
});
Yet the file size on the file system is always 0. I'm not using my readStream correctly, but I'm not sure how to correct it. Any help would be greatly appreciated.
Note that I want to stream the file, not load it in memory on transmit and receive, the file can potentially be too large for an in memory operation.
Thanks,
Todd
One thing is that you would need to specify a content-type of multi-part/form-data. However, it looks like restify doesn't support that content type, so you're probably out of luck using the restify client to post a file.
To answer my own question, it doesn't appear to be possible to do this with the restify client. I also tried the request module, which claims to have this capability. However, when using their streaming examples, I always had a file size of 0 on the server. Below is a functional mocha integration test.
const testPath = 'test/assets/test.zip';
fs.existsSync(testPath).should.equal(true);
const readStream = fs.createReadStream(testPath);
var options = {
host: 'localhost'
, port: port
, path: '/v1/deploy/testvalue'
, method: 'PUT'
};
var req = http.request(options, function (res) {
//this feels a bit backwards, but these are evaluated AFTER the read stream has closed
var buffer = '';
//pipe body to a buffer
res.on('data', function(data){
buffer+= data;
});
res.on('end', function () {
should(res).not.null();
should(res.statusCode).not.null();
should(res.statusCode).not.undefined();
res.statusCode.should.equal(200);
const json = JSON.parse(buffer);
should(json).not.null();
should(json.endpoint).not.undefined();
json.endpoint.should.equal('http://endpointyouhit:8080');
done();
});
});
req.on('error', function (err) {
if (err) {
throw new Error(err);
}
});
//pipe the readstream into the request
readStream.pipe(req);
/**
* Close the request on the close of the read stream
*/
readStream.on('close', function () {
req.end();
console.log('I finished.');
});
//note that if we end up with larger files, we may want to support the continue, much as S3 does
//https://nodejs.org/api/http.html#http_event_continue

Resources