how to pipe file during download with puppteer? - node.js

Is that possible to pipe during download file with Puppter?
Code attached are example of download with puppteer while the second part is how i extract file during download.
i want to include part 2 in part 1 somehow.
const page = await browser.newPage();//skiped other configs
const client = await page.target().createCDPSession();//set directory of files
await client.send("Page.setDownloadBehavior", {
behavior: "allow",
downloadPath: process.cwd() + "\\src\\tempDataFiles\\rami",
});
//array of links from page
const fileUrlArray = await page.$$eval("selector", (files) => {
return files.map((link) => {link.getAttribute("href")});
//download files
const filtredFiles = fileUrlArray.filter((url) => url !== null);
for (const file of filtredFiles) {
await page.click(`[href="${file}"]`);
}
This code works perfect but the files are zip files and i want to extract them before save.
When i download file without puppter, the extraction is as next code.
**(in this case im not able use https request yet due to lake of knowledge)
The code of unzip file directly when download without puppter (simple http request)
const file = fs.createWriteStream(`./src/tempDataFiles/${store}/${fileName}.xml`);
const request = http.get(url, function (response) {
response.pipe(zlib.createGunzip()).pipe(file);
file.on("error", async (e) => {
Log.error(`downloadAndUnZipFile error : ${e}`);
await fileOnDb.destroy();
file.end();
});
file.on("finish", () => {
Log.success(`${fileName} download Completed`);
file.close();
});
});

Related

Writing file in /tmp in a Firebase Function does not work

I am writing a Firebase function that exposes an API endpoint using express. When the endpoint is called, it needs to download an image from an external API and use that image to make a second API call. The second API call needs the image to be passed as a readableStream. Specifically, I am calling the pinFileToIPFS endpoint of the Pinata API.
My Firebase function is using axios to download the image and fs to write the image to /tmp. Then I am using fs to read the image, convert it to a readableStream and send it to Pinata.
A stripped-down version of my code looks like this:
const functions = require("firebase-functions");
const express = require("express");
const axios = require("axios");
const fs = require('fs-extra')
require("dotenv").config();
const key = process.env.REACT_APP_PINATA_KEY;
const secret = process.env.REACT_APP_PINATA_SECRET;
const pinataSDK = require('#pinata/sdk');
const pinata = pinataSDK(key, secret);
const app = express();
const downloadFile = async (fileUrl, downloadFilePath) => {
try {
const response = await axios({
method: 'GET',
url: fileUrl,
responseType: 'stream',
});
// pipe the result stream into a file on disc
response.data.pipe(fs.createWriteStream(downloadFilePath, {flags:'w'}))
// return a promise and resolve when download finishes
return new Promise((resolve, reject) => {
response.data.on('end', () => {
resolve()
})
response.data.on('error', () => {
reject()
})
})
} catch (err) {
console.log('Failed to download image')
console.log(err);
throw new Error(err);
}
};
app.post('/pinata/pinFileToIPFS', cors(), async (req, res) => {
const id = req.query.id;
var url = '<URL of API endpoint to download the image>';
await fs.ensureDir('/tmp');
if (fs.existsSync('/tmp')) {
console.log('Folder: /tmp exists!')
} else {
console.log('Folder: /tmp does not exist!')
}
var filename = '/tmp/image-'+id+'.png';
downloadFile(url, filename);
if (fs.existsSync(filename)) {
console.log('File: ' + filename + ' exists!')
} else {
console.log('File: ' + filename + ' does not exist!')
}
var image = fs.createReadStream(filename);
const options = {
pinataOptions: {cidVersion: 1}
};
pinata.pinFileToIPFS(image, options).then((result) => {
console.log(JSON.stringify(result));
res.header("Access-Control-Allow-Origin", "*");
res.header("Access-Control-Allow-Headers", "Authorization, Origin, X-Requested-With, Accept");
res.status(200).json(JSON.stringify(result));
res.send();
}).catch((err) => {
console.log('Failed to pin file');
console.log(err);
res.status(500).json(JSON.stringify(err));
res.send();
});
});
exports.api = functions.https.onRequest(app);
Interestingly, my debug messages tell me that the /tmp folder exists, but the file of my downloaded file does not exist in the file system.
[Error: ENOENT: no such file or directory, open '/tmp/image-314502.png']. Note that the image can be accessed correctly when I manually access the URL of the image.
I've tried to download and save the file using many ways but none of them work. Also, based on what I've read, Firebase Functions allow to write and read temp files from /tmp.
Any advice will be appreciated. Note that I am very new to NodeJS and to Firebase, so please excuse my basic code.
Many thanks!
I was not able to see you are initializing the directory as suggested in this post:
const bucket = gcs.bucket(object.bucket);
const filePath = object.name;
const fileName = filePath.split('/').pop();
const thumbFileName = 'thumb_' + fileName;
const workingDir = join(tmpdir(), `${object.name.split('/')[0]}/`);//new
const tmpFilePath = join(workingDir, fileName);
const tmpThumbPath = join(workingDir, thumbFileName);
await fs.ensureDir(workingDir);
Also, please consider that if you are using two functions, the /tmp directory would not be shared as each one has its own. Here is an explanation from Doug Stevenson. In the same answer, there is a very well explained video about local and global scopes and how to use the tmp directory:
Cloud Functions only allows one function to run at a time in a particular server instance. Functions running in parallel run on different server instances, which have different /tmp spaces. Each function invocation runs in complete isolation from each other. You should always clean up files you write in /tmp so that they don't accumulate and cause a server instance to run out of memory over time.
I would suggest using Google Cloud Storage extended with Cloud Functions to achieve your goal.

Nodejs Script Not Reading GDOC File Extension

I am using the Google Drive for Developers Drive API (V3) Nodejs quickstart.
In particular I am concentrating on the following function. Where I have customized the pageSize to 1 for testing. And am calling my function read(file.name);
/**
* Lists the names and IDs of up to 10 files.
* #param {google.auth.OAuth2} auth An authorized OAuth2 client.
*/
function listFiles(auth) {
const drive = google.drive({version: 'v3', auth});
drive.files.list({
pageSize: 1, // only find the last modified file in dev folder
fields: 'nextPageToken, files(id, name)',
}, (err, res) => {
if (err) return console.log('The API returned an error: ' + err);
const files = res.data.files;
if (files.length) {
console.log('Files:');
files.map((file) => {
console.log(`${file.name} (${file.id})`);
read(file.name); // my function here
});
} else {
console.log('No files found.');
}
});
}
// custom code - function to read and output file contents
function read(fileName) {
const readableStream = fs.createReadStream(fileName, 'utf8');
readableStream.on('error', function (error) {
console.log(`error: ${error.message}`);
})
readableStream.on('data', (chunk) => {
console.log(chunk);
})
}
This code reads the file from the Google Drive folder that is synced. I am using this local folder for development. I have found the pageSize: 1 parameter produces the last file that has been modified in this local folder. Therefore my process has been:
Edit .js code file
Make minor edit on testfiles (first txt then gdoc) to ensure it is last modified
Run the code
I am testing a text file against a GDOC file. The filenames are atest.txt & 31832_226114__0001-00028.gdoc respectively. The outputs are as follows:
PS C:\Users\david\Google Drive\Technical-local\gDriveDev> node . gdocToTextDownload.js
Files:
atest.txt (1bm1E4s4ET6HVTrJUj4TmNGaxqJJRcnCC)
atest.txt this is a test file!!
PS C:\Users\david\Google Drive\Technical-local\gDriveDev> node . gdocToTextDownload.js
Files:
31832_226114__0001-00028 (1oi_hE0TTfsKG9lr8Wl7ahGNvMvXJoFj70LssGNFFjOg)
error: ENOENT: no such file or directory, open 'C:\Users\david\Google Drive\Technical-local\gDriveDev\31832_226114__0001-00028'
My question is:
Why does the script read the text file but not the gdoc?
At this point I must 'hard code' the gdoc file extension to the file name, in the function call, to produce the required output as per the text file example eg
read('31832_226114__0001-00028.gdoc');
Which is obviously not what I want to do.
I am aiming to produce a script that will download a large number of gdocs that have been created from .jpg files.
------------------------- code completed below ------------------------
/**
* Lists the names and IDs of pageSize number of files (using query to define folder of files)
* #param {google.auth.OAuth2} auth An authorized OAuth2 client.
*/
function listFiles(auth) {
const drive = google.drive({version: 'v3', auth});
drive.files.list({
corpora: 'user',
pageSize: 100,
// files in a parent folder that have not been trashed
// get ID from Drive > Folder by looking at the URL after /folders/
q: `'11Sejh6XG-2WzycpcC-MaEmDQJc78LCFg' in parents and trashed=false`,
fields: 'nextPageToken, files(id, name)',
}, (err, res) => {
if (err) return console.log('The API returned an error: ' + err);
const files = res.data.files;
if (files.length) {
var ids = [ ];
var names = [ ];
files.forEach(function(file, i) {
ids.push(file.id);
names.push(file.name);
});
ids.forEach((fileId, i) => {
fileName = names[i];
downloadFile(drive, fileId, fileName);
});
}
else
{
console.log('No files found.');
}
});
}
/**
* #param {google.auth.OAuth2} auth An authorized OAuth2 client.
*/
function downloadFile(drive, fileId, fileName) {
// make sure you have valid path & permissions. Use UNIX filepath notation.
const filePath = `/test/test1/${fileName}`;
const dest = fs.createWriteStream(filePath);
let progress = 0;
drive.files.export(
{ fileId, mimeType: 'text/plain' },
{ responseType: 'stream' }
).then(res => {
res.data
.on('end', () => {
console.log(' Done downloading');
})
.on('error', err => {
console.error('Error downloading file.');
})
.on('data', d => {
progress += d.length;
if (process.stdout.isTTY) {
process.stdout.clearLine();
process.stdout.cursorTo(0);
process.stdout.write(`Downloading ${fileName} ${progress} bytes`);
}
})
.pipe(dest);
});
}
My question is: Why does the script read the text file but not the gdoc?
This is because you're trying to download a Google Workspace document, only files with binary content can be downloaded using drive.files.get method. For Google Workspace documents you need to use drive.files.exports as documented here
From your code, I'm seeing you're only listing the files, you will need to identify the type of file you want to download, you can use the mimeType field to check if you need to use the exports method vs get, for example, a Google Doc mime type is application/vnd.google-apps.document meanwhile a docx file (binary) would be application/vnd.openxmlformats-officedocument.wordprocessingml.document
Check the following working example:
Download a file from Google Drive                                                                                
Run in Fusebit
const fs = require("fs");
const getFile = async (drive, fileId, name) => {
const res = await drive.files.get({ fileId, alt: "media" }, { responseType: "stream" });
return new Promise((resolve, reject) => {
const filePath = `/tmp/${name}`;
console.log(`writing to ${filePath}`);
const dest = fs.createWriteStream(filePath);
let progress = 0;
res.data
.on("end", () => {
console.log("🎉 Done downloading file.");
resolve(filePath);
})
.on("error", (err) => {
console.error("🚫 Error downloading file.");
reject(err);
})
.on("data", (d) => {
progress += d.length;
console.log(`🕛 Downloaded ${progress} bytes`);
})
.pipe(dest);
});
};
const fileKind = "drive#file";
let filesCounter = 0;
const drive = googleClient.drive({ version: "v3" });
const files = await drive.files.list();
// Only files with binary content can be downloaded. Use Export with Docs Editors files
// Read more at https://developers.google.com/drive/api/v3/reference/files/get
// In this example, any docx folder will be downloaded in a temp folder.
const onlyFiles = files.data.files.filter(
(file) =>
file.kind === fileKind &&
file.mimeType === "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
);
const numberOfFilesToDownload = onlyFiles.length;
console.log(`😏 About to download ${numberOfFilesToDownload} files`);
for await (const file of onlyFiles) {
filesCounter++;
console.log(`📁 Downloading file ${file.name}, ${filesCounter} of ${numberOfFilesToDownload}`);
await getFile(drive, file.id, file.name);
}
The answer (as I see it) is that the nodejs script above is running on Windows and therefore must comply with the native OS/file system inherited via the DOS/NT development of Windows. On the other hand, the gdoc extension is a reference created by the Google Drive sync desktop client. And here is the important distinction. The gdoc extension references a file stored on Google Drive (the reference being in the sync folder on a hard drive and the file being in the cloud on Google Drive) Therefore it's not an extension in the usual sense. The usual sense being where the extension is used by a local application as a valid access/read/write file type. So my test function above function read(fileName) won't be able to read the .gdoc in the same way as the .txt extension.
Therefore the correct way to access files on Google Drive from a local application is to use the file's ID. The filename is just a convenient way of labelling the IDs so that the user can meaningfully compare the downloaded copy of the file with the original on Google Drive.
(Refer to the original question) Using the code under the ---------- code completed below --------- I have added these two functions to Google's Nodejs Quickstart Replacing the function listFiles(auth) and adding function downloadFile(drive, fileId, fileName)
The total script file has been used to download multiple files (more than 50 at a time) to my hard drive. This is a useful piece of code in an OCR setup which has a gscript convert .JPG images of historic Electoral Rolls into readable text. These gdocs are messy (still containing the original image and colored fonts of various formats) In downloading as text files the above script cleans them up. Of course images are removed from text files and the fonts are standardized to just upper/lower case text. So, it's more than just a downloader. It's a filter as well.
I hope this of some use to someone.

How to read file from createReadStream in Node.js?

I have web appication that can upload excel file. If user upload, the app should parse it and will return some rows that file have. So, The application don't need to save file to its filesystem. Parsing file and return rows is a job. But below code, I wrote this morning, it save file to its server and then parse it.. I think it's waste server resource.
I don't know how to read excel file with createReadStream. Without saving file, how can I parse excel directly? I am not familiar with fs, of course, I can delete file after the job finished, but is there any elegant way?
import { createWriteStream } from 'fs'
import path from 'path'
import xlsx from 'node-xlsx'
// some graphql code here...
async singleUpload(_, { file }, context) {
try {
console.log(file)
const { createReadStream, filename, mimetype, encoding } = await file
await new Promise((res) =>
createReadStream()
.pipe(createWriteStream(path.join(__dirname, '../uploads', filename)))
.on('close', res)
)
const workSheetsFromFile = xlsx.parse(path.join(__dirname, '../uploads', filename))
for (const row of workSheetsFromFile[0].data) {
console.log(row)
}
return { filename }
} catch (e) {
throw new Error(e)
}
},
Using express-fileupload library which provides a buffer representation for uploaded files (through data property), combined with excel.js which accepts a buffers will get you there.
see Express-fileupload and Excel.js
// read from a file
const workbook = new Excel.Workbook();
await workbook.xlsx.readFile(filename);
// ... use workbook
// read from a stream
const workbook = new Excel.Workbook();
await workbook.xlsx.read(stream);
// ... use workbook
// load from buffer // this is what you're looking for
const workbook = new Excel.Workbook();
await workbook.xlsx.load(data);
// ... use workbook
Here's a simplified example:
const app = require('express')();
const fileUpload = require('express-fileupload');
const { Workbook } = require('exceljs');
app.use(fileUpload());
app.post('/', async (req, res) => {
if (!req.files || Object.keys(req.files).length === 0) {
return res.status(400).send('No files were uploaded.');
}
// The name of the input field (i.e. "myFile") is used to retrieve the uploaded file
await new Workbook().xlsx.load(req.files.myFile.data)
});
app.listen(3000)
var xlsx = require('xlsx')
//var workbook = xlsx.readFile('testSingle.xlsx')
var workbook = xlsx.read(fileObj);
You just need to use xlsx.read method to read a stream of data.
you can add an event listener before you pipe the data, so you can do something with your file before it uploaded, it look like this
async singleUpload(_, { file }, context) {
try {
console.log(file)
const { createReadStream, filename, mimetype, encoding } = await file
await new Promise((res) =>
createReadStream()
.on('data', (data)=>{
//do something with your data/file
console.log({data})
//your code here
})
.pipe(createWriteStream(path.join(__dirname, '../uploads', filename)))
.on('close', res)
)
},
you can see the documentation
stream node js

How to use bucket.upload() instead of file.createWriteStream() in Google Cloud Storage?

I'm trying to get the permanent (unsigned) download URL after uploading a file to Google Cloud Storage. I can get the signed download URL using file.createWriteStream() but file.createWriteStream() doesn't return the UploadResponse that includes the unsigned download URL. bucket.upload() includes the UploadResponse, and Get Download URL from file uploaded with Cloud Functions for Firebase has several answers explaining how to get the unsigned download URL from the UploadResponse. How do I change file.createWriteStream() in my code to bucket.upload()? Here's my code:
const {Storage} = require('#google-cloud/storage');
const storage = new Storage({ projectId: 'my-app' });
const bucket = storage.bucket('my-app.appspot.com');
var file = bucket.file('Audio/' + longLanguage + '/' + pronunciation + '/' + wordFileType);
const config = {
action: 'read',
expires: '03-17-2025',
content_type: 'audio/mp3'
};
function oedPromise() {
return new Promise(function(resolve, reject) {
http.get(oedAudioURL, function(response) {
response.pipe(file.createWriteStream(options))
.on('error', function(error) {
console.error(error);
reject(error);
})
.on('finish', function() {
file.getSignedUrl(config, function(err, url) {
if (err) {
console.error(err);
return;
} else {
resolve(url);
}
});
});
});
});
}
I tried this, it didn't work:
function oedPromise() {
return new Promise(function(resolve, reject) {
http.get(oedAudioURL, function(response) {
bucket.upload(response, options)
.then(function(uploadResponse) {
console.log('Then do something with UploadResponse.');
})
.catch(error => console.error(error));
});
});
}
The error message was Path must be a string. In other words, response is a variable but needs to be a string.
I used the Google Cloud text-to-speech API to simulate what you are doing. Getting the text to create the audio file from a text file. Once the file was created, I used the upload method to add it to my bucket and the makePublic method to got its public URL. Also I used the async/await feature offered by node.js instead of function chaining (using then) to avoid the 'No such object: ..." error produced because the makePublic method is executed before the file finishes uploading to the bucket.
// Imports the Google Cloud client library
const {Storage} = require('#google-cloud/storage');
// Creates a client using Application Default Credentials
const storage = new Storage();
// Imports the Google Cloud client library
const textToSpeech = require('#google-cloud/text-to-speech');
// Get the bucket
const myBucket = storage.bucket('my_bucket');
// Import other required libraries
const fs = require('fs');
const util = require('util');
// Create a client
const client = new textToSpeech.TextToSpeechClient();
// Create the variable to save the text to create the audio file
var text = "";
// Function that reads my_text.txt file (which contains the text that will be
// used to create my_audio.mp3) and saves its content in a variable.
function readFile() {
// This line opens the file as a readable stream
var readStream = fs.createReadStream('/home/usr/my_text.txt');
// Read and display the file data on console
readStream.on('data', function (data) {
text = data.toString();
});
// Execute the createAndUploadFile() fuction until the whole file is read
readStream.on('end', function (data) {
createAndUploadFile();
});
}
// Function that uploads the file to the bucket and generates it public URL.
async function createAndUploadFile() {
// Construct the request
const request = {
input: {text: text},
// Select the language and SSML voice gender (optional)
voice: {languageCode: 'en-US', ssmlGender: 'NEUTRAL'},
// select the type of audio encoding
audioConfig: {audioEncoding: 'MP3'},
};
// Performs the text-to-speech request
const [response] = await client.synthesizeSpeech(request);
// Write the binary audio content to a local file
const writeFile = util.promisify(fs.writeFile);
await writeFile('my_audio.mp3', response.audioContent, 'binary');
console.log('Audio content written to file: my_audio.mp3');
// Wait for the myBucket.upload() function to complete before moving on to the
// next line to execute it
let res = await myBucket.upload('/home/usr/my_audio.mp3');
// If there is an error, it is printed
if (res.err) {
console.log('error');
}
// If not, the makePublic() fuction is executed
else {
// Get the file in the bucket
let file = myBucket.file('my_audio.mp3');
file.makePublic();
}
}
readFile();
bucket.upload() is a convenience wrapper around file.createWriteStream() that takes a local filesystem path and upload the file into the bucket as an object:
bucket.upload("path/to/local/file.ext", options)
.then(() => {
// upload has completed
});
To generate a signed URL, you'll need to get a file object from the bucket:
const theFile = bucket.file('file_name');
The file name will either be that of your local file, or if you specified an alternate remote name options.destination for the file on GCS.
Then, use File.getSignedUrl() to get a signed URL:
bucket.upload("path/to/local/file.ext", options)
.then(() => {
const theFile = bucket.file('file.ext');
return theFile.getSignedURL(signedUrlOptions); // getSignedURL returns a Promise
})
.then((signedUrl) => {
// do something with the signedURL
});
See:
Bucket.upload() documentation
File.getSignedUrl() documentation
You can make a specific file in a bucket publicly readable with the method makePublic.
From the docs:
const {Storage} = require('#google-cloud/storage');
const storage = new Storage();
// 'my-bucket' is your bucket's name
const myBucket = storage.bucket('my-bucket');
// 'my-file' is the path to your file inside your bucket
const file = myBucket.file('my-file');
file.makePublic(function(err, apiResponse) {});
//-
// If the callback is omitted, we'll return a Promise.
//-
file.makePublic().then(function(data) {
const apiResponse = data[0];
});
Now the URI http://storage.googleapis.com/[BUCKET_NAME]/[OBJECT_NAME] is a public link to the file, as explained here.
The point is that you only need this minimal code to make an object public, for instance with a Cloud Function. Then you already know how the public link is and can use it directly in your app.

download and untar file than check the content, async await problem, node.js

I am downloading a file in tar format with request-promise module. Then I untar that file with tar module using async await syntax.
const list = new Promise(async (resolve, reject) => {
const filePath = "somedir/myFile.tar.gz";
if (!fs.existsSync(filePath)) {
const options = {
uri: "http://tarFileUrl",
encoding: "binary"
};
try {
console.log("download and untar");
const response = await rp.get(options);
const file = await fs.createWriteStream(filePath);
file.write(response, 'binary');
file.on('finish', () => {
console.log('wrote all data to file');
//here is the untar process
tar.x(
{
file: filePath,
cwd: "lists"
}
);
console.log("extracted");
});
file.end();
} catch(e) {
reject();
}
console.log("doesn't exist");
}
}
//here I am checking if the file exists no need to download either extract it (the try catch block)
//then the Array is created which includes the the list content line by line
if (fs.existsSync(filePath)) {
const file = await fs.readFileSync("lists/alreadyExtractedFile.list").toString().match(/[^\r\n]+/g);
if (file) {
file.map(name => {
if (name === checkingName) {
blackListed = true;
return resolve(blackListed);
}
});
}
else {
console.log("err");
}
}
The console.log output sequence is like so:
download and untar
file doesn't exist
UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open '...lists/alreadyExtractedFile.list'
wrote all data to file
extracted
So the file lists/alreadyExtractedFile.list is being checked before it's created. My guess is I am doing some wrong async await actions. As console.logs pointed that out the second checking block is somehow coming earlier than the file creating and untaring processes.
Please help me to figure out what I am doing wrong.
Your problem is here
const file = await fs.readFileSync("lists/alreadyExtractedFile.list").toString().match(/[^\r\n]+/g);
the readFileSync function doesn't return a promise, so you shouldn't await it:
const file = fs.readFileSync("lists/alreadyExtractedFile.list")
.toString().match(/[^\r\n]+/g);
This should solve the issue
You need to call resolve inside new Promise() callback.
If you write a local utility and use some sync methods, you can use sync methods whenever possible (in fs, tar etc).
This is a small example where a small archive from the Node.js repository is asynchronously downloaded, synchronously written and unpacked, then a file is synchronously read:
'use strict';
const fs = require('fs');
const rp = require('request-promise');
const tar = require('tar');
(async function main() {
try {
const url = 'https://nodejs.org/download/release/latest/node-v11.10.1-headers.tar.gz';
const arcName = 'node-v11.10.1-headers.tar.gz';
const response = await rp.get({ uri: url, encoding: null });
fs.writeFileSync(arcName, response, { encoding: null });
tar.x({ file: arcName, cwd: '.', sync: true });
const fileContent = fs.readFileSync('node-v11.10.1/include/node/v8-version.h', 'utf8');
console.log(fileContent.match(/[^\r\n]+/g));
} catch (err) {
console.error(err);
}
})();

Resources