How to get the download stream (buffer) using puppeteer? - node.js

I want to get the download content (buffer) and after soon, store the data at my S3 account. So far I wasn't able to find out some solution... Looking for some examples in the web, I noticed that there is a lot of people with this problem. I tried (unsuccessfully) to use the page.on("response") event to retrieve the raw response content, acording the following snippet:
const bucket = [];
await page.on("response", async response => {
const url = response.url();
if (
url ===
"https://the.earth.li/~sgtatham/putty/0.71/w32/putty-0.71-installer.msi"
) {
try {
if (response.status() === 200) {
bucket.push(await response.buffer());
console.log(bucket);
// I got the following: 'Protocol error (Network.getResponseBody): No resource with given identifier found' }
}
} catch (err) {
console.error(err, "ERROR");
}
}
});
With such code above, I would intend to detect the event of the download dialog and then, in some way, be able to receive the binary content.
I'm not sure if that's the correct approach. I noticed that some people use a solution based on reading files, in the other words, after download finishes, them read the stored file from the disk. There is a similar discussion at: https://github.com/GoogleChrome/puppeteer/issues/299.
My question is: Is there some way (using puppeteer), to intercept the download stream without having to save the file to disk before?
Thank you very much.

The problem is, that the buffer is cleared as soon as any kind of navigation request is happening. This might be a redirect or page reload in your case.
To solve this problem, you need to make sure that the page does not make any navigation requests as long as you have not finished downloading your resource. To do this we can use page.setRequestInterception.
There is a simple solutions, which might get you started, but might not always work and a more complex solution to this problem.
Simple solution
This solution cancels any navigation requests after the initial request. This means, any reload or navigation on the page will not work. Therefore the buffers of the resources are not cleared.
const browser = await puppeteer.launch();
const [page] = await browser.pages();
let initialRequest = true;
await page.setRequestInterception(true);
page.on('request', request => {
// cancel any navigation requests after the initial page.goto
if (request.isNavigationRequest() && !initialRequest) {
return request.abort();
}
initialRequest = false;
request.continue();
});
page.on('response', async (response) => {
if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
const buffer = await response.buffer();
// handle buffer
}
});
await page.goto('...');
Advanced solution
The following code will process each request one after another. In case you download the buffer it will wait until the buffer is downloaded before processing the next request.
const browser = await puppeteer.launch();
const [page] = await browser.pages();
let paused = false;
let pausedRequests = [];
const nextRequest = () => { // continue the next request or "unpause"
if (pausedRequests.length === 0) {
paused = false;
} else {
// continue first request in "queue"
(pausedRequests.shift())(); // calls the request.continue function
}
};
await page.setRequestInterception(true);
page.on('request', request => {
if (paused) {
pausedRequests.push(() => request.continue());
} else {
paused = true; // pause, as we are processing a request now
request.continue();
}
});
page.on('requestfinished', async (request) => {
const response = await request.response();
if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
const buffer = await response.buffer();
// handle buffer
}
nextRequest(); // continue with next request
});
page.on('requestfailed', nextRequest);
await page.goto('...');

Related

Koa API server - Wait until previous request is processed before processing a new request

I'm building an API in Node with Koa which uses another API to process some information. A request comes in to my API from the client and my API does some different requests to another API. Problem is, the other API is fragile and slow so to guarantee data integrity, I have to check if there is no previous incoming request being processed, before starting a new process. My first idea was to use promises and a global boolean to check if theres an ongoing processing and await until the process has finished. Somehow this prevents concurrent requests but even if 3-4 requests come in during the process, only the first one is done and that is it. Why are the rest of the incoming requests forgotten ?
Edit: As a side note, I do not need to respond to the incoming request with processed information. I could send response right after the request is recieved. I need to do operations with the 3rd party API.
My solution so far:
The entry point:
router.get('/update', (ctx, next) => {
ctx.body = 'Updating...';
update();
next();
});
And the update function:
let updateInProgress = false;
const update = async () => {
const updateProcess = () => {
return new Promise((resolve, reject) => {
if (!updateInProgress) {
return resolve();
} else {
setTimeout(updateProcess, 5000);
}
});
};
await updateProcess();
updateInProgress = true;
// Process the request
updateInProgress = false
}
Ok, I found a working solution, not sure how elegant it is tough...
I'm guessing the problem was, that new Promise was created with the Timeout function, and another one, and another one until one of them was resolved. That did not resolve the first Promise tough and the code got stuck. The solution was to create an interval which checked if the condition is met and then resolve the Promise. If someone smarter could comment, I'd appreciate it.
let updateInProgress = false;
const update = async () => {
const updateProcess = () => {
return new Promise((resolve, reject) => {
if (!updateInProgress) {
return resolve();
} else {
const processCheck = setInterval(() => {
if (!updateInProgress) {
clearInterval(processCheck);
return resolve();
}
}, 5000);
}
});
};
await updateProcess();
updateInProgress = true;
// Process the request
updateInProgress = false
}

Node js repeating a get request until there is a change in response

I will start off by saying I am a complete newbie when it comes to node js. I have the code below which currently sends a get request to the URL. It parses a specific value of the response and stores it as the search variable. It then uses the instagram api to change the bio on my instagram account to that search variable. However I would like the get request to continue until it detects a change. Ex. When the program is first run it fires off a get request. The first response value we get we will call 1. However after the first response I want it to continue to do get requests say every 5 seconds. The moment the response value changes from 1 to anything else I want that new value to be sent to the instagram bio. Can anyone help?
const { IgApiClient } = require("instagram-private-api")
const ig = new IgApiClient()
const https = require('https')
const USERNAME = "MYUSERNAME"
const PASSWORD = "MYPASS"
ig.state.generateDevice(USERNAME)
const main = async () => {
let url = "https://11z.co/_w/14011/selection";
https.get(url,(res) => {
let body = "";
res.on("data", (chunk) => {
body += chunk;
});
res.on("end", async () => {
try {
search = await JSON.parse(body).value;
} catch (error) {
console.error(error.message);
};
});
}).on("error", (error) => {
console.error(error.message);
});
await ig.simulate.preLoginFlow()
await ig.account.login(USERNAME, PASSWORD)
// log out of Instagram when done
process.nextTick(async () => await ig.simulate.postLoginFlow())
// fill in whatever you want your new Instagram bio to be
await ig.account.setBiography(search)
}
main()
// code is written in main() so that I can use async/await
to be good citizen to the target endpoint:
have a look on exponential-backoff package - https://www.npmjs.com/package/exponential-backoff
A utility that allows retrying a function with an exponential delay between attempts.

Puppeteer: return JSON response of AJAX response

while the page is loading I am trying to wait for a certain AJAX request made by my page and then return its response's JSON body. My code does not stop iterating through every response even after the condition is met within the listener for 'response' event.
Once I find the response I want to return, how can I capture the JSON from the response, stop execution the page from loading further, and return my JSON?
async function runScrape() {
const browser = await browserPromise;
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.setDefaultTimeout(60000);
let apiResponse;
page.on('response', async response => {
let url = await response.url();
let status = await response.status();
console.info(status + " NETWORK CALL: " + url);
if ( url.match(requestPattern) ) {
apiResponse = await response.text();
await page.evaluate(() => window.stop());
}
});
await page.goto(req.query.url);
console.log("API RESPONSE:\n" + apiResponse);
return apiResponse
}}
=== UPDATE ===
This was the solution that ended up working. It seemed this approach was required due to the specific behavior of the page being scraped.
async function runScrape() {
const browser = await browserPromise;
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.setDefaultTimeout(60000);
await page.setRequestInterception(true);
let JSONResponse;
page.on('response', async response => {
if ( !JSONResponse && response.url().match(requestPattern) ) {
JSONResponse = await response.text();
}
});
page.on('request', request => {
if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet') request.abort()
else request.continue()
});
await page.goto(scrapeURL, {waitUntil: 'networkidle2'});
await page.close();
return JSONResponse
}
runScrape()
.then( response => {
res.setHeader("content-type", "application/json");
res.status(200).send(response);
})
.catch(err => {
let payload = {"errorType": err.name, "errorMessage": err.message+"\n"+err.stack};
console.error(JSON.stringify(payload));
res.status(500).json(payload);
});
I would simplify it to a single page.on('response'... where we are looking for the desired request pattern with String.includes().
Once the response is identified then we can emulate the "Stop loading this page" button of the browser with await page.evaluate(() => window.stop()). The window.stop() method won't close the browser yet, just stops the network requests.
let resp
page.on('response', async response => {
if (response.url().includes(requestPattern)) {
resp = await response.json()
await page.evaluate(() => window.stop())
}
})
await page.goto(req.query.url, { waitUntil: 'networkidle0' } )
console.log(resp)
Edit:
To avoid undefined response you should use waitUntil: 'networkidle0' setting on page.goto(), see the docs about the options. You've got undefined because by default puppeteer considered page to be loaded when the load event is fired on the page (this is the default setting of waitUntil). So if the page considered loaded but there are still network connections in the queue and your request pattern is not found yet: the script will go on from goto to console.log. So you make sure the request is registered before it would happen by waiting until all network request has been finished.
networkidle0: consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
Please note: by setting networkidle you won't be able to disconnect after the request pattern condition was fulfilled, so your plan to stop the responses won't be possible.
I recommend to abort those resourceTypes which are not needed, like this you may have similar results as you would with stopping the requests:
For example:
Place it right after the page.on('response', async response => {... block ended.
await page.setRequestInterception(true)
page.on('request', request => {
if (request.resourceType() === 'image' || request.resourceType() === 'stylesheet') request.abort()
else request.continue()
})
You can use it with a request.url().includes(unwantedRequestPattern) condition as well if you know which connections you don't need.

Problem with async when downloading a series of files with nodejs

I'm trying to download a bunch of files. Let's say 1.jpg, 2.jpg, 3.jpg and so on. If 1.jpg exist, then I want to try and download 2.jpg. And if that exist I will try the next, and so on.
But the current "getFile" returns a promise, so I can't loop through it. I thought I had solved it by adding await in front of the http.get method. But it looks like it doesn't wait for the callback method to finish. Is there a more elegant way to solve this than to wrap the whole thing in a new async method?
// this returns a promise
var result = getFile(url, fileToDownload);
const getFile = async (url, saveName) => {
try {
const file = fs.createWriteStream(saveName);
const request = await http.get(url, function(response) {
const { statusCode } = response;
if (statusCode === 200) {
response.pipe(file);
return true;
}
else
return false;
});
} catch (e) {
console.log(e);
return false;
}
}
I don't think your getFile method is returning promise and also there is no point of awaiting a callback. You should split functionality in to two parts
- get file - which gets the file
- saving file which saves the file if get file returns something.
try the code like this
const getFile = url => {
return new Promise((resolve, reject) => {
http.get(url, response => {
const {statusCode} = response;
if (statusCode === 200) {
resolve(response);
}
reject(null);
});
});
};
async function save(saveName) {
const result = await getFile(url);
if (result) {
const file = fs.createWriteStream(saveName);
response.pipe(file);
}
}
What you are trying to do is getting / requesting images in some sync fashion.
Possible solutions :
You know the exact number of images you want to get, then go ahead with "request" or "http" module and use promoise chain.
You do not how the exact number of images, but will stop at image no. N-1 if N not found. then go ahed with sync-request module.
your getFile does return a promise, but only because it has async keyword before it, and it's not a kind of promise you want. http.get uses old callback style handling, luckily it's easy to convert it to Promise to suit your needs
const tryToGetFile = (url, saveName) => {
return new Promise((resolve) => {
http.get(url, response => {
if (response.statusCode === 200) {
const stream = fs.createWriteStream(saveName)
response.pipe(stream)
resolve(true);
} else {
// usually it is better to reject promise and propagate errors further
// but the function is called tryToGetFile as it expects that some file will not be available
// and this is not an error. Simply resolve to false
resolve(false);
}
})
})
}
const fileUrls = [
'somesite.file1.jpg',
'somesite.file2.jpg',
'somesite.file3.jpg',
'somesite.file4.jpg',
]
const downloadInSequence = async () => {
// using for..of instead of forEach to be able to pause
// downloadInSequence function execution while getting file
// can also use classic for
for (const fileUrl of fileUrls) {
const success = await tryToGetFile('http://' + fileUrl, fileUrl)
if (!success) {
// file with this name wasn't found
return;
}
}
}
This is a basic setup to show how to wrap http.get in a Promise and run it in sequence. Add error handling wherever you want. Also it's worth noting that it will proceed to the next file as soon as it has received a 200 status code and started downloading it rather than waiting for a full download before proceeding

How to wait for all downloads to complete with Puppeteer?

I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.
It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close() my downloads are stopped because the chrome instance is closed before the downloads have finished.
Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" }) and "networkidle2", but those seem to wait indefinitely.
node.js 8.10
puppeteer 1.10.0
Update:
It's 2022. Use Playwright to get away from this mass. manage downloads
It also has 'smarter' locator, which examine selectors every time before click()
old version for puppeteer:
My solution is to use chrome's own chrome://downloads/ page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature
This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0~#frbn) in that page, well, take care of your network:)
dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')
await your_download_button.click() // start download
await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
() => {
// monitoring the state of the first download item
// if finish than return true; if fail click
const dm = document.querySelector('downloads-manager').shadowRoot
const firstItem = dm.querySelector('#frb0')
if (firstItem) {
const thatArea = firstItem.shadowRoot.querySelector('.controls')
const atag = thatArea.querySelector('a')
if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
return true
}
const btn = thatArea.querySelector('cr-button')
if (btn && btn.textContent === '重试') { // may be 'try again'
btn.click()
}
}
},
{ polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')
An alternative if you have the file name or a suggestion for other ways to check.
async function waitFile (filename) {
return new Promise(async (resolve, reject) => {
if (!fs.existsSync(filename)) {
await delay(3000);
await waitFile(filename);
resolve();
}else{
resolve();
}
})
}
function delay(time) {
return new Promise(function(resolve) {
setTimeout(resolve, time)
});
}
Implementation:
var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);
You need check request response.
await page.on('response', (response)=>{ console.log(response, response._url)}
You should check what is coming from response then find status, it comes with status 200
Using puppeteer and chrome I have one more solution which might help you.
If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.
So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.
async checkFileDownloaded(path, timer) {
return new Promise(async (resolve, reject) => {
let noOfFile;
try {
noOfFile = await fs.readdirSync(path);
} catch (err) {
return resolve("null");
}
for (let i in noOfFile) {
if (noOfFile[i].includes('.crdownload')) {
await this.delay(20000);
if (timer == 0) {
fs.unlink(path + '/' + noOfFile[i], (err) => {
});
return resolve("Success");
} else {
timer = timer - 1;
await this.checkFileDownloaded(path, timer);
}
}
}
return resolve("Success");
});
}
Here is another function, its just wait for the pause button to disappear:
async function waitForDownload(browser: Browser) {
const dmPage = await browser.newPage();
await dmPage.goto("chrome://downloads/");
await dmPage.bringToFront();
await dmPage.waitForFunction(() => {
try {
const donePath = document.querySelector("downloads-manager")!.shadowRoot!
.querySelector(
"#frb0",
)!.shadowRoot!.querySelector("#pauseOrResume")!;
if ((donePath as HTMLButtonElement).innerText != "Pause") {
return true;
}
} catch {
//
}
}, { timeout: 0 });
console.log("Download finished");
}
I didn't like solutions that were checking DOM or file system for the file.
From Chrome DevTools Protocol documentation](https://chromedevtools.github.io/) I found two events,
Page.downloadProgress and Browser.downloadProgress. (Though Page.downloadProgress is marked as deprecated, that's the one that worked for me.)
This event has a property called state which tells you about the state of the download. state could be inProgress, completed and canceled.
You can wrap this event in a Promise to await it till the status changes to completed
async function waitUntilDownload(page, fileName = '') {
return new Promise((resolve, reject) => {
page._client().on('Page.downloadProgress', e => {
if (e.state === 'completed') {
resolve(fileName);
} else if (e.state === 'canceled') {
reject();
}
});
});
}
and await it as follows,
await waitUntilDownload(page, fileName);
Created simple await function that will check for file rapidly or timeout in 10 seconds
import fs from "fs";
awaitFileDownloaded: async (filePath) => {
let timeout = 10000
const delay = 200
return new Promise(async (resolve, reject) => {
while (timeout > 0) {
if (fs.existsSync(filePath)) {
resolve(true);
return
} else {
await HelperUI.delay(delay)
timeout -= delay
}
}
reject("awaitFileDownloaded timed out")
});
},
You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.
Run npm to install node-watch:
npm install node-watch
Sample code:
const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');
// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
watch(watchDir, function(event, name) {
if (event == "update") {
if (name === filepath)) {
browser.close(); // use case specific
process.exit(); // use case specific
}
}
})
Tried doing an await page.waitFor(50000); with a time as long as the download should take.
Or look at watching for file changes on complete file transfer
you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end
const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
siNo = 0;
filenames = fs.readdirSync(myPath);
filenames.forEach(file => {
if (file.includes('crdownload')) {
siNo = 1;
}
});
return siNo;
}
Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.
while (true) {
execSync('sleep 3');
if (stillWorking(myPath) == 0) {
await browser.close();
break;
}
}

Resources