I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.
It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close() my downloads are stopped because the chrome instance is closed before the downloads have finished.
Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" }) and "networkidle2", but those seem to wait indefinitely.
node.js 8.10
puppeteer 1.10.0
Update:
It's 2022. Use Playwright to get away from this mass. manage downloads
It also has 'smarter' locator, which examine selectors every time before click()
old version for puppeteer:
My solution is to use chrome's own chrome://downloads/ page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature
This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0~#frbn) in that page, well, take care of your network:)
dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')
await your_download_button.click() // start download
await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
() => {
// monitoring the state of the first download item
// if finish than return true; if fail click
const dm = document.querySelector('downloads-manager').shadowRoot
const firstItem = dm.querySelector('#frb0')
if (firstItem) {
const thatArea = firstItem.shadowRoot.querySelector('.controls')
const atag = thatArea.querySelector('a')
if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
return true
}
const btn = thatArea.querySelector('cr-button')
if (btn && btn.textContent === '重试') { // may be 'try again'
btn.click()
}
}
},
{ polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')
An alternative if you have the file name or a suggestion for other ways to check.
async function waitFile (filename) {
return new Promise(async (resolve, reject) => {
if (!fs.existsSync(filename)) {
await delay(3000);
await waitFile(filename);
resolve();
}else{
resolve();
}
})
}
function delay(time) {
return new Promise(function(resolve) {
setTimeout(resolve, time)
});
}
Implementation:
var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);
You need check request response.
await page.on('response', (response)=>{ console.log(response, response._url)}
You should check what is coming from response then find status, it comes with status 200
Using puppeteer and chrome I have one more solution which might help you.
If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.
So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.
async checkFileDownloaded(path, timer) {
return new Promise(async (resolve, reject) => {
let noOfFile;
try {
noOfFile = await fs.readdirSync(path);
} catch (err) {
return resolve("null");
}
for (let i in noOfFile) {
if (noOfFile[i].includes('.crdownload')) {
await this.delay(20000);
if (timer == 0) {
fs.unlink(path + '/' + noOfFile[i], (err) => {
});
return resolve("Success");
} else {
timer = timer - 1;
await this.checkFileDownloaded(path, timer);
}
}
}
return resolve("Success");
});
}
Here is another function, its just wait for the pause button to disappear:
async function waitForDownload(browser: Browser) {
const dmPage = await browser.newPage();
await dmPage.goto("chrome://downloads/");
await dmPage.bringToFront();
await dmPage.waitForFunction(() => {
try {
const donePath = document.querySelector("downloads-manager")!.shadowRoot!
.querySelector(
"#frb0",
)!.shadowRoot!.querySelector("#pauseOrResume")!;
if ((donePath as HTMLButtonElement).innerText != "Pause") {
return true;
}
} catch {
//
}
}, { timeout: 0 });
console.log("Download finished");
}
I didn't like solutions that were checking DOM or file system for the file.
From Chrome DevTools Protocol documentation](https://chromedevtools.github.io/) I found two events,
Page.downloadProgress and Browser.downloadProgress. (Though Page.downloadProgress is marked as deprecated, that's the one that worked for me.)
This event has a property called state which tells you about the state of the download. state could be inProgress, completed and canceled.
You can wrap this event in a Promise to await it till the status changes to completed
async function waitUntilDownload(page, fileName = '') {
return new Promise((resolve, reject) => {
page._client().on('Page.downloadProgress', e => {
if (e.state === 'completed') {
resolve(fileName);
} else if (e.state === 'canceled') {
reject();
}
});
});
}
and await it as follows,
await waitUntilDownload(page, fileName);
Created simple await function that will check for file rapidly or timeout in 10 seconds
import fs from "fs";
awaitFileDownloaded: async (filePath) => {
let timeout = 10000
const delay = 200
return new Promise(async (resolve, reject) => {
while (timeout > 0) {
if (fs.existsSync(filePath)) {
resolve(true);
return
} else {
await HelperUI.delay(delay)
timeout -= delay
}
}
reject("awaitFileDownloaded timed out")
});
},
You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.
Run npm to install node-watch:
npm install node-watch
Sample code:
const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');
// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
watch(watchDir, function(event, name) {
if (event == "update") {
if (name === filepath)) {
browser.close(); // use case specific
process.exit(); // use case specific
}
}
})
Tried doing an await page.waitFor(50000); with a time as long as the download should take.
Or look at watching for file changes on complete file transfer
you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end
const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
siNo = 0;
filenames = fs.readdirSync(myPath);
filenames.forEach(file => {
if (file.includes('crdownload')) {
siNo = 1;
}
});
return siNo;
}
Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.
while (true) {
execSync('sleep 3');
if (stillWorking(myPath) == 0) {
await browser.close();
break;
}
}
Related
I get problems, when I try to execute two async function.
This is my first function that download images in specific folders.
const getLayerData = async (Collection_id) => {
let layers_data = await getLayersID(Collection_id);
//let collection_name = await getcollecionName(Collection_id);
return new Promise(function(resolve, reject) {
layers_data.forEach((value, key) => {
getTraitData(key, value);
});
console.log('finish downloading layers');
});
};
My second function allow it to resize images, modified and upload it to firestore.
To run both, I read states in the firestore to do tasks. When it runs, only one async is execute.
const start = async () => {
var users = await readStates();
var task = users.length;
while (true) {
if (task > 0){
users.forEach(data => {
// download layers
buildSetup(); //sync
getLayerData(data[1)]; //async
startCreating(); //async allow resize, modify, upload to firestore, storage
//espera (data);
//runTask();
//end task processs
task = task - 1;
});
}else{
//console.log(task);
break;
};
};
};
My plan is to connect to a page, interact with its elements for a while, and then wait and start over. Since the process of accessing the page is complicated, I would ideally log in only once, and then permanently stay on page.
index.js
const puppeteer = require('puppeteer');
const creds = require("./creds.json");
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.online-messenger.com');
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
setInterval(async () => {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
index++;
}
}, 30000);
} catch (e) { console.log(e); }
await page.close();
await browser.close();
})();
Again, I do not want to close the connection, so I thought add the setInterval() would help me with the problem. The core code works absolutely fine, but every single time I run the code with the interval function I get this error:
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I timed the main part of my code and it would typically take around 20-25 seconds. I thought the problem lies in the delay set to 30 seconds, but I get the same error even when I increase it to e.g. 60000 (60 seconds).
What am I doing wrong? Why is setInterval not working, and is there possibly a different way of tackling my problem?
Okay, after spending some more time on the problem, I realised it was indeed the setInterval function that caused all the errors.
The code is asynchronous and in order to make it all work, I had to use an 'async' version of setInterval(). I wrapped my code in an endless loop, which finishes with promise, which resolves after a specified time.
...
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
while(true) {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
waitBeforeNextIteration(10000);
index++;
}
...
waitBeforeNextIteration(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}
I am trying to create a script to download pages from multiple urls using node js but the loop didn't want to wait for the request to finish and continued printing, I also got a hint to use the async for loop, but still it didn't work.
here's my code
function GetPage(url){
console.log(` Downloading page ${url}`);
request({
url: `${url}`
},(err,res,body) => {
if(err) throw err;
console.log(` Writing html to file` );
fs.writeFile(`${url.split('/').slice(-1)[0]}`,`${body}`,(err) => {
if(err) throw err;
console.log('saved');
});
});
}
var list = [ 'https://www.someurl.com/page1.html', 'https://www.someurl.com/page2.html', 'https://www.someurl.com/page3.html' ]
const main = async () => {
for(let i = 0; i < list.length; i++){
console.log(` processing ${list[i]}`);
await GetPage(list[i]);
}
};
main().catch(console.error);
Output :
processing https://www.someurl.com/page1.html
Downloading page https://www.someurl.com/page1.html
processing https://www.someurl.com/page2.html
Downloading page https://www.someurl.com/page2.html
processing https://www.someurl.com/page3.html
Downloading page https://www.someurl.com/page3.html
Writing html to file
Writing html to file
saved
saved
Writing html to file
saved
There are a couple of problems with your code.
You are mixing code that uses the callback style programming and code that should be using promises. Also, your getPage function is not async (it doesn't return a promise) so you cannot await on it.
You just have to return a promise from your getPage() function, and correctly resolve it or reject it.
function getPage(url) {
return new Promise((resolve, reject) => {
console.log(` Downloading page ${url}`);
request({ url: `${url}` }, (err, res, body) => {
if (err) reject(err);
console.log(` Writing html to file`);
fs.writeFile(`${url.replace(/\//g,'-')}.html`, `${body}`, (writeErr) => {
if (writeErr) reject(writeErr);
console.log("saved");
resolve();
});
});
});
}
You don't have to change your main() function loop will await for the getPage() function.
For loop doesn't wait for callback to be finished, it will continue executing it. You need to turn either getPage function to promise or use Promise.all as shown below.
var list = [
"https://www.someurl.com/page1.html",
"https://www.someurl.com/page2.html",
"https://www.someurl.com/page3.html",
];
function getPage(url) {
return new Promise((resolve, reject) => {
console.log(` Downloading page ${url}`);
request({ url: `${url}` }, (err, res, body) => {
if (err) reject(err);
console.log(` Writing html to file`);
fs.writeFile(`${url}.html`, `${body}`, (writeErr) => {
if (writeErr) reject(writeErr);
console.log("saved");
resolve();
});
});
});
}
const main = async () => {
return new Promise((resolve, reject) => {
let promises = [];
list.map((path) => promises.push(getPage(path)));
Promise.all(promises).then(resolve).catch(reject);
});
};
main().catch(console.error);
GetPage() is not built around promises and doesn't even return a promise so await on its result does NOTHING. await has no magic powers. It awaits a promise. If you don't give it a promise that properly resolves/rejects when your async operation is done, then the await does nothing. Your GetPage() function returns nothing so the await has nothing to do.
What you need is to fix GetPage() so it returns a promise that is properly tied to your asynchronous result. Because the request() library has been deprecated and is no longer recommended for new projects and because you need a promise-based solution anyway so you can use await with it, I'd suggest you switch to one of the alternative promise-based libraries recommended here. My favorite from that list is got(), but you can choose whichever one you like best. In addition, you can use fs.promises.writeFile() for promise-based file writing.
Here's how that code would look like using got():
const got = require('got');
const { URL } = require('url');
const path = require('path');
const fs = require('fs');
function getPage(url) {
console.log(` Downloading page ${url}`);
return got(url).text().then(data => {
// can't just use an URL for your filename as it contains potentially illegal
// characters for the file system
// so, add some code to create a sanitized filename here
// find just the root filename in the URL
let urlObj = new URL(url);
let filename = path.basename(urlObj.pathname);
if (!filename) {
filename = "index.html";
}
let extension = path.extname(filename);
if (!extension) {
filename += ".html";
} else if (extension === ".") {
filename += "html";
}
console.log(` Writing file ${filename}`)
return fs.promises.writeFile(filename, data);
});
}
const list = ['https://www.someurl.com/page1.html', 'https://www.someurl.com/page2.html', 'https://www.someurl.com/page3.html'];
async function main() {
for (let url of list) {
console.log(` processing ${url}`);
await getPage(url);
}
}
main().then(() => {
console.log("all done");
}).catch(console.error);
If you put real URLs in the array, this is directly runnable in nodejs. I ran it myself with my own URLs.
Summary of Changes and Improvements:
Switched from request() to got() because it's promise-based and not deprecated.
Modified getPage() to return a promise that represents the asynchronous operations in the function.
Switched to fs.promises.writeFile() so we are using only promises for asynchronous control-flow.
Added legal filename generation from the base path of the URL since you can't just use a full URL as a filename (at least in some file systems).
Switched to simpler for/of loop
I'm trying to download a bunch of files. Let's say 1.jpg, 2.jpg, 3.jpg and so on. If 1.jpg exist, then I want to try and download 2.jpg. And if that exist I will try the next, and so on.
But the current "getFile" returns a promise, so I can't loop through it. I thought I had solved it by adding await in front of the http.get method. But it looks like it doesn't wait for the callback method to finish. Is there a more elegant way to solve this than to wrap the whole thing in a new async method?
// this returns a promise
var result = getFile(url, fileToDownload);
const getFile = async (url, saveName) => {
try {
const file = fs.createWriteStream(saveName);
const request = await http.get(url, function(response) {
const { statusCode } = response;
if (statusCode === 200) {
response.pipe(file);
return true;
}
else
return false;
});
} catch (e) {
console.log(e);
return false;
}
}
I don't think your getFile method is returning promise and also there is no point of awaiting a callback. You should split functionality in to two parts
- get file - which gets the file
- saving file which saves the file if get file returns something.
try the code like this
const getFile = url => {
return new Promise((resolve, reject) => {
http.get(url, response => {
const {statusCode} = response;
if (statusCode === 200) {
resolve(response);
}
reject(null);
});
});
};
async function save(saveName) {
const result = await getFile(url);
if (result) {
const file = fs.createWriteStream(saveName);
response.pipe(file);
}
}
What you are trying to do is getting / requesting images in some sync fashion.
Possible solutions :
You know the exact number of images you want to get, then go ahead with "request" or "http" module and use promoise chain.
You do not how the exact number of images, but will stop at image no. N-1 if N not found. then go ahed with sync-request module.
your getFile does return a promise, but only because it has async keyword before it, and it's not a kind of promise you want. http.get uses old callback style handling, luckily it's easy to convert it to Promise to suit your needs
const tryToGetFile = (url, saveName) => {
return new Promise((resolve) => {
http.get(url, response => {
if (response.statusCode === 200) {
const stream = fs.createWriteStream(saveName)
response.pipe(stream)
resolve(true);
} else {
// usually it is better to reject promise and propagate errors further
// but the function is called tryToGetFile as it expects that some file will not be available
// and this is not an error. Simply resolve to false
resolve(false);
}
})
})
}
const fileUrls = [
'somesite.file1.jpg',
'somesite.file2.jpg',
'somesite.file3.jpg',
'somesite.file4.jpg',
]
const downloadInSequence = async () => {
// using for..of instead of forEach to be able to pause
// downloadInSequence function execution while getting file
// can also use classic for
for (const fileUrl of fileUrls) {
const success = await tryToGetFile('http://' + fileUrl, fileUrl)
if (!success) {
// file with this name wasn't found
return;
}
}
}
This is a basic setup to show how to wrap http.get in a Promise and run it in sequence. Add error handling wherever you want. Also it's worth noting that it will proceed to the next file as soon as it has received a 200 status code and started downloading it rather than waiting for a full download before proceeding
I want to get the download content (buffer) and after soon, store the data at my S3 account. So far I wasn't able to find out some solution... Looking for some examples in the web, I noticed that there is a lot of people with this problem. I tried (unsuccessfully) to use the page.on("response") event to retrieve the raw response content, acording the following snippet:
const bucket = [];
await page.on("response", async response => {
const url = response.url();
if (
url ===
"https://the.earth.li/~sgtatham/putty/0.71/w32/putty-0.71-installer.msi"
) {
try {
if (response.status() === 200) {
bucket.push(await response.buffer());
console.log(bucket);
// I got the following: 'Protocol error (Network.getResponseBody): No resource with given identifier found' }
}
} catch (err) {
console.error(err, "ERROR");
}
}
});
With such code above, I would intend to detect the event of the download dialog and then, in some way, be able to receive the binary content.
I'm not sure if that's the correct approach. I noticed that some people use a solution based on reading files, in the other words, after download finishes, them read the stored file from the disk. There is a similar discussion at: https://github.com/GoogleChrome/puppeteer/issues/299.
My question is: Is there some way (using puppeteer), to intercept the download stream without having to save the file to disk before?
Thank you very much.
The problem is, that the buffer is cleared as soon as any kind of navigation request is happening. This might be a redirect or page reload in your case.
To solve this problem, you need to make sure that the page does not make any navigation requests as long as you have not finished downloading your resource. To do this we can use page.setRequestInterception.
There is a simple solutions, which might get you started, but might not always work and a more complex solution to this problem.
Simple solution
This solution cancels any navigation requests after the initial request. This means, any reload or navigation on the page will not work. Therefore the buffers of the resources are not cleared.
const browser = await puppeteer.launch();
const [page] = await browser.pages();
let initialRequest = true;
await page.setRequestInterception(true);
page.on('request', request => {
// cancel any navigation requests after the initial page.goto
if (request.isNavigationRequest() && !initialRequest) {
return request.abort();
}
initialRequest = false;
request.continue();
});
page.on('response', async (response) => {
if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
const buffer = await response.buffer();
// handle buffer
}
});
await page.goto('...');
Advanced solution
The following code will process each request one after another. In case you download the buffer it will wait until the buffer is downloaded before processing the next request.
const browser = await puppeteer.launch();
const [page] = await browser.pages();
let paused = false;
let pausedRequests = [];
const nextRequest = () => { // continue the next request or "unpause"
if (pausedRequests.length === 0) {
paused = false;
} else {
// continue first request in "queue"
(pausedRequests.shift())(); // calls the request.continue function
}
};
await page.setRequestInterception(true);
page.on('request', request => {
if (paused) {
pausedRequests.push(() => request.continue());
} else {
paused = true; // pause, as we are processing a request now
request.continue();
}
});
page.on('requestfinished', async (request) => {
const response = await request.response();
if (response.url() === 'RESOURCE YOU WANT TO DOWNLOAD') {
const buffer = await response.buffer();
// handle buffer
}
nextRequest(); // continue with next request
});
page.on('requestfailed', nextRequest);
await page.goto('...');