How to catch a download with playwright? - node.js

I'm trying to download a file from a website using Playwright. The button that triggers the download does some js and then the download starts.
Clicking the button using the .click function triggers a download but it shows an error: Failed - Download error.
I've tried using the devtools protocol Page.setDownloadBehavior, but this doesn't seem to do anything.
const playwright = require("playwright");
const { /*chromium,*/ devices } = require("playwright");
const iPhone = devices["iPad (gen 7) landscape"];
(async () => {
const my_chromium = playwright["chromium"];
const browser = await my_chromium.launch({ headless: false });
const context = await browser.newContext({
viewport: iPhone.viewport,
userAgent: iPhone.userAgent
});
const page = await context.newPage();
const client = await browser.pageTarget(page).createCDPSession();
console.log(client);
await client.send("Page.setDownloadBehavior", {
behavior: "allow",
downloadPath: "C:/in"
});
//...and so on
await page.click("#download-button");
browser.close();
})();
Full file here
There is a proposal for a better download api in Playwright, but I can't find the current API.
There was a suggestion that something to do with the downloadWillBegin event would be useful, but I've no idea how to access that from Playwright.
I'm open to the suggestion that I should use Puppeteer instead, but I moved to playwright because I couldn't work out how to download a file with Pupeteer either, and the issue related to it suggested that the whole team had moved to Playwright.

Take a look at the page.on("download")
const browser = await playwright.chromium.launch({});
const context = await browser.newContext({ acceptDownloads: true });
const page = await context.newPage();
await page.goto("https://somedownloadpage.weburl");
await page.type("#password", password);
await page.click("text=Continue");
const download = await page.waitForEvent("download");
console.log("file downloaded to", await download.path());

Embarassingly, I was closing the browser before the download had started.
It turns out that the download error was caused by the client section. However that means that I have no control over where the file is saved.
The download works when headless: false but not when headless: true.
If anyone has a better answer, that'd be great!

You can use waitForTimeout.
I tried with {headless: true} & await page.waitForTimeout(1000);
it's working fine. you can check same here

To download file (also its buffer) i highly recomend this module: Got node module. Its much easier, clean and light.
(async () => {
const response = await got('https://sindresorhus.com')
.on('downloadProgress', progress => {
// Report download progress
})
.on('uploadProgress', progress => {
// Report upload progress
});
console.log(response);
})();

Related

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

How do I replace a response file in Puppeteer?

Puppeteer seems to give one some room for optimization. I managed to filter out all resources that were not relevant for my scraping to boost the speed a little bit. Now I see, that Puppeteer is stuck on the main page html file for a very long time before being able to continue.
So my idea was to just download (as a beginning example) the index.html file, and make Puppeteer read it from my own storage. (In general it might not work for all files if the server dynamically distributes the files.)
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless:true,
userDataDir: './my/path'
});
const page = (await browser.pages())[0];
await page.setRequestInterception(true);
page.on('request', request => {
if(request.url()==='https://stackoverflow.com/'){
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8',
body:'<html>Test</html>'
});
//request.abort();
}
});
await page.goto('https://stackoverflow.com');
await page.screenshot({path: 'example.jpg'});
await browser.close();
})();
But I get the following error: Error: Request is already handled!
How do I do it before the request is handled?

How to download file with puppeteer if page doesn't send any request

I'm using puppeteer for downloading a file from the site. I have only an element which I click and file just downloads.
I googled info about downloading files with puppeteer but all of them are based on page.on('request', ...). But it doesn't work for me because page doesn't send any request
page.on('request', arg => {
console.log(arg.url())
})
And in the terminal I have only "https://some-site/images/csv.gif" but it's only .gif. How does this file downloads at all? If browser doesn't do any request does it mean that this file is already on the client site?
Try (replace the variables on top fit to your needs):
const puppeteer = require('puppeteer');
const url = 'your.url.com';
const buttonElementId = '#downloadButton';
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url);
await page.click(buttonElementId);
await browser.close();
})();

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

Open Puppeteer with specific configuration (download PDF instead of PDF viewer)

I would like to open Chromium with a specific configuration.
I am looking for the configuration to activate the following option :
Settings => Site Settings => Permissions => PDF documents => "Download PDF files instead of automatically openning them in Chrome"
I searched the tags on this command line switch page but the only parameter that deals with pdf is --print-to-pdf which does not correspond to my need.
Do you have any ideas?
There is no option you can pass into Puppeteer to force PDF downloads. However, you can use chrome-devtools-protocol to add a content-disposition: attachment response header to force downloads.
A visual flow of what you need to do:
I'll include a full example code below. In the example below, PDF files and XML files will be downloaded in headful mode.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await client.send('Fetch.enable', {
patterns: [
{
urlPattern: '*',
requestStage: 'Response',
},
],
});
await client.on('Fetch.requestPaused', async (reqEvent) => {
const { requestId } = reqEvent;
let responseHeaders = reqEvent.responseHeaders || [];
let contentType = '';
for (let elements of responseHeaders) {
if (elements.name.toLowerCase() === 'content-type') {
contentType = elements.value;
}
}
if (contentType.endsWith('pdf') || contentType.endsWith('xml')) {
responseHeaders.push({
name: 'content-disposition',
value: 'attachment',
});
const responseObj = await client.send('Fetch.getResponseBody', {
requestId,
});
await client.send('Fetch.fulfillRequest', {
requestId,
responseCode: 200,
responseHeaders,
body: responseObj.body,
});
} else {
await client.send('Fetch.continueRequest', { requestId });
}
});
await page.goto('https://pdf-xml-download-test.vercel.app/');
await page.waitFor(100000);
await client.send('Fetch.disable');
await browser.close();
})();
For a more detailed explanation, please refer to the Git repo I've setup with comments. It also includes an example code for playwright.
Puppeteer currently does not support navigating (or downloading) PDFs
in headless mode that easily. Quote from the docs for the page.goto function:
NOTE Headless mode doesn't support navigation to a PDF document. See the upstream issue.
What you can do though, is detect if the browser is navigating to the PDF file and then download it yourself via Node.js.
Code sample
const puppeteer = require('puppeteer');
const http = require('http');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('request', req => {
if (req.url() === '...') {
const file = fs.createWriteStream('./file.pdf');
http.get(req.url(), response => response.pipe(file));
}
});
await page.goto('...');
await browser.close();
})();
This navigates to a URL and monitors the ongoing requests. If the "matched request" is found, Node.js will manually download the file via http.get and pipe it into file.pdf. Please be aware that this is a minimal working example. You want to catch errors when downloading and might also want to use something more sophisticated then http.get depending on the situation.
Future note
In the future, there might be an easier way to do it. When puppeteer will support response interception, you will be able to simply force the browser to download a document, but right now this is not supported (May 2019).

Resources