Blocking specific resources (css, images, videos, etc) using crawlee and playwright

Blocking specific resources (css, images, videos, etc) using crawlee and playwright - node.js

I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.

you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

Related

How do I replace a response file in Puppeteer?

Puppeteer seems to give one some room for optimization. I managed to filter out all resources that were not relevant for my scraping to boost the speed a little bit. Now I see, that Puppeteer is stuck on the main page html file for a very long time before being able to continue.
So my idea was to just download (as a beginning example) the index.html file, and make Puppeteer read it from my own storage. (In general it might not work for all files if the server dynamically distributes the files.)
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless:true,
userDataDir: './my/path'
});
const page = (await browser.pages())[0];
await page.setRequestInterception(true);
page.on('request', request => {
if(request.url()==='https://stackoverflow.com/'){
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8',
body:'<html>Test</html>'
});
//request.abort();
}
});
await page.goto('https://stackoverflow.com');
await page.screenshot({path: 'example.jpg'});
await browser.close();
})();
But I get the following error: Error: Request is already handled!
How do I do it before the request is handled?

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals

It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

How to catch a download with playwright?

I'm trying to download a file from a website using Playwright. The button that triggers the download does some js and then the download starts.
Clicking the button using the .click function triggers a download but it shows an error: Failed - Download error.
I've tried using the devtools protocol Page.setDownloadBehavior, but this doesn't seem to do anything.
const playwright = require("playwright");
const { /*chromium,*/ devices } = require("playwright");
const iPhone = devices["iPad (gen 7) landscape"];
(async () => {
const my_chromium = playwright["chromium"];
const browser = await my_chromium.launch({ headless: false });
const context = await browser.newContext({
viewport: iPhone.viewport,
userAgent: iPhone.userAgent
});
const page = await context.newPage();
const client = await browser.pageTarget(page).createCDPSession();
console.log(client);
await client.send("Page.setDownloadBehavior", {
behavior: "allow",
downloadPath: "C:/in"
});
//...and so on
await page.click("#download-button");
browser.close();
})();
Full file here
There is a proposal for a better download api in Playwright, but I can't find the current API.
There was a suggestion that something to do with the downloadWillBegin event would be useful, but I've no idea how to access that from Playwright.
I'm open to the suggestion that I should use Puppeteer instead, but I moved to playwright because I couldn't work out how to download a file with Pupeteer either, and the issue related to it suggested that the whole team had moved to Playwright.

Take a look at the page.on("download")
const browser = await playwright.chromium.launch({});
const context = await browser.newContext({ acceptDownloads: true });
const page = await context.newPage();
await page.goto("https://somedownloadpage.weburl");
await page.type("#password", password);
await page.click("text=Continue");
const download = await page.waitForEvent("download");
console.log("file downloaded to", await download.path());

Embarassingly, I was closing the browser before the download had started.
It turns out that the download error was caused by the client section. However that means that I have no control over where the file is saved.
The download works when headless: false but not when headless: true.
If anyone has a better answer, that'd be great!

You can use waitForTimeout.
I tried with {headless: true} & await page.waitForTimeout(1000);
it's working fine. you can check same here

To download file (also its buffer) i highly recomend this module: Got node module. Its much easier, clean and light.
(async () => {
const response = await got('https://sindresorhus.com')
.on('downloadProgress', progress => {
// Report download progress
})
.on('uploadProgress', progress => {
// Report upload progress
});
console.log(response);
})();

Open Puppeteer with specific configuration (download PDF instead of PDF viewer)

I would like to open Chromium with a specific configuration.
I am looking for the configuration to activate the following option :
Settings => Site Settings => Permissions => PDF documents => "Download PDF files instead of automatically openning them in Chrome"
I searched the tags on this command line switch page but the only parameter that deals with pdf is --print-to-pdf which does not correspond to my need.
Do you have any ideas?

There is no option you can pass into Puppeteer to force PDF downloads. However, you can use chrome-devtools-protocol to add a content-disposition: attachment response header to force downloads.
A visual flow of what you need to do:
I'll include a full example code below. In the example below, PDF files and XML files will be downloaded in headful mode.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await client.send('Fetch.enable', {
patterns: [
{
urlPattern: '*',
requestStage: 'Response',
},
],
});
await client.on('Fetch.requestPaused', async (reqEvent) => {
const { requestId } = reqEvent;
let responseHeaders = reqEvent.responseHeaders || [];
let contentType = '';
for (let elements of responseHeaders) {
if (elements.name.toLowerCase() === 'content-type') {
contentType = elements.value;
}
}
if (contentType.endsWith('pdf') || contentType.endsWith('xml')) {
responseHeaders.push({
name: 'content-disposition',
value: 'attachment',
});
const responseObj = await client.send('Fetch.getResponseBody', {
requestId,
});
await client.send('Fetch.fulfillRequest', {
requestId,
responseCode: 200,
responseHeaders,
body: responseObj.body,
});
} else {
await client.send('Fetch.continueRequest', { requestId });
}
});
await page.goto('https://pdf-xml-download-test.vercel.app/');
await page.waitFor(100000);
await client.send('Fetch.disable');
await browser.close();
})();
For a more detailed explanation, please refer to the Git repo I've setup with comments. It also includes an example code for playwright.

Puppeteer currently does not support navigating (or downloading) PDFs
in headless mode that easily. Quote from the docs for the page.goto function:
NOTE Headless mode doesn't support navigation to a PDF document. See the upstream issue.
What you can do though, is detect if the browser is navigating to the PDF file and then download it yourself via Node.js.
Code sample
const puppeteer = require('puppeteer');
const http = require('http');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('request', req => {
if (req.url() === '...') {
const file = fs.createWriteStream('./file.pdf');
http.get(req.url(), response => response.pipe(file));
}
});
await page.goto('...');
await browser.close();
})();
This navigates to a URL and monitors the ongoing requests. If the "matched request" is found, Node.js will manually download the file via http.get and pipe it into file.pdf. Please be aware that this is a minimal working example. You want to catch errors when downloading and might also want to use something more sophisticated then http.get depending on the situation.
Future note
In the future, there might be an easier way to do it. When puppeteer will support response interception, you will be able to simply force the browser to download a document, but right now this is not supported (May 2019).

Puppeteer screenshot stop in random URL

I have a service nodejs working in Ubuntu, using puppeteer to take screenshots of pages, but in some pages the method page.screenshot({fullPage: true, type: 'jpeg'}) doesn't works in some random URLs and no errors are displayed in the log. The code is:
async takeScreenshot() {
console.log('trying take Screenshot [...]');
let image = await this.page.screenshot({fullPage: true, type: 'jpeg'});
console.log('Completed!');
return image;
}
An example of page that I had this problem is: https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p

It seems you are not waiting until the page is fully loaded before you take the screenshot. You need 'waitUntil':
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
So the whole thing should look like something like this:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
await page.setViewport(viewPortData);
await page.screenshot({path: outputFile, type: 'jpeg', quality: 50, clip: cropData});
await browser.close();
})();

You did not explain how it did not work, or how you expected it to show, but I'm assuming it captured something like the following image.
When you first load the page, it'll load something like this, which is not related to puppeteer, since you are not telling the code to wait for certain amount, if you are doing so, I don't see it specified anywhere.
For me, just normal screenshot worked, except the region settings. I used version 0.13 and the following code,
await page.goto("https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p", {waitUntil: "networkidle0"});
await page.screenshot({
path: "test_google.png",
fullPage: true
});
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Blocking specific resources (css, images, videos, etc) using crawlee and playwright - node.js

Related

How do I replace a response file in Puppeteer?

Trying to crawl a website using puppeteer but getting a timeout error

How to catch a download with playwright?

Open Puppeteer with specific configuration (download PDF instead of PDF viewer)

Puppeteer screenshot stop in random URL

Categories

Resources