Puppeteer seems to give one some room for optimization. I managed to filter out all resources that were not relevant for my scraping to boost the speed a little bit. Now I see, that Puppeteer is stuck on the main page html file for a very long time before being able to continue.
So my idea was to just download (as a beginning example) the index.html file, and make Puppeteer read it from my own storage. (In general it might not work for all files if the server dynamically distributes the files.)
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless:true,
userDataDir: './my/path'
});
const page = (await browser.pages())[0];
await page.setRequestInterception(true);
page.on('request', request => {
if(request.url()==='https://stackoverflow.com/'){
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8',
body:'<html>Test</html>'
});
//request.abort();
}
});
await page.goto('https://stackoverflow.com');
await page.screenshot({path: 'example.jpg'});
await browser.close();
})();
But I get the following error: Error: Request is already handled!
How do I do it before the request is handled?
Related
I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
Using Puppeteer, I'd like to load a URL in Chrome and capture the following information:
request URL
request headers
request post data
response headers text (including duplicate headers like set-cookie)
transferred response size (i.e. compressed size)
full response body
Capturing the full response body is what causes the problems for me.
Things I've tried:
Getting response content with response.buffer - this does not work if there are redirects at any point, since buffers are wiped on navigation
intercepting requests and using getResponseBodyForInterception - this means I can no longer access the encodedLength, and I also had problems getting the correct request and response headers in some cases
Using a local proxy works, but this slowed down page load times significantly (and also changed some behavior for e.g. certificate errors)
Ideally the solution should only have a minor performance impact and have no functional differences from loading a page normally. I would also like to avoid forking Chrome.
You can enable a request interception with page.setRequestInterception() for each request, and then, inside page.on('request'), you can use the request-promise-native module to act as a middle man to gather the response data before continuing the request with request.continue() in Puppeteer.
Here's a full working example:
'use strict';
const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const result = [];
await page.setRequestInterception(true);
page.on('request', request => {
request_client({
uri: request.url(),
resolveWithFullResponse: true,
}).then(response => {
const request_url = request.url();
const request_headers = request.headers();
const request_post_data = request.postData();
const response_headers = response.headers;
const response_size = response_headers['content-length'];
const response_body = response.body;
result.push({
request_url,
request_headers,
request_post_data,
response_headers,
response_size,
response_body,
});
console.log(result);
request.continue();
}).catch(error => {
console.error(error);
request.abort();
});
});
await page.goto('https://example.com/', {
waitUntil: 'networkidle0',
});
await browser.close();
})();
Puppeteer-only solution
This can be done with puppeteer alone. The problem you are describing that the response.buffer is cleared on navigation, can be circumvented by processing each request one after another.
How it works
The code below uses page.setRequestInterception to intercept all requests. If there is currently a request being processed/being waited for, new requests are put into a queue. Then, response.buffer() can be used without the problem that other requests might asynchronously wipe the buffer as there are no parallel requests. As soon as the currently processed request/response is handled, the next request will be processed.
Code
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
const results = []; // collects all results
let paused = false;
let pausedRequests = [];
const nextRequest = () => { // continue the next request or "unpause"
if (pausedRequests.length === 0) {
paused = false;
} else {
// continue first request in "queue"
(pausedRequests.shift())(); // calls the request.continue function
}
};
await page.setRequestInterception(true);
page.on('request', request => {
if (paused) {
pausedRequests.push(() => request.continue());
} else {
paused = true; // pause, as we are processing a request now
request.continue();
}
});
page.on('requestfinished', async (request) => {
const response = await request.response();
const responseHeaders = response.headers();
let responseBody;
if (request.redirectChain().length === 0) {
// body can only be access for non-redirect responses
responseBody = await response.buffer();
}
const information = {
url: request.url(),
requestHeaders: request.headers(),
requestPostData: request.postData(),
responseHeaders: responseHeaders,
responseSize: responseHeaders['content-length'],
responseBody,
};
results.push(information);
nextRequest(); // continue with next request
});
page.on('requestfailed', (request) => {
// handle failed request
nextRequest();
});
await page.goto('...', { waitUntil: 'networkidle0' });
console.log(results);
await browser.close();
})();
I would suggest you to search for a quick proxy server which allows to write requests logs together with actual content.
The target setup is to allow proxy server to just write a log file, and then analyze the log, searching for information you need.
Don't intercept requests while proxy is working (this will lead to slow down)
The performance issues(with proxy as logger setup) you may encounter are mostly related to TLS support, please pay attention to allow quick TLS handshake, HTTP2 protocol in the proxy setup
E.g. Squid benchmarks show that it is able to process hundreds RPS, which should be enough for testing purposes
I would suggest using a tool namely 'fiddler'. It will capture all the information that you mentioned when you load a URL url.
Here's my workaround which I hope will help others.
I had issues with the await page.setRequestInterception(True) command blocking the flow and made the page hanging until timeout.
So I added this function
async def request_interception(req):
""" await page.setRequestInterception(True) would block the flow, the interception is enabled individually """
# enable interception
req.__setattr__('_allowInterception', True)
if req.url.startswith('http'):
print(f"\nreq.url: {req.url}")
print(f" req.resourceType: {req.resourceType}")
print(f" req.method: {req.method}")
print(f" req.postData: {req.postData}")
print(f" req.headers: {req.headers}")
print(f" req.response: {req.response}")
return await req.continue_()
removed the await page.setRequestInterception(True) and called the function above with
page.on('request', lambda req: asyncio.ensure_future(request_interception(req)))
in my main().
Without the req.__setattr__('_allowInterception', True) statement Pyppeteer would complain about the intercept not enabled for some requests but works fine for me with it.
Just in case someone interested in the system I'm running Pyppeteer:
Ubuntu 20.04
Python 3.7 (venv)
...
pyee==8.1.0
pyppeteer==0.2.5
python-dateutil==2.8.1
requests==2.25.1
urllib3==1.26.3
websockets==8.1
...
I also posted the solution at https://github.com/pyppeteer/pyppeteer/issues/198
Cheers
go to Chrome press F12, then go to "network" tab, you can see there all the http request that the website sends, yo're be able to see the details you mentioned.
I'm using puppeteer for downloading a file from the site. I have only an element which I click and file just downloads.
I googled info about downloading files with puppeteer but all of them are based on page.on('request', ...). But it doesn't work for me because page doesn't send any request
page.on('request', arg => {
console.log(arg.url())
})
And in the terminal I have only "https://some-site/images/csv.gif" but it's only .gif. How does this file downloads at all? If browser doesn't do any request does it mean that this file is already on the client site?
Try (replace the variables on top fit to your needs):
const puppeteer = require('puppeteer');
const url = 'your.url.com';
const buttonElementId = '#downloadButton';
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url);
await page.click(buttonElementId);
await browser.close();
})();
I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.
I would like to open Chromium with a specific configuration.
I am looking for the configuration to activate the following option :
Settings => Site Settings => Permissions => PDF documents => "Download PDF files instead of automatically openning them in Chrome"
I searched the tags on this command line switch page but the only parameter that deals with pdf is --print-to-pdf which does not correspond to my need.
Do you have any ideas?
There is no option you can pass into Puppeteer to force PDF downloads. However, you can use chrome-devtools-protocol to add a content-disposition: attachment response header to force downloads.
A visual flow of what you need to do:
I'll include a full example code below. In the example below, PDF files and XML files will be downloaded in headful mode.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await client.send('Fetch.enable', {
patterns: [
{
urlPattern: '*',
requestStage: 'Response',
},
],
});
await client.on('Fetch.requestPaused', async (reqEvent) => {
const { requestId } = reqEvent;
let responseHeaders = reqEvent.responseHeaders || [];
let contentType = '';
for (let elements of responseHeaders) {
if (elements.name.toLowerCase() === 'content-type') {
contentType = elements.value;
}
}
if (contentType.endsWith('pdf') || contentType.endsWith('xml')) {
responseHeaders.push({
name: 'content-disposition',
value: 'attachment',
});
const responseObj = await client.send('Fetch.getResponseBody', {
requestId,
});
await client.send('Fetch.fulfillRequest', {
requestId,
responseCode: 200,
responseHeaders,
body: responseObj.body,
});
} else {
await client.send('Fetch.continueRequest', { requestId });
}
});
await page.goto('https://pdf-xml-download-test.vercel.app/');
await page.waitFor(100000);
await client.send('Fetch.disable');
await browser.close();
})();
For a more detailed explanation, please refer to the Git repo I've setup with comments. It also includes an example code for playwright.
Puppeteer currently does not support navigating (or downloading) PDFs
in headless mode that easily. Quote from the docs for the page.goto function:
NOTE Headless mode doesn't support navigation to a PDF document. See the upstream issue.
What you can do though, is detect if the browser is navigating to the PDF file and then download it yourself via Node.js.
Code sample
const puppeteer = require('puppeteer');
const http = require('http');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('request', req => {
if (req.url() === '...') {
const file = fs.createWriteStream('./file.pdf');
http.get(req.url(), response => response.pipe(file));
}
});
await page.goto('...');
await browser.close();
})();
This navigates to a URL and monitors the ongoing requests. If the "matched request" is found, Node.js will manually download the file via http.get and pipe it into file.pdf. Please be aware that this is a minimal working example. You want to catch errors when downloading and might also want to use something more sophisticated then http.get depending on the situation.
Future note
In the future, there might be an easier way to do it. When puppeteer will support response interception, you will be able to simply force the browser to download a document, but right now this is not supported (May 2019).