Puppeteer not rendering images stored on AWS S3 - node.js

I am using Puppeteer on Nodejs to render a PDF document via a page. However, it seems like images hosted on AWS S3 cannot be loaded, whereas images stored locally on the server itself loads fine.
I tried adding both waitUntil: "networkidle0" and waitUntil: "networkidle1" but it still does not work. I also tried adding printBackground: true too.
The images loads perfectly fine on the page as seen here
However, on the PDF generated by Puppeteer, the images does not load
This is my code:
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
authorization: req.session.token
});
await page.goto(config.url + "/project/download/" + permalink, {
waitUntil: "networkidle0"
});
const buffer = await page.pdf({
filename: permalink + "_ProjectBrief" + ".pdf",
format: "A4",
margin: {
top: 60,
bottom: 60,
left: 60,
right: 50
},
});
res.type("application/pdf");
res.send(buffer);
await browser.close();
})();
Any idea what should I do to get over this issue?
Thanks in advance!

I think I solved the issue.
After adding headless: false to
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
I realised that the images did not load because of a 400 error. My hypothesis is that Chromium did not have enough time to download the images and thus throwing this error.
What I did was to edit the HTML file that I want Puppeteer to render, adding this code to it:
data.ProjectImages.forEach(e => {
window.open(e.imageUrl, '_blank');
});
What this does is that it opens up the images via its URL onto a new tab. This ensures that the images are downloaded to the Chromium instance (Puppeteer runs Chromium).
The rendered PDF can now display the images.

Related

puppeteer hangs while generating large pdf

I have a pdf generation API which takes JSON data, creates html page using ejs templates and generates PDF. For 45 page PDF it takes around 2 min. When I am trying with large amount of data that will result in 450+ pages it just hangs. Previously it used to throw TimeoutError: waiting for Page.printToPDF failed: timeout 30000ms exceeded. But I added timeout: 0 option. Now it just hangs. There's no response. I tried looking for online solutions. No luck.
Here's the code.
ejs.renderFile(filePath, { data }, async (err, html) => {
if (err) {
// error handler
} else {
const browser = await puppeteer.launch({
headless: true,
args: ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu"]
})
const page = await browser.newPage();
await page.setContent(html, { waitUntil: 'domcontentloaded' })
const fileStream = await page.pdf({
displayHeaderFooter: true,
footerTemplate:
'<div class="footer" style="font-size: 10px;color: #000; margin: 10px
auto;clear:both; position: relative;"><span class="pageNumber"></span>
</div>',
margin: {top :30 ,bottom: 50},
});
await browser.close();
res.send(fileStream)
}
})
I added some console logs. It gets stuck on page.pdf().

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

How do I replace a response file in Puppeteer?

Puppeteer seems to give one some room for optimization. I managed to filter out all resources that were not relevant for my scraping to boost the speed a little bit. Now I see, that Puppeteer is stuck on the main page html file for a very long time before being able to continue.
So my idea was to just download (as a beginning example) the index.html file, and make Puppeteer read it from my own storage. (In general it might not work for all files if the server dynamically distributes the files.)
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless:true,
userDataDir: './my/path'
});
const page = (await browser.pages())[0];
await page.setRequestInterception(true);
page.on('request', request => {
if(request.url()==='https://stackoverflow.com/'){
request.respond({
status: 200,
contentType: 'text/html; charset=utf-8',
body:'<html>Test</html>'
});
//request.abort();
}
});
await page.goto('https://stackoverflow.com');
await page.screenshot({path: 'example.jpg'});
await browser.close();
})();
But I get the following error: Error: Request is already handled!
How do I do it before the request is handled?

Puppeteer - Failed to launch the browser process! in Windows Service

I have a small javascript setup to convert html to pdf using the javascript library puppeteer.
Hosting the service by opening the command panel and starting node index.js everything works fine. The express-api hosts the service under the predefined port and requesting the service I get the converted PDF back.
Now, installing the javascript as Windows-Service by using the library node-windows and requesting the service, I get the following error message:
Failed to launch the browser process!
Now I'm not sure where to search for the root cause. Is it possible that this could be a permission issue?
Following my puppeteer javascript code :
const ValidationError = require('./../errors/ValidationError.js')
const puppeteer = require('puppeteer-core');
module.exports = class PdfService{
static async htmlToPdf(html){
if(!html){
throw new ValidationError("no html");
}
const browser = await puppeteer.launch({
headless: true,
executablePath: process.env.EDGE_PATH,
args: ["--no-sandbox"]
});
const page = await browser.newPage();
await page.setContent(html, {
waitUntil: "networkidle2"
});
const pdf = await page.pdf({format: 'A4',printBackground: true});
await browser.close();
return pdf;
}
}

Puppeteer screenshot stop in random URL

I have a service nodejs working in Ubuntu, using puppeteer to take screenshots of pages, but in some pages the method page.screenshot({fullPage: true, type: 'jpeg'}) doesn't works in some random URLs and no errors are displayed in the log. The code is:
async takeScreenshot() {
console.log('trying take Screenshot [...]');
let image = await this.page.screenshot({fullPage: true, type: 'jpeg'});
console.log('Completed!');
return image;
}
An example of page that I had this problem is: https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p
It seems you are not waiting until the page is fully loaded before you take the screenshot. You need 'waitUntil':
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
So the whole thing should look like something like this:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
await page.setViewport(viewPortData);
await page.screenshot({path: outputFile, type: 'jpeg', quality: 50, clip: cropData});
await browser.close();
})();
You did not explain how it did not work, or how you expected it to show, but I'm assuming it captured something like the following image.
When you first load the page, it'll load something like this, which is not related to puppeteer, since you are not telling the code to wait for certain amount, if you are doing so, I don't see it specified anywhere.
For me, just normal screenshot worked, except the region settings. I used version 0.13 and the following code,
await page.goto("https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p", {waitUntil: "networkidle0"});
await page.screenshot({
path: "test_google.png",
fullPage: true
});
});

Resources