How to scrape image src the right way using puppeteer?

How to scrape image src the right way using puppeteer? - node.js

I'm trying to create a function that can capture the src attribute from a website. But all of the most common ways of doing so, aren't working.
This was my original attempt.
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.setDefaultNavigationTimeout(0);
await page.waitForTimeout(500);
await page.goto(
`https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus/3413654`,
{
waitUntil: "domcontentloaded",
}
);
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll(
"#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
);
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
console.log(fetchImgSrc);
} catch (err) {
console.log(err);
}
await browser.close();
})();
[];
In my next attempt I tried a suggestion and was returned an empty string.
await page.setViewport({ width: 1024, height: 768 });
const imgs = await page.$$eval("#menus img", (images) =>
images.map((i) => i.src)
);
console.log(imgs);
And in my final attempt I fallowed another suggestion and was returned an array with two empty strings inside of it.
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll(".swiper-lazy-loaded");
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
console.log(fetchImgSrc);
In each attempt i only replaced the function and console log portion of the code. I've done a lot of digging and found these are the most common ways of scrapping an image src using puppeteer and I've used them in other ways but for some reason right now they aren't working for me. I'm not sure if I have a bug in my code or why it will not work.

To return the src link for the two menu images on this page you can use
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll('.swiper-lazy-loaded');
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
This gives us the expected output
['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']

You have two issues here:
Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.
By the way: you can shorten your script with page.$$eval like this:
await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)
Output:
[
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]

Related

Deleting child's child elements while web scrapping and writing it to a html file using NodeJS puppeteer

I'm doing webscarping and writing the data to another HTML file.
On line " const content = await page.$eval('.eApVPN', e => e.innerHTML);" I'm fetching the inner html of a div, this div has multiple p tag, inside those p tags there are multiple hyperlink(a) tags
I want to remove those tags href, but I'm unable to do so
const fs = require('fs').promises;
const helps = require('./_helpers');
const OUTDIR = './results/dataset/'
fs.stat(OUTDIR).catch(async (err) => {
if (err.message.includes('Result Director Doesnt Exist')) {
await fs.mkdir(OUTDIR);
}
await fs.mkdir(OUTDIR);
});
const scraperObject = {
async scraper(browser){
const dataSet = await helps.readCSV('./results/dataset.csv');
console.log("dataset is : ", dataset);
var cookies = null
let page = await browser.newPage();
for (let i = 0; i < dataSet.length ; i++) {
let url = dataSet[i].coinPage
const filename = dataSet[i].symbol;
try{
console.log(`Navigating to ${url}...`);
await page.goto(url);
if (cookies == null){
cookies = await page.cookies();
await fs.writeFile('./storage/cookies', JSON.stringify(cookies, null, 2));
}
await helps.autoScroll(page);
await page.waitForSelector('.eApVPN');
const content = await page.$eval('.eApVPN', e => e.innerHTML);
await fs.writeFile(`${OUTDIR}${filename}.html`, content, (error) => { console.log(error); });
console.log("Written to HTML successfully!");
} catch (err){
console.log(err, '------->', dataSet[i].symbol);
}
}
await page.close();
}
}
module.exports = scraperObject;

Unfortunately Puppeteer doesn't have native functionality to remove nodes. However, you can use .evaluate method to evaluate any javascript script against the current document. For example a script which removes your nodes would look something like this:
await page.evaluate((sel) => {
var elements = document.querySelectorAll(sel);
for(var i=0; i< elements.length; i++){
elements[i].remove()
}
}, ".eApVPN>a")
The above code will remove any <a> nodes directly under a node with eApVPN class. Then you can extract the data with your $eval selector.

Is there a way to iterate over a <li> list in Playwright and click over each element?

I'm trying to iterate over a list of dynamic elements with Playwright, I've tried a couple of things already, but none have been working:
await this.page.locator('li').click();
const elements = await this.page.locator('ul > li');
await elements.click()
await this.page.$$('ul > li').click();
await this.page.click('ul > li');
const divCounts = await elements.evaluateAll(async (divs) => await divs.click());
this.page.click('ul > li > i.red', { strict: false, clickCount: 1 },)
const elements = await this.page.$$('ul > li > i.red')
elements.forEach(async value => {
console.log(value)
await this.page.click('ul > li > i.red', { strict: false, clickCount: 1 },)
await value.click();
})

Since https://playwright.dev/docs/api/class-locator#locator-element-handles doesn't have a good example on how to use .elementHandles().
Another way to solve this issue is as follows
const checkboxLocator = page.locator('tbody tr input[type="checkbox"]');
for (const el of await checkboxLocator.elementHandles()) {
await el.check();
}

I managed to do it with the following code:
test('user can click multiple li', async ({ page }) => {
const items = page.locator('ul > li');
for (let i = 0; i < await items.count(); i++) {
await items.nth(i).click();
}
})

A similar question was asked recently on the Playwright Slack community.
This is copy-pasted and minimally adjusted from the answer by one of the maintainers there.
let listItems = this.page.locator('ul > li');
// In case the li elements don't appear all together, you have to wait before the loop below. What element to wait for depends on your situation.
await listItems.nth(9).waitFor();
for (let i = 0; i < listItems.count(); i++) {
await listItems.nth(i).click();
}

You can achieve that using $$eval and pure client side javascript.
const results = await page.$$eval(`ul > li`, (allListItems) => {
allListItems.forEach(async singleListItem => await singleListItem.click())
});
Please note that what you write inside the callback, will be executed on the browser. So if you want to output anything, you need to return it. That way it will end up inside the results variable.

This works for me (my example):
// reset state and remove all existing bookmarks
const bookmarkedItems = await page.locator('.bookmark img[src="/static/img/like_orange.png"]');
const bookmarkedItemsCounter = await bookmarkedItems.count();
if (bookmarkedItemsCounter) {
for (let i = 0; i < bookmarkedItemsCounter; i++) {
await bookmarkedItems.nth(i).click();
}
}
await page.waitForTimeout(1000);
If try to solve your task should be:
test('click by each li element in the list', async ({ page }) => {
await page.goto(some_url);
const liItems = await page.locator('ul > li');
const liItemCounter = await liItems.count();
if (liItemCounter) {
for (let i = 0; i < liItemCounter; i++) {
await liItems.nth(i).click();
}
}
await page.waitForTimeout(1000);
});

Print all pdf pages except the last one using Puppeteer

I need to generate different footer for the last pdf page. After investigation I realised that the best way is to generate two different pdfs and combine them. It works fine when I need to change footer or use different templates for first page, or in cases when I know which pages should looks different (using pageRanges option), but I can't find a way to get only last (last n) pages in case when total page number is unknown. Any ideas how I can generate pdf for only last (last n) pages?
Will be appreciated for any answers.
I'm using Puppeteer v 2.1.0 with node.js v 8.16.0
This is a script which I'm using for generating pdf files now.
const puppeteer = require('puppeteer');
const fs = require('fs');
const DEFAULT_HEADER = '<span></span>';
const DEFAULT_FOOTER_HEIGHT = 90;
const DEFAULT_PAGE_PADDING = 36;
const createPdf = async () => {
let browser;
try {
browser = await puppeteer.launch();
const page = await browser.newPage();
const [, , bodyFilePath, outputFilePath, footerFilePath] = process.argv;
await page.goto(`file:${bodyFilePath}`, { waitUntil: 'networkidle0' });
let footerTemplate = DEFAULT_HEADER;
if (footerFilePath) {
footerTemplate = fs.readFileSync(footerFilePath, 'utf8');
}
await page.pdf({
path: outputFilePath,
format: 'A4',
margin: {
top: DEFAULT_PAGE_PADDING,
right: DEFAULT_PAGE_PADDING,
bottom: DEFAULT_FOOTER_HEIGHT,
left: DEFAULT_PAGE_PADDING,
},
printBackground: true,
displayHeaderFooter: true,
headerTemplate: DEFAULT_HEADER,
footerTemplate,
});
} catch (err) {
console.log(err.message);
} finally {
if (browser) {
browser.close();
}
process.exit();
}
};
createPdf();
Templates which I'm converting to pdf are .html.erb files

Maybe there is better ways to solve this problem but at this point I've used this approach - I'm generating export using same script as above, and than I'm using one more script which opens previous pdf file, count pages and generates two new files (which I'm combining to one file on the backend) - All pages except last one, and only last page with different footer.
const puppeteer = require('puppeteer');
const fs = require('fs');
const pdf = require('pdf-parse');
const DEFAULT_HEADER = '<span></span>';
const DEFAULT_FOOTER_HEIGHT = 90;
const DEFAULT_PAGE_PADDING = 36;
const createPdf = async () => {
let browser;
try {
browser = await puppeteer.launch();
const page = await browser.newPage();
const [
,
,
bodyFilePath,
outputFilePath,
footerFilePath,
lastPagePath,
lastPageFooterPath,
] = process.argv;
await page.goto(`file:${bodyFilePath}`, { waitUntil: 'networkidle0' });
let footerTemplate = DEFAULT_HEADER;
let lastPageFooterTemplate = DEFAULT_HEADER;
if (footerFilePath) {
footerTemplate = fs.readFileSync(footerFilePath, 'utf8');
}
if (lastPageFooterPath) {
lastPageFooterTemplate = fs.readFileSync(lastPageFooterPath, 'utf8');
}
const dataBuffer = fs.readFileSync(outputFilePath);
const pdfInfo = await pdf(dataBuffer);
const numPages = pdfInfo.numpages;
const baseOptions = {
path: outputFilePath,
format: 'A4',
margin: {
top: DEFAULT_PAGE_PADDING,
right: DEFAULT_PAGE_PADDING,
bottom: DEFAULT_FOOTER_HEIGHT,
left: DEFAULT_PAGE_PADDING,
},
printBackground: true,
displayHeaderFooter: true,
headerTemplate: DEFAULT_HEADER,
pageRanges: `${numPages}`,
footerTemplate: lastPageFooterTemplate,
};
if (numPages === 1) {
await page.pdf(baseOptions);
} else {
await page.pdf({
...baseOptions,
footerTemplate,
pageRanges: `-${numPages - 1}`,
});
await page.pdf({
...baseOptions,
path: lastPagePath,
footerTemplate: lastPageFooterTemplate,
});
}
} catch (err) {
console.log(err.message);
} finally {
if (browser) {
browser.close();
}
process.exit();
}
};
createPdf();
Hope this will be helpful for someone with same issue.

puppeteer await page.$$('.className'), but I get only the first 11 element with that class, why?

the code that I am using to scrape a student list:
let collection1 = await page.$$('div.layout-2DM8Md')
console.log("Student Online:")
for (let el of collection1) {
let name = await el.$eval(('div.name-uJV0GL'), node => node.innerText.trim());
console.log(name)
}

It's probably because the contents of rest of those elements are loaded dynamically with a Javascript framework like React or Vue. This means that it only gets loaded when those elements enter the viewport of the browser.
To fix this you will need to write a function that auto scrolls the page so that those elements can get into the viewport and then you have to wait for that function to finish before you collect the data.
The scrolling function:
const autoScroll = async(page) => {
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
var totalHeight = 0;
var distance = 100;
var timer = setInterval(() => {
var scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if(totalHeight >= scrollHeight){
clearInterval(timer);
resolve();
}
}, 30);
});
});
}
Then call this function after page.goto() and before you grab the content with page.content(). I also set the viewport width and height then the scrolling goes a little faster:
await page.goto(url, {waitUntil: 'load'});
await page.setViewport({
width: 1200,
height: 800
});
await autoScroll(page); // The scroll function
const html = await page.content()

Get href attribute in pupeteer Node.js

I know the common methods such as evaluate for capturing the elements in puppeteer, but I am curious why I cannot get the href attribute in a JavaScript-like approach as
const page = await browser.newPage();
await page.goto('https://www.example.com');
let links = await page.$$('a');
for (let i = 0; i < links.length; i++) {
console.log(links[i].getAttribute('href'));
console.log(links[i].href);
}

await page.$$('a') returns an array with ElementHandles — these are objects with their own pupeteer-specific API, they have not usual DOM API for HTML elements or DOM nodes. So you need either retrieve attributes/properties in the browser context via page.evaluate() or use rather complicated ElementHandles API. This is an example with both ways:
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/');
// way 1
const hrefs1 = await page.evaluate(
() => Array.from(
document.querySelectorAll('a[href]'),
a => a.getAttribute('href')
)
);
// way 2
const elementHandles = await page.$$('a');
const propertyJsHandles = await Promise.all(
elementHandles.map(handle => handle.getProperty('href'))
);
const hrefs2 = await Promise.all(
propertyJsHandles.map(handle => handle.jsonValue())
);
console.log(hrefs1, hrefs2);
await browser.close();
} catch (err) {
console.error(err);
}
})();

const yourHref = await page.$eval('selector', anchor => anchor.getAttribute('href'));
but if are working with a handle you can
const handle = await page.$('selector');
const yourHref = await page.evaluate(anchor => anchor.getAttribute('href'), handle);

I don't know why it's such a pain, but this was found when I encountered this a while ago.
async function getHrefs(page, selector) {
return await page.$$eval(selector, anchors => [].map.call(anchors, a => a.href));
}

A Type safe way of returning an array of strings as the hrefs of the links by casting using the HTMLLinkElement generic for TypeScript users:
await page.$$eval('a', (anchors) => anchors.map((link) => (link as HTMLLinkElement).href));

A simple way to get an href from an anchor element
Say you fetched an anchor element with the following
const anchorElement = await page.$('a') // or page.$<HTMLAnchorElement>('a') if using typescript
You can get the href property with the following
const href = anchorElement.evaluate(element => element.href)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to scrape image src the right way using puppeteer? - node.js

Related

Deleting child's child elements while web scrapping and writing it to a html file using NodeJS puppeteer

Is there a way to iterate over a <li> list in Playwright and click over each element?

Print all pdf pages except the last one using Puppeteer

puppeteer await page.$$('.className'), but I get only the first 11 element with that class, why?

Get href attribute in pupeteer Node.js

Categories

Resources