Get href attribute in pupeteer Node.js - node.js

I know the common methods such as evaluate for capturing the elements in puppeteer, but I am curious why I cannot get the href attribute in a JavaScript-like approach as
const page = await browser.newPage();
await page.goto('https://www.example.com');
let links = await page.$$('a');
for (let i = 0; i < links.length; i++) {
console.log(links[i].getAttribute('href'));
console.log(links[i].href);
}

await page.$$('a') returns an array with ElementHandles — these are objects with their own pupeteer-specific API, they have not usual DOM API for HTML elements or DOM nodes. So you need either retrieve attributes/properties in the browser context via page.evaluate() or use rather complicated ElementHandles API. This is an example with both ways:
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/');
// way 1
const hrefs1 = await page.evaluate(
() => Array.from(
document.querySelectorAll('a[href]'),
a => a.getAttribute('href')
)
);
// way 2
const elementHandles = await page.$$('a');
const propertyJsHandles = await Promise.all(
elementHandles.map(handle => handle.getProperty('href'))
);
const hrefs2 = await Promise.all(
propertyJsHandles.map(handle => handle.jsonValue())
);
console.log(hrefs1, hrefs2);
await browser.close();
} catch (err) {
console.error(err);
}
})();

const yourHref = await page.$eval('selector', anchor => anchor.getAttribute('href'));
but if are working with a handle you can
const handle = await page.$('selector');
const yourHref = await page.evaluate(anchor => anchor.getAttribute('href'), handle);

I don't know why it's such a pain, but this was found when I encountered this a while ago.
async function getHrefs(page, selector) {
return await page.$$eval(selector, anchors => [].map.call(anchors, a => a.href));
}

A Type safe way of returning an array of strings as the hrefs of the links by casting using the HTMLLinkElement generic for TypeScript users:
await page.$$eval('a', (anchors) => anchors.map((link) => (link as HTMLLinkElement).href));

A simple way to get an href from an anchor element
Say you fetched an anchor element with the following
const anchorElement = await page.$('a') // or page.$<HTMLAnchorElement>('a') if using typescript
You can get the href property with the following
const href = anchorElement.evaluate(element => element.href)

Related

How to evaluate a relative XPath inside another XPath in Puppeteer?

Here is my code where I have got the element Handle of some target divs
const puppeteer = require("puppeteer");
(async () => {
const searchString = `https://www.google.com/maps/search/restaurants/#-6.4775265,112.057849,3.67z`;
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(searchString);
const xpath_expression ='//div[contains(#aria-label, "Results for")]/div/div[./a]';
await page.waitForXPath(xpath_expression);
const targetDivs = await page.$x(xpath_expression);
// const link_urls = await page.evaluate((...targetDivs) => {
// return targetDivs.map((e) => {
// return e.textContent;
// });
// }, ...targetDivs);
})();
I have two relative XPath links inside these target Divs which contain related data
'link' : './a/#href'
'title': './a/#aria-label'
I have a sample of similar python code like this
from parsel import Selector
response = Selector(page_content)
results = []
for el in response.xpath('//div[contains(#aria-label, "Results for")]/div/div[./a]'):
results.append({
'link': el.xpath('./a/#href').extract_first(''),
'title': el.xpath('./a/#aria-label').extract_first('')
})
How to do it in puppeteer?
I think you can get the href and ariaLabel property values with e.g.
const targetDivs = await page.$x(xpath_expression);
targetDivs.forEach(async (div, pos) => {
const links = await div.$x('a[#href]');
const href = await (await links[0].getProperty('href')).jsonValue();
const ariaLabel = await (await links[0].getProperty('ariaLabel')).jsonValue();
console.log(pos, href, ariaLabel);
});
These are the element properties, not the attribute values, which, in the case of href, might for instance mean you get an absolute instead of a relative URL but I haven't checked for that particular page whether it makes a difference. I am not sure the $x allows direct attribute node or even string value selection, the documentation only talks about element handles.

Attempting to get *all classes* from elementHandle in Puppeteer

Currently I am using
const element = await page.$('div.layout-board-section')
to get the elementHandle of the div. However, I then need to get the list of classes from that elementHandle. I've tried a couple different solutions though they all seem to only return the first class using element.className in an evaluate function.
Is there any way to get all of the classes of an element?
You can use a node's .classList property.
const classes = await page.$eval(
'div.layout-board-section',
el => [...el.classList]
);
or if you already have an elementHandle:
const classes = await someElement.evaluate(el => [...el.classList]);
Complete example:
const puppeteer = require("puppeteer");
let browser;
(async () => {
const html = `<div class="foo bar baz quux">blahhh</div>`;
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const classes = await page.$eval("div", el => [...el.classList]);
console.log(classes); // => [ 'foo', 'bar', 'baz', 'quux' ]
// or with an elementHandle:
const divEl = await page.$("div");
console.log(await divEl.evaluate(el => [...el.classList]));
})()
.catch(err => console.error(err))
.finally(async () => await browser.close())
;

Why am I not able to navigate through iFrames using Apify/Puppeteer?

I'm trying to manipulate forms of sites w/ iFrames in it using Puppeteer. I tried different ways to reach a specific iFrame, or even to count iFrames in a website, with no success.
Why isn't Puppeteer's object recognizing the iFrames / child frames of the page I'm trying to navigate through?
It's happening with other pages as well, such as https://www.veiculos.itau.com.br/simulacao
const Apify = require('apify');
const sleep = require('sleep-promise');
Apify.main(async () => {
// Launch the web browser.
const browser = await Apify.launchPuppeteer();
// Create and navigate new page
console.log('Open target page');
const page = await browser.newPage();
await page.goto('https://www.credlineitau.com.br/');
await sleep(15 * 1000);
for (const frame in page.mainFrame().childFrames()) {
console.log('test');
}
await browser.close();
});
Perhaps you'll find some helpful inspiration below.
const waitForIframeContent = async (page, frameSelector, contentSelector) => {
await page.waitForFunction((frameSelector, contentSelector) => {
const frame = document.querySelector(frameSelector);
const node = frame.contentDocument.querySelector(contentSelector);
return node && node.innerText;
}, {
timeout: TIMEOUTS.ten,
}, frameSelector, contentSelector);
};
const $frame = await waitForSelector(page, SELECTORS.frame.iframeNode).catch(() => null);
if ($frame) {
const frame = page.frames().find(frame => frame.name() === 'content-iframe');
const $cancelStatus = await waitForSelector(frame, SELECTORS.frame.membership.cancelStatus).catch(() => null);
await waitForIframeContent(page, SELECTORS.frame.iframeNode, SELECTORS.frame.membership.cancelStatus);
}
Give it a shot.

Check if element is visible in DOM in Node.js

I would like to check if element is visible in DOM in Node.js. I use jsdom library for getting DOM structure. There are 2 approaches how to check element's visibility in client side javascript, but it doesn't work with jsdom in node.js.
1) offsetParent property is always null, even for visible elements
2) dom.window.getComputedStyle(el).display returns block, but element's css rule is display: none
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://crooked.com/podcast-series/majority-54/', jar: true }, function (e, r, b) {
const dom = new JSDOM(b);
test(dom);
});
const test = (dom) => {
const hiddenElement = dom.window.document.querySelector('.search-outer-lg');
const visibleElement = dom.window.document.querySelector('.body-tag-inner');
console.log(dom.window.getComputedStyle(hiddenElement).display); // block
console.log(visibleElement.offsetParent); // null
}
Is it possible or another way how to check element's visibility in DOM in node.js?
I tried puppeteer instead of jsdom and I got correct display value. Here is the snippet:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(uri);
const searchDiv = await page.evaluate(() => {
const btn = document.querySelector('.search-outer-lg');
return getComputedStyle(btn).display;
});
console.log(searchDiv)
await browser.close()
})()
Trick method :)
function isHiddenElement(selector) {
return (document.querySelector(selector).offsetParent === null)
}
if(isHiddenElement('.search-outer-lg')
{
alert("element hidden");
}
try without use
const a1=dom.window.document.querySelector('.search-outer-lg');
const coponentStyle= dom.window.getComputedStyle(a1)
coponentStyle.getPropertyValue('display')
[![const offsetParet=window.document.querySelector('.body-tag-inner').offsetParent][1]][1]
it return body hav class archive tax-podcast_type term-majority-54 term-98
// it will be return none
itry this in the consle without use dom
show this image
if it not work tell me

Get all p tags with Puppeteer

I am trying to get all paragraph tags from a website using Puppeteer and later extract the text from it. pTags, however, is always an empty array and I have no clue why.
Here is my code.
const puppeteer = require('puppeteer')
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.newyorker.com/news/letter-from-trumps-washington/the-worst-hour-of-his-entire-life-cohen-manafort-and-the-twin-courtroom-dramas-that-changed-trumps-presidency');
const pTags = await page.evaluate(() => Array.from(document.querySelectorAll('p')));
console.log(pTags);
browser.close();
})();
As stated in the Official Documentation:
If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined.
You are attempting to return a NodeList (a non-Serializable value) via querySelectorAll(), and therefore, your page.evaluate() function is returning undefined.
Instead, you can obtain an ElementHandle array of p elements using page.$$() or page.$x():
const pTags = await page.$$('p');
const pTags = await page.$x('//p');
Use:
const pTags = await page.$$("p");
Reference: https://github.com/GoogleChrome/puppeteer/blob/v1.7.0/docs/api.md#pageselector-1

Resources