Here is my code where I have got the element Handle of some target divs
const puppeteer = require("puppeteer");
(async () => {
const searchString = `https://www.google.com/maps/search/restaurants/#-6.4775265,112.057849,3.67z`;
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(searchString);
const xpath_expression ='//div[contains(#aria-label, "Results for")]/div/div[./a]';
await page.waitForXPath(xpath_expression);
const targetDivs = await page.$x(xpath_expression);
// const link_urls = await page.evaluate((...targetDivs) => {
// return targetDivs.map((e) => {
// return e.textContent;
// });
// }, ...targetDivs);
})();
I have two relative XPath links inside these target Divs which contain related data
'link' : './a/#href'
'title': './a/#aria-label'
I have a sample of similar python code like this
from parsel import Selector
response = Selector(page_content)
results = []
for el in response.xpath('//div[contains(#aria-label, "Results for")]/div/div[./a]'):
results.append({
'link': el.xpath('./a/#href').extract_first(''),
'title': el.xpath('./a/#aria-label').extract_first('')
})
How to do it in puppeteer?
I think you can get the href and ariaLabel property values with e.g.
const targetDivs = await page.$x(xpath_expression);
targetDivs.forEach(async (div, pos) => {
const links = await div.$x('a[#href]');
const href = await (await links[0].getProperty('href')).jsonValue();
const ariaLabel = await (await links[0].getProperty('ariaLabel')).jsonValue();
console.log(pos, href, ariaLabel);
});
These are the element properties, not the attribute values, which, in the case of href, might for instance mean you get an absolute instead of a relative URL but I haven't checked for that particular page whether it makes a difference. I am not sure the $x allows direct attribute node or even string value selection, the documentation only talks about element handles.
Related
I am getting this error, when I try to run the script (which uses webpack)
Error: Evaluation failed: ReferenceError: _babel_runtime_helpers_toConsumableArray__WEBPACK_IMPORTED_MODULE_1___default is not defined at __puppeteer_evaluation_script__:2:27
but when I run same code which doesn't use webpack I got the expected result.
here is my function.
const getMeenaClickProducts = async (title) => {
const url = ` ${MEENACLICK}/${title}`;
console.log({ url });
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url);
await page.waitForSelector('.ant-pagination-total-text');
const products = await page.evaluate(() => {
const cards = [...document.querySelectorAll('.card-thumb')];
console.log({ cards });
return cards.map((card) => {
const productTitle = card.querySelector('.title').innerText;
const priceElement = card.querySelector('.reg-price');
const price = priceElement ? priceElement.innerText : '';
const image = card.querySelector('.img').src;
const link = card.querySelector('.main-link').href;
return {
title: productTitle,
price,
image,
link,
};
});
});
await browser.close();
const filteredProducts = products
.filter((product) =>
product.title.toLowerCase().includes(title.toLowerCase())
)
.filter((item) => item.price);
return filteredProducts;
};
what could be the reason?
The problem is with Babel, and with this part:
const products = await page.evaluate(() => {
const cards = [...document.querySelectorAll('.card-thumb')];
console.log({ cards });
return cards.map((card) => {
const productTitle = card.querySelector('.title').innerText;
const priceElement = card.querySelector('.reg-price');
const price = priceElement ? priceElement.innerText : '';
const image = card.querySelector('.img').src;
const link = card.querySelector('.main-link').href;
return {
title: productTitle,
price,
image,
link,
};
});
});
The inside of the page.evaluate() script you are passing as a function parameter, is not the actual code that is being passed to the page instance, because first you are using babel to transform it.
The array spread operator you have in this part:
const cards = [...document.querySelectorAll('.card-thumb')];
Is most likely being transformed in your build to a function named _babel_runtime_helpers_toConsumableArray__WEBPACK_IMPORTED_MODULE_1___default, which is then passed to the puppeteer page context, and ultimately executed in that page. But such function is not defined in that context, that's why you get a ReferenceError.
Some options to fix it:
Don't use the spread operator combined with the current babel config you are using, so the transformed build doesn't includ a polyfill/replacement of it. Think of a replacement with an equivalent effect, such as:
const cards = Array.from(document.querySelectorAll('.card-thumb'));
Or more traditional for / forEach() loops and build up the array yourself will get job done.
Update your babel config / target language level to support the spread operator natively.
i'm using puppeteer to retrieve datas online, and facing an issue.
Two functions have the same name and return serialized object, the first one returns an empty object, but the second one does contains the datas i'm targeting.
My question is, how can I proceed to select the second occurence of the function instead of the first one, which return an empty object.
Thanks.
My code :
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const Variants = require('./variants.js');
const Feedback = require('./feedback.js');
async function Scraper(productId, feedbackLimit) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
/** Scrape page for details */
await page.goto(`${productId}`);
const data = (await page.evaluate()).match(/window.runParams = {"result/)
const data = data.items
await page.close();
await browser.close();
console.log(data);
return data;
}
module.exports = Scraper;
Website source code :
window.runParams = {};
window.runParams = {"resultCount":19449,"seoFeaturedSnippet":};
Please try this, it should work.
const data = await page.content();
const regexp = /window.runParams/g;
const matches = string.matchAll(regexp);
for (const match of matches) {
console.log(match);
console.log(match.index)
}
I would like to check if element is visible in DOM in Node.js. I use jsdom library for getting DOM structure. There are 2 approaches how to check element's visibility in client side javascript, but it doesn't work with jsdom in node.js.
1) offsetParent property is always null, even for visible elements
2) dom.window.getComputedStyle(el).display returns block, but element's css rule is display: none
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://crooked.com/podcast-series/majority-54/', jar: true }, function (e, r, b) {
const dom = new JSDOM(b);
test(dom);
});
const test = (dom) => {
const hiddenElement = dom.window.document.querySelector('.search-outer-lg');
const visibleElement = dom.window.document.querySelector('.body-tag-inner');
console.log(dom.window.getComputedStyle(hiddenElement).display); // block
console.log(visibleElement.offsetParent); // null
}
Is it possible or another way how to check element's visibility in DOM in node.js?
I tried puppeteer instead of jsdom and I got correct display value. Here is the snippet:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(uri);
const searchDiv = await page.evaluate(() => {
const btn = document.querySelector('.search-outer-lg');
return getComputedStyle(btn).display;
});
console.log(searchDiv)
await browser.close()
})()
Trick method :)
function isHiddenElement(selector) {
return (document.querySelector(selector).offsetParent === null)
}
if(isHiddenElement('.search-outer-lg')
{
alert("element hidden");
}
try without use
const a1=dom.window.document.querySelector('.search-outer-lg');
const coponentStyle= dom.window.getComputedStyle(a1)
coponentStyle.getPropertyValue('display')
[![const offsetParet=window.document.querySelector('.body-tag-inner').offsetParent][1]][1]
it return body hav class archive tax-podcast_type term-majority-54 term-98
// it will be return none
itry this in the consle without use dom
show this image
if it not work tell me
I know the common methods such as evaluate for capturing the elements in puppeteer, but I am curious why I cannot get the href attribute in a JavaScript-like approach as
const page = await browser.newPage();
await page.goto('https://www.example.com');
let links = await page.$$('a');
for (let i = 0; i < links.length; i++) {
console.log(links[i].getAttribute('href'));
console.log(links[i].href);
}
await page.$$('a') returns an array with ElementHandles — these are objects with their own pupeteer-specific API, they have not usual DOM API for HTML elements or DOM nodes. So you need either retrieve attributes/properties in the browser context via page.evaluate() or use rather complicated ElementHandles API. This is an example with both ways:
'use strict';
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto('https://example.org/');
// way 1
const hrefs1 = await page.evaluate(
() => Array.from(
document.querySelectorAll('a[href]'),
a => a.getAttribute('href')
)
);
// way 2
const elementHandles = await page.$$('a');
const propertyJsHandles = await Promise.all(
elementHandles.map(handle => handle.getProperty('href'))
);
const hrefs2 = await Promise.all(
propertyJsHandles.map(handle => handle.jsonValue())
);
console.log(hrefs1, hrefs2);
await browser.close();
} catch (err) {
console.error(err);
}
})();
const yourHref = await page.$eval('selector', anchor => anchor.getAttribute('href'));
but if are working with a handle you can
const handle = await page.$('selector');
const yourHref = await page.evaluate(anchor => anchor.getAttribute('href'), handle);
I don't know why it's such a pain, but this was found when I encountered this a while ago.
async function getHrefs(page, selector) {
return await page.$$eval(selector, anchors => [].map.call(anchors, a => a.href));
}
A Type safe way of returning an array of strings as the hrefs of the links by casting using the HTMLLinkElement generic for TypeScript users:
await page.$$eval('a', (anchors) => anchors.map((link) => (link as HTMLLinkElement).href));
A simple way to get an href from an anchor element
Say you fetched an anchor element with the following
const anchorElement = await page.$('a') // or page.$<HTMLAnchorElement>('a') if using typescript
You can get the href property with the following
const href = anchorElement.evaluate(element => element.href)
I am trying to get all paragraph tags from a website using Puppeteer and later extract the text from it. pTags, however, is always an empty array and I have no clue why.
Here is my code.
const puppeteer = require('puppeteer')
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.newyorker.com/news/letter-from-trumps-washington/the-worst-hour-of-his-entire-life-cohen-manafort-and-the-twin-courtroom-dramas-that-changed-trumps-presidency');
const pTags = await page.evaluate(() => Array.from(document.querySelectorAll('p')));
console.log(pTags);
browser.close();
})();
As stated in the Official Documentation:
If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined.
You are attempting to return a NodeList (a non-Serializable value) via querySelectorAll(), and therefore, your page.evaluate() function is returning undefined.
Instead, you can obtain an ElementHandle array of p elements using page.$$() or page.$x():
const pTags = await page.$$('p');
const pTags = await page.$x('//p');
Use:
const pTags = await page.$$("p");
Reference: https://github.com/GoogleChrome/puppeteer/blob/v1.7.0/docs/api.md#pageselector-1