Nodejs/Puppeteer - How to use page.evaluate - node.js

I know is a noob question, but I want to know when I should use page.evaluate
I also know the documentation exists, but I still do not understand
Can anybody give me an explanation about how and when to use this function when creating a scraper with puppeteer?

First, it is important to understand that there are two main environments:
Node.js (Puppeteer) Environment
Page DOM Environment
You should use page.evaluate() when you are seeking to interact with the page directly in the page DOM environment by passing a function and returning a <Promise<Serializable>> which resolves to the return value of the passed function.
Otherwise, if you do not use page.evaluate(), you will be dealing with elements as an ElementHandle object in the Node.js (Puppeteer) environment.
Example Usage:
const example = await page.evaluate(() => {
const elements = document.getElementsByClassName('example');
const result = [];
document.title = 'New Title';
for (let i = 0; i < elements.length; i++) {
result.push(elements[i].textContent);
}
return JSON.stringify(result);
});
See the simplified diagram below:

Related

Asynch, await, callback - What exactly is the execution context of a callback function?

So, I have previous programming experience in numerous languages: assembly(s), c, c++, basic(s), page description language(s), etc.
I am currently learning node, js, puppeteer and have run into something I can not quite make sense of.
I have read various things that seem explain various limitations of the callback execution context, but I have not found anything specifically that explains this.
I am attempting to call functions or reference variables (defined in the current module) from within a callback function. I have tried a number of variations, I have tried with variables of assorted types defined in assorted locations - but this one demonstrates the problem and I expect the solution to this will be the solution for all the variants. I am getting errors that "aFunction is not defined".
Why can't the callback function see the globally defined function "aFunction()"
function aFunction(parm)
{
return something;
}
(async () => {
let pages = await browser.pages();
// array of browser titles
var titles = [];
// iterate pages extracting each title using forloop because foreach can not contain await.
for (let index = 0; index < pages.length; index++) {
const pagex = pages[index]
const title = await pagex.title();
titles.push(title);
}
//chopped and edited a bunch to keep it simple
// here is the home of my problem.
foundAt = 0;
const container = await pages[foundAt].evaluate(() => {
let elements = $('.classiwant').toArray();
// this is the failing call
var x = aFunction(something);
for (i = 0; i < elements.length; i++) {
$(elements[i]).click();
}
})

How to get element from multiple URLs appending each one?

I have a website that has a main URL containing several links. I want to get the first <p> element from each link on that main page.
I have the following code that works fine to get the desired links from main page and stores them in urls array. But my issue is
that I don't know how to make a loop to load each url from urls array and print each first <p> in each iteration or append them
in a variable and print all at the end.
How can I do this? thanks
var request = require('request');
var cheerio = require('cheerio');
var main_url = 'http://www.someurl.com';
request(main_url, function(err, resp, body){
$ = cheerio.load(body);
links = $('a'); //get all hyperlinks from main URL
var urls = [];
//With this part I get the links (URLs) that I want to scrape.
$(links).each(function(i, link){
lnk = 'http://www.someurl.com/files/' + $(link).attr('href');
urls.push(lnk);
});
//In this part I don't know how to make a loop to load each url within urls array and get first <p>
for (i = 0; i < urls.length; i++) {
var p = $("p:first") //first <p> element
console.log(p.html());
}
});
if you can successfully get the URLs from the first <p>, you already know everything to do that so I suppose you have issues with the way request is working and in particular with the callback based workflow.
My suggestion is to drop request since it's deprecated. You can use something like got which is Promise based so you can use the newer async/await features coming with it (which usually means easier workflow) (Though, you need to use at least nodejs 8 then!).
Your loop would look like this:
for (const i = 0; i < urls.length; i++) {
const source = await got(urls[i]);
// Do your cheerio determination
console.log(new_p.html());
}
Mind you, that your function signature needs to be adjusted. In your case you didn't specify a function at all so the module's function signature is used which means you can't use await. So write a function for that:
async function pullAllUrls() {
const mainSource = await got(main_url);
...
}
If you don't want to use async/await you could work with some promise reductions but that's rather cumbersome in my opinion. Then rather go back to promises and use a workflow library like async to help you manage the URL fetching.
A real example with async/await:
In a real life example, I'd create a function to fetch the source of the page I'd like to fetch, like so (don't forget to add got to your script/package.json):
async function getSourceFromUrl(thatUrl) {
const response = await got(thatUrl);
return response.body;
}
Then you have a workflow logic to get all those links in the other page. I implemented it like this:
async function grabLinksFromUrl(thatUrl) {
const mainSource = await getSourceFromUrl(thatUrl);
const $ = cheerio.load(mainSource);
const hrefs = [];
$('ul.menu__main-list').each((i, content) => {
$('li a', content).each((idx, inner) => {
const wantedUrl = $(inner).attr('href');
hrefs.push(wantedUrl);
});
}).get();
return hrefs;
}
I decided that I'd like to get the links in the <nav> element which are usually wrapped inside <ul> and elements of <li>. So we just take those.
Then you need a workflow to work with those links. This is where the for loop is. I decided that I wanted the title of each page.
async function mainFlow() {
const urls = await grabLinksFromUrl('https://netzpolitik.org/');
for (const url of urls) {
const source = await getSourceFromUrl(url);
const $ = cheerio.load(source);
// Netpolitik has two <title> in their <head>
const title = $('head > title').first().text();
console.log(`${title} (${url}) has source of ${source.length} size`);
// TODO: More work in here
}
}
And finally, you need to call that workflow function:
return mainFlow();
The result you see on your screen should look like this:
Dossiers & Recherchen (https://netzpolitik.org/dossiers-recherchen/) has source of 413853 size
Der Netzpolitik-Podcast (https://netzpolitik.org/podcast/) has source of 333354 size
14 Tage (https://netzpolitik.org/14-tage/) has source of 402312 size
Official Netzpolitik Shop (https://netzpolitik.merchcowboy.com/) has source of 47825 size
Über uns (https://netzpolitik.org/ueber-uns/#transparenz) has source of 308068 size
Über uns (https://netzpolitik.org/ueber-uns) has source of 308068 size
netzpolitik.org-Newsletter (https://netzpolitik.org/newsletter) has source of 291133 size
netzwerk (https://netzpolitik.org/netzwerk/?via=nav) has source of 299694 size
Spenden für netzpolitik.org (https://netzpolitik.org/spenden/?via=nav) has source of 296190 size

Finding a div content based on attribute value

Node-red node for integrating with an older ventilation system, using screen scraping, nodejs with cheerio. Works fine for fetching some values now, but I seem unable to fetch the right element in the more complex structured telling which operating mode is active. Screenshot of structure attached. And yes, never used jquery and quite a newbie on cheerio.
I have managed, way to complex, to get the value, if it is within a certain part of the tree.
const msgResult = scraped('.control-1');
const activeMode = msgResult.get(0).children.find(x => x.attribs['data-selected'] === '1').attribs['id'];
But only works on first match, fails if the data-selected === 1 isn't in that part of the tree. Thought I should be able to use just .find from the top of the tree, but no matches.
const activeMode = scraped('.control-1').find(x => x.attribs['data-selected'] === '1')
What I would like to get from the html structure attached, is the ID of the div that has data-selected=1, which again can be below any of the two divs of class control-1. Maybe also the content of the underlying span, where the mode is described in text.
HTML structure
It's hard to tell what you're looking for but maybe:
$('.control-1 [data-selected="1"]').attr('id')
You should try to make some loop to check every tree.
Try this code, hope this works.
const cheerio = require ('cheerio')
const fsextra = require ('fs-extra');
(async () => {
try {
const parseFile = async (error, contentHTML) => {
let $ = await cheerio.load (contentHTML)
const selector = $('.control-1 [data-selected="1"]')
for (let num = 0; num < selector.length; num++){
console.log ( selector[num].attribs.id )
}
}
let activeMode = await fsextra.readFile('untitled.html', 'utf-8', parseFile )
} catch (error) {
console.log ('ERROR: ', error)
}
})()

Click on every 'a' tag in page puppeteer

I am trying to get puppeteer to go to all a tags in a page and load them, add them to an array and return it. My puppeteer version is 1.5.0. Here is my code:
module.exports.scrapeLinks = async (page, linkXpath) => {
page.waitForNavigation();
linksElement = await page.$x(linkXpath);
var url_list_arr = [];
console.log(linksElement.length);
i=1;
for(linksElementItem in linksElement)
{
const linksData = await page.$x('(' + linkXpath + ')[' + (i + 1) +']');
if (linksData.length > 0) {
linksData[0].click();
console.log(page.url());
url_list_arr.push(page.url());
}
else {
throw new Error('Link not found');
}
}
return url_list_arr;
};
However with this code, I get an
UnhandledPromiseRejectionWarning: Error: Node is either not visible or
not an HTMLElement
I also found out through the docs that is not possible to use the xpath on the page.click function. Is there anyway to achieve this?
It is also okay if there is a function to get all the link from a page, but I couldn't find it in the docs.
To get a handle on all a-tags in an array:
const aTags= await page.$$('a')
Loop through them with:
for (const aTag of aTags) {...}
Inside the loop you can interact with each of these elementHandle separately.
Note that
await aTag.click()
will destroy (garbage collect) all elementHandles when the page context is navigated. In this case you need a workaround like loading the initial page inside a loop to always start with a fresh instance.

Nightmare, PhantomJS and extracting page data

I'm new to Nightmare/PhantomJS and am struggling to get a simple inventory of all the tags on a given page. I'm running on Ubuntu 14.04 after building PhantomJS from source and installing NodeJS, Nightmare and so forth manually, and other functions seem to be working as I expect.
Here's the code I'm using:
var Nightmare = require('nightmare');
new Nightmare()
.goto("http://www.google.com")
.wait()
.evaluate(function ()
{
var a = document.getElementsByTagName("*");
return(a);
},
function(i)
{
for (var index = 0; index < i.length; index++)
if (i[index])
console.log("Element " + index + ": " + i[index].nodeName);
})
.run(function(err, nightmare)
{
if (err)
console.log(err);
});
When I run this inside a "real" browser, I get a list of all the tag types on the page (HTML, HEAD, BODY, ...). When I run this using node GetTags.js, I just get a single line of output:
Element 0: HTML
I'm sure it's a newbie problem, but what am I doing wrong here?
PhantomJS has two contexts. The page context which provides access to the DOM can only be accessed through evaluate(). So, variables must be explicitly passed in and out of the page context. But there is a limitation (docs):
Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!
Nightmare's evaluate() function is only a wrapper around the PhantomJS function of the same name. This means that you will need to work with the elements in the page context and only pass a representation to the outside. For example:
.evaluate(function ()
{
var a = document.getElementsByTagName("div");
return a.length;
},
function(i)
{
console.log(i + " divs available");
})

Resources