Return window object using puppeteer - node.js

I'm trying to return the whole windows object from a page, and then traversing the object outside of puppeteer.
I'm trying to access the data in Highcharts property, for which I need to access the window object. The normal javascript code being something like window.Highcharts.charts[0].series[0].data.
I thought the easiest way would be to use puppeteer to access the site, and just send me back the windows object, which I could then use outside of puppeteer like any other JS object.
After reading the documentation, I'm finding it difficult to return the object as it would appear just putting 'window' into the chrome console. I'm not sure what I'm missing?
I've read through the documentation, and the following two methods seem like they should work?
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example.com', {waitUntil: 'networkidle2'});
// METHOD 1
// Create a Map object
await page.evaluate(() => window.map = new Map());
// Get a handle to the Map object prototype
const mapPrototype = await page.evaluateHandle(() => Map.prototype);
// Query all map instances into an array
const mapInstances = await page.queryObjects(mapPrototype);
console.log(mapInstances);
await mapInstances.dispose();
await mapPrototype.dispose();
// METHOD 2
const handle = await page.evaluateHandle(() => ({window, document}));
const properties = await handle.getProperties();
const windowHandle = properties.get('window');
const documentHandle = properties.get('document');
var result = await page.evaluate(win => win, windowHandle);
console.log(result)
await handle.dispose();
await browser.close();
})();
However, it only returns the following in the console, and not the simple object I would like;
Not sure if I'm going about this the right way, so any help/advice is much appreciated.

Related

Puppeteer - cannot read properties of null from an element

I am trying to use puppeteer to get data from a website, mostly for learning purposes, but I am getting the following error:
Error: Evaluation failed: TypeError: Cannot read properties of null
(reading 'innerHTML')
I tested by removing the .innerHTML from the result and it logs the whole element as an object successfully so i know im hitting the right element, its when I add the .innerHTML (i tried .innerText too) that is errors.
I suspect its down to some delay in the page loading as very occasionally it does work but I am not sure how to go about fixing that.
async function getData(searchJob,searchLocation){
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto("https://somewebsite");
await page.waitForSelector('#onetrust-accept-btn-handler');
//Click the accept cookies button
await page.evaluate(()=>{
document.querySelector('#onetrust-accept-btn-handler').click();
})
await page.type("#keywords",searchJob);
await page.type("#location",searchLocation);
await Promise.all([page.click(".btn-search"),page.waitForNavigation()]);
const grabJobs = await page.evaluate(() =>{
const jobs = document.querySelectorAll(".job-result-card"); //get the overall container for all of the jobs
let jobsArray = []; //create an array to put the job details into
jobs.forEach((jobTag)=>{ //loop through the retrieved jobs
const company = jobTag.querySelector(".gtmJobListingPostedBy");
jobsArray.push([companyText.innerHTML])
})
console.log(jobsArray);
return jobsArray;
})
await browser2.close();
}

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Use post variable with querySelector

I'm facing an issue trying to scrape datas on the web with puppeteer and querySelector.
I have a nodeJS WebServer that handle a post query, and then call a function to scrape the datas. I'm sending 2 parameters (postBlogUrl & postDomValue).
PostDomValue will contains as string the selector I'm trying to fetch datas from, for example:
[itemprop='articleBody'].
If I manually suggest the selector ([itemprop='articleBody']), everything is working well, I'm able to retrieve datas, but if i use the postDomValue var, nothing is returned.
I already tried to escape the var using CSS.escape(postDomValue), but no luck.
fetchBlogContent: async function(postBlogUrl, postDomValue) {
try {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(postBlogUrl, {
waitUntil: 'load'
})
let description = await page.evaluate(() => {
//This works return document.querySelector("[itemprop='articleBody']").innerHTML;
//This won't return document.querySelector(postDomValue).innerHTML;
})
return description
} catch (err) {
// handle err
return err;
}
}
const description = await page.evaluate((value) =>
document.querySelector(value).innerHTML, JSON.stringify(postDomValue));
See docs on how to pass args to page.evaluate() in puppeteer
If I understand correctly, the issue may be that you try to use a variable declared in the Node.js context inside an argument function of page.evaluate() that is executed in the browser context. In such cases, you need to transfer the value of a variable as an additional argument:
let description = await page.evaluate((selector) => {
return document.querySelector(selector).innerHTML;
}, postDomValue);
See more in page.evaluate().

got undefined with Puppeteer while scraping Youtube playlist

I am using Puppeteer to scrape data from YouTube playlist but can not got any data.
I have tried code with browser and use Query Selector but want to automate this process and generate json file as output of this process.
code
const puppeteer = require('puppeteer');
(async () => {
console.log("begin");
const browser = await puppeteer.launch({headless : false });
const page = await browser.newPage();
console.log("after newPage");
await page.goto('https://www.youtube.com/playlist?list=PL2-FkZlJhxqVXZO1c6gKgsAdiet0zcOAO');
console.log("after goto ");
const selectorA = "a.yt-simple-endpoint.ytd-playlist-video-renderer"
await page.waitForSelector(selectorA);
console.log("after waitForSelector ");
const items = await page.$$eval(selectorA, rows => {
console.log("eval " + rows);
return rows;
});
console.log("items " + items);
await browser.close();
})();
results
begin
after newPage
after goto
after waitForSelector
items undefined
Screenshot from same selector with broswer
According to the docs, various eval functions can transfer only serializable data (roughly, the data JSON can handle, with some additions). Your code returns an array of DOM elements, which are not serializable (they have methods and circular references). Try to retrieve the data in the browser context and returns only serializable data. For example:
return rows.map(row => [row.innerText, row.href]);

Simple web scraping with puppeteer / cheerio not working with params

I am trying to scrape https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274
The results being returned are for the page minus the ?se=274
This is applied by using the filter dropdown on the page and selecting 2019/20 season. I can navigate directly to the page and it works fine, but through code it does not work.
I have tried in cheerio and puppeteer. I was going to try nightmare too but this seems overkill I think. I am clearly not an expert! ;)
function getStats(callback){
var url = "https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274";
request(url, function (error, response, html) {
//console.log(html);
var $ = cheerio.load(html);
if(!error){
$('.allStatContainer.statontarget_scoring_att').filter(function(){
var data = $(this);
var vSOT = data.text();
//console.log(data);
console.log(vSOT);
});
}
});
callback;
}
This will return 564 instead of 2
It seems like you're calling callback before request returns. Move the callback call into the internal block, where the task you need is completed (in your case, it looks like the filter block).
It also looks like you're missing the () on the callback call.
Also, a recommendation: return the value you need through the callback.
So this code works....$10 from a rent-a-coder did the trick. Easy when you know how!
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.premierleague.com/clubs/4/Chelsea/stats?se=274')
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms))
await sleep(4000)
const element = await page.$(".allStatContainer.statontarget_scoring_att");
const text = await page.evaluate(element => element.textContent, element);
console.log("Shots on Target:"+text)
browser.close()
})()

Resources