Why can't I access 'window' in an exposeFunction() function with Puppeteer? - node.js

I have a very simple Puppeteer script that uses exposeFunction() to run something inside headless Chrome.
(async function(){
var log = console.log.bind(console),
puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var functionToInject = function(){
return window.navigator.appName;
}
await page.exposeFunction('functionToInject', functionToInject);
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return await functionToInject();
});
console.log(data);
await browser.close();
})()
This fails with:
ReferenceError: window is not defined
Which refers to the injected function. How can I access window inside the headless Chrome?
I know I can do evaluate() instead, but this doesn't work with a function I pass dynamically:
(async function(){
var log = console.log.bind(console),
puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return window.navigator.appName;
});
console.log(data);
await browser.close();
})()

evaluate the function
You can pass the dynamic script using evaluate.
(async function(){
var puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var functionToInject = function(){
return window.navigator.appName;
}
var data = await page.evaluate(functionToInject); // <-- Just pass the function
console.log(data); // outputs: Netscape
await browser.close();
})()
addScriptTag and readFileSync
You can save the function to a seperate file and use the function using addScriptTag.
await page.addScriptTag({path: 'my-script.js'});
or evaluate with readFileSync.
await page.evaluate(fs.readFileSync(filePath, 'utf8'));
or, pass a parameterized funciton as a string to page.evaluate.
await page.evaluate(new Function('foo', 'console.log(foo);'), {foo: 'bar'});
Make a new function dynamically
How about making it into a runnable function :D ?
function runnable(fn) {
return new Function("arguments", `return ${fn.toString()}(arguments)`);
}
The above will create a new function with provided arguments. We can pass any function we want.
Such as the following function with window, along with arguments,
function functionToInject() {
return window.location.href;
};
works flawlessly with promises too,
function functionToInject() {
return new Promise((resolve, reject) => {
setTimeout(() => {
resolve(window.location.href);
}, 5000);
});
}
and with arguments,
async function functionToInject(someargs) {
return someargs; // {bar: 'foo'}
};
Call the desired function with evaluate,
var data = await page.evaluate(runnable(functionToInject), {bar: "foo"});
console.log(data); // shows the location

exposeFunction() isn't the right tool for this job.
From the Puppeteer docs
page.exposeFunction(name, puppeteerFunction)
puppeteerFunction Callback function which will be called in Puppeteer's context.
'In puppeteer's context' is a little vague, but check out the docs for evaluate():
page.evaluateHandle(pageFunction, ...args)
pageFunction Function to be evaluated in the page context
exposeFunction() doesn't expose a function to run inside the page, but exposes a function to be be run in node to be called from the page.
I have to use evaluate():

You problem could be related to the fact that page.exposeFunction() will make your function return a Promise (requiring the use of async and await). This happens because your function will not be running inside your browser, but inside your nodejs application and its results are being send back and forth into/to the browser code. This is why you function passed to page.exposeFunction() is now returning a promise instead of the actual result. And it explains why the window function is not defined, because your function is running inside nodejs (not your browser) and inside nodejs there is no window definition available.
Related questions:
exposeFunction() does not work after goto()
exposed function queryseldtcor not working in puppeteer
How to use evaluateOnNewDocument and exposeFunction?
exposeFunction remains in memory?
Puppeteer: pass variable in .evaluate()
Puppeteer evaluate function
allow to pass a parameterized funciton as a string to page.evaluate
Functions bound with page.exposeFunction() produce unhandled promise rejections
How to pass a function in Puppeteers .evaluate() method?
How can I dynamically inject functions to evaluate using Puppeteer?

Related

Use post variable with querySelector

I'm facing an issue trying to scrape datas on the web with puppeteer and querySelector.
I have a nodeJS WebServer that handle a post query, and then call a function to scrape the datas. I'm sending 2 parameters (postBlogUrl & postDomValue).
PostDomValue will contains as string the selector I'm trying to fetch datas from, for example:
[itemprop='articleBody'].
If I manually suggest the selector ([itemprop='articleBody']), everything is working well, I'm able to retrieve datas, but if i use the postDomValue var, nothing is returned.
I already tried to escape the var using CSS.escape(postDomValue), but no luck.
fetchBlogContent: async function(postBlogUrl, postDomValue) {
try {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(postBlogUrl, {
waitUntil: 'load'
})
let description = await page.evaluate(() => {
//This works return document.querySelector("[itemprop='articleBody']").innerHTML;
//This won't return document.querySelector(postDomValue).innerHTML;
})
return description
} catch (err) {
// handle err
return err;
}
}
const description = await page.evaluate((value) =>
document.querySelector(value).innerHTML, JSON.stringify(postDomValue));
See docs on how to pass args to page.evaluate() in puppeteer
If I understand correctly, the issue may be that you try to use a variable declared in the Node.js context inside an argument function of page.evaluate() that is executed in the browser context. In such cases, you need to transfer the value of a variable as an additional argument:
let description = await page.evaluate((selector) => {
return document.querySelector(selector).innerHTML;
}, postDomValue);
See more in page.evaluate().

Exposed function querySelector not working in Puppeteer

document.querySelectorAll('.summary').innerText;
This throws an error in the below snippet saying "document.querySelector is not a function" in my Puppeteer page's exposed fucntion docTest.
I want to pass a specific node to each method and get the result inside evaluate.
Same with document.getElemenetbyId.
const puppeteer = require('puppeteer');
//var querySelectorAll = require('query-selector');
let docTest = (document) => {
var summary = document.querySelectorAll(.summary).innerText;
console.log(summary);
return summary;
}
let scrape = async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('http://localhost.com:80/static.html');
await page.waitFor(5000)
await page.exposeFunction('docTest', docTest);
var result = await page.evaluate(() => {
var resultworking = document.querySelector("tr");
console.log(resultworking);
var summary = docTest(document);
console.log(resultworking);
return summary;
});
console.log(result);
await page.waitFor(7000);
browser.close();
return {
result
}
};
scrape().then((value) => {
console.log(value); // Success!
});
I just had the same question. The problem is that the page.evaluate() function callback has to be an async function and your function docTest() will return a Promise when called inside the page.evaluate(). To fix it, just add the async and await keywords to your code:
await page.exposeFunction('docTest', docTest);
var result = await page.evaluate(async () => {
var summary = await docTest(document);
console.log(summary);
return summary;
});
Just remember that page.exposeFunction() will make your function return a Promise, then, you need to use async and await. This happens because your function will not be running inside your browser, but inside your nodejs application.
exposeFunction() does not work after goto()
Why can't I access 'window' in an exposeFunction() function with Puppeteer?
How to use evaluateOnNewDocument and exposeFunction?
exposeFunction remains in memory?
Puppeteer: pass variable in .evaluate()
Puppeteer evaluate function
allow to pass a parameterized funciton as a string to page.evaluate
Functions bound with page.exposeFunction() produce unhandled promise rejections
How to pass a function in Puppeteers .evaluate() method?
How can I dynamically inject functions to evaluate using Puppeteer?

Simple web scraping with puppeteer / cheerio not working with params

I am trying to scrape https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274
The results being returned are for the page minus the ?se=274
This is applied by using the filter dropdown on the page and selecting 2019/20 season. I can navigate directly to the page and it works fine, but through code it does not work.
I have tried in cheerio and puppeteer. I was going to try nightmare too but this seems overkill I think. I am clearly not an expert! ;)
function getStats(callback){
var url = "https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274";
request(url, function (error, response, html) {
//console.log(html);
var $ = cheerio.load(html);
if(!error){
$('.allStatContainer.statontarget_scoring_att').filter(function(){
var data = $(this);
var vSOT = data.text();
//console.log(data);
console.log(vSOT);
});
}
});
callback;
}
This will return 564 instead of 2
It seems like you're calling callback before request returns. Move the callback call into the internal block, where the task you need is completed (in your case, it looks like the filter block).
It also looks like you're missing the () on the callback call.
Also, a recommendation: return the value you need through the callback.
So this code works....$10 from a rent-a-coder did the trick. Easy when you know how!
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.premierleague.com/clubs/4/Chelsea/stats?se=274')
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms))
await sleep(4000)
const element = await page.$(".allStatContainer.statontarget_scoring_att");
const text = await page.evaluate(element => element.textContent, element);
console.log("Shots on Target:"+text)
browser.close()
})()

Unable to store result value from a tesseract function to a global variable inside an asynchronous function

I'm using tesseract JS to convert an image into text format. The conversion is successful and I'm able to print it out in the console. But I am unable to get this text outside the scope of the function.
I have tried assigning the text to a global variable and then printing it but nothing happens.
(async () => {
tesseract.process('new.png', (err, text) => {
if(err){return console.log("An error occured: ", err); }
console.log("Recognized text:",text);
});
})();
Need to be able to get the value of text outside the function and use it again in another asynchronous call.
If you use asynchronous operations, like Promise, callback, async-await you cannot use synchronous flow anymore.
Think it like this, Asynchronous functions are operations that will be completed in future, you want some value out of it then you CANNOT get the value untill the first asynchronous function is completed.
That being said, you CAN use Promises (seem) like synchronous functions if you use aysnc-await, IF you don't want to use Promise chain. So you need to promisify the tesseract.process function:
const utils = require('util');
(async () => {
const tessProcess = utils.promisify(tesseract.process);
try {
const text = await tessProcess('new.png');
console.log("Recognized text:", text);
} catch (err) {
console.log("An error occured: ", err);
}
})();
EDIT: After checking the code snippet:
const utils = require('util');
(async () => {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
const tessProcess = utils.promisify(tesseract.process);
await page.setViewport(viewPort)
await page.goto('example.com')
await page.screenshot(options)
const text = await tessProcess('new.png');
//YOU CAN USE text HERE/////////////
await page.$eval('input[id=companyID]', (el, value) => el.value = value, text);//here too
await browser.close()
})()

How come async/await doesn't work in my code?

How come this async/await doesn't work?
I've spent all day trying different combinations, watching videos and reading about async/await to find why this doesn't work before posting this here.
I'm trying to make a second nodejs app that will run on a different port, and my main app will call this so it scrap some data and save it to the db for cache.
What it's suppose to do:
Take a keyword and send it to a method called scrapSearch, this method create a complete URI link and send it to the method that actually get the webpage and returns it up to the first caller.
What is happening:
The console.log below the initial call is triggered before the results are returned.
Console output
Requesting : https://www.google.ca/?q=mykeyword
TypeError: Cannot read property 'substr' of undefined
at /DarkHawk/srv/NodesProjects/_scraper/node_scrapper.js:34:18
at <anonymous>
app.js:
'use strict';
var koa = require('koa');
var fs = require('fs');
var app = new koa();
var Router = require('koa-router');
var router = new Router();
app
.use(router.routes())
.use(router.allowedMethods());
app.listen(3002, 'localhost');
router.get('/scraptest', async function(ctx, next) {
var sfn = require('./scrap-functions.js');
var scrapFunctions = new sfn();
var html = await scrapFunctions.scrapSearch("mykeyword");
console.log(html.substr(0, 20));
//Normally here I'll be calling my other method to extract content
let json_extracted = scrapFunctions.exGg('mykeywords', html);
//Save to db
});
scrap-functions.js:
'use strict';
var request = require('request');
var cheerio = require('cheerio');
function Scraper() {
this.html = ''; //I tried saving html in here but the main script seems to have issues
retrieving that
this.kw = {};
this.tr = {};
}
// Search G0000000gle
Scraper.prototype.scrapSearch = async function(keyword) {
let url = "https://www.google.ca/?q="+keyword";
let html = await this.urlRequest(url);
return html;
};
// Get a url'S content
Scraper.prototype.urlRequest = async function(url) {
console.log("Requesting : "+url);
await request(url, await function(error, response, html) {
if(error) console.error(error);
return response;
});
};
module.exports = Scraper;
I tried a lot of things but I finally gave up - I tried putting await/async before each methods - didn't work either.
Why that isn't working?
Edit: wrong function name based on the fact that I created 2 different projects for testing and I mixed the file while copy/pasting.
You are not returning anything from urlRequest. Because it is an async function, it will still create a promise, but it will resolve with undefined. Therefore your html is undefined as seen in the error.
The problematic part is the request function which is a callback style function, but you're treating it as a promise. Using await on any value that is not a promise, won't do anything (technically it creates a promise that resolves directly with the value, but the resulting value remains the same). Both awaits within the urlRequest are unnecessary.
request(url, function(error, response, html) {
if(error) console.error(error);
// This return is for the callback function, not the outer function
return response;
});
You cannot return a value from within the callback. As it's asynchronous, your function will already have finished by the time the callback is called. With the callback style you would do the work inside the callback.
But you can turn it into a promise. You have to create a new promise and return it from urlRequest. Inside the promise you do the asynchronous work (request) and either resolve with the value (the response) or reject with the error.
Scraper.prototype.urlRequest = function(url) {
console.log("Requesting : "+url);
return new Promise((resolve, reject) => {
request(url, (err, response) => {
if (err) {
return reject(err);
}
resolve(response);
});
});
};
When an error occurred you want to return from the callback, so the rest (successful part) is not executed. I also removed the async keyword, because it's manually creating a promise.
If you're using Node 8, you can promisify the request function with the built-in util.promisify.
const util = require('util');
const request = require('request');
const requestPromise = util.promisify(request);
Scraper.prototype.urlRequest = function(url) {
console.log("Requesting : " + url);
return requestPromise(url);
};
Both versions will resolve with the response and to get the HTML you need to use response.body.
Scraper.prototype.scrapSearch = async function(keyword) {
let url = "https://www.google.ca/?q=" + keyword;
let response = await this.urlRequest(url);
return response.body;
};
You still need to handle errors from the promise, either with .catch() on the promise, or using try/catch when you await it.
It is absolutely essential to understand promises when using async/await, because it's syntactic sugar on top of promises, to make it look more like synchronous code.
See also:
Understand promises before you start using async/await
Async functions - making promises friendly
Exploring ES6 - Promises for asynchronous programming

Resources