Pupateer page.evaluate randomly stopped work when parsing website

Pupateer page.evaluate randomly stopped work when parsing website - node.js

i created a webparser for option alerts about 3 weeks ago and all was going well, as of today I checked on it and for some reason it was returning empty values, I thought maybe the website reformated but there is nothing different, i have been trying many fixes for the last hours so hoping I could get some help, below is the code I use for parising the website:
const browser = await puppeteer.launch({
args: ['--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process',
'--disable-gpu'],
dumpio: true,
headless: true
});
const page = await browser.newPage();
await page.goto(process.env.ALERTS_PARSER_WEBSITE);
// page.on("console", msg => console.log("PAGE LOG:", msg));
const data = await page.evaluate(() =>
Array.from(document.querySelectorAll("table > tbody > tr"), (row) =>
Array.from(row.querySelectorAll("th, td"), (cell) => cell.innerText)
)
);
And then I map the data into my own array and pass back to my front-end, the website I am trying to parse from is Bar Chart Unusual Options Activity. You can inspect the site there and see that the query selector should work, im really on my last leg on this one so any help would be greatly appreciated.

Not sure what the cause may be, but I manage to get the data only with puppeteer.launch({ headless: false }); and
page.setDefaultTimeout(300_000);
// ...
await page.waitForSelector("table > tbody > tr");
(the last may be needed only on slow machines like my one).
Maybe the site starts using some protection against headless mode.
P.S. When I try to get a page screenshot in headless mode, I instantly get this:
P.P.S. It seems the solution is simple for now. As response.request().redirectChain() is empty, the site only checks the user agent header in the first request. So this seems to fix the issue for the headless mode (the difference can be inferred from comparing await browser.userAgent() values in both mode):
await page.setUserAgent((await browser.userAgent()).replace('HeadlessChrome', 'Chrome'));
await page.goto('https://www.barchart.com/options/unusual-activity/stocks?orderBy=tradeTime&orderDir=desc');

Related

How to get lastModified property of another website

When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.

How to kill old Puppeteer browser if still running?

I am trying to scrape data from different websites using only one Puppeteer instance. I don't want to launch a new browser for each website. So I need to check if any existing browser has already launched then just open a new tab. I did something like the below, Some conditions I always check before launching any browser
const browser = await puppeteer.launch();
browser?.isConnected()
browser.process() // null if browser is still running
but still, I found sometimes my script re-launch the browser if any old browser has already been launched. So I am thinking to kill if any old browser has been launched or what would be the best check? Any other good suggestion will be highly appreciated.

I'm not sure if that specific command (Close existing browsers) can be done inside puppeteer's APIs, but what I could recommend is how would people usually handle this situation which is to make sure that the browser instance is closed if any issue was encountered:
let browser = null;
try {
browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
if (browser)
await browser.close();
}
Otherwise, you can use shell based commands like kill or pkill if you have access to the process ID of the previous browser.

The most reliable means of closing a puppeteer instance that I've found is to close all of the pages within a BrowserContext, which automatically closes the Browser. I've seen instances of chromium linger in Task Manager after calling just await browser.close().
Here is how I do this:
const openAndCloseBrowser = async () => {
const browser = await puppeteer.launch();
try {
// your logic
catch(ERROR) {
// error handling
} finally {
const pages = await browser.pages();
for(const page of pages) await page.close();
}
}
If you try running await browser.close() after running the loop and closing each page individually, you should see an error stating that the browser was already closed and your Task Manager should not have lingering chromium instances.

Xpath: Working in Devtools but returns empty object with document.evaluate()

The following command works as expected in Devtools console (for example, here - https://news.ycombinator.com/)
$x('//a[#class="storylink"]') (Edge browser)
But the following code:
const page = await browser.newPage();
await page.goto("https://news.ycombinator.com/");
let urls = await page.evaluate(() => {
var item = document.evaluate(
'//a[#class="storylink"]',
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
return item
})
browser.close();
returns an empty object: {}. The same is happening in every other website.
Why is that happening?

If you are automatic Chrome with Puppeteer from Node.js then the page object you have already exposes a method $x for XPath evaluation, see https://pptr.dev/#?product=Puppeteer&version=v10.4.0&show=api-pagexexpression. That means that doing
let urls = await page.$x('//a[#class="storylink"]')
should suffice.

Possible to Get Puppeteer Audio Feed and/or Input Audio Directly to Puppeteer?

I want to input WAV or MP3 into puppeteer as a microphone, however while in headless the application is muted, so I was wondering if there was a way to get input directly into the browser.
I am also wondering if it's possible to get a feed of audio from the browser while in headless, and/or record the audio and place it in a folder.

I ended up using this solution. First, I enabled some options for Chromium:
const browser = await puppeteer.launch({
args: [
'--use-fake-ui-for-media-stream',
],
ignoreDefaultArgs: ['--mute-audio'],
});
Remember to close Chromium if it is open already.
I created an <audio> element with the src set to the file that I wanted to input. I also overrode navigator.mediaDevices.getUserMedia, so that it returns a Promise with the same stream as the audio file, rather than from the microphone.
const page = await browser.newPage();
await page.goto('http://example.com', {waitUntil: 'load'});
await page.evaluate(() => {
var audio = document.createElement("audio");
audio.setAttribute("src", "http://example.com/example.mp3");
audio.setAttribute("crossorigin", "anonymous");
audio.setAttribute("controls", "");
audio.onplay = function() {
var stream = audio.captureStream();
navigator.mediaDevices.getUserMedia = async function() {
return stream;
};
});
document.querySelector("body").appendChild(audio);
});
Now whenever the website's code calls navigator.mediaDevices.getUserMedia, it will get the audio stream from the file. (You need to make sure the audio is playing first.)

User Flimm's second answer was on the ball. Not sure why it did not work for him. Adding the answer here for others, as I do not have sufficient reputation to comment on his answer.
const browser = await puppeteer.launch({ headless: true,
args: [
'--use-fake-ui-for-media-stream',
'--use-fake-device-for-media-stream',
'--use-file-for-fake-audio-capture=C:\\Users\\Username\\Desktop\\newtest.wav',
],
One possible reason might be that in Windows, the slashes need to be doubled as I've done here.
The --use-fake-device-for-media-stream also needs to be enabled for this to work.
The wav file should be 44.1 kHz, 16-bit. This sample one worked for me.

Not sure what you mean by inputting as a microphone, but you can enable audio in headless mode. This should work:
const browser = await puppeteer.launch({
ignoreDefaultArgs: ['--mute-audio']
});

In theory, to load the file example.wav and use it as if it were microphone input, this should work:
const browser = await puppeteer.launch({
args: [
'--use-fake-ui-for-media-stream',
'--use-fake-device-for-media-stream',
'--use-file-for-fake-audio-capture=/path/example.wav',
'--allow-file-access',
],
ignoreDefaultArgs: ['--mute-audio'],
});
I was told that you need to make sure that Chrome has been closed before running the script, and that the .wav file needs to be 41khz and 16-bit.
However, I could not get it to work :(

Puppeteer: How to handle multiple tabs?

Scenario: Web form for developer app registration with two part workflow.
Page 1: Fill out developer app details and click on button to create Application ID, which opens, in a new tab...
Page 2: The App ID page. I need to copy the App ID from this page, then close the tab and go back to Page 1 and fill in the App ID (saved from Page 2), then submit the form.
I understand basic usage - how to open Page 1 and click the button which opens Page 2 - but how do I get a handle on Page 2 when it opens in a new tab?
Example:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({headless: false, executablePath: '/Applications/Google Chrome.app'});
const page = await browser.newPage();
// go to the new bot registration page
await page.goto('https://register.example.com/new', {waitUntil: 'networkidle'});
// fill in the form info
const form = await page.$('new-app-form');
await page.focus('#input-appName');
await page.type('App name here');
await page.focus('#input-appDescription');
await page.type('short description of app here');
await page.click('.get-appId'); //opens new tab with Page 2
// handle Page 2
// get appID from Page 2
// close Page 2
// go back to Page 1
await page.focus('#input-appId');
await page.type(appIdSavedFromPage2);
// submit the form
await form.evaluate(form => form.submit());
browser.close();
})();
Update 2017-10-25
The work for Browser.pages has been completed and merged
Fixes Emit new Page objects when new tabs created #386 and Request: browser.currentPage() or similar way to access Pages #443.
Still looking for a good usage example.

A new patch has been committed two days ago and now you can use browser.pages() to access all Pages in current browser.
Works fine, tried myself yesterday :)
Edit:
An example how to get a JSON value of a new page opened as 'target: _blank' link.
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'load'});
// click on a 'target:_blank' link
await page.click(someATag);
// get all the currently open pages as an array
let pages = await browser.pages();
// get the last element of the array (third in my case) and do some
// hucus-pocus to get it as JSON...
const aHandle = await pages[3].evaluateHandle(() => document.body);
const resultHandle = await pages[3].evaluateHandle(body =>
body.innerHTML, aHandle);
// get the JSON value of the page.
let jsonValue = await resultHandle.jsonValue();
// ...do something with JSON

This will work for you in the latest alpha branch:
const newPagePromise = new Promise(x => browser.once('targetcreated', target => x(target.page())));
await page.click('my-link');
// handle Page 2: you can access new page DOM through newPage object
const newPage = await newPagePromise;
await newPage.waitForSelector('#appid');
const appidHandle = await page.$('#appid');
const appID = await page.evaluate(element=> element.innerHTML, appidHandle );
newPage.close()
[...]
//back to page 1 interactions
Be sure to use the last puppeteer version (from Github master branch) by setting package.json dependency to
"dependencies": {
"puppeteer": "git://github.com/GoogleChrome/puppeteer"
},
Source: JoelEinbinder # https://github.com/GoogleChrome/puppeteer/issues/386#issuecomment-343059315

According to the Official Documentation:
browser.pages()
returns: <Promise<Array<Page>>> Promise which resolves to an array of all open pages. Non visible pages, such as "background_page", will not be listed here. You can find them using target.page().
An array of all pages inside the Browser. In case of multiple browser contexts, the method will return an array with all the pages in all browser contexts.
Example Usage:
let pages = await browser.pages();
await pages[0].evaluate(() => { /* ... */ });
await pages[1].evaluate(() => { /* ... */ });
await pages[2].evaluate(() => { /* ... */ });

In theory, you could override the window.open function to always open "new tabs" on your current page and navigate via history.
Your workflow would then be:
Override the window.open function:
await page.evaluateOnNewDocument(() => {
window.open = (url) => {
top.location = url
}
})
Go to your first page and perform some actions:
await page.goto(PAGE1_URL)
// ... do stuff on page 1
Navigate to your second page by clicking the button and perform some actions there:
await page.click('#button_that_opens_page_2')
await page.waitForNavigation()
// ... do stuff on page 2, extract any info required on page 1
// e.g. const handle = await page.evaluate(() => { ... })
Return to your first page:
await page.goBack()
// or: await page.goto(PAGE1_URL)
// ... do stuff on page 1, injecting info saved from page 2
This approach, obviously, has its drawbacks, but I find it simplifies multi-tab navigation drastically, which is especially useful if you're running parallel jobs on multiple tabs already. Unfortunately, current API doesn't make it an easy task.

You could remove the need to switch page in case it is caused by target="_blank" attribute - by setting target="_self"
Example:
element = page.$(selector)
await page.evaluateHandle((el) => {
el.target = '_self';
}, element)
element.click()

If your click action is emitting a pageload, then any subsequent scripts being ran are effectively lost. To get around this you need to trigger the action (a click in this case) but not await for it. Instead, wait for the pageload:
page.click('.get-appId');
await page.waitForNavigation();
This will allow your script to effectively wait for the next pageload event before proceeding with further actions.

You can't currently - Follow https://github.com/GoogleChrome/puppeteer/issues/386 to know when the ability is added to puppeteer (hopefully soon)

it looks like there's a simple 'page.popup' event
Page corresponding to "popup" window
Emitted when the page opens a new tab or window.
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.click('a[target=_blank]'),
]);
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.evaluate(() => window.open('https://example.com')),
]);
credit to this github issue for easier 'targetcreated'

You can have multiple inheritance from browser.newPage() to open multiple tabs
Example
const page = await browser.newPage();
await page.goto("https://www.google.com/");
const page2 = await browser.newPage();
await page2.goto("https://www.youtube.com/");

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pupateer page.evaluate randomly stopped work when parsing website - node.js

Related

How to get lastModified property of another website

How to kill old Puppeteer browser if still running?

Xpath: Working in Devtools but returns empty object with document.evaluate()

Possible to Get Puppeteer Audio Feed and/or Input Audio Directly to Puppeteer?

Puppeteer: How to handle multiple tabs?

Categories

Resources