I'm trying to scroll an auto loading page , and while doing it I want to fetch the appearing (and disapearing elements).
My code looks like that , the scrolling works great, but I'm not able to make my puppeteer code work in order to detect the elements and save their values (the code does work outside the scroll functions)
async function autoScroll(page) {
await page.evaluate(async () => {
await new Promise((resolve, reject) => {
let totalHeight = 0;
let distance = 100;
let timer = setInterval(async () => {
let scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
console.log("scrolling"); // That one never shows up
await getUsers(); // Trying to fetch elements on every scroll
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
async function getUsers() {
let hrefs = await page.$$('div > a');
for (let i = 0; i < hrefs.length; i = i++) { adding each link to database }
-- What I want to achieve is , that every time I scroll to the bottom of the page , the getUsers functions will fetch all the links in spesific div and add them to the DB if they they don't exist yet but calling the function from the SetInterval doesn't seem to work
How can I Include my puppeteer async function while scrolling through the page?
the code does work outside the scroll functions
The getUsers function is defined in the main node.js script, but in autoScroll it is used inside of page.evaluate function, and the code inside page.evaluate runs in the browser context (as if we run it in DevTools console) where there is no getUsers function.
Since getUsers works with a database it can only work on node.js side, not in page.evaluate, you should rewrite the scraping code.
I'd suggest first getting userdata inside of page.evaluate and only after the page doesn't scroll anymore return the data to the main context and then save to the databaswe.
console.log from page.evaluate is not shown becase you need to specifically subscribe to it to see console messages.
Related
I have a working puppeteer script that I'd like to make into an API but I'm having problems with waitForSelector.
Background:
I wrote a puppeteer script that successfully searches for and scrapes the result of a query I specify in the code e.g. let address = xyz;. Now I'd like to make it into an API so that a user can query something. I managed to code everything necessary for the local API (working with express) and everything works as well. By that I mean: I coded all the server side stuff: I can make a request, the scraper function is called, puppeteer starts up, carries out my search (I need to type in an address, choose from a dropdown and press enter).
The status:
The result of my query is a form (basically 3 columns and some rows) in an iFrame and I want to scrape all the rows (I modify them into a specific json later on). The way it works is I use waitForSelector on the form's selector and then I use frame.evaluate.
Problem:
When I run my normal scraper everything works well, but when I run the (slightly modified but essentially same) code within the API framework, waitForSelector suddenly always times out. I have tried all the usual workarounds: waitForNavigation, taking a screenshot and inspecting etc but nothing helped. I've been reading quite a bit and could it be that I'm screwing something up in terms of async/await when I call my scraper from within the context of the API? I'm still quite new to this so please bear with me. This is the code of the working script - I indicated the important part
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
address = 'Gumpendorfer Straße 12, 1060 Wien';
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch();
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(1000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', address, { delay: 100});
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');//This keeps failing in the API
const res = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});
await browser.close();
console.log(success("Browser Closed"));
const mapFields = (arr1, arr2) => {
const mappedArray = arr2.map((el) => {
const mappedArrayEl = {};
el.forEach((value, i) => {
if (arr1.length < (i+1)) return;
mappedArrayEl[arr1[i]] = value;
});
return mappedArrayEl;
});
return mappedArray;
}
const Arr1 = res[0];
const Arr2 = res.slice(1,3);
let dataObj = {};
dataObj[address] = [];
// dataObj['lagezuschlag'] = mapFields(Arr1, Arr2);
// dataObj['adresse'] = address;
dataObj[address] = mapFields(Arr1, Arr2);
console.log(dataObj);
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I just can't understand why it would work in the one case and not in the other, even though I barely changed something. For the API I basically changed the name of the async function to const search = async (address) => { such that I can call it with the query in my server side script.
Thanks in advance - I'm not attaching the API code cause I don't want to clutter the question. I can update it if it's necessary
I solved this myself. Turns out the problem wasn't as complicated as I thought and it was annoyingly simple to solve. The problem wasn't with the selector that was timing out but with the previous selectors, specifically the typing and choosing from dropdown selectors. Essentially, things were going too fast. Before the search query was typed in, the dropdown was already pressed and nonsense came out. How I solved it: I included a waitFor(1000) call before the dropdown is selected and everything went perfectly. An interesting realisation was that even though that one selector timed out, it wasn't actually the source of the problem. But like I said, annoyingly simple and I feel dumb for asking this :) but maybe someone will see this and learn from my mistake
Node-red node for integrating with an older ventilation system, using screen scraping, nodejs with cheerio. Works fine for fetching some values now, but I seem unable to fetch the right element in the more complex structured telling which operating mode is active. Screenshot of structure attached. And yes, never used jquery and quite a newbie on cheerio.
I have managed, way to complex, to get the value, if it is within a certain part of the tree.
const msgResult = scraped('.control-1');
const activeMode = msgResult.get(0).children.find(x => x.attribs['data-selected'] === '1').attribs['id'];
But only works on first match, fails if the data-selected === 1 isn't in that part of the tree. Thought I should be able to use just .find from the top of the tree, but no matches.
const activeMode = scraped('.control-1').find(x => x.attribs['data-selected'] === '1')
What I would like to get from the html structure attached, is the ID of the div that has data-selected=1, which again can be below any of the two divs of class control-1. Maybe also the content of the underlying span, where the mode is described in text.
HTML structure
It's hard to tell what you're looking for but maybe:
$('.control-1 [data-selected="1"]').attr('id')
You should try to make some loop to check every tree.
Try this code, hope this works.
const cheerio = require ('cheerio')
const fsextra = require ('fs-extra');
(async () => {
try {
const parseFile = async (error, contentHTML) => {
let $ = await cheerio.load (contentHTML)
const selector = $('.control-1 [data-selected="1"]')
for (let num = 0; num < selector.length; num++){
console.log ( selector[num].attribs.id )
}
}
let activeMode = await fsextra.readFile('untitled.html', 'utf-8', parseFile )
} catch (error) {
console.log ('ERROR: ', error)
}
})()
When I trigger a .click() event in a non-headless mode in puppeteer, nothing happens, not even an error.. "non-headless mode so i could visually monitor what is being clicked"
const scraper = {
test: async () => {
let browser, page;
try {
browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
page = await browser.newPage();
} catch (err) {
console.log(err);
}
try {
await page.goto("https://www.betking.com/sports/s/eventOdds/1-840-841-0-0,1-1107-1108-0-0,1-835-3775-0-0,", {
waitUntil: "domcontentloaded"
});
console.log("scraping, wait...");
} catch (err) {
console.log(err);
}
console.log("waiting....");
try {
await page.waitFor('.eventsWrapper');
} catch (err) {
console.log(err, err.response);
}
try {
let oddsListData = await page.evaluate(async () => {
let regionAreaContainer = document.querySelectorAll('.areaContainer.region .regionGroup > .regionAreas > div:first-child > .area:nth-child(5)');
regionAreaContainer = Array.prototype.slice.call(regionAreaContainer);
let t = []; //Used to monitor the element being clicked
regionAreaContainer.forEach(async (region) => {
let dat = await region.querySelector('div');
dat.innerHTML === "GG/NG" ? t.push(dat.innerHTML) : false; //Used to confirm that the right element is being clicked
dat.innerHTML === "GG/NG" ? dat.click() : false;
})
return t;
})
console.log(oddsListData);
} catch (err) {
console.log(err);
}
}
}
I expect it to click the specified button and load in some dynamic data on the page.
In Chrome's console, I get the error
Transition Rejection($id: 1 type: 2, message: The transition has been superseded by a different transition, detail: Transition#3( 'sportsMultipleEvents'{"eventMarketIds":"1-840-841-0-0,1-1107-1108-0-0,1-835-3775-0-0,"} -> 'sportsMultipleEvents'{"eventMarketIds":"1-840-841-0-0,1-1107-1108-0-0,1-835-3775-535-14,"} ))
Problem
Behaving non-human-like by executing code like element.click() (inside the page context) or element.value = '..' (see this answer for a similar problem) seems to be problematic for Angular applications. You want to try to behave more human-like by using puppeteer functions like page.click() as they simulate a "real" mouse click instead of just triggering the element's click event.
In addition the page seems to rebuild parts of the page whenever one of the items is clicked. Therefore, you need to execute the selector again after each click.
Code sample
To behave more human-like and requery the elements after each click you can change the latter part of your code to something like this:
let list = await page.$x("//div[div/text() = 'GG/NG']");
for (let i = 0; i < list.length; i++) {
await list[i].click();
// give the page some time and then query the selectors again
await page.waitFor(500);
list = await page.$x("//div[div/text() = 'GG/NG']");
}
This code uses an XPath expression to query the div elements which contain another div element with the given text. After that, a click is simulated on the element and then the contents of the page are queried another time to respect the change of the DOM elements.
Here might be a less confusing way to click those:
for(var div of document.querySelectorAll('div')){
if(div.innerHTML === 'GG/NG') div.click()
}
I am trying to get puppeteer to go to all a tags in a page and load them, add them to an array and return it. My puppeteer version is 1.5.0. Here is my code:
module.exports.scrapeLinks = async (page, linkXpath) => {
page.waitForNavigation();
linksElement = await page.$x(linkXpath);
var url_list_arr = [];
console.log(linksElement.length);
i=1;
for(linksElementItem in linksElement)
{
const linksData = await page.$x('(' + linkXpath + ')[' + (i + 1) +']');
if (linksData.length > 0) {
linksData[0].click();
console.log(page.url());
url_list_arr.push(page.url());
}
else {
throw new Error('Link not found');
}
}
return url_list_arr;
};
However with this code, I get an
UnhandledPromiseRejectionWarning: Error: Node is either not visible or
not an HTMLElement
I also found out through the docs that is not possible to use the xpath on the page.click function. Is there anyway to achieve this?
It is also okay if there is a function to get all the link from a page, but I couldn't find it in the docs.
To get a handle on all a-tags in an array:
const aTags= await page.$$('a')
Loop through them with:
for (const aTag of aTags) {...}
Inside the loop you can interact with each of these elementHandle separately.
Note that
await aTag.click()
will destroy (garbage collect) all elementHandles when the page context is navigated. In this case you need a workaround like loading the initial page inside a loop to always start with a fresh instance.
So here's the code snippet:
for (let item of items)
{
await page.waitFor(10000)
await page.click("#item_"+item)
await page.click("#i"+item)
let pages = await browser.pages()
let tempPage = pages[pages.length-1]
await tempPage.waitFor("a.orange", {timeout: 60000, visible: true})
await tempPage.click("a.orange")
counter++
}
page and tempPage are two different pages.
What happens is that page waits for 10 seconds, then clicks some stuff, which opens a second page.
What's supposed to happen is that tempPage waits for an element, clicks it, then page should wait 10 seconds before doing it all over again.
However, what actually happens is that page waits for 10 seconds, clicks the stuff, then starts waiting for 10 seconds without waiting for tempPage to finish its tasks.
Is this a bug, or am I misunderstanding something? How should I fix this so that when the for loop loops again, it is only after tempPage has clicked.
Generally, you cannot rely on await tempPage.click("a.orange") to pause execution until tempPage has "finish[ed] its tasks". For super simple code that executes synchronously, it may work. But in general, you cannot rely on it.
If the click triggers an Ajax operation, or starts a CSS animation, or starts a computation that cannot be immediately computed, or opens a new page, etc., then the result you are waiting for is asynchronous, and the .click method will not wait for this asynchronous operation to complete.
What can you do? In some cases you may be able to hook into the code that is running on the page and wait for some event that matters to you. For instance, if you want to wait for an Ajax operation to be done and the code on the page uses jQuery, then you might use ajaxComplete to detect when the operation is complete. If you cannot hook into any event system to detect when the operation is done, then you may need to poll the page to wait for evidence that the operation is done.
Here is an example that shows the issue:
const puppeteer = require('puppeteer');
function getResults(page) {
return page.evaluate(() => ({
clicked: window.clicked,
asynchronousResponse: window.asynchronousResponse,
}));
}
puppeteer.launch().then(async browser => {
const page = await browser.newPage();
await page.goto("https://example.com");
// We add a button to the page that will click later.
await page.evaluate(() => {
const button = document.createElement("button");
button.id = "myButton";
button.textContent = "My Button";
document.body.appendChild(button);
window.clicked = 0;
window.asynchronousResponse = 0;
button.addEventListener("click", () => {
// Synchronous operation
window.clicked++;
// Asynchronous operation.
setTimeout(() => {
window.asynchronousResponse++;
}, 1000);
});
});
console.log("before clicks", await getResults(page));
const button = await page.$("#myButton");
await button.click();
await button.click();
console.log("after clicks", await getResults(page));
await page.waitForFunction(() => window.asynchronousResponse === 2);
console.log("after wait", await getResults(page));
await browser.close();
});
The setTimeout code simulates any kind of asynchronous operation started by the click.
When you run this code, you'll see on the console:
before click { clicked: 0, asynchronousResponse: 0 }
after click { clicked: 2, asynchronousResponse: 0 }
after wait { clicked: 2, asynchronousResponse: 2 }
You see that clicked is immediately incremented twice by the two clicks. However, it takes a while before asynchronousResponse is incremented. The statement await page.waitForFunction(() => window.asynchronousResponse === 2) polls the page until the condition we are waiting for is realized.
You mentioned in a comment that the button is closing the tab. Opening and closing tabs are asynchronous operations. Here's an example:
puppeteer.launch().then(async browser => {
let pages = await browser.pages();
console.log("number of pages", pages.length);
const page = pages[0];
await page.goto("https://example.com");
await page.evaluate(() => {
window.open("https://example.com");
});
do {
pages = await browser.pages();
// For whatever reason, I need to have this here otherwise
// browser.pages() always returns the same value. And the loop
// never terminates.
await page.evaluate(() => {});
console.log("number of pages after evaluating open", pages.length);
} while (pages.length === 1);
let tempPage = pages[pages.length - 1];
// Add a button that will close the page when we click it.
tempPage.evaluate(() => {
const button = document.createElement("button");
button.id = "myButton";
button.textContent = "My Button";
document.body.appendChild(button);
window.clicked = 0;
window.asynchronousResponse = 0;
button.addEventListener("click", () => {
window.close();
});
});
const button = await tempPage.$("#myButton");
await button.click();
do {
pages = await browser.pages();
// For whatever reason, I need to have this here otherwise
// browser.pages() always returns the same value. And the loop
// never terminates.
await page.evaluate(() => {});
console.log("number of pages after click", pages.length);
} while (pages.length > 1);
await browser.close();
});
When I run the above, I get:
number of pages 1
number of pages after evaluating open 1
number of pages after evaluating open 1
number of pages after evaluating open 2
number of pages after click 2
number of pages after click 1
You can see it takes a bit before window.open() and window.close() have detectable effects.
In your comment you also wrote:
I thought await was basically what turned an asynchronous function into a synchronous one
I would not say it turns asynchronous functions into synchronous ones. It makes the current code wait for an asynchronous operation's promise to be resolved or rejected. However, more importantly for the issue at hand here, the problem is that you have two virtual machines executing JavaScript code: there's Node which runs puppeteer and the script that controls the browser, and there's the browser itself which has its own JavaScript virtual machine. Any await that you use on the Node side affects only the Node code: it has no bearing on the code that runs in the browser.
It can get confusing when you see things like await page.evaluate(() => { some code; }). It looks like it is all of one piece, and all executing in the same virtual machine, but it is not. puppeteer takes the parameter passed to .evaluate, serializes it, and sends it over to the browser, where it executes. Try adding something like await page.evaluate(() => { button.click(); }); in the script above, after const button = .... Something like this:
const button = await tempPage.$("#myButton");
await button.click();
await page.evaluate(() => { button.click(); });
In the script, button is defined before page.evaluate, but you'll get a ReferenceError when page.evaluate runs because button is not defined on the browser side!