I'm attempting on making a small script to click on a random Google Search result after searching "'what is ' + Word." Nothing I've done has been able to get me the results I want, heck, I can't even get the script to click a single Google Search result!
I've tried doing multiple things here, such as collecting all search results in an array and clicking a random one (didn't collect into an array), clicking an element by partial text (https:// brought no results), and many other solutions that work in Python, but don't work here.
const puppeteer = require('puppeteer');
const searchbar = "#tsf > div:nth-child(2) > div > div.RNNXgb > div > div.a4bIc > input"
async function gsearch() {
const browser = await puppeteer.launch({headless:false, args:['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
await page.goto('https://google.com');
var fs = require("fs");
var array = fs.readFileSync("words.txt").toString().split('\n');
var random = array[Math.floor(Math.random() * array.length)]
await page.click(searchbar)
await page.keyboard.type("what is " + random);
await page.waitFor(1000);
await page.evaluate(() => {
let elements = $('LC20lb').toArray();
for (i = 0; i < elements.length; i++) {
$(elements[i]).click();
}
})
}
gsearch();
(ignore any indent-inheritant errors, I swear it looks cleaner in VSC)
Expected to click a random search result. End up getting nothing done, maybe an error or two but that's about it.
LC20lb is not html tag and it should be class name for h3 and by using$() are you trying to select elements with jQuery? use document.querySelectorAll() instead.
const puppeteer = require('puppeteer');
const fs = require("fs");
async function gsearch() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
await page.goto('https://google.com');
var array = fs.readFileSync("words.txt").toString().split('\n');
var random = array[Math.floor(Math.random() * array.length)];
// simple selector for search box
await page.click('[name=q]');
await page.keyboard.type("what is " + random);
// you forgot this
await page.keyboard.press('Enter');
// wait for search results
await page.waitForSelector('h3.LC20lb', {timeout: 10000});
await page.evaluate(() => {
let elements = document.querySelectorAll('h3.LC20lb')
// "for loop" will click all element not random
let randomIndex = Math.floor(Math.random() * elements.length) + 1
elements[randomIndex].click();
})
}
Related
Here is my code where I have got the element Handle of some target divs
const puppeteer = require("puppeteer");
(async () => {
const searchString = `https://www.google.com/maps/search/restaurants/#-6.4775265,112.057849,3.67z`;
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(searchString);
const xpath_expression ='//div[contains(#aria-label, "Results for")]/div/div[./a]';
await page.waitForXPath(xpath_expression);
const targetDivs = await page.$x(xpath_expression);
// const link_urls = await page.evaluate((...targetDivs) => {
// return targetDivs.map((e) => {
// return e.textContent;
// });
// }, ...targetDivs);
})();
I have two relative XPath links inside these target Divs which contain related data
'link' : './a/#href'
'title': './a/#aria-label'
I have a sample of similar python code like this
from parsel import Selector
response = Selector(page_content)
results = []
for el in response.xpath('//div[contains(#aria-label, "Results for")]/div/div[./a]'):
results.append({
'link': el.xpath('./a/#href').extract_first(''),
'title': el.xpath('./a/#aria-label').extract_first('')
})
How to do it in puppeteer?
I think you can get the href and ariaLabel property values with e.g.
const targetDivs = await page.$x(xpath_expression);
targetDivs.forEach(async (div, pos) => {
const links = await div.$x('a[#href]');
const href = await (await links[0].getProperty('href')).jsonValue();
const ariaLabel = await (await links[0].getProperty('ariaLabel')).jsonValue();
console.log(pos, href, ariaLabel);
});
These are the element properties, not the attribute values, which, in the case of href, might for instance mean you get an absolute instead of a relative URL but I haven't checked for that particular page whether it makes a difference. I am not sure the $x allows direct attribute node or even string value selection, the documentation only talks about element handles.
In order to learn web scraping with puppeteer , i have started a little project , which aims to extract the planning of Power Outages from the National Power Supplier's website. In order to do that i have to manually change the region then retrieve the Outage's program list. The QuerySelector request i use browser side looks totally fine as it contains without fault all the outages displayed . But when i use it on the server end i receive an empty list.
Here is my code and the url of the website can be found in it .
Thanks in advance !
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://alert.eneo.cm/', { waitUntil: 'networkidle0' });
await page.evaluate(() => {
var region = "Littoral";
var j = $('#regions option:contains(' + region + ')');
$('#regions').val(j.val()).change();
});
const outages = await page.evaluate(() => {
const elements = document.querySelectorAll("#contentdata .outage");
return elements;
});
console.log(outages);
})();
I see there is list of power outages on the page you want scrape. Here is how you can get the power outage data for the first div
(async()=>{
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto('https://alert.eneo.cm/', { waitUntil: 'networkidle0' });
await page.select('select[name="regions"]', '5')
const outageData = await page.evaluate( async () => {
let quartier = document.querySelector('div[class="quartier"]').innerText;
let ville = document.querySelector('div[class="ville"]').innerText;
let observations = document.querySelector('div[class="observations"]').innerText;
let dateAndTime= document.querySelector('div[class="prog_date"]').innerText;
return {quartier, ville, observations, dateAndTime}
});
await browser.close();
console.log(outageData);
})();
i'm using puppeteer to retrieve datas online, and facing an issue.
Two functions have the same name and return serialized object, the first one returns an empty object, but the second one does contains the datas i'm targeting.
My question is, how can I proceed to select the second occurence of the function instead of the first one, which return an empty object.
Thanks.
My code :
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
const Variants = require('./variants.js');
const Feedback = require('./feedback.js');
async function Scraper(productId, feedbackLimit) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
/** Scrape page for details */
await page.goto(`${productId}`);
const data = (await page.evaluate()).match(/window.runParams = {"result/)
const data = data.items
await page.close();
await browser.close();
console.log(data);
return data;
}
module.exports = Scraper;
Website source code :
window.runParams = {};
window.runParams = {"resultCount":19449,"seoFeaturedSnippet":};
Please try this, it should work.
const data = await page.content();
const regexp = /window.runParams/g;
const matches = string.matchAll(regexp);
for (const match of matches) {
console.log(match);
console.log(match.index)
}
I'm trying to click on a cookiewall on a webpage, but Puppeteer refuses to recognize the short selector with just the type and class selector (button.button-action). Changing this to the full CSS selector fixes the problem but isn't a viable solution since any chance in parent elements can break the selector. As far as I know this shouldn't be a problem because on the page in question using document.querySelector("button.button-action") also returns the element I'm trying to click.
The code that doesn't work:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch({headless: false,});
const page = await browser.newPage();
await page.goto("https://www.euclaim.nl/check-uw-vlucht#/problem", { waitUntil: 'networkidle2' });
const cookiewall = await page.waitForSelector("button.button-action", {visible: true});
await cookiewall.click();
};
main();
The code that does work:
const puppeteer = require('puppeteer');
const main = async () => {
const browser = await puppeteer.launch({headless: false,});
const page = await browser.newPage();
await page.goto("https://www.euclaim.nl/check-uw-vlucht#/problem", { waitUntil: 'networkidle2' });
const cookiewall = await page.waitForSelector("#InfoPopupContainer > div.ipBody > div > div > div.row.actionButtonContainer.mobileText > button", {visible: true});
await cookiewall.click();
};
main();
The problem is that you have three button.button-action there. And the first match is not visible.
One thing you could do is waitForSelector but without the visible bit (because it will check the first button).
And then iterate through all items checking which item is clickable.
await page.waitForSelector("button.button-action");
const actions = await page.$$("button.button-action");
for(let action of actions) {
if(await action.boundingBox()){
await action.click();
break;
}
}
I am using Puppeteer to build a basic web-scraper and so far I can return all the data I require from any given page, however when pagination is involved my scraper comes unstuck (only returning the 1st page).
See example - this returns Title/Price for 1st 20 books, but doesn't look at the other 49 pages of books.
Just looking for guidance on how to overcome this - I can't see anything in the docs.
Thanks!
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
const result = await page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements){
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({title, price});
}
return data;
});
browser.close();
return result;
};
scrape().then((value) => {
console.log(value);
});
To be clear. I am following a tutorial here - this code comes from Brandon Morelli on codeburst.io!! https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921
I was following same article in order to educate myself on how to use Puppeteer.
Short answer on your question is that you need to introduce one more loop to iterate over all available pages in online book catalogue.
I've done following steps in order to collect all book titles and prices:
Extracted page.evaluate part in separate async function that takes page as argument
Introduced for-loop with hardcoded last catalogue page number (you can extract it with help of Puppeteer if you wish)
Placed async function from step one inside a loop
Same exact code from Brandon Morelli article, but now with one extra loop:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
var results = []; // variable to hold collection of all book titles and prices
var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
// defined simple loop to iterate over number of catalogue pages
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(1000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await extractedEvaluateCall(page));
// this is where next button on page clicked to jump to another page
if (index != lastPageNumber - 1) {
// no next button on last page
await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
}
}
browser.close();
return results;
};
async function extractedEvaluateCall(page) {
// just extracted same exact logic in separate function
// this function should use async keyword in order to work and take page as argument
return page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements) {
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({ title, price });
}
return data;
});
}
scrape().then((value) => {
console.log(value);
console.log('Collection length: ' + value.length);
console.log(value[0]);
console.log(value[value.length - 1]);
});
Console output:
...
{ title: 'In the Country We ...', price: '£22.00' },
... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }