Overcoming pagination when using puppeteer (library) for web-scraping

Overcoming pagination when using puppeteer (library) for web-scraping - node.js

I am using Puppeteer to build a basic web-scraper and so far I can return all the data I require from any given page, however when pagination is involved my scraper comes unstuck (only returning the 1st page).
See example - this returns Title/Price for 1st 20 books, but doesn't look at the other 49 pages of books.
Just looking for guidance on how to overcome this - I can't see anything in the docs.
Thanks!
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
const result = await page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements){
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({title, price});
}
return data;
});
browser.close();
return result;
};
scrape().then((value) => {
console.log(value);
});
To be clear. I am following a tutorial here - this code comes from Brandon Morelli on codeburst.io!! https://codeburst.io/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921

I was following same article in order to educate myself on how to use Puppeteer.
Short answer on your question is that you need to introduce one more loop to iterate over all available pages in online book catalogue.
I've done following steps in order to collect all book titles and prices:
Extracted page.evaluate part in separate async function that takes page as argument
Introduced for-loop with hardcoded last catalogue page number (you can extract it with help of Puppeteer if you wish)
Placed async function from step one inside a loop
Same exact code from Brandon Morelli article, but now with one extra loop:
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/');
var results = []; // variable to hold collection of all book titles and prices
var lastPageNumber = 50; // this is hardcoded last catalogue page, you can set it dunamically if you wish
// defined simple loop to iterate over number of catalogue pages
for (let index = 0; index < lastPageNumber; index++) {
// wait 1 sec for page load
await page.waitFor(1000);
// call and wait extractedEvaluateCall and concatenate results every iteration.
// You can use results.push, but will get collection of collections at the end of iteration
results = results.concat(await extractedEvaluateCall(page));
// this is where next button on page clicked to jump to another page
if (index != lastPageNumber - 1) {
// no next button on last page
await page.click('#default > div > div > div > div > section > div:nth-child(2) > div > ul > li.next > a');
}
}
browser.close();
return results;
};
async function extractedEvaluateCall(page) {
// just extracted same exact logic in separate function
// this function should use async keyword in order to work and take page as argument
return page.evaluate(() => {
let data = [];
let elements = document.querySelectorAll('.product_pod');
for (var element of elements) {
let title = element.childNodes[5].innerText;
let price = element.childNodes[7].children[0].innerText;
data.push({ title, price });
}
return data;
});
}
scrape().then((value) => {
console.log(value);
console.log('Collection length: ' + value.length);
console.log(value[0]);
console.log(value[value.length - 1]);
});
Console output:
...
{ title: 'In the Country We ...', price: '£22.00' },
... 900 more items ]
Collection length: 1000
{ title: 'A Light in the ...', price: '£51.77' }
{ title: '1,000 Places to See ...', price: '£26.08' }

Related

Page returning same results even though the results are different each time. (Node, puppeteer, cheerio)

Im scraping a dynamic website, which doesn't instantly load a list full of items.
I'm waiting until that content loads then i'm getting the pagination list and extracting the last items value for the page count.
once this is done i'm using a for loop to go through this website x amount of times based on the page count for that list of items.
once the first loop is complete and the data is gathered... i click the next button to move on to the next page, i await the data to change, then go back through the for loop to gather the new data in that list.
For some reason, the results returned is the same for each page, even though they're not the same, and i can see with using headless: false that the data is in-fact changing so why isn't it returning that new data?
`
try {
const browser = await puppeteer.launch(
{
headless: false
}
);
const page = await browser.newPage();
await page.goto(url, {
waitUntil: "networkidle0",
}).catch((err) => console.log("error loading url", err));
await page.waitForSelector(selector);
const pageData = await page.evaluate(() => {
return {
html: document.documentElement.innerHTML,
}
});
const $ = cheerio.load(pageData.html);
for (let i = 0; i < pageCount; i++) {
await page.waitForSelector(selector);
$('.itemDivWrapper > div').each(async (i, g) => {
let itemName = $(g).find('a > div > h3').text();
console.log(itemName);
});
await Promise.all([
page.$eval('div.itemList > nav > ul > li.nextButton > a', element =>
element.click()
),
await page.waitForTimeout(5000);
]);
}
await browser.close();
} catch (error) {
console.log(error);
}
`
Maybe i'm just doing something silly wrong :'D and fresh eyes are needed :D.
Expect each loop to provide the new page data, which 100% showed before it tried to get that data. but returns the same values each time.

How to scrape image src the right way using puppeteer?

I'm trying to create a function that can capture the src attribute from a website. But all of the most common ways of doing so, aren't working.
This was my original attempt.
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.setDefaultNavigationTimeout(0);
await page.waitForTimeout(500);
await page.goto(
`https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus/3413654`,
{
waitUntil: "domcontentloaded",
}
);
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll(
"#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
);
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
console.log(fetchImgSrc);
} catch (err) {
console.log(err);
}
await browser.close();
})();
[];
In my next attempt I tried a suggestion and was returned an empty string.
await page.setViewport({ width: 1024, height: 768 });
const imgs = await page.$$eval("#menus img", (images) =>
images.map((i) => i.src)
);
console.log(imgs);
And in my final attempt I fallowed another suggestion and was returned an array with two empty strings inside of it.
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll(".swiper-lazy-loaded");
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
console.log(fetchImgSrc);
In each attempt i only replaced the function and console log portion of the code. I've done a lot of digging and found these are the most common ways of scrapping an image src using puppeteer and I've used them in other ways but for some reason right now they aren't working for me. I'm not sure if I have a bug in my code or why it will not work.

To return the src link for the two menu images on this page you can use
const fetchImgSrc = await page.evaluate(() => {
const img = document.querySelectorAll('.swiper-lazy-loaded');
let src = [];
for (let i = 0; i < img.length; i++) {
src.push(img[i].getAttribute("src"));
}
return src;
});
This gives us the expected output
['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']

You have two issues here:
Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.
By the way: you can shorten your script with page.$$eval like this:
await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)
Output:
[
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]

How to check for a null value in an element of a json object in node js function?

I am using google-play-scraper module in node.js to scrape google play reviews. The review function for a single page is as below:
var gplay = require('google-play-scraper');
gplay.reviews({
appId: 'es.socialpoint.chefparadise',
page: 0,
}).then(console.log, console.log);
Now, I like to scrape all the comments on all pages at once and save them in a logger. For this, I am using winston logger and a for loop as below:
var gplay = require('google-play-scraper');
const winston= require('winston');
const logger = winston.createLogger({
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'rev1.log' })
]
});
package_id='com.jetstartgames.chess'
for (i=0; i<112; i++){
gplay.reviews({
appId: package_id,
page: i,
}).then(logger.info, logger.info);
}
The problem is that I should pre-defined the maximum number of pages that each application owns for its reviews (I should determine the maximum value of i for the loop). In order to do this, I taught of checking for the null value but I couldn't find a plausible way for doing it. The log file for a page that doesn't exist in reality has a structure as below:
{"message":[],"level":"info"}
I tried this code which doesn't work:
max=0
for (i=0; i<10000; i++){
data=gplay.reviews({
appId: 'com.jetstartgames.chess',
page: i,
});
if (data.message==null || data.message==undefined){
break;
} else {
max+=1;
}
}
Is there any way that I can figure out the maximum number of pages by checking of the first null output? or any other suggestion for this purpose?

So there is a couple issues, it looks like the api your using uses Promises so the returned value won't be available for you until further loops.
If your using a node.js > 7.6 you can you use async / await like so;
import gplay from 'google-play-scraper';
async function getReviews(appId, page = 1) {
return await gplay.reviews({
appId,
page,
});
}
async function process(appId) {
let page = 1;
let messages = [];
let result;
do {
result = await getReviews(appId, page);
messages = messages.concat(result);
++page;
} while (result.length > 0);
return messages;
}
process('com.jetstartgames.chess')
.then((messages) => {
console.log(messages);
})

I try to implement like this. Pls try and let me know if it works :)
In document from reviews, pls noted:
Note that this method returns reviews in a specific language (english
by default), so you need to try different languages to get more
reviews. Also, the counter displayed in the Google Play page refers to
the total number of 1-5 stars ratings the application has, not the
written reviews count. So if the app has 100k ratings, don't expect to
get 100k reviews by using this method.
var gplay = require('google-play-scraper');
var appId = 'com.jetstartgames.chess';
var taskList = [];
for(var i = 1 ; i < 10000; i++){
taskList.push(new Promise((res, rej)=>{
gplay.reviews({
appId: appId,
page: i,
sort: gplay.sort.RATING
}).then(result =>{
res(result.length);
})
.catch(err => rej(err))
}));
}
Promise.all(taskList)
.then(results => {
results = results.filter(x => x > 0);
var maxPage = results.length;
console.log('maxPage', maxPage);
})
.catch(err => console.log(err))

The problem is that I should pre-defined the maximum number of pages that each application owns for its reviews (I should determine the maximum value of i for the loop).
I think we can get this data from app response.
{
appId: 'es.socialpoint.chefparadise',
...
ratings: 27904,
reviews: 11372, // data to determine pagenumber
...
}
Also, review offers a ball park number for the page number calculation.
page (optional, defaults to 0): Number of page that contains reviews. Every page has 40 reviews at most.
Making those changes,
'use strict';
const gplay = require('google-play-scraper');
const packageId = 'es.socialpoint.chefparadise';
function getAppDetails(packageId) {
return gplay.app({ appId: packageId })
.catch(console.log);
}
getAppDetails(packageId).then(appDetails => {
let { reviews, ratings } = appDetails;
const totalPages = Math.round(reviews / 40);
console.log(`Total reviews => ${reviews} \nTotal ratings => ${ratings}\nTotal pages => ${totalPages} `);
let rawReview = [];
let pageNumber = 0;
while (pageNumber < totalPages) {
console.log(`pageNumber =${pageNumber},totalPages=${totalPages}`);
rawReview.push(gplay.reviews({
appId: packageId,
page: pageNumber,
}).catch(err => {
console.log(packageId, pageNumber);
console.log(err);
}));
pageNumber++;
}
return Promise.all(rawReview);
}).then(reviewsResults => {
console.log('***Reviews***');
for (let review of reviewsResults) {
console.log(review);
}
}).catch(err => {
console.log('Err ', err);
});
It worked well for the packageId which had less reviews. But for es.socialpoint.chefparadise I frequently ran into Issue #298 since the data size is huge.
Output
Total reviews => 215922
Total ratings => 688107
Total pages => 5398
Reviews
....

Click on random Google Search result using NodeJS and Puppeteer?

I'm attempting on making a small script to click on a random Google Search result after searching "'what is ' + Word." Nothing I've done has been able to get me the results I want, heck, I can't even get the script to click a single Google Search result!
I've tried doing multiple things here, such as collecting all search results in an array and clicking a random one (didn't collect into an array), clicking an element by partial text (https:// brought no results), and many other solutions that work in Python, but don't work here.
const puppeteer = require('puppeteer');
const searchbar = "#tsf > div:nth-child(2) > div > div.RNNXgb > div > div.a4bIc > input"
async function gsearch() {
const browser = await puppeteer.launch({headless:false, args:['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
await page.goto('https://google.com');
var fs = require("fs");
var array = fs.readFileSync("words.txt").toString().split('\n');
var random = array[Math.floor(Math.random() * array.length)]
await page.click(searchbar)
await page.keyboard.type("what is " + random);
await page.waitFor(1000);
await page.evaluate(() => {
let elements = $('LC20lb').toArray();
for (i = 0; i < elements.length; i++) {
$(elements[i]).click();
}
})
}
gsearch();
(ignore any indent-inheritant errors, I swear it looks cleaner in VSC)
Expected to click a random search result. End up getting nothing done, maybe an error or two but that's about it.

LC20lb is not html tag and it should be class name for h3 and by using$() are you trying to select elements with jQuery? use document.querySelectorAll() instead.
const puppeteer = require('puppeteer');
const fs = require("fs");
async function gsearch() {
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
await page.goto('https://google.com');
var array = fs.readFileSync("words.txt").toString().split('\n');
var random = array[Math.floor(Math.random() * array.length)];
// simple selector for search box
await page.click('[name=q]');
await page.keyboard.type("what is " + random);
// you forgot this
await page.keyboard.press('Enter');
// wait for search results
await page.waitForSelector('h3.LC20lb', {timeout: 10000});
await page.evaluate(() => {
let elements = document.querySelectorAll('h3.LC20lb')
// "for loop" will click all element not random
let randomIndex = Math.floor(Math.random() * elements.length) + 1
elements[randomIndex].click();
})
}

How to scrape multi-level links using puppeteer js?

I am scraping table rows of site page using Puppeteer. I have the code to scrape content and assign them to an object for each in the table. In each table row there is a link that I need to open in a new page (puppeteer) and then scrape for a particular element then assign it to the same object and return the whole object with the new keys to puppeteer. How is that possible with Puppeteer?
async function run() {
const browser = await puppeteer.launch({
headless: false
})
const page = await browser.newPage()
await page.goto('https://tokenmarket.net/blockchain/', {waitUntil: 'networkidle0'})
await page.waitFor(5000)
var onlink = ''
var result = await page.$$eval('table > tbody tr .col-actions a:first-child', (els) => Array.from(els).map(function(el) {
//running ajax requests to load the inner page links.
$.get(el.children[0].href, function(response) {
onlink = $(response).find('#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2)').text()
})
return {
icoImgUrl: el.children[0].children[0].children[0].currentSrc,
icoDate: el.children[2].innerText.split('\n').shift() === 'To be announced' ? null : new Date( el.children[2].innerText.split('\n').shift() ).toISOString(),
icoName:el.children[1].children[0].innerText,
link:el.children[1].children[0].children[0].href,
description:el.children[3].innerText,
assets :onlink
}
}))
console.log(result)
UpcomingIco.insertMany(result, function(error, docs) {})
browser.close()
}
run()

If you try opening a new tab for each ICO page in parallel you might end up with 100+ pages loading at the same time.
So the best thing you could do is to first collect the URLs and then visit them one by one in a loop.
This also allows keeping the code simple and readable.
For example (please, see my comments):
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://tokenmarket.net/blockchain/');
// Gather assets page urls for all the blockchains
const assetUrls = await page.$$eval(
'.table-assets > tbody > tr .col-actions a:first-child',
assetLinks => assetLinks.map(link => link.href)
);
const results = [];
// Visit each assets page one by one
for (let assetsUrl of assetUrls) {
await page.goto(assetsUrl);
// Now collect all the ICO urls.
const icoUrls = await page.$$eval(
'#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2) a',
links => links.map(link => link.href)
);
// Visit each ICO one by one and collect the data.
for (let icoUrl of icoUrls) {
await page.goto(icoUrl);
const icoImgUrl = await page.$eval('#asset-logo-wrapper img', img => img.src);
const icoName = await page.$eval('h1', h1 => h1.innerText.trim());
// TODO: Gather all the needed info like description etc here.
results.push([{
icoName,
icoUrl,
icoImgUrl
}]);
}
}
// Results are ready
console.log(results);
browser.close();

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Overcoming pagination when using puppeteer (library) for web-scraping - node.js

Related

Page returning same results even though the results are different each time. (Node, puppeteer, cheerio)

How to scrape image src the right way using puppeteer?

How to check for a null value in an element of a json object in node js function?

Click on random Google Search result using NodeJS and Puppeteer?

How to scrape multi-level links using puppeteer js?

Categories

Resources