puppeteer to cheerio scraping from dynamic website for specific data - node.js

i wanted to scrape certain data from a mutual fund website where i can track only selective funds instead of all of them.
so i tried to puppeteer to scrape the dynamic table generated by the website. I manage to get the table but when i try to parse it to cheerio, seems like nothing happen
const scrapeImages = async (username) => {
console.log("test");
const browser = await puppeteer.launch({
args: ['--no-sandbox']
});
const page = await browser.newPage();
await page.goto('https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices');
await page.waitFor(5000);
const data = await page.evaluate( () => {
const tds = Array.from(document.querySelectorAll('div.form-group:nth-child(4) > div:nth-child(1) > div:nth-child(1)'))
return tds.map(td => td.innerHTML)
});
await browser.close();
console.log(data);
let $ = cheerio.load(data);
$('table > tbody > tr > td').each((index, element) => {
console.log($(element).text());
});
};
scrapeImages("test");
ultimately i am not sure how can i do this directly with puppeteer only instead of directing to cheerio for the scraping and also i would like to scrape only selected funds for instance, if you visit the web here https://www.publicmutual.com.my/Our-Products/UT-Fund-Prices
i would like to get only funds from abbreviation
PAIF
PAGF
PCIF
instead of all of them. not sure how can i do this with only puppeteer?

That page has jQuery already which is even better than cheerio:
const rows = await page.evaluate( () => {
return $('.fundtable tr').get().map(tr => $(tr).find('td').get().map(td => $(td).text()))
}

Related

QuerySelector's returning an empty selection in console while on browser side it contains elements

In order to learn web scraping with puppeteer , i have started a little project , which aims to extract the planning of Power Outages from the National Power Supplier's website. In order to do that i have to manually change the region then retrieve the Outage's program list. The QuerySelector request i use browser side looks totally fine as it contains without fault all the outages displayed . But when i use it on the server end i receive an empty list.
Here is my code and the url of the website can be found in it .
Thanks in advance !
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://alert.eneo.cm/', { waitUntil: 'networkidle0' });
await page.evaluate(() => {
var region = "Littoral";
var j = $('#regions option:contains(' + region + ')');
$('#regions').val(j.val()).change();
});
const outages = await page.evaluate(() => {
const elements = document.querySelectorAll("#contentdata .outage");
return elements;
});
console.log(outages);
})();
I see there is list of power outages on the page you want scrape. Here is how you can get the power outage data for the first div
(async()=>{
let browser = await puppeteer.launch();
let page = await browser.newPage();
await page.goto('https://alert.eneo.cm/', { waitUntil: 'networkidle0' });
await page.select('select[name="regions"]', '5')
const outageData = await page.evaluate( async () => {
let quartier = document.querySelector('div[class="quartier"]').innerText;
let ville = document.querySelector('div[class="ville"]').innerText;
let observations = document.querySelector('div[class="observations"]').innerText;
let dateAndTime= document.querySelector('div[class="prog_date"]').innerText;
return {quartier, ville, observations, dateAndTime}
});
await browser.close();
console.log(outageData);
})();

Selecting the radio button with puppeteer

I am trying to fetch data and trigger some automatic buying process with the following website. https://www.klwines.com/
Was using "puppeteer" methods with NodeJS to process the script. According to the following screenshot provided, I got stuck with an issue where I cannot select one of the a radio button from the list since all radio buttons having the same id. What I am trying to do is just trying to select the last radio button from the following list and then trigger he button shown in the image. I was using the following NodeJS code with the help of puppeteer.
await page.waitForNavigation();
await page.waitForSelector('[name="continue"]');
const radio = await page.evaluate("table tr:nth-child(4) > td > input[type=radio]")
radio.click()
Please note that the page variable is defined as the following.
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
If someone can help with this to find a way that would be really great full.
You can try this way;
const puppeteer = require('puppeteer');
exports.yourStatus = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.klwines.com/');
const data = await page.evaluate(() => {
function cleanData(element) {
const items = element.getElementById('Shepmente_0__shepmentewayCode');
return [...items].map(item => {
console.log(item)
});
}
return data;
};

Cannot reach a dynamic page with puppeteer

I need to read data on https://www.cmegroup.com/tools-information/quikstrike/options-calendar.html
I tried to click on FX tab from page.click in puppeteer, but the page remains on the default.
Any help welcome
const puppeteer = require('puppeteer');
let scrape = async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('https://www.cmegroup.com/tools-information/quikstrike/options-calendar.html');
await page.waitFor(1000);
//div select FX
await page.click('#ctl00_MainContent_ucViewControl_IntegratedCMEOptionExpirationCalendar_ucViewControl_ucProductSelector_lvGroups_ctrl3_lbProductGroup');
//browser.close();
return result;
};
scrape().then((value) => {
console.log(value); // Success!
});
I couldn't find the element you're looking for on that page. However, this might be helpful:
Wait for the selector to appear on the page before clicking on it:
await page.waitForSelector(selector);
If still facing the issue, try using Javascript click method:
await page.$eval(selector, elem => elem.click());

How can i get all the items like src, titles and url from specific page using this code?

i have been working in a web scraping code in node.js using the npm puppeteer to get the url, image and titles from each news in the page but the only thing i was able to get the url, image and title from the first news.
const puppeteer = require('puppeteer');
(async () => {
const brower = await puppeteer.launch();
const page = await brower.newPage();
const url = 'https://es.cointelegraph.com/category/latest';
await page.goto(url, { waitUntil: 'load' });
const datos = await page.evaluate(() => Array.from(document.querySelectorAll('.categories-page__list'))
.map( info => ({
titulo: info.querySelector('.post-preview-item-inline__title').innerText.trim(),
link: info.querySelector('.post-preview-item-inline__title-link').href,
imagen: info.querySelector('.post-preview-item-inline__figure .lazy-image__wrp img ').src
}))
)
console.log(datos);
await page.close();
await brower.close();
})()
Because there is just one .categories-page__list in the page while there are a lot of .post-preview-list-inline__item elements.
You map over an array returned from document.querySelectorAll('.categories-page__list') but the array has just one element, it's right that it run the map closure just once.
So, replace
document.querySelectorAll('.categories-page__list')
with
document.querySelectorAll('.post-preview-list-inline__item')
and everything works.
Here you can find a working example.
Let me know if you need some more help 😉

How to scrape multi-level links using puppeteer js?

I am scraping table rows of site page using Puppeteer. I have the code to scrape content and assign them to an object for each in the table. In each table row there is a link that I need to open in a new page (puppeteer) and then scrape for a particular element then assign it to the same object and return the whole object with the new keys to puppeteer. How is that possible with Puppeteer?
async function run() {
const browser = await puppeteer.launch({
headless: false
})
const page = await browser.newPage()
await page.goto('https://tokenmarket.net/blockchain/', {waitUntil: 'networkidle0'})
await page.waitFor(5000)
var onlink = ''
var result = await page.$$eval('table > tbody tr .col-actions a:first-child', (els) => Array.from(els).map(function(el) {
//running ajax requests to load the inner page links.
$.get(el.children[0].href, function(response) {
onlink = $(response).find('#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2)').text()
})
return {
icoImgUrl: el.children[0].children[0].children[0].currentSrc,
icoDate: el.children[2].innerText.split('\n').shift() === 'To be announced' ? null : new Date( el.children[2].innerText.split('\n').shift() ).toISOString(),
icoName:el.children[1].children[0].innerText,
link:el.children[1].children[0].children[0].href,
description:el.children[3].innerText,
assets :onlink
}
}))
console.log(result)
UpcomingIco.insertMany(result, function(error, docs) {})
browser.close()
}
run()
If you try opening a new tab for each ICO page in parallel you might end up with 100+ pages loading at the same time.
So the best thing you could do is to first collect the URLs and then visit them one by one in a loop.
This also allows keeping the code simple and readable.
For example (please, see my comments):
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('https://tokenmarket.net/blockchain/');
// Gather assets page urls for all the blockchains
const assetUrls = await page.$$eval(
'.table-assets > tbody > tr .col-actions a:first-child',
assetLinks => assetLinks.map(link => link.href)
);
const results = [];
// Visit each assets page one by one
for (let assetsUrl of assetUrls) {
await page.goto(assetsUrl);
// Now collect all the ICO urls.
const icoUrls = await page.$$eval(
'#page-wrapper > main > div.container > div > table > tbody > tr > td:nth-child(2) a',
links => links.map(link => link.href)
);
// Visit each ICO one by one and collect the data.
for (let icoUrl of icoUrls) {
await page.goto(icoUrl);
const icoImgUrl = await page.$eval('#asset-logo-wrapper img', img => img.src);
const icoName = await page.$eval('h1', h1 => h1.innerText.trim());
// TODO: Gather all the needed info like description etc here.
results.push([{
icoName,
icoUrl,
icoImgUrl
}]);
}
}
// Results are ready
console.log(results);
browser.close();

Resources