how to loop over divs with the same class with Puppeteer - node.js

Using puppeteer to scrape a page Im able to get the contents from a list of divs with the same class and nested list of divs within those i.e.
<div class="parent">
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
...
</div>
...
now my problem is i need to reiterate over the list and run the page.click() on the child class divs to open lightboxes, select an element in the lightbox to click then run the page.pdf() on.
I currently have a for loop over the parent class divs, and an inner for loop over the child class divs. I'm not sure how to select the right div with the for loop index values as there is no nth-of-class etc.
I simply want to run something like
for (let a = 0; a < data.length; a++) {
for (let b = 0; b < data[a].length; b++) {
await page.click('.parent[a] .child[b]');
// other code here...
}
}
to open the lightbox, then a
await page.waitForSelector('.ReactModal')
to scrape the lightbox html and the run
await page.pdf({
path: dir + "/"+ filename,
format: 'A4'
});
Any guidance would be appreciated as to the possible approaches would be.

If I understand correctly, you can try something like this:
for (const parent of await page.$$('.parent')) {
for (const child of await parent.$$('.child')) {
await child.click();
await page.waitForSelector('.ReactModal'); // maybe check if this is not the same lightbox
await page.pdf(/*...*/);
}
}

Related

How to scrape from 2 divs that are on the same level with Cheerio

I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.
Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
Below is my code to get the data from a website:
var leafletList = $('.store-flyer__info', html).each(function() {
let leaflet = {
title: $(this).find('h3').text(),
image: $(this).find('source').attr('srcset'),
link: $(this).find('a').attr('href'),
validDate: $(this).find('small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
Below is the website's HTML:
The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:
var leafletList = $('.store-flyers', html).each(function() {
let leaflet = {
title: $(this).find('.store-flyer__info h3').text(),
image: $(this).find('.store-flyer__front source').attr('srcset'),
link: $(this).find('.store-flyer__info a').attr('href'),
validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.
With that in mind, here are a few options:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<div class="store-flyer">
<picture>
<source srcset="foo.jpeg" type="image/webp">
<source srcset="bar.jpeg" type="image/jpeg">
</picture>
</div>
<div class="store-flyer">
<picture>
<source srcset="quux.jpeg" type="image/webp">
<source srcset="garply.jpeg" type="image/jpeg">
</picture>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".store-flyer")].map(e => ({
// select using `.first()` and `.last()` Cheerio methods:
firstImage: $(e).find("source").first().attr("srcset"),
secondImage: $(e).find("source").last().attr("srcset"),
// select using CSS attribute selectors:
firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),
// select as an array of all <source> elements:
allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
}));
console.log(result);
Output:
[
{
firstImage: 'foo.jpeg',
secondImage: 'bar.jpeg',
firstImageByType: 'foo.jpeg',
secondImageByType: 'bar.jpeg',
allImages: [ 'foo.jpeg', 'bar.jpeg' ]
},
{
firstImage: 'quux.jpeg',
secondImage: 'garply.jpeg',
firstImageByType: 'quux.jpeg',
secondImageByType: 'garply.jpeg',
allImages: [ 'quux.jpeg', 'garply.jpeg' ]
}
]
Prepending .store-flyer__front to your source selectors might be a good idea if you need to disambiguate.
With cheerio, you can access node properties such as:
parentNode
previousSibling
nextSibling
nodeValue
firstChild
childNodes
lastChild
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
.main.firstChild is .one
.one.nextSibling is .two
.main.lastChild is .two
.two.previousSibling is .one

Puppeteer: Empty Children Array

I am attempting to perform a web scraping operation and would like to get all the children element in a html tree similar to this:
<div class="main">
<p>Some p</p>
<a>Some a</a>
<br>
<br>
<em>
<p>Another p</p>
<a>Another a</a>
<br>
<br>
<em>
//...
</div>
I scraped the html using Puppeteer like so and managed to get the children but as a string format. Here are my attempts:
const children = await page.evaluate(el => el.children, await page.$('div.main'))
console.log(children)
//prints {"1": {}, "2": {}, "3": {} ...}
I then refer to this post and this post, and attempted this:
const children = await page.evaluate(() => {
var children = [...document.querySelector('div.main').children];
return children.map((e) => e.outerHTML);
})
console.log(children)
//prints all children correctly, but all as strings
Is there a way to get all child elements under a tag but with all the DOM attributes retained so that I can loop over each element, perform some algorithmic operation and extract some attributes.

cheerio selection of a list

On a page I need to scrape (with node.js and cheerio), I have this pattern:
<h2>
<span id="2015"></span>
<span class="ignore-me"></span>
</h2>
<div>
<ol>
<li>
<a title="TITLE1" href="HREF1"></a>
<a class="image" title="ignore-me-1" href="ignore-me-1"></a>
</li>
...
<li>
<a title="TITLE2" href="HREF2"></a>
<a class="image" title="ignore-me-2" href="ignore-me-2"></a>
</li>
</ol>
</div>
I would like to extract a list with TITLEs an HREFs.
I am trying something like this:
$('h2 > span[id="2015"]').next('ol > li > a').each(function(index, element) {
console.log('title:', element.attr('title'), 'href:', element.attr('href'));
});
without success (each loop is never entered...).
Any suggestion?
The ol element isn't actually the next element of span#2015. The ol element is inside a div which is the next element of h2. The right tree traversal is :
$('h2 > span[id="2015"]')
.parent()
.next('div')
.find('ol > li > a:not([class])')
.each(function() {
var $el = $(this);
console.log('title:', $el.attr('title'), 'href:', $el.attr('href'));
});
The h2 tag does not have an ID, thus your selector finds no results, nothing to loop over.
You could easily do it by looping anchor tags.
$("a").each(function(i, e) {
if (e.attr('title') && e.attr('href')) console.log("... stuff ...");
});
Or you can give your h2 an id, or remove the id from your selector. Many ways to loop.

Nested ListView or Nested Repeater

I am trying to created a nested repeater or a nested list view using WinJS 4.0, but I am unable to figure out how to bind the data source of the inner listview/repeater.
Here is a sample of what I am trying to do (note that the control could be Repeater, which I would prefer):
HTML:
<div id="myList" data-win-control="WinJS.UI.ListView">
<span data-win-bind="innerText: title"></span>
<div data-win-control="WinJS.UI.ListView">
<span data-win-bind="innerText: name"></span>
</div>
</div>
JS:
var myList = element.querySelector('#myList).winControl;
var myData = [
{
title: "line 1",
items: [
{name: "item 1.1"},
{name: "item 1.2"}
]
},
{
title: "line 2",
items: [
{name: "item 2.1"},
{name: "item 2.2"}
]
}
];
myList.data = new WinJS.Binding.List(myData);
When I try this, nothing renders for the inner list. I have attempted trying to use this answer Nested Repeaters Using Table Tags and this one WinJS: Nested ListViews but I still seem to have the same problem and was hoping it was a little less complicated (like KnockOut).
I know it is mentioned that WinJS doesn't support nested ListViews, but that seems to be a few years ago and I am hoping that is still not the issue.
Update
I was able to get the nested repeater to work correctly, thanks to Kraig's answer. Here is what my code looks like:
HTML:
<div id="myTemplate" data-win-control="WinJS.Binding.Template">
<div
<span>Bucket:</span><span data-win-bind="innerText: name"></span>
<span>Amount:</span><input type="text" data-win-bind="value: amount" />
<button class="removeBucket">X</button>
<div id="bucketItems" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#myTemplate')}"
data-win-bind="winControl.data: lineItems">
</div>
</div>
</div>
<div id="budgetBuckets" data-win-control="WinJS.UI.Repeater"
data-win-options="{data: Data.buckets,template: select('#myTemplate')}">
</div>
JS: (after the "use strict" statement)
WinJS.Namespace.define("Data", {
buckets: new WinJS.Binding.List([
{
name: "A",
amount: 5,
lineItems: new WinJS.Binding.List( [
{ name: 'test item1', amount: 50 },
{ name: 'test item2', amount: 25 }
]
)
}
])
})
*Note that this answers part of my question, however, I would really like to do this all after a repo call and set the repeater data source programmatically. I am going to keep working towards that and if I get it I will post that as the accepted answer.
The HTML Repeater control sample for Windows 8.1 has an example in scenario 6 with a nested Repeater, and in this case the Repeater is created through a Template control. That's a good place to start. (I discuss this sample in Chapter 7 of Programming Windows Store Apps with HTML, CSS, and JavaScript, 2nd Edition, starting on page 372, or 374 for the nested part.)
Should still work with WinJS 4, though I haven't tried it.
Ok, so I have to give much credit to Kraig because he got me on the correct path to getting this worked out and the referenced book Programming Windows Store Apps with HTML, CSS, and JavaScript, 2nd Edition is amazing.
The original issue was a combination of not using templates correctly (using curly braces in the data-win-bind attribute), not structuring my HTML correctly and not setting the child lists as WinJS.Binding.List data source. Below is the final working code structure to created a nested repeater when binding the data from code only:
HTML:
This is the template for the child lists. It looks similar, but I plan on add more things so I wanted it separate instead of recursive as referenced in the book. Note that the inner div after the template control declaration was important for me.
<div id="bucketItemTemplate" data-win-control="WinJS.Binding.Template">
<div>
<span>Description:</span>
<span data-win-bind="innerText: description"></span>
<span>Amount:</span>
<input type="text" data-win-bind="value: amount" />
<button class="removeBucketItem">X</button>
</div>
</div>
This is the main repeater template for the lists. Note that the inner div after the template control declaration was important for me. Another key point was using the "winControl.data" property against the property name of the child lists.
<div id="bucketTemplate" data-win-control="WinJS.Binding.Template">
<div>
<span>Bucket:</span>
<span data-win-bind="innerText: bucket"></span>
<span>Amount:</span>
<input type="text" data-win-bind="value: amount" />
<button class="removeBucket">X</button>
<div id="bucketItems" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#bucketItemTemplate')}"
data-win-bind="winControl.data: lineItems">
</div>
</div>
</div>
This is the main control element for the nested repeater and it is pretty basic.
<div id="budgetBuckets" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#bucketTemplate')}">
</div>
JavaScript:
The JavaScript came down to a few simple steps:
Getting the winControl
var bucketsControl = element.querySelector('#budgetBuckets').winControl;
Looping through the elements and making the child lists into Binding Lists - the data here is made up but could have easily came from the repo:
var bucketsData = selectedBudget.buckets;
for (var i = 0; i < bucketsData.length; i++) {
bucketsData[i].lineItems =
new WinJS.Binding.List([{ description: i, amount: i * 10 }]);
}
Then finally converting the entire data into a Binding list and setting it to the "data" property of the winControl.
bucketsControl.data = new WinJS.Binding.List(bucketsData);
*Note that this is the entire JavaScript file, for clarity.
(function () {
"use strict";
var nav = WinJS.Navigation;
WinJS.UI.Pages.define("/pages/budget/budget.html", {
// This function is called whenever a user navigates to this page. It
// populates the page elements with the app's data.
ready: function (element, options) {
// TODO: Initialize the page here.
var bindableBuckets;
require(['repository'], function (repo) {
//we can setup our save button here
var appBar = document.getElementById('appBarBudget').winControl;
appBar.getCommandById('cmdSave').addEventListener('click', function () {
//do save work
}, false);
repo.getBudgets(nav.state.budgetSelectedIndex).done(function (selectedBudget) {
var budgetContainer = element.querySelector('#budgetContainer');
WinJS.Binding.processAll(budgetContainer, selectedBudget);
var bucketsControl = element.querySelector('#budgetBuckets').winControl;
var bucketsData = selectedBudget.buckets;
for (var i = 0; i < bucketsData.length; i++)
{
bucketsData[i].lineItems = new WinJS.Binding.List([{ description: i, amount: i * 10 }]);
}
bucketsControl.data = new WinJS.Binding.List(bucketsData);
});
});
WinJS.UI.processAll();
}
});
})();

How to efficiently do web scraping in Node.js?

I am trying to scrape some data from a shopping site Express.com. Here's 1 of many products that contains image, price, title, color(s).
<div class="cat-thu-product cat-thu-product-all item-1">
<div class="cat-thu-p-cont reg-thumb" id="p-50715" style="position: relative;"><img class="cat-thu-p-ima widget-app-quickview" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_900/i81?$dcat191$" alt="ROCCO SLIM FIT SKINNY LEG CORDUROY JEAN"><img id="widget-quickview-but" class="widget-ie6png glo-but-css-off2" src="/assets/images/but/cat/but-cat-quickview.png" alt="Express View" style="position: absolute; left: 50px;"></div>
<ul>
<li class="cat-cat-more-colors">
<div class="productId-50715">
<img class="js-swatchLinkQuickview" title="INK BLUE" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_900_s/i81?$swatch$" width="16" height="6" alt="INK BLUE">
<img class="js-swatchLinkQuickview" title="GRAPHITE" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_924_s/i81?$swatch$" width="16" height="6" alt="GRAPHITE">
<img class="js-swatchLinkQuickview" title="MERCURY GRAY" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_930_s/i81?$swatch$" width="16" height="6" alt="MERCURY GRAY">
<img class="js-swatchLinkQuickview" title="HARVARD RED" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_853_s/i81?$swatch$" width="16" height="6" alt="HARVARD RED">
</div>
</li>
<li class="cat-thu-name"><a href="/rocco-slim-fit-skinny-leg-corduroy-jean-50715-647/control/show/3/index.pro" onclick="var x=".tl(";s_objectID="http://www.express.com/rocco-slim-fit-skinny-leg-corduroy-jean-50715-647/control/show/3/index.pro_2";return this.s_oc?this.s_oc(e):true">ROCCO SLIM FIT SKINNY LEG CORDUROY JEAN
</a></li>
<li>
<strong>$88.00</strong>
</li>
<li class="cat-thu-promo-text"><font color="BLACK" style="font-weight:normal">Buy 1, Get 1 50% Off</font>
</li>
</ul>
The very naive and possibly error-prone approach I've done is to first to grab all prices, images, titles and colors:
var price_objects = $('.cat-thu-product li strong');
var image_objects = $('.cat-thu-p-ima');
var name_objects = $('.cat-thu-name a');
var color_objects = $('.cat-cat-more-colors div');
Next, I populate arrays with the data from DOM extracted using jsdom or cheerio scraping libraries for node.js. (Cheerio in this case).
// price info
for (var i = 0; i < price_objects.length; i++) {
prices.push(price_objects[i].children[0].data);
}
// image links
for (var i = 0; i < image_objects.length; i++) {
images.push(image_objects[i].attribs.src.slice(0, -10));
}
// name info
for (var i = 0; i < name_objects.length; i++) {
names.push(name_objects[i].children[0].data);
}
// color info
for (var i = 0; i < color_objects.length; i++) {
colors.push(color_objects[i].attribs.src);
}
Lastly, based on the assumption that price, title, image and colors will match up create a product object:
for (var i = 0; i < images.length; i++) {
items.push({
id: i,
name: names[i],
price: prices[i],
image: images[i],
colors: colors[i]
});
}
This method is slow, error-prone, and very anti-DRY. I was thinking it would be nice if we could grab $('.cat-thu-product') and using a single for-loop extract relevant information from a single product a time.
But have you ever tried traversing the DOM in jsdom or cheerio? I am not sure how anyone can even comprehend it. Could someone show how would I use this proposed method of scraping, by grabbing $('.cat-thu-product') div element containing all relevant information and then extract necessary data?
Or perhaps there is a better way to do this?
I would suggest still using jQuery (because it's easy, fast and secure) with one .each example:
var items = [];
$('div.cat-thu-product').each(function(index, productElement) {
var product = {
id: $('div.cat-thu-p-cont', productElement).attr('id'),
name: $('li.cat-thu-name a', productElement).text().trim(),
price: $('ul li strong', productElement).text(),
image: $('.cat-thu-p-ima', productElement).attr('src'),
colors: []
};
// Adding colors array
$('.cat-cat-more-colors div img', productElement).each(function(index, colorElement) {
product.colors.push({name: $(colorElement).attr('alt'), imageUrl: $(colorElement).attr('src')});
});
items.push(product);
});
console.log(items);
And to validate that you have all the required fields, you can write easilly validator or test. But if you are using different library, you still should loop through "div.cat-thu-product" elements.
Try node.io https://github.com/chriso/node.io/wiki
This will be a good approach of doing what you are trying to do.
using https://github.com/rc0x03/node-promise-parser
products = [];
pp('website.com/products')
.find('div.cat-thu-product')
.set({
'id': 'div.cat-thu-p-cont #id',
'name': 'li.cat-thu-name a',
'price': 'ul li strong',
'image': '.cat-thu-p-ima',
'colors[]': '.cat-cat-more-colors div img #alt',
})
.get(function(product) {
console.log(product);
products.push(product);
})

Resources