How to scrape from 2 divs that are on the same level with Cheerio - node.js

I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.
Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
Below is my code to get the data from a website:
var leafletList = $('.store-flyer__info', html).each(function() {
let leaflet = {
title: $(this).find('h3').text(),
image: $(this).find('source').attr('srcset'),
link: $(this).find('a').attr('href'),
validDate: $(this).find('small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
Below is the website's HTML:
The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:
var leafletList = $('.store-flyers', html).each(function() {
let leaflet = {
title: $(this).find('.store-flyer__info h3').text(),
image: $(this).find('.store-flyer__front source').attr('srcset'),
link: $(this).find('.store-flyer__info a').attr('href'),
validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
}
leaflets.push(leaflet)
})

There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.
With that in mind, here are a few options:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<div class="store-flyer">
<picture>
<source srcset="foo.jpeg" type="image/webp">
<source srcset="bar.jpeg" type="image/jpeg">
</picture>
</div>
<div class="store-flyer">
<picture>
<source srcset="quux.jpeg" type="image/webp">
<source srcset="garply.jpeg" type="image/jpeg">
</picture>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".store-flyer")].map(e => ({
// select using `.first()` and `.last()` Cheerio methods:
firstImage: $(e).find("source").first().attr("srcset"),
secondImage: $(e).find("source").last().attr("srcset"),
// select using CSS attribute selectors:
firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),
// select as an array of all <source> elements:
allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
}));
console.log(result);
Output:
[
{
firstImage: 'foo.jpeg',
secondImage: 'bar.jpeg',
firstImageByType: 'foo.jpeg',
secondImageByType: 'bar.jpeg',
allImages: [ 'foo.jpeg', 'bar.jpeg' ]
},
{
firstImage: 'quux.jpeg',
secondImage: 'garply.jpeg',
firstImageByType: 'quux.jpeg',
secondImageByType: 'garply.jpeg',
allImages: [ 'quux.jpeg', 'garply.jpeg' ]
}
]
Prepending .store-flyer__front to your source selectors might be a good idea if you need to disambiguate.

With cheerio, you can access node properties such as:
parentNode
previousSibling
nextSibling
nodeValue
firstChild
childNodes
lastChild
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
.main.firstChild is .one
.one.nextSibling is .two
.main.lastChild is .two
.two.previousSibling is .one

Related

Extract links using cheerio (with puppeteer)

I am using puppeteer & cheerio and new to this.
Here is the pertinent HTML page source code snippet:
<section class="descr">
<div class="center">
<a class="mfp-image" href="https://site.pics/store/1234/cat/img.jpg" title="Full size: 642x642" target="_blank"><img class="lazy 123" src="/assets/images/blank.gif" data-src="https://site.pics/store/1234/cat/th_img.jpg" alt="Image"></a>
</div>
<div class="info">JPG | 500px | 1MB 22.11.2021</div>
<hr id='more-3948099'>
<br>
<div class="blockSpoiler dl-links"><span class="fixHeader" id="download-links"></span><i class="sa sa-download-spoiler pl1em"></i><span class="blockTitle pl0">Get from file storage </span></div>
<div class="blockSpoiler-content txtleft c-dl-links"><a rel="external nofollow noopener" href="https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html" target="_blank">HOST1</a>
<br><a rel="external nofollow noopener" href="https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin" target="_blank">HOST2</a>
<br><a rel="external nofollow noopener" href="http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin" target="_blank">HOST3</a>
<br><a rel="external nofollow noopener" href="https://www.link4.com/riwtuwz9vjr3" target="_blank">HOST4</a>
<br>
</div>
I need to get these links:
https://site.pics/store/1234/cat/img.jpg
https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html
https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin
http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin
https://www.link4.com/riwtuwz9vjr3
Please note that there could be a link5 also in some cases (not shown in this case)
I used this code in the Chrome Developer tools:
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML
I am able to get a lot of text that includes what is needed, along with unwanted text too. I have been trying for more than just a few hours, but not able to make any more progress.
When i write code using cheerio, I do not get any useful output:
const html = await page.content();
const $ = cheerio.load(html);
console.log($("div.blockSpoiler-content.txtleft.c-dl-links"));
console.log($("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML);
console.log($("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML);
Any help is appreciated.
This should help.
const $ = cheerio.load(html);
var urls = $('a[href]').map(function() {return $(this).attr('href') || '';}).toArray();
console.log('urls', urls);
In this case though, using puppeteer is better:
let urls = await page.$$eval('a', as => as.map(a => a.href))

How to use text between the tags in a svelte component?

Let's say I have two components:
Bold1.svelte:
<script>
external let t="";
</script>
<b>{t}</b}
Usage:
<Bold1 t="my text 1" />
Works like expected.
Bold2.svelte:
<script>
</script>
<b>???</b>
Usage:
<Bold2>
my text 2
</Bold2>
What do I have to write instead of ??? to get a bold my text 2? I have tried <b>{this}</b>, but without success.
Get the slot content.
App.svelte:
<script>
import Child from './Child.svelte';
</script>
<Child>Hi</Child>
Child.svelte
<script>
import { onMount } from 'svelte';
let thisObj;
let text = '';
onMount(() => {
text = thisObj.textContent;
});
</script>
<div bind:this={thisObj}>
<slot />
</div>
<h3>
Slot content-1: {text}
</h3>
Are you trying to pass HTML and have it render as HTML?
If var t has HTML, you can render it like this:
{#html t}
https://svelte.dev/docs
Just watch out for XSS risk.

How do I create show page based on id of item clicked

I am creating list of items looped through .map function. I want each of these items be rendered in a single page with some other details.
import React from 'react'
import {faArrowRight, faMusic, faPlay, faPlayCircle, faTachometerAlt} from "#fortawesome/free-solid-svg-icons";
import {FontAwesomeIcon} from "#fortawesome/react-fontawesome";
import music from '../mocks/music.json'
import { Link } from 'gatsby'
import Music from '../pages/music'
const newData = music.map( (data) => {
return (
<div className="row no-gutters justify-content-between align-items-center">
<div className="col-auto">
<button className="btn-gradient btn-circle">
<FontAwesomeIcon icon={faPlayCircle} />
</button>
</div>
<div className="col">
<div className="music-list-content">
<span className="artist">{data.author}</span>
<Link to={`/music/${data.id}`}>{data.title}</Link>
<span className="play">
<FontAwesomeIcon icon={faPlay} /> {data.duration}
</span>
</div>
</div>
<div className="col-auto">
<span className="badge-dark badge">{data.genre}</span>
</div>
</div>
)
})
const membersToRender = music.filter(member => member.id)
const numRows = membersToRender.length
const Musics = () => {
return (
<div>
<div className="title">
<h5>New Music</h5>
<span>{numRows} new songs</span>
</div>
<div>
<div className="music-list card-wrapper">
{newData}
</div>
</div>
<div className="footer-wrapper">
<div>
<FontAwesomeIcon icon={faMusic} />
<span>Song Library</span>
</div>
<FontAwesomeIcon icon={faArrowRight} />
</div>
</div>
)
}
export default Musics
I created a link which whenever I click, it takes me to another page (page not found) with id appended and .js extension.
Please, how do go about it? I want a click on the title and have it displayed on a full page.
Your logic seems good, however, you are missing the most important part, the page creation, since you are not creating the pages, all of your links are broken.
In Gatsby, you have two different ways of creating pages:
Using gatsby-node.js to create pages dynamically: when dealing with a huge amount of data, like your JSON, it's easier to let Gatsby deal with this responsibility of creating pages for Gatsby. Since you are sourcing from a JSON, you need everything set to create dynamic pages.
const path = require("path")
// Implement the Gatsby API “createPages”. This is called once the
// data layer is bootstrapped to let plugins create pages from data.
exports.createPages = async ({ graphql, actions, reporter }) => {
const { createPage } = actions
const musics= require("./data/mocks/musics.json")
const musicTemplate = path.resolve(`src/templates/music-template.js`)
musics.forEach(music) => {
createPage({
path: `/music/${music.slug}`
component: musicTemplate,
context: {
title: music.title,
description: music.description,
// and so on for the rest of the fields
},
})
})
}
Note: I'm assuming that your JSON is properly defined and formatted, having all the fields I queried.
Your musicTemplate must be a template (inside /templates folder).
Notice that you are passing some fields through Gatsby's context, this means that those fields will be available through props.pageContext in your template. So, there, create a template like:
import React from "react"
import Layout from "../components/layout"
export default function MusicTemplate({pageContext}) {
return (
<Layout>
<div>Hello musician {pageContext.title}</div>
</Layout>
)
}
So, as I said, with this approach you are creating dynamic pages based on your JSON file, and they will be available inside localhost:8000/music/{music.slug}, and all your reference and links that point there, will be valid.
I would also recommend using static query/useStaticQuery to retrieve data from your JSON in that loop. If you create a static query from that data (in a separate component) you will be able to fetch it on-demand across your project, so you will be reusing an interesting part of logic. It's better to use it rather than requesting a JSON directly.
You can follow this guide from the great Jason Lengstorf which is mostly what you need.
Adding .js files in your /pages folder: Gatsby infers the internal structure of your /pages folder and will create pages accordingly to that structure. For instance, if you have a structure like: /pages/musicians/name1.js Gatsby will create a page like localhost:8000/musicians/name1.
As it has been said, the first approach fits your requirements and it's preferred for this use-cases, since the second one will be less scalable and maintainable.
You should do some routing with React-Router (https://reactrouter.com/web/example/basic).
So the link have to point to a Route in a Switch, as is in the example of the link.

Why does Videogular put the video source on controller.config instead of on the $scope?

I have a basic Videogular video player setup to play videos from Firebase Storage. In the HTML view this works:
<div ng-controller="MyController as controller" class="videogular-container">
<videogular vg-theme="controller.config.theme.url">
<vg-media vg-src="controller.config.sources" vg-native-controls="true"></vg-media>
</videogular>
</div>
In the controller this works:
var ref = firebase.database().ref(); // Create Firebase reference
var obj = $firebaseObject(ref.child($routeParams.id)); // get the record with the key passed in from the URL
var controller = this; // controller refers to the controller object
obj.$loaded( // wait until the async data loads from the remote Firebase
function(data) {
// video player
controller.config = { // provides an object to the controller
preload: "auto",
sources: [
// My Firebase video
{src: $sce.trustAsResourceUrl($scope.wordObject.videos[0].videoURL), type: "video/" + $scope.wordObject.videos[0].videoMediaFormat},
// The Videogular test videos
{src: $sce.trustAsResourceUrl("http://static.videogular.com/assets/videos/videogular.mp4"), type: "video/mp4"},
{src: $sce.trustAsResourceUrl("http://static.videogular.com/assets/videos/videogular.webm"), type: "video/webm"},
{src: $sce.trustAsResourceUrl("http://static.videogular.com/assets/videos/videogular.ogg"), type: "video/ogg"}
],
theme: {
url: "http://www.videogular.com/styles/themes/default/latest/videogular.css"
}
};
},
function(error) {
console.log("Error: ", error)
});
Everything works, to play one video. Now I want to dynamically access arrays of videos by theme. E.g., the user clicks to see all my cat videos or clicks another button to see all my dog videos. I have the Firebase Storage URLs on the $scope and ng-repeat prints out the URLs in the view:
<div class="row">
<div class="col-sm-12 col-md-12 col-lg-12 text-center">
<h3>{{currentTheme}}</h3>
<div>
<div ng-repeat="video in currentVideos">
{{video.videoURL}}
</div>
</div>
</div>
</div>
That works great too. So to spin out a series of video players with all my cat videos I just have to make an ng-repeat with a new video player for each video, with the vg-src coming from the $scope:
<div class="row">
<div class="col-sm-12 col-md-12 col-lg-12 text-center">
<h3>{{currentTheme}}</h3>
<div>
<div ng-repeat="video in currentVideos">
<div ng-controller="MyController as controller" class="videogular-container">
<videogular vg-theme="controller.config.theme.url">
<vg-media vg-src="{{video.videoURL}}" vg-native-controls="true"></vg-media>
</videogular>
</div>
</div>
</div>
</div>
</div>
That doesn't work. The error is Error: [$parse:syntax], meaning there's an Angular syntax error. The syntax error goes away when I change the vg-src back to vg-src="controller.config.sources":
<div class="row">
<div class="col-sm-12 col-md-12 col-lg-12 text-center">
<h3>{{currentWord}}</h3>
<div>
<div ng-repeat="video in currentVideos">
<div ng-controller="EnglishController as controller" class="videogular-container">
<videogular vg-theme="controller.config.theme.url">
<vg-media vg-src="controller.config.sources" vg-native-controls="true"></vg-media>
</videogular>
</div>
</div>
</div>
</div>
</div>
That works. The problem is that vg-src="controller.config.sources" works but vg-src="{{video.videoURL}}" doesn't work. Why can't Videogular source videos from the $scope?
I tried to put my video sources from the $scope onto controller.config in the controller but this never worked. Should I try to do this again tomorrow? (It's late and I'm getting confused trying to figure out why I can't put my video sources from the $scope onto controller.config in the controller.)
I wrote the question before I went to bed and woke up with (what I hope is) the answer. {{video.videoURL}} inserts the URLs of the videos. controller.config.sources inserts an object with a lot of stuff. I'll try making an array of configured objects and see what happens!
...
Yep, that worked! I wrote a tutorial for a Videogular minimum install, using the $scope instead of controller.config. I don't understand why the official How To Start tutorial uses controller.config instead of the $scope.
...
I can get the one video to play from my array of cat videos when the user clicks "Cat Videos" but I can't get ng-repeat to spin out all the videos in the array.
In the controller when the user clicks the "Cat Videos" button the handler accesses the array of cat videos on Firebase Storage, iterates through the array with forEach, for each video in the array it creates a variable for the videoSource and another variable for the video file format (videoSourceType), then makes a videoObject with an array of sources and a theme, then pushes the videoObject into the array $scope.videoObjects.
$scope.videoObjects = [];
$scope.showVideosOfTheme = function() {
theme.videos.forEach(function(video) { // iterate through the array of videos
var i = 0;
var videoSource = $scope.currentVideos[i].videoURL; // set the video source
var videoSourceType = $scope.currentVideos[i].videoMediaFormat; // set the video format
var videoObject = { // make a video object
preload: "auto",
sources: [
{src: $sce.trustAsResourceUrl(videoSource), type: "video/" + videoSourceType},
],
theme: {
url: "http://www.videogular.com/styles/themes/default/latest/videogular.css"
}
};
$scope.videoObjects.push(videoObject);
i++;
});
};
In the HTML view ng-repeat iterates through the array $scope.videoObjects and fdor each video object spins out a new Videogular video player using the theme and the sources. This doesn't work and the error message is Error: [$parse:syntax], in other words, an Angular syntax error.
<div ng-repeat="video in videoObjects" class="videogular-container">
<videogular vg-theme="{{video.theme.url}}">
<vg-media vg-src="{{video.sources}}" vg-native-controls="true"></vg-media>
</videogular>
</div>
I'll keep working on it!

Nested ListView or Nested Repeater

I am trying to created a nested repeater or a nested list view using WinJS 4.0, but I am unable to figure out how to bind the data source of the inner listview/repeater.
Here is a sample of what I am trying to do (note that the control could be Repeater, which I would prefer):
HTML:
<div id="myList" data-win-control="WinJS.UI.ListView">
<span data-win-bind="innerText: title"></span>
<div data-win-control="WinJS.UI.ListView">
<span data-win-bind="innerText: name"></span>
</div>
</div>
JS:
var myList = element.querySelector('#myList).winControl;
var myData = [
{
title: "line 1",
items: [
{name: "item 1.1"},
{name: "item 1.2"}
]
},
{
title: "line 2",
items: [
{name: "item 2.1"},
{name: "item 2.2"}
]
}
];
myList.data = new WinJS.Binding.List(myData);
When I try this, nothing renders for the inner list. I have attempted trying to use this answer Nested Repeaters Using Table Tags and this one WinJS: Nested ListViews but I still seem to have the same problem and was hoping it was a little less complicated (like KnockOut).
I know it is mentioned that WinJS doesn't support nested ListViews, but that seems to be a few years ago and I am hoping that is still not the issue.
Update
I was able to get the nested repeater to work correctly, thanks to Kraig's answer. Here is what my code looks like:
HTML:
<div id="myTemplate" data-win-control="WinJS.Binding.Template">
<div
<span>Bucket:</span><span data-win-bind="innerText: name"></span>
<span>Amount:</span><input type="text" data-win-bind="value: amount" />
<button class="removeBucket">X</button>
<div id="bucketItems" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#myTemplate')}"
data-win-bind="winControl.data: lineItems">
</div>
</div>
</div>
<div id="budgetBuckets" data-win-control="WinJS.UI.Repeater"
data-win-options="{data: Data.buckets,template: select('#myTemplate')}">
</div>
JS: (after the "use strict" statement)
WinJS.Namespace.define("Data", {
buckets: new WinJS.Binding.List([
{
name: "A",
amount: 5,
lineItems: new WinJS.Binding.List( [
{ name: 'test item1', amount: 50 },
{ name: 'test item2', amount: 25 }
]
)
}
])
})
*Note that this answers part of my question, however, I would really like to do this all after a repo call and set the repeater data source programmatically. I am going to keep working towards that and if I get it I will post that as the accepted answer.
The HTML Repeater control sample for Windows 8.1 has an example in scenario 6 with a nested Repeater, and in this case the Repeater is created through a Template control. That's a good place to start. (I discuss this sample in Chapter 7 of Programming Windows Store Apps with HTML, CSS, and JavaScript, 2nd Edition, starting on page 372, or 374 for the nested part.)
Should still work with WinJS 4, though I haven't tried it.
Ok, so I have to give much credit to Kraig because he got me on the correct path to getting this worked out and the referenced book Programming Windows Store Apps with HTML, CSS, and JavaScript, 2nd Edition is amazing.
The original issue was a combination of not using templates correctly (using curly braces in the data-win-bind attribute), not structuring my HTML correctly and not setting the child lists as WinJS.Binding.List data source. Below is the final working code structure to created a nested repeater when binding the data from code only:
HTML:
This is the template for the child lists. It looks similar, but I plan on add more things so I wanted it separate instead of recursive as referenced in the book. Note that the inner div after the template control declaration was important for me.
<div id="bucketItemTemplate" data-win-control="WinJS.Binding.Template">
<div>
<span>Description:</span>
<span data-win-bind="innerText: description"></span>
<span>Amount:</span>
<input type="text" data-win-bind="value: amount" />
<button class="removeBucketItem">X</button>
</div>
</div>
This is the main repeater template for the lists. Note that the inner div after the template control declaration was important for me. Another key point was using the "winControl.data" property against the property name of the child lists.
<div id="bucketTemplate" data-win-control="WinJS.Binding.Template">
<div>
<span>Bucket:</span>
<span data-win-bind="innerText: bucket"></span>
<span>Amount:</span>
<input type="text" data-win-bind="value: amount" />
<button class="removeBucket">X</button>
<div id="bucketItems" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#bucketItemTemplate')}"
data-win-bind="winControl.data: lineItems">
</div>
</div>
</div>
This is the main control element for the nested repeater and it is pretty basic.
<div id="budgetBuckets" data-win-control="WinJS.UI.Repeater"
data-win-options="{template: select('#bucketTemplate')}">
</div>
JavaScript:
The JavaScript came down to a few simple steps:
Getting the winControl
var bucketsControl = element.querySelector('#budgetBuckets').winControl;
Looping through the elements and making the child lists into Binding Lists - the data here is made up but could have easily came from the repo:
var bucketsData = selectedBudget.buckets;
for (var i = 0; i < bucketsData.length; i++) {
bucketsData[i].lineItems =
new WinJS.Binding.List([{ description: i, amount: i * 10 }]);
}
Then finally converting the entire data into a Binding list and setting it to the "data" property of the winControl.
bucketsControl.data = new WinJS.Binding.List(bucketsData);
*Note that this is the entire JavaScript file, for clarity.
(function () {
"use strict";
var nav = WinJS.Navigation;
WinJS.UI.Pages.define("/pages/budget/budget.html", {
// This function is called whenever a user navigates to this page. It
// populates the page elements with the app's data.
ready: function (element, options) {
// TODO: Initialize the page here.
var bindableBuckets;
require(['repository'], function (repo) {
//we can setup our save button here
var appBar = document.getElementById('appBarBudget').winControl;
appBar.getCommandById('cmdSave').addEventListener('click', function () {
//do save work
}, false);
repo.getBudgets(nav.state.budgetSelectedIndex).done(function (selectedBudget) {
var budgetContainer = element.querySelector('#budgetContainer');
WinJS.Binding.processAll(budgetContainer, selectedBudget);
var bucketsControl = element.querySelector('#budgetBuckets').winControl;
var bucketsData = selectedBudget.buckets;
for (var i = 0; i < bucketsData.length; i++)
{
bucketsData[i].lineItems = new WinJS.Binding.List([{ description: i, amount: i * 10 }]);
}
bucketsControl.data = new WinJS.Binding.List(bucketsData);
});
});
WinJS.UI.processAll();
}
});
})();

Resources