Extract links using cheerio (with puppeteer) - node.js

I am using puppeteer & cheerio and new to this.
Here is the pertinent HTML page source code snippet:
<section class="descr">
<div class="center">
<a class="mfp-image" href="https://site.pics/store/1234/cat/img.jpg" title="Full size: 642x642" target="_blank"><img class="lazy 123" src="/assets/images/blank.gif" data-src="https://site.pics/store/1234/cat/th_img.jpg" alt="Image"></a>
</div>
<div class="info">JPG | 500px | 1MB 22.11.2021</div>
<hr id='more-3948099'>
<br>
<div class="blockSpoiler dl-links"><span class="fixHeader" id="download-links"></span><i class="sa sa-download-spoiler pl1em"></i><span class="blockTitle pl0">Get from file storage </span></div>
<div class="blockSpoiler-content txtleft c-dl-links"><a rel="external nofollow noopener" href="https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html" target="_blank">HOST1</a>
<br><a rel="external nofollow noopener" href="https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin" target="_blank">HOST2</a>
<br><a rel="external nofollow noopener" href="http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin" target="_blank">HOST3</a>
<br><a rel="external nofollow noopener" href="https://www.link4.com/riwtuwz9vjr3" target="_blank">HOST4</a>
<br>
</div>
I need to get these links:
https://site.pics/store/1234/cat/img.jpg
https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html
https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin
http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin
https://www.link4.com/riwtuwz9vjr3
Please note that there could be a link5 also in some cases (not shown in this case)
I used this code in the Chrome Developer tools:
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML
document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML
I am able to get a lot of text that includes what is needed, along with unwanted text too. I have been trying for more than just a few hours, but not able to make any more progress.
When i write code using cheerio, I do not get any useful output:
const html = await page.content();
const $ = cheerio.load(html);
console.log($("div.blockSpoiler-content.txtleft.c-dl-links"));
console.log($("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML);
console.log($("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML);
Any help is appreciated.

This should help.
const $ = cheerio.load(html);
var urls = $('a[href]').map(function() {return $(this).attr('href') || '';}).toArray();
console.log('urls', urls);

In this case though, using puppeteer is better:
let urls = await page.$$eval('a', as => as.map(a => a.href))

Related

How to scrape from 2 divs that are on the same level with Cheerio

I'm trying to web scrape content from 2 different divs that are on the same level. I'm using NodeJS, Axios, Cheerio and Express.
Basically, I'm trying to collect an image and the info related to it, but they are placed of different divs that are on the same level. Using the "main" doesn't seem to work in my case.
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
Below is my code to get the data from a website:
var leafletList = $('.store-flyer__info', html).each(function() {
let leaflet = {
title: $(this).find('h3').text(),
image: $(this).find('source').attr('srcset'),
link: $(this).find('a').attr('href'),
validDate: $(this).find('small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
Below is the website's HTML:
The way my code is right now, it's obviously getting only the title, link and validDate. But anyone knows how can I get the the srcset from the other div? I've also tried the following method, but it doesn't work:
var leafletList = $('.store-flyers', html).each(function() {
let leaflet = {
title: $(this).find('.store-flyer__info h3').text(),
image: $(this).find('.store-flyer__front source').attr('srcset'),
link: $(this).find('.store-flyer__info a').attr('href'),
validDate: $(this).find('.store-flyer__info small').text().slice(3,-1)
}
leaflets.push(leaflet)
})
There are many ways to get the result based on the HTML snippet you show, with the caveat that the developer tools can be misleading. It shows elements created after page load with JS, which you won't have if you're only requesting the raw page HTML.
With that in mind, here are a few options:
const cheerio = require("cheerio"); // ^1.0.0-rc.12
const html = `
<div class="store-flyer">
<picture>
<source srcset="foo.jpeg" type="image/webp">
<source srcset="bar.jpeg" type="image/jpeg">
</picture>
</div>
<div class="store-flyer">
<picture>
<source srcset="quux.jpeg" type="image/webp">
<source srcset="garply.jpeg" type="image/jpeg">
</picture>
</div>
`;
const $ = cheerio.load(html);
const result = [...$(".store-flyer")].map(e => ({
// select using `.first()` and `.last()` Cheerio methods:
firstImage: $(e).find("source").first().attr("srcset"),
secondImage: $(e).find("source").last().attr("srcset"),
// select using CSS attribute selectors:
firstImageByType: $(e).find('source[type="image/webp"]').attr("srcset"),
secondImageByType: $(e).find('source[type="image/jpeg"]').attr("srcset"),
// select as an array of all <source> elements:
allImages: [...$(e).find("source")].map(e => $(e).attr("srcset")),
}));
console.log(result);
Output:
[
{
firstImage: 'foo.jpeg',
secondImage: 'bar.jpeg',
firstImageByType: 'foo.jpeg',
secondImageByType: 'bar.jpeg',
allImages: [ 'foo.jpeg', 'bar.jpeg' ]
},
{
firstImage: 'quux.jpeg',
secondImage: 'garply.jpeg',
firstImageByType: 'quux.jpeg',
secondImageByType: 'garply.jpeg',
allImages: [ 'quux.jpeg', 'garply.jpeg' ]
}
]
Prepending .store-flyer__front to your source selectors might be a good idea if you need to disambiguate.
With cheerio, you can access node properties such as:
parentNode
previousSibling
nextSibling
nodeValue
firstChild
childNodes
lastChild
<div class="main">
<div class="one">
// image
</div>
<div class="two">
// info
</div>
</div>
.main.firstChild is .one
.one.nextSibling is .two
.main.lastChild is .two
.two.previousSibling is .one

use href tag for different html pages but with same url id in express

i'm trying to switch between home and profile page for same user with click on each button. also except change between home and profile,id should stay the same.
but after click on profile button only id will be changed and it uses profile as id
(i used ejs as format for my views/html pages)
any idea how can i fix it?is that even possible?
there is my nav code:
<nav>
<div class="nav-wrapper teal darken-4">
BAZAART
<ul class="right hide-on-med-and-down">
<li><a class="waves-effect waves-light btn teal lighten-1" href="home"> <i class="material-icons right">home</i> home</a></li>
<li><a class="waves-effect waves-light btn teal lighten-1" href="profile">profile <i class="material-icons right">account_box</i></a></li>
</ul>
</div>
</nav>
homeController:
exports.sendReqParam = (req, res) => {
let userHome = req.params.userHome;
res.render("home", { name: userHome });
// res.send(`This is the homepage for ${userHome}`);
};
exports.respondWithName = (req, res) => {
let paramsName = req.params.myName;
res.render("profile", { name: paramsName });
}
main.js
app.get("/profile/:myName", homeController.respondWithName);
app.get("/", homeController.respondInfo);
app.get("/home/:userHome", homeController.sendReqParam)
I was recently making a blog website, where I write a post and it displays it on the home page. But if we wanted to go to the specific post page, instead of making another separate page for each new post, we made a post.ejs page instead, and later to acces the specific post we simply used something called lodash. I'll show you an example of it, so it makes more sense, and I'll show you the code we used.
So the example is this, I go to the compose.ejs page and I write a random post: title=Post, content=A random lorem ipsum
and lets say we write another post: title=Another post, content=Another random lorem ipsum
Okay so now everytime we write a blog post it sends us to the home page (where we currently are) and it shows the two blogs posts. If we wanted to go to the specific url of the post, we simply write this link localhost:3000/posts/Another post hit enter and it takes us to the second post we wrote.
And this is the code we used inside the app.js:
app.get("/posts/:postName", function(req, res){
const requestedTitle = _.lowerCase(req.params.postName);
posts.forEach(function(post) {
const storedTitle = _.lowerCase(post.title);
if (storedTitle === requestedTitle) {
res.render("post", {title: post.title, content: post.content});
}
});
});
In the app.js code, we see in the app.get /posts/:postName and this is just the name that is going to show in the url, :postName is like a variable and it will store whatever the user writes.
In the second line, we use lodash to rewrite what the user wrote to what we want, for example if the user wrote AnoTheR POst it will automatically change it to another-post, and we store it in a constant called requestedTitle.
Next is a forEach loop on a posts array (where we store every post), and this is just to go throught every post and check the names.
In the 4th line, we are again using lodash for the same thing, but this time arround for the title of each individual post, and storing it in a constant called storedTitle.
And last, an if statement, where if both the names are the same then it will render the post.ejs page, and we just pass down the title and content from the selected post using this code , {title: post.title, content: post.content}.
And this is the code we used inside the post.ejs:
<%- include("partials/header") -%>
<br>
<div class="card">
<h2 class="card-header"> <%= title %> </h2>
<div class="card-body">
<p class="card-text"> <%= content %> </p>
</div>
</div>
<%- include("partials/footer") -%>
As you can see this post.ejs isn't hard to explain, the top and bottom lines where it says include("partials are just the header and footer templates I use, just to save time coding. Whats inside is what the post.ejs will render when it gets called.
I hope it wasn't that confusing, I'm still learning to code and I hope it helps you with what you are looking for. I think this isn't the exact answer for your question, but I think it will help you navigate your way throught.
If you need more explanation or help, this is my instagram: #cemruniversal, I'm always happy to help if I can.
Edit: 30 minutes after original post
I think I found a way it could work, I'll show you a piece of code from the same blog website.
Whenever I want to compose a new post I use this code:
app.get("/compose", function(req, res){
res.render("compose");
});
And obviously there is a form for you to write the post, and after you submit, it sends you to the home page, and saves the post. For that I used this piece of code:
app.post("/compose", function(req, res){
const post = {
title: req.body.postTitle,
content: req.body.postBody
};
posts.push(post);
res.redirect("/");
});
I had an idea for your website, what if when you pressed the Profile button, it renders a specific page on your site, and when you press another button it renders another page. It could work, wouldn't it?
Please try it out and tell me how it went.
I think something like this:
<nav>
<div class="nav-wrapper teal darken-4">
BAZAART
<ul class="right hide-on-med-and-down">
<li><a class="waves-effect waves-light btn teal lighten-1" href="/home"> <i class="material-icons right">home</i> home</a></li>
<li><a class="waves-effect waves-light btn teal lighten-1" href="/profile">profile <i class="material-icons right">account_box</i></a></li>
</ul>
</div>
</nav>

Navigate to new url from SPFX in Sharepoint online

I have a dropdown in SPFX webpart in sharepoint online. In that dropdown, onchange, I am constructing a url with # tag.
E.x. https://sharepointonine/default.aspx#2349-234234-23434
I need to navigate to this new url. I am not sure how to accomplish things.
I have tried:
window.location = url //Gives error that string is not assignable to Location
window.location.href= url//does not reload the page
window.open(url, "_self")//does not reload the page
window.location.assign(url);//does not reload the page
window.location.replace(url);//does not reload the page
Any help?
You can create element 'a' and call click to open url
let a = document.createElement('a');
a.href = 'your link to open';
a.click();
this works fine.
Also you can use react-router, as describe here
There are also redirect link:
import { Redirect } from 'react-router';
When you need to redirect to som url, you render redirect:
<Redirect to={'/to url'}></Redirect>
I test with no framework SPFX.
Test result:
Test code for your reference:
public render(): void {
this.domElement.innerHTML = `
<div class="${styles.noframeworkSpfx}">
<div class="${styles.container}">
<div class="${styles.row}">
<div class="${styles.column}">
<span class="${styles.title}">Welcome to SharePoint!</span>
<p class="${styles.subTitle}">Customize SharePoint experiences using Web Parts.</p>
<p class="${styles.description}">${escape(this.properties.description)}</p>
<a href="https://aka.ms/spfx" class="${styles.button}">
<span class="${styles.label}">Learn more</span>
</a>
<select id="option" >
<option value="test1">test1</option>
<option value="test2">test2</option>
<option value="test3">test3</option>
</select>
</div>
</div>
</div>
</div>`;
this.domElement.querySelector('#option').addEventListener('change', (e) => {
window.location.href="https://www.google.com/search?q="+e.target["value"]
})
}

How to get className of an element in jsdom?

first time posting so sorry if I mess something up. Below is the code I have tried:
const domPreParse = new JSDOM(incident); //incident is the html fragment I want to parse
const dom = domPreParse.window.document;
const cNameHome = dom.querySelector('[data-type="home-icon"], svg').className;
So cNameHome returns an object with only the first class name. There are multiple class name on the element (e.g. class="class1 class2"). How can I return all the classes in a space separated string preferably.
And this is the code I'm trying to parse:
<div class="sco" data-type="middle">
<div class="clear">
<span class="inc" data-type="home-icon"></span>
<span class="score" data-type="score"> </span>
<span class="inc" data-type="away-icon">
<svg class="inc yellowcard"><use xlink:href="#icon-yellowcard"></use></svg>
</span>
</div>
</div>
Thanks for the help.
The problem was my CSS selectors. I should have used [data-type="home-icon"] > svg.

How can I render specific piece of data from my mongodb into my html?

It's my first project. I'm creating a website where administrator have the privilege to change home page texts. So I'm keeping all texts in a collection called "Texts" , each document has a text:value . When home page is rendered I use Texts.find() to return an array containing objects each has a "text" value. I link it to my home page using index.
Like this..
<h2>Texts[0].text</h2>, so if i have 100 texts I go all the way to <h2>Texts[100].text</h2>. and they are different texts and I need to put them in a specific order so I can't just throw them into my html.
I know that's so stupid , so I'm looking for some idea instead of this.
--------------Modification------------------>>>>
I tried using find method for arrays but it also is so tiring , so something simpler would be great , here's a portion of the code
<div class="card bg-dark text-white">
<img src="<%=imgsArr.find(x => x.name === 'main2').src%>" class="card-img" alt="...">
<div class="card-img-overlay ">
<h2 class="card-title"><%=textArr.find(x => x.id === 14).text%></h2>
<div class="triangle-up"></div>
<hr class="ml-0 ">
<div class="triangle-down"></div>
<p class="card-text "><%=textArr.find(x => x.id === 15).text%></p>
<div class="card-topic">
<h2 class="card-title" style=";"><%=textArr.find(x => x.id === 16).text%></h2>
<p class="card-text"><%=textArr.find(x => x.id === 17).text%></p>
</div>
<div class="card-topic">
<h2 class="card-title" style=""><%=textArr.find(x => x.id === 18).text%></h2>
<p class="card-text"><%=textArr.find(x => x.id === 19).text%></p>
</div>
<div class="btnOut ">
<button class="btn btn-lg shadow-lg ">MENU</button>
</div>
</div>
</div>
Good news is that for loops are always a good option. If you are using ejs or a similar rendering engine something like this should work, although you will have to tweak it a bit. Take a look are the EJS rendering engine documentation for an exact format, but it would look something like below and no matter how long the list gets it will render as much as it receives(if it is a lot you might want to consider using pagination)
<% for text in Texts.text %>
<% if text==="some value"%>
<h2>text</h2>
<% else %>
<p> text </p>
If your front-end is perhaps react.js or vue.js or similar the same prnciple would work but the format will be different.
No offense friend but you need to get some tutorials. But let me help you as best i can. Everything in a web application is somewhat defined. Meaning it falls into one predetermined category or another. When you get json for example you get something base on how that particular data is defined in the database. Hence you can get json data from your backend that looks
{ 'title':'sneaker', 'price':'200', 'quantity':'50' }
Now assuming it is a list of json objects what you can do is loop through this and assign them to tags based on their object key(because it gets converted to a javscript object). so again you code would look something like
<% for text in Texts.text %>
<h2>text.title</h2>
<div>
<p>text.price</p>
<p>text.quantity</p>
</div>
This would be how you would render data from your database(the format might not exactly be spot on). Dealing with forms is a whole other ball game. So get a book or some tutorial videos that discuss it. You will better understand how to handle it.
For your code you are being very rigid, at first glance there is already a pattern, work with that pattern, you code just needs a simple for loop from what i gather.
for(let i=0; let i>100; i++){
if (i%2===0){
console.log(<h2> i </h2>)
}else{
console.log(<p> i </p>
}
}
The reason i have resulted to pure javascript is so you can see what the inner workings of your render should look like. You can get the total number of ids in your database(that is what the integer 100 represents in this case), loop through one by one and produce what you want. It is still javascript, do not let the html throw you off

Resources