I am attempting to perform a web scraping operation and would like to get all the children element in a html tree similar to this:
<div class="main">
<p>Some p</p>
<a>Some a</a>
<br>
<br>
<em>
<p>Another p</p>
<a>Another a</a>
<br>
<br>
<em>
//...
</div>
I scraped the html using Puppeteer like so and managed to get the children but as a string format. Here are my attempts:
const children = await page.evaluate(el => el.children, await page.$('div.main'))
console.log(children)
//prints {"1": {}, "2": {}, "3": {} ...}
I then refer to this post and this post, and attempted this:
const children = await page.evaluate(() => {
var children = [...document.querySelector('div.main').children];
return children.map((e) => e.outerHTML);
})
console.log(children)
//prints all children correctly, but all as strings
Is there a way to get all child elements under a tag but with all the DOM attributes retained so that I can loop over each element, perform some algorithmic operation and extract some attributes.
Related
I am trying to write a bot to convert a bunch of HTML pages to markdown, in order to import them as Jekyll document. For this, I use puppeteer to get the HTML document, and cheerio to manipulate it.
The source HTML is pretty complex, and polluted with Google ADS tags, external scripts, etc. What I need to do, is to get the HTML content of a predefined selector, and then remove elements that match a predefined set of selectors from it in order to get a plain HTML with just the text and convert it to markdown.
Assume the source html is something like this:
<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>
What I want to achieve is something like
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>
I defined an array of selectors that I want to strip from the source object:
stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],
And wrote the following function:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
// 1
content = await content.remove(s);
// 2
// await content.remove(s);
// 3
// content = await content.find(s).remove();
// 4
// await content.find(s).remove();
// 5
// const matches = await content.find(s);
// for (m of matches) {
// await m.remove();
// }
};
value = content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
console.log(value);
return value;
};
Where
$ is the cheerio object containing const $ = await cheerio.load(html);
selector is the dome selector for the container (in the example above it would be .post)
What I am unable to do, is to use cheerio to remove() the objects. I tried all the 5 versions I left commented in the code, but without success. Cheerio's documentation didn't help so far, and I just found this link but the proposed solution did not work for me.
I was wondering if someone more experienced with cheerio could point me in the right direction, or explain me what I am missing here.
I found a classical newby error in my code, I was missing an await before the .remove() call.
The working function now looks like this, and works:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
console.log(`--- Stripping ${s}`);
await content.find(s).remove();
};
value = await content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
return value;
};
You can remove the elements with remove:
$('script,ins,div').remove()
I'm using Puppeteer to extract the text of a span by it's class name but I'm getting returned nothing. I don't know if its because the page isn't loading in time or not.
This is my current code:
async function Reload() {
Page.reload()
Price = await Page.evaluate(() => document.getElementsByClassName("text-robux-lg wait-for-i18n-format-render"))
console.log(Price)
}
Reload()
HTML
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>
because the function that you passed to Page.evaluate() returns a non-Serializable value.
from the puppeteer official document
If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined
so you have to make the function that passed to Page.evaluate() returns the text of span element rather than returns the Element object of span.
like the following code
const puppeteer = require('puppeteer');
const htmlCode = `
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>
`;
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setContent(htmlCode);
const price = await page.evaluate(() => {
const elements = document.getElementsByClassName('text-robux-lg wait-for-i18n-format-render');
return Array.from(elements).map(element => element.innerText); // as you see, now this function returns array of texts instead of Array of elements
})
console.log(price); // this will log the text of all elements that have the specific class above
console.log(price[0]); // this will log the first element that have the specific class above
// other actions...
await browser.close();
})();
NOTE: if you want to get the html code from another site by its url use page.goto() instead of page.setContent()
NOTE: because you are using document.getElementsByClassName() the returned value of the function that passed to page.evaluate() in the code above will be array of texts and not text as document.getElementById() do
NOTE: if you want to know what is the difference between Serializable objects and non-serializable objects read the answers of this question on Stackoverflow
Using puppeteer to scrape a page Im able to get the contents from a list of divs with the same class and nested list of divs within those i.e.
<div class="parent">
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
...
</div>
...
now my problem is i need to reiterate over the list and run the page.click() on the child class divs to open lightboxes, select an element in the lightbox to click then run the page.pdf() on.
I currently have a for loop over the parent class divs, and an inner for loop over the child class divs. I'm not sure how to select the right div with the for loop index values as there is no nth-of-class etc.
I simply want to run something like
for (let a = 0; a < data.length; a++) {
for (let b = 0; b < data[a].length; b++) {
await page.click('.parent[a] .child[b]');
// other code here...
}
}
to open the lightbox, then a
await page.waitForSelector('.ReactModal')
to scrape the lightbox html and the run
await page.pdf({
path: dir + "/"+ filename,
format: 'A4'
});
Any guidance would be appreciated as to the possible approaches would be.
If I understand correctly, you can try something like this:
for (const parent of await page.$$('.parent')) {
for (const child of await parent.$$('.child')) {
await child.click();
await page.waitForSelector('.ReactModal'); // maybe check if this is not the same lightbox
await page.pdf(/*...*/);
}
}
after click on the search button, the puppeteer take a screenshot but I can't get the element's value
here is my code
await page.$eval('#textInputSelector',(el, licenceInfo) =>
(el.value = licenceInfo),licenceInfo)
const searchBtn = await page.$x('//*[#id="searchBtnXPath"]')
await searchBtn[0].click()
await page.waitFor(4000);
console.log(await page.$eval('#selector1', el => el.innerText));
await makeScreenShot(page, screenPath, { fullPage: true })
and the result is (red box)
result's image
and its HTML code output
<div>
<span id="#selector1" >
Your search returned no results. Please modify your search criteria
and try again.
</span>
</div>
and button html code
<div id="#selector2">
<a id="searchBtnXPath" href="
javascript:
__doPostBack('ctl00$PlaceHolderMain$btnNewSearch','');
var p = new ProcessLoading();p.showLoading(false);">
<span>Search</span>
</a>
</div>
and this is my error
Error: failed to find element matching selector "#selector1"
Use this instead:
console.log(await page.$eval('body #selector1', el => el.innerHTML));
I am working on a project in React. The idea is that when you search an artist an img render on the pg. Once you click the image a list of collaborating artists is rendered. You can then click a name and see that persons collabpratign artists. Here is my issue: Rather than the state clearing/resetting each time a new artist is clicked, new artists just add on to the original state. Can someone help me figure out how to clear the state so that the state clears and returns a new list of collaborators? Been stuck on this for hours. Here is the code
searchForArtist(query) {
request.get(`https://api.spotify.com/v1/search?q=${query}&type=artist`)
.then((response) => {
const artist = response.body.artists.items[0];
const name = artist.name;
const id = artist.id;
const img_url = artist.images[0].url;
this.setState({
selectedArtist: {
name,
id,
img_url,
},
});
})
.then(() => {
this.getArtistAlbums();
})
.catch((err) => {
console.error(err);
});
}
getArtistCollabs() {
console.log('reached get artist collab function');
const { artistCounts } = this.state;
// console.log(artistCounts);
const artist = Object.keys(artistCounts).map((key) => {
//kate
const i = document.createElement("div");
i.innerHTML = key;
i.addEventListener('click', () => {
this.searchForArtist(key);
})
document.getElementById("collabs").appendChild(i);
});
this.setState({});
}
//kate
renderArtists() {
const artists = this.getArtistCollabs();
}
render() {
const img_url = this.state.selectedArtist.img_url;
return (
<div>
<form onSubmit={this.handleSubmit}>
<input type='text' name='searchInput' className="searchInput" placeholder="Artist" onChange={this.handleChange} />
<input type='submit' className="button" />
</form>
<img className="artist-img" src={this.state.selectedArtist.img_url}
// kate
onClick={this.renderArtists} alt="" />
<div id="collabs">
</div>
</div>
Your problem is right here:
const artist = Object.keys(artistCounts).map((key) => {
//kate
const i = document.createElement("div");
i.innerHTML = key;
i.addEventListener('click', () => {
this.searchForArtist(key);
})
document.getElementById("collabs").appendChild(i);
What you have done here is manually create html elements and insert them into the dom. As soon as this takes place react has no control over these newly created elements. You should only manipulate the DOM like this when its absolutely necessary. Instead you should be making a new component called something like <ArtistCollaborators> and it should take in the artists as props and be what renders the code you have here into the DOM using its own render method.
This will be the React way of doing it, and allows react to be fully control of what you are rendering into the DOM.