Using Puppeteer to extract text from span - node.js

I'm using Puppeteer to extract the text of a span by it's class name but I'm getting returned nothing. I don't know if its because the page isn't loading in time or not.
This is my current code:
async function Reload() {
Page.reload()
Price = await Page.evaluate(() => document.getElementsByClassName("text-robux-lg wait-for-i18n-format-render"))
console.log(Price)
}
Reload()
HTML
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>

because the function that you passed to Page.evaluate() returns a non-Serializable value.
from the puppeteer official document
If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined
so you have to make the function that passed to Page.evaluate() returns the text of span element rather than returns the Element object of span.
like the following code
const puppeteer = require('puppeteer');
const htmlCode = `
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>
`;
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setContent(htmlCode);
const price = await page.evaluate(() => {
const elements = document.getElementsByClassName('text-robux-lg wait-for-i18n-format-render');
return Array.from(elements).map(element => element.innerText); // as you see, now this function returns array of texts instead of Array of elements
})
console.log(price); // this will log the text of all elements that have the specific class above
console.log(price[0]); // this will log the first element that have the specific class above
// other actions...
await browser.close();
})();
NOTE: if you want to get the html code from another site by its url use page.goto() instead of page.setContent()
NOTE: because you are using document.getElementsByClassName() the returned value of the function that passed to page.evaluate() in the code above will be array of texts and not text as document.getElementById() do
NOTE: if you want to know what is the difference between Serializable objects and non-serializable objects read the answers of this question on Stackoverflow

Related

Cheerio how to remove DOM elements from selection

I am trying to write a bot to convert a bunch of HTML pages to markdown, in order to import them as Jekyll document. For this, I use puppeteer to get the HTML document, and cheerio to manipulate it.
The source HTML is pretty complex, and polluted with Google ADS tags, external scripts, etc. What I need to do, is to get the HTML content of a predefined selector, and then remove elements that match a predefined set of selectors from it in order to get a plain HTML with just the text and convert it to markdown.
Assume the source html is something like this:
<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>
What I want to achieve is something like
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>
I defined an array of selectors that I want to strip from the source object:
stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],
And wrote the following function:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
// 1
content = await content.remove(s);
// 2
// await content.remove(s);
// 3
// content = await content.find(s).remove();
// 4
// await content.find(s).remove();
// 5
// const matches = await content.find(s);
// for (m of matches) {
// await m.remove();
// }
};
value = content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
console.log(value);
return value;
};
Where
$ is the cheerio object containing const $ = await cheerio.load(html);
selector is the dome selector for the container (in the example above it would be .post)
What I am unable to do, is to use cheerio to remove() the objects. I tried all the 5 versions I left commented in the code, but without success. Cheerio's documentation didn't help so far, and I just found this link but the proposed solution did not work for me.
I was wondering if someone more experienced with cheerio could point me in the right direction, or explain me what I am missing here.
I found a classical newby error in my code, I was missing an await before the .remove() call.
The working function now looks like this, and works:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
console.log(`--- Stripping ${s}`);
await content.find(s).remove();
};
value = await content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
return value;
};
You can remove the elements with remove:
$('script,ins,div').remove()

Puppeteer: Empty Children Array

I am attempting to perform a web scraping operation and would like to get all the children element in a html tree similar to this:
<div class="main">
<p>Some p</p>
<a>Some a</a>
<br>
<br>
<em>
<p>Another p</p>
<a>Another a</a>
<br>
<br>
<em>
//...
</div>
I scraped the html using Puppeteer like so and managed to get the children but as a string format. Here are my attempts:
const children = await page.evaluate(el => el.children, await page.$('div.main'))
console.log(children)
//prints {"1": {}, "2": {}, "3": {} ...}
I then refer to this post and this post, and attempted this:
const children = await page.evaluate(() => {
var children = [...document.querySelector('div.main').children];
return children.map((e) => e.outerHTML);
})
console.log(children)
//prints all children correctly, but all as strings
Is there a way to get all child elements under a tag but with all the DOM attributes retained so that I can loop over each element, perform some algorithmic operation and extract some attributes.

With Puppeteer how can I click the parent element of my selector?

The markup i have to work with looks like this:
<label>
<input type="radio" name="myfield" value="Yes" size>
</label>
I want to call page.click(selector) with the radio as the selector, but I can't. I don't think it is visible because of the size attribute.
My javascript looks like this:
const page = await browser.newPage();
const selector = 'input[name="myfield"]';
await page.click(selector);
So I would like to target and click the parent label element.
How do I change the value of my selector constant to target the label?
Sorry, I didn't explain very well. By can't, i mean that the element is not visible and therefore i don't believe it can technically be clicked. Therefore I think i need to target the label which is visible, but i don't know how I target it
not visible or not visible at the moment?
Have you tried to use waitForSelector(selector) ?
const page = await browser.newPage();
const selector = 'input[name="myfield"]';
await page.waitForSelector(selector); // waiting here before click
await page.click(selector);
Or something like:
const page = await browser.newPage();
const selector = 'label';
await page.waitForSelector(selector);
await page.evaluate((_) => {
document.querySelector('label > input[name="myfield"]').parentElement.click()
});

puppeteer howto get element tagName

I would like to get an element's tagName. Should be button in following example.
const puppeteer = require('puppeteer')
async function run () {
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
const html = `
<div>
<button type="button">click me</button>
<span>Some words.</span>
</div>
`
await page.setContent(html)
const elements = await page.$$('button')
const tagName = await elements[0].$eval('*', node => node.tagName)
console.log(tagName) // expect to be 'button'
await browser.close()
}
run()
The error message said Error: failed to find element matching selector "*"
I can tell elements matched one element as elements.length is 1
Where is wrong?
========== Edit ==========
Let's say I already had elements beforehand, how to get the tagName out of it.
Thanks!
Try using page.$eval to select the button, and then get the tagName from the button:
const tagName = await page.$eval('button', button => button.tagName);
If you already have an elementHandle like elements[0], you can get an attribute from that element by passing it through page.evaluate:
const tagName = await page.evaluate(
element => element.tagName,
elements[0]
);
It appears your elements is an array of ElementHandles.
In that case, there may be a slightly more straightforward syntax:
const tag_name = await (await elements[0].getProperty('tagName')).jsonValue()
This does not involve referring to the page object.
Thanks!

Uncaught Invariant Violation: Objects are not valid as a React child for HTML Form

I wrote a function to build an HTML form based on the keys and values of an object and I am trying to return the form in my render method. However, I keep getting the error:
ReactJs 0.14 - Invariant Violation: Objects are not valid as a React child
Here is my createForm() method:
createForm() {
const obj = {
}
const object_fields = resourceFields.fields;
let form = document.createElement('form');
_.forIn(object_fields, function(field_value, field_name) {
let div = document.createElement('div');
div.setAttribute('className', 'form-control');
let label = document.createElement('label');
label.setAttribute('htmlFor', 'name');
label.innerHTML = field_name;
let input = document.createElement('input');
input.setAttribute('className', 'form-control');
input.setAttribute('type', 'text');
input.setAttribute('ref', field_name);
input.setAttribute('id', field_name);
input.setAttribute('value', field_value);
input.setAttribute('onChange', '{this.handleChange}');
div.appendChild(label);
div.appendChild(input);
form.appendChild(div);
})
console.log(form) //this prints out fine
return form
}
Here is my render() method:
render() {
return (
<div>
{this.createForm()}
</div>
)
}
Does anyone know what might be happening? My form prints out in the console just fine... Thanks in advance!
You never manipulate actual DOM nodes when you're working with React. When you build your UI in the render function, the JSX markup is translated into plain JavaScript (React.createElement function calls), which builds a representation of the DOM.
So, in your case, you should return JSX in createForm, not a DOM element.

Resources