I am trying to write a bot to convert a bunch of HTML pages to markdown, in order to import them as Jekyll document. For this, I use puppeteer to get the HTML document, and cheerio to manipulate it.
The source HTML is pretty complex, and polluted with Google ADS tags, external scripts, etc. What I need to do, is to get the HTML content of a predefined selector, and then remove elements that match a predefined set of selectors from it in order to get a plain HTML with just the text and convert it to markdown.
Assume the source html is something like this:
<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>
What I want to achieve is something like
<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>
I defined an array of selectors that I want to strip from the source object:
stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],
And wrote the following function:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
// 1
content = await content.remove(s);
// 2
// await content.remove(s);
// 3
// content = await content.find(s).remove();
// 4
// await content.find(s).remove();
// 5
// const matches = await content.find(s);
// for (m of matches) {
// await m.remove();
// }
};
value = content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
console.log(value);
return value;
};
Where
$ is the cheerio object containing const $ = await cheerio.load(html);
selector is the dome selector for the container (in the example above it would be .post)
What I am unable to do, is to use cheerio to remove() the objects. I tried all the 5 versions I left commented in the code, but without success. Cheerio's documentation didn't help so far, and I just found this link but the proposed solution did not work for me.
I was wondering if someone more experienced with cheerio could point me in the right direction, or explain me what I am missing here.
I found a classical newby error in my code, I was missing an await before the .remove() call.
The working function now looks like this, and works:
const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
console.log(`--- Stripping ${s}`);
await content.find(s).remove();
};
value = await content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
return value;
};
You can remove the elements with remove:
$('script,ins,div').remove()
Related
I'm using Puppeteer to extract the text of a span by it's class name but I'm getting returned nothing. I don't know if its because the page isn't loading in time or not.
This is my current code:
async function Reload() {
Page.reload()
Price = await Page.evaluate(() => document.getElementsByClassName("text-robux-lg wait-for-i18n-format-render"))
console.log(Price)
}
Reload()
HTML
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>
because the function that you passed to Page.evaluate() returns a non-Serializable value.
from the puppeteer official document
If the function passed to the page.evaluate returns a non-Serializable value, then page.evaluate resolves to undefined
so you have to make the function that passed to Page.evaluate() returns the text of span element rather than returns the Element object of span.
like the following code
const puppeteer = require('puppeteer');
const htmlCode = `
<div class="icon-text-wrapper clearfix icon-robux-price-container">
<span class="icon-robux-16x16 wait-for-i18n-format-render"></span>
<span class="text-robux-lg wait-for-i18n-format-render">689</span>
</div>
`;
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setContent(htmlCode);
const price = await page.evaluate(() => {
const elements = document.getElementsByClassName('text-robux-lg wait-for-i18n-format-render');
return Array.from(elements).map(element => element.innerText); // as you see, now this function returns array of texts instead of Array of elements
})
console.log(price); // this will log the text of all elements that have the specific class above
console.log(price[0]); // this will log the first element that have the specific class above
// other actions...
await browser.close();
})();
NOTE: if you want to get the html code from another site by its url use page.goto() instead of page.setContent()
NOTE: because you are using document.getElementsByClassName() the returned value of the function that passed to page.evaluate() in the code above will be array of texts and not text as document.getElementById() do
NOTE: if you want to know what is the difference between Serializable objects and non-serializable objects read the answers of this question on Stackoverflow
Does lit-html have by any change something like React's ref feature?
For example in the following pseudo-code inputRef would be a callback function or an object { current: ... } where lit-html could pass/set the HTMLElement instance of the input element when the input element is created/attached.
// that #ref pseudo-attribute is fictional
html`<div><input #ref={inputRef}/></div>`
Thanks.
In lit-element you can use #query property decorator. It's just syntactic sugar around this.renderRoot.querySelector().
import { LitElement, html, query } from 'lit-element';
class MyElement extends LitElement {
#query('#first')
first;
render() {
return html`
<div id="first"></div>
<div id="second"></div>
`;
}
}
lit-html renders directly to the dom so you don't really need refs like you do in react, you can use querySelector to get a reference to the rendered input
Here's some sample code if you were only using lit-html
<html>
<head>
<title>lit-html example</title>
<script type="module">
import { render, html } from 'https://cdn.pika.dev/lit-html/^1.1.2';
const app = document.querySelector('.app');
const inputTemplate = label => {
return html `<label>${label}<input value="rendered input content"></label>`;
};
// rendering the template
render(inputTemplate('Some Label'), app);
// after the render we can access it normally
console.log(app.querySelector('input').value);
</script>
</head>
<body>
<div class="app"></div>
<label>
Other random input
<input value="this is not the value">
</label>
</body>
</html>
If you're using LitElement you could access to the inner elements using this.shadowRoot.querySelector if you're using shadow dom or this.querySelector if you aren't
As #WeiChing has mentioned somewhere above, since Lit version 2.0 you can use the newly added directive ref for that:
https://lit.dev/docs/templates/directives/#ref
-- [EDIT - 6th October 2021] ----------------------------
Since lit 2.0.0 has been released my answer below
is completely obsolete and unnecessary!
Please check https://lit.dev/docs/api/directives/#ref
for the right solution.
---------------------------------------------------------
Even if this is not exactly what I have asked for:
Depending on the concrete use case, one option to consider is the use of directives.
In my very special use-case it was for example (with a little luck and a some tricks) possible to simulate more or less that ref object behavior.
const inputRef = useElementRef() // { current: null, bind: (special-lit-html-directive) }
...
return html`<div><input ref=${inputRef.bind}/></div>`
In my use case I could do the following:
Before rendering, set elementRef.current to null
Make sure that elementRef.current cannot be read while the component is rerendered (elementRef.current is not needed while rendering and an exception will be thrown if someone tries to read it in render phase)
That elementRef.bind directive will fill elementRef.current with the actual DOM element if available.
After that, elementRef.current can be read again.
For lit-html v1, you can define your own custom Derivative:
import { html, render, directive } from "lit-html";
function createRef(){ return {value: null}; }
const ref = directive((refObj) => (attributePart) => {
refObj.value = attributePart.committer.element;
});
const inputRef = createRef();
render(html`<input ref=${ref(inputRef)} />`;
// inputRef.value is a reference to rendered HTMLInputElement
It seems that if you don't inject Material-UI stylesheets into a jest/react-testing-library test then jsdom will fail to get the correct styles from your components (e.g. running getComputedStyle(component) will return the incorrect styles for the component).
How you properly setup a jest/react-testing-library test so that the styles are correctly injected into the test? I've already wrapped the components in a theme provider, which works fine.
As a workaround reinserting the whole head (or the element where JSS styles are injected) before assertion seems to apply styles correctly with both getComputedStyle() and react testing library's toHaveStyle():
import React from "react";
import "#testing-library/jest-dom/extend-expect";
import { render } from "#testing-library/react";
test("test my styles", () => {
const { getByTestId } = render(
<div data-testid="wrapper">
<MyButtonStyledWithJSS/>
</div>
);
const button = getByTestId("wrapper").firstChild;
document.head.innerHTML = document.head.innerHTML;
expect(button).toHaveStyle(`border-radius: 4px;`);
});
This will still fail though when you're using dynamic styles, like:
myButton: {
padding: props => props.spacing,
...
}
That's because JSS uses CSSStyleSheet.insertRule method to inject these styles, and it won't appear as a style node in the head. One solution to this issue is to hook into the browser's insertRule method and add incoming rules to the head as style tags. To extract all this into a function:
function mockStyleInjection() {
const defaultInsertRule = window.CSSStyleSheet.prototype.insertRule;
window.CSSStyleSheet.prototype.insertRule = function (rule, index) {
const styleElement = document.createElement("style");
const textNode = document.createTextNode(rule);
styleElement.appendChild(textNode);
document.head.appendChild(styleElement);
return defaultInsertRule.bind(this)(rule, index);
};
// cleanup function, which reinserts the head and cleans up method overwrite
return function applyJSSRules() {
window.CSSStyleSheet.prototype.insertRule = defaultInsertRule;
document.head.innerHTML = document.head.innerHTML;
};
}
Example usage of this function in our previous test:
import React from "react";
import "#testing-library/jest-dom/extend-expect";
import { render } from "#testing-library/react";
test("test my styles", () => {
const applyJSSRules = mockStyleInjection();
const { getByTestId } = render(
<div data-testid="wrapper">
<MyButtonStyledWithJSS spacing="8px"/>
</div>
);
const button = getByTestId("wrapper").firstChild;
applyJSSRules();
expect(button).toHaveStyle("border-radius: 4px;");
expect(button).toHaveStyle("padding: 8px;");
});
This ultimately seems like an issue with JSS and various browser implementations like jsdom and and Blink (at least in Chrome). You can see it in Chrome when trying to modify/enable/disable these style rules (you can't).
The behavior appears to be a result of the JSS library using the CSSOM insertRule API. There's a stylesheet generated in the DOM for the styles we expect in our component, but the tag is empty - it's just used to link the shadow CSS back to the DOM. The styles are never written to the inline stylesheet in the DOM, and as a result, the getComputedStyle method does not return the expected results.
There's an open issue to address this behavior and make development easier.
I switched my custom components to styled-components, which does not have some of these idiosyncrasies.
Material-UI is planning on transitioning soon as well.
You could add this to a custom render function. After rendering, the function pulls the styles out of cssom and puts them into a style tag. Here is an implementation:
let customRender = (ui, options) => {
let renderResult = render(ui, options);
let styleElement = document.createElement("style");
let styleText = "";
for (let styleSheet of document.styleSheets) {
for (let rule of styleSheet.cssRules) {
styleText += rule.cssText + "\n";
}
}
styleElement.textContent = styleText.slice(0, -1);
document.head.appendChild(styleElement);
// remove old style elements
let emptyStyleElements = document.head.querySelectorAll('style[data-jss=""]');
for (let element of emptyStyleElements) {
element.remove();
}
return renderResult;
}
I can't speak specifically to Material-UI stylesheets, but you can inject a stylesheet into rendered component:
import {render} from '#testing-library/react';
import fs from 'fs';
import path from 'path';
const stylesheetFile = fs.reactFileSync(path.resolve(__dirname, '../path-to-stylesheet'), 'utf-8');
const styleTag = document.createElement('style');
styleTag.type = 'text/css';
styleTag.innerHTML = stylesheetFile;
const rendered = render(<MyComponent>);
rendered.append(style);
You don't necessarily have to read from a file, you can use whatever text you want.
I used to compile and insert JSX components via
<div key={ ID } dangerouslySetInnerHTML={ { __html: HTML } } />
which wrapped my HTML into a <div>:
<div>my html from the HTML object</div>
Now react > 16.2.0 has support for Fragments and I wonder if I can use that somehow to avoid wrapping my HTML in a <div> each time I get data from the back end.
Running
<Fragment key={ ID } dangerouslySetInnerHTML={ { __html: HTML } } />
will throw a warning
Warning: Invalid prop `dangerouslySetInnerHTML` supplied to `React.Fragment`. React.Fragment can only have `key` and `children` props.
in React.Fragment
Is this supported yet at all? Is there another way to solve this?
Update
Created an issue in the react repo for it if you want to upvote it.
Short Answer
Not possible:
key is the only attribute that can be passed to Fragment. In the
future, we may add support for additional attributes, such as event
handlers.
https://reactjs.org/docs/fragments.html
You may want to chime in and suggest this as a future addition.
https://github.com/facebook/react/issues
In the Meantime
You may want to consider using an HTML parsing library like:
https://github.com/remarkablemark/html-react-parser
Check out this example to see how it will accomplish your goal:
http://remarkablemark.org/blog/2016/10/07/dangerously-set-innerhtml-alternative/
In Short
You'll be able to do this:
<>
{require('html-react-parser')(
'<em>foo</em>'
)}
</>
Update December 2020
This issue (also mentioned by OP) was closed on Oct 2, 2019. - However, stemming from the original issue, it seems a RawHTML component has entered the RFC process but has not reached production, and has no set timeline for when a working solution may be available.
That being said, I would now like to allude to a solution I currently use to get around this issue.
In my case, dangerouslySetInnerHTML was utilized to render plain HTML for a user to download; it was not ideal to have additional wrapper tags included in the output.
After reading around the web and StackOverflow, it seemed most solutions mentioned using an external library like html-react-parser.
For this use-case, html-react-parser would not suffice because it converts HTML strings to React element(s). Meaning, it would strip all HTML that wasn't standard JSX.
Solution:
The code below is the no library solution I opted to use:
//HTML that will be set using dangerouslySetInnerHTML
const html = `<div>This is a div</div>`
The wrapper div within the RawHtml component is purposely named "unwanteddiv".
//Component that will return our dangerouslySetInnerHTML
//Note that we are using "unwanteddiv" as a wrapper
const RawHtml = () => {
return (
<unwanteddiv key={[]}
dangerouslySetInnerHTML={{
__html: html,
}}
/>
);
};
For the purpose of this example, we will use renderToStaticMarkup.
const staticHtml = ReactDomServer.renderToStaticMarkup(
<RawHtml/>
);
The ParseStaticHtml function is where the magic happens, here you will see why we named the wrapper div "unwanteddiv".
//The ParseStaticHtml function will check the staticHtml
//If the staticHtml type is 'string'
//We will remove "<unwanteddiv/>" leaving us with only the desired output
const ParseStaticHtml = (html) => {
if (typeof html === 'string') {
return html.replace(/<unwanteddiv>/g, '').replace(/<\/unwanteddiv>/g, '');
} else {
return html;
}
};
Now, if we pass the staticHtml through the ParseStaticHtml function you will see the desired output without the additional wrapper div:
console.log(ParseStaticHtml(staticHtml));
Additionally, I have created a codesandbox example that shows this in action.
Notice, the console log will throw a warning: "The tag <unwanteddiv> is unrecognized in this browser..." - However, this is fine because we intentionally gave it a unique name so we can easily differentiate and target the wrapper with our replace method and essentially remove it before output.
Besides, receiving a mild scolding from a code linter is not as bad as adding more dependencies for something that should be more simply implemented.
i found a workaround
by using react's ref
import React, { FC, useEffect, useRef } from 'react'
interface RawHtmlProps {
html: string
}
const RawHtml: FC<RawHtmlProps> = ({ html }) => {
const ref = useRef<HTMLDivElement>(null)
useEffect(() => {
if (!ref.current) return
// make a js fragment element
const fragment = document.createDocumentFragment()
// move every child from our div to new fragment
while (ref.current.childNodes[0]) {
fragment.appendChild(ref.current.childNodes[0])
}
// and after all replace the div with fragment
ref.current.replaceWith(fragment)
}, [ref])
return <div ref={ref} dangerouslySetInnerHTML={{ __html: html }}></div>
}
export { RawHtml }
Here's a solution that works for <td> elements only:
type DangerousHtml = {__html:string}
function isHtml(x: any): x is DangerousHtml {
if(!x) return false;
if(typeof x !== 'object') return false;
const keys = Object.keys(x)
if(keys.length !== 1) return false;
return keys[0] === '__html'
}
const DangerousTD = forwardRef<HTMLTableCellElement,Override<React.ComponentPropsWithoutRef<'td'>,{children: ReactNode|DangerousHtml}>>(({children,...props}, ref) => {
if(isHtml(children)) {
return <td dangerouslySetInnerHTML={children} {...props} ref={ref}/>
}
return <td {...props} ref={ref}>{children}</td>
})
With a bit of work you can make this more generic, but that should give the general idea.
Usage:
<DangerousTD>{{__html: "<span>foo</span>"}}</DangerousTD>
http://jsfiddle.net/Lc5gdvge/
In the fiddle above, I've showcased how text(), returns elements text and childrens' as well. How do I avoid this and only make it return "outerdiv"?
NOTE: Just click the blue container to call function.
$(document).ready(function() {
$('.outerdiv').click(function() {
var htmlstring = $(this).text();
alert(htmlstring);
})
});
NOTE 2: This has to be without the use of ID selector.
To get the text of only the .outerDiv and not the children elements, use the following code :
$(this).clone().children().remove().end().text();
What this does is, it clones the element, removes all the children tags, and goes back to the selected tag and gets the text.
The final code should look like :
$(document).ready(function () {
$('.outerdiv').click(function () {
var htmlstring = $(this).clone().children().remove().end().text();
alert(htmlstring);
});
});