Extract main text from HTML using Cheerio - node.js

How to extract just main text using cheerio?
I wish to go to unknown sites, and get main text (or all text) simply using nodeJS and Cheerio.

Resolved using npm moudle called boilerpipe

Use the Request Library and you get the HTML text back. Check the site to see if it's not using a Virtual DOM or Shadow DOM ie. React. If it is, Cheerio's methods don't work and you get an unusable circular object.

Related

Is there a way to have the user upload an SVG file but then render the SVG source?

Using 2sxc on DNN, I have a website that uses SVGs for icons in content types. The client wants to be able to upload the SVG icons to 2sxc via a Link field but then instead of rendering <img src="#Content.SVG" />, they want it to render the source code of the SVG (so we could manipulate the fill color via CSS). Is this even possible and how could it be done?
Basically 2 steps
Get the real file name using 2sxc and DNN
Then load the file as a string using normal .net stuff System.IO and add it to your html - https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readalltext?view=netframework-4.5.1
ca. like this
<div>
#Html.Raw(System.IO.File.ReadAllText(fileName)
</div>
Some examples of how to do this can be found below
Using the fetch API
How to convert image (svg) to rendered svg in javascript?
Older methods such as XMLHttpRequest or jQuery
Include SVG files with HTML, and still be able to apply styles to them?
Using D3
(Embed and) Refer to an external SVG via D3 and/or javascript
Using a custom JS library
One example: SVGInjector
Interestingly Dnn is doing this nowadays and you can look at the code here. If you ignore the caching, you might be able to do similar in a View.
https://github.com/dnnsoftware/Dnn.Platform/blob/0d3b931d8483b01aaad0291a5fde2cbc0bac60ca/DNN%20Platform/Website/admin/Skins/Logo.ascx.cs#L123
And that is called from above, see ~line 71, so they are doing a real inject of the file contents to inline. Obviously caching file-access stuff should be a priority for caching if the website is high-traffic, but otherwise it is not needed or at least secondary.

How to only get the "title" and "main content" of a page using puppeteer?

I'm trying to create a clone of getpocket.com for learning. On that app, every saved link gets converted into a markdown; and it seems like the it's a filtered content with only the page title and body without headers, footers, etc.
I could get the page's title using puppeteer api thru different means:
using page.title()
or get the page's opengraph "og:title"
But how do i get like the summarized version containing only the main content of the page.
Note that i don't know beforehand the "css class" of the main content since i'm planning on just entering a url in a textbox and scrape that site from there.
I have found what i've needed for this scenario.
I used the Readability.js library for making webpages readable by removing some certain html tags. Here's the library.
This library is what mozilla uses behind the scenes when rendering their reader view

CKEditor inline with Node.js

I'm trying to create minimalistic content management system with ckeditor using node and express as a server. I would definitely want to implement the inline editing capabilities of ckeditor, but I'm having no success in sending the data to server and finally to nosql (mongodb) database.
I would like to have multiple inline editors within a page and to save to my database them simultaneously upon a POST event. I have my editor instances in invividual divs with an attribute contenteditable="true". Editor instances launch just fine, but when I'm trying to grab the data in my controller, all I have is an empty object. I can get the data from input fields, but then I lose the inline editing features. I've tried tinkering with bodyparser, but no success. All my divs containing the editable content lay under a HTML form element.
I would be more than happy is someone could at least point me to a general direction of how to accomplish this. Sorry if I was unable to make my self clear posting this question :)
tldr; How can I parse data from HTML elements, other than input-fields and text areas, in node/express with bodyparser?
Content of non-input fields won't be posted in a form, so you can't do that. A couple options come to mind:
Use JavaScript to update hidden inputs on the page as those divs change. Updated content will be posted.
Use JavaScript to make the POST, on save grab the contents, post them to the server, and then after that make the redirect from client side.

JavaScript function does not work with webbrowser.document.body.innerhtml

I'm using a web browser for showing some data.
When I change my web browser control content with this code:
webbrowser.document.body.innerhtml=htmlbody;
My JavaScript function does not work.
I believe that changing innerHTML in that way breaks up the DOM, so the JavaScript will not function any more.
I had the same problem. If you are trying to make changes to the body, the easiest way I found would be through injecting JavaScript into your page, without changing the HTML in this way.
i got the solution
i put the j query function in body and my problem solved:D

Printing a webpage with JSF

I have seen webpage with a PDF icon, where you could click on it to print the content of that webpage.
The page i am intending to add the print feature is designed in JSF, so is there anyway where i could add a print button, to get the webpage printed ?
No, you must do this yourself. Get some PDF library (for example iText), then get web page output (plain HTML). Then you will have to iterate thru HTML and create PDF version (for example build iText document). You will probably have to do this yourself, because some elements (javascript powered) will need to turn into static content. Nobody but you knows how the output should look like.

Resources