How to replace JSDOM with cheerio for Readability - node.js

I'd like to use Readability to parse out the "article" content within web pages. Readability does a good job, but it depends on JSDOM which seems to be very slow and throws errors if parsing CSS content fails (which I do not need at all), but it's not possible to ignore CSS in JSDOM, as I understand from the Gibhub issues of the project.
I've been trying to replace JSDOM with cheerio, but I haven't been able to figure out what part of its API is compatible with the output of JSDOM.
In JSDOM, the statement
var doc = new JSDOM(html);
produces a DOM object which can be passed into
let reader = new Readability(doc.window.document);
In cheerio however, there doesn't seem to anything that produces a DOM object.
I've tried
var $ = cheerio.load(html);
var object_something = $('html');
It throws an error when I call new Readability(object_something)
Error: First argument to Readability constructor should be a document object.
which clearly means that what it says. object_something is an object, but not sure what it actually is.
Is it even possible to produce a DOM object with cheerio? I have over 10 million local HTML documents, so any performance improvement would save me a lot of time.

Related

How to use `nsIParserUtils.parseFragment()` for Firefox addon

Our Firefox addon issues queries to Google at the backend (main.js), then extracts some content through xpath. For this purpose, we use innerHTML to create a document instance for xpath parsing. But when we submit this addon to Mozilla, we got rejected because:
This add-on is creating DOM nodes from HTML strings containing potentially unsanitized data, by assigning to innerHTML, jQuery.html, or through similar means. Aside from being inefficient, this is a major security risk. For more information, see https://developer.mozilla.org/en/XUL_School/DOM_Building_and_HTML_Insertion
Following the link provided, we tried to replace innerHTML with nsIParserUtils.parseFragment(). However, the example code:
let { Cc, Ci } = require("chrome");
function parseHTML(doc, html, allowStyle, baseURI, isXML) {
let PARSER_UTILS = "#mozilla.org/parserutils;1";
...
The Cc, Ci utilities can only be used on main.js, while the function requires a document (doc) as the argument. But we could not find any examples about creating a document inside main.js, where we could not use document.implementation.createHTMLDocument("");. Because main.js is a background script, which does not have reference to the global built-in document.
I googled a lot, but still could not find any solutions. Could anybody kindly help?
You probably want to use nsIDOMParser instead, which is the same as the standard DOMParser accessible in window globals except that you can use it from privileged contexts without a window object.
Although that gives you a whole document with synthesized <html> and <body> elements if you don't provide your own. If you absolutely need a fragment you can use the html5 template element to extract a fragment via domparser:
let partialHTML = "foo <b>baz</b> bar"
let frag = parser.parseFromString(`<template>${partialHTML}</template>`, 'text/html').querySelector("template").content

JavaScript Library XPages

I currently have JavaScript dotted around my XPages - I would like to create a JavaScript library containing all this code and then just call the individual functions e.g. the press of a button. Here is an example of the code I would like in the JavaScript library:
// get the user document for that person
var myView:NotesView = database.getView("xpBenutzerInnen");
var query = new java.util.Vector();
query.addElement(sessionScope.notesName);
var myDoc:NotesDocument = myView.getDocumentByKey(query, true);
When I place this code in the library I get the error:
Syntax error on token ":", ; expected
I assume that the "var name:type" declaration is specific to XPages (this is what I found on creating JavaScript vars: http://www.w3schools.com/js/js_variables.asp) - I could just remove the type declaration and the code seems to run without any problems - I would like to better define the variable type though.
Is there any way that I can move the code out of the XPage but still keeping my typing?
Thanking you in advance
Ursus
You need to distinguish between client side JavaScript and Server side JavaScript. Your code is server side JS and should work in a Script library just as it does inside an XPage. Have you accidentally created a client side JS lib?
A few pointers: I try to make functions in a script library independent from global objects. Mostly the database object. This function works in a library just fine:
function getUserDocument(username:string, db:database) {
var myView:NotesView = db.getView("xpBenutzerInnen");
var query = new java.util.Vector();
query.addElement(username);
var myDoc:NotesDocument = myView.getDocumentByKey(query, true);
myView.recycle();
return myDoc;
}
Let me know how it goes

Nodejs: parsing XML, editing values, saving the end result using sax-js module

What I'd like to achieve is:
parsing a chunk of XML
editing some values
saving the end result in a
new xml file
The module is sax-js: https://github.com/isaacs/sax-js#readme
The module has some built-in mechanism to read/write any.
I thought the task would be a piece of cake; on the contrary I have been struggling with it for the whole day.
Here is my code:
var fs = require('fs');
var saxStream = require("sax").createStream(true);
saxStream.on("text", function (node) {
if (node === 'foo') { //the content I want to update
node = 'blabla';
}
});
fs.createReadStream("mysongs.xml")
.pipe(saxStream)
.pipe(fs.createWriteStream("mysongs-copy.xml"));
I did think that updating some content (see the comment above) would suffice to write the updated stream into a new file.
What's wrong with this code?
Thanks for your help,
Roland
The sax module doesn't let you modify nodes like that. If you take a look at this bit of code, you'll see that the input is passed indiscriminately to the output.
All hope is not, however, lost! Check out the pretty-print example - it would be a good starting point for what you want to do. You'd have to do a bit of work to implement the readable part of the stream, though, if you still want to be able to .pipe() out of it.
If you know the general structure of the XML, you can try xml-flow. It converts an XML stream into objects, but has a utility to convert them back to xml strings:
https://github.com/matthewmatician/xml-flow
Based on deoxxa's answer I wrote an NPM module for this https://www.npmjs.com/package/sax-streamer

How can I replicate Chrome's ability to 'resolve' a DOM from bad html?

I'm using cheerio and node.js to parse a webpage and then use css selectors to find data on it. Cheerio doesn't perform so well on malformed html. jsdom is more forgiving, but both behave differently and I've seen both break when the other works fine in certain cases.
Chrome seems to do a fine job with the same malformed html in creating a DOM.
How can I replicate Chrome's ability to create a DOM from malformed HTML, then give the 'cleaned' html representation of this DOM to cheerio for processing?
This way I'll know the html it gets is wellformed. I tried phantomjs by setting page.content, but then when I read page.content's value the html is still malformed.
So you can use https://github.com/aredridel/html5/ which is a lot more forgiving and from my experience works where jsdom fails.
But last time I tested it, a few month back, it was super slow. I hope it got better.
Then there is also the possibility to spawn a phantomjs process and to output on stdout a json of the data you want to feed it back to your Node.
This seems to do the trick, using phantomjs-node and jquery:
function cleanHtmlWithPhantom(html, callback){
var phantom = require('phantom');
phantom.create(
function(ph){
ph.createPage(
function(page){
page.injectJs(
"/some_local_location/jquery_1.6.1.min.js",
function(){
page.evaluate(
function(){
$('html').html(newHtml)
return $('html').html();
}.toString().replace(/newHtml/g, "'"+html+"'"),
function(result){
callback("<html>" + result + "</html>")
ph.exit();
}
)
}
);
}
)
}
)
}
cleanHtmlWithPhantom(
"<p>malformed",
function(newHtml){
console.log(newHtml);
}
)

raphael text() not working in ie9

Why is the following example not working in ie9?
http://jsfiddle.net/dzyyd/2/
It spits out a console error:
"Unexpected call to method or property access."
I found it pretty quickly. You created the element, but did not put it anywhere. Once it is added to the document body, everything seems to be fine.
this._width=300;
this._height=300;
this._bgSvgContainer = document.createElement("div");
//NOTE: add the created div to the body of the document so that it is displayed
document.body.appendChild(this._bgSvgContainer);
var bgCanvas = Raphael(this._bgSvgContainer, this._width, this._height);
this._bgCanvas = bgCanvas;
var num = this._bgCanvas.text(this._width-10,this._height-10,"1");
It's really hard to tell with such a tiny code-fragment (doesn't run on any browser for me), but it's probably a scope issue this in IE during events is completely different to this using the W3C event model. See: quirksmode-Event order-Problems of the Microsoft model

Resources