How to parse html in a client-side script? - browser

What's the best way to create scripts for a browser?
I need to parse some html pages on different domains
I am on windows and use firefox most of all.

If it's just about retrieving the pages to do whatever you want with it, the built-in urllib module in python will do that for you.

It sounds like you want to retrieve webpages and parse them to extract meaningful data? I would suggest something like TagSoup (for Java) which fires off nice SAX events which you can use directly, or using an XML module of your choice (raw DOM, JDOM, dom4j, XOM, etc...). The TagSoup page also lists a number of references for other languages, suck as Beautiful Soup for Python, Rubyful Soup for Ruby and others.
From there, I would suggest using something like XPath to retrieve the bits of data that you want. Another option would be XSLT to transform the HTML into some unified format that you can more easily manipulate.

I'd recommend Synthetics Web. Here is a working example at jsFiddle.
jsFiddle
http://jsfiddle.net/dwayne05/YkLVw/
Synthetics Web
http://www.syntheticsweb.com/

Related

Count all nodes in a HTML file

Is there an easy way to count the nodes in a HTML file? I also need to count nodes of a certain type such as div etc.
I'd like to do this if possible without having to use an external library like HTMLAgilityPack if possible. Also, the HTML I'm dealing with is not guarenteed to be well formed and valid.
Is there a way to do this from C#?
Thanks.
first of all. are your sure a client-side solution using javascript isn't sufficent to your needs?
because the easiest way to count nodes within an HTML document is using jQuery on the client-side browser.
<script src="http://code.jquery.com/jquery-1.7.min.js"></script>
<script>
$('html').children() // will give you all child elements of the html element
$('body').children() // same for body element
$('body').children('div') // will give you just the direct children elements of 'div' type
$('body').find('div') // will give you all the nested elements of 'div' type
</script>
if you are unfamilier with jQuery then take a look at www.jquery.com
if u still need a C# solution for server-side parsing of the document then then i would recommend to use HTMLAgilityPack (even thou you wish not to). writing your own parser seems to me like a waste of time as you need to consider malformed html/xml and such which can be a pain.
try and use this s-overflow article: What is the best way to parse html in C#?
hope it will satisfy your needs
If you have XHTML you can load it in a XDocument and use XML manipulation API or LINQ to XML to count the particular modes.
If you don't you can try using Regular Expressions. But this one works in small number of interesting tags since you have to define manually an expression for each tag.
With LinqToXml API, you can easily parse and loop through all the nodes of an HTML document. You can find helpful articles related to LinqToXml but all in context of parsing XML documents.
Following is a similar thread from StackOverflow : C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

How to add text to any html element?

I want to add text to body element but I don't know how. Which method will work on the body tag?
Sorry for my english and thanks for replies.
In Watir, you can manipulate a web page (DOM) using JS, just like that:
browser.execute_script("document.getElementById('pageContent').appendChild(document.createTextNode('Great Success!'));")
I assume that the point of the question is:
All users are not just interacting by just clicking buttons and links on the web app, some of them are doing nasty things like altering http requests to make your system do something that it is not supposed to do... or to just have some fun.
To mimic this behavior, you could write a ui-test that alters forms on the web page, so that for example, one could type in anything into any field instead of a limited dropdown.
To do that, ui test has to:
manipulate DOM to set form inputs free of limitations (replace select's with input's, etc.)
ui test has to know, which values to use, in many cases it's pointless to enter random values. Your webapp has to provide some good "unwanted" options.
Why would you want to modify the webpage in Watir? It's for automated testing, not DOM manipulation.
If you want to add something to the DOM element in javascript, you can do it like that:
var txt = document.createTextNode(" This text was added to the DIV.");
document.getElementById('myDiv').appendChild(txt);
Or use some DOM manipulation library, like jQuery.
If you have not worked your way though the watir tutorial, I would suggest you do so. It deals with things like filling in text fields etc.
Learn to use the developer tools for your browser, Firebug for Firefox, or the built in tools for IE and CHrome. They will let you look at things as you interact with the site.
If the element is not a normal HTML input field of some sort, then you are dealing with a custom control. Many exist and they are varied and there is no one set solution for dealing with them. Without knowing which control you are using, and being able ourselves to interact with a sample of it, or at least see the HTML, it is very very difficult to advise you, we basically have to just guess (which is often a waste of everyone's time)
Odds are if you have a place you can enter text, then it is some form of input control, it might not start out that way, you may need to click on some other element, to make the input area appear, but without a sample of HTML all we can do is guess.
If this is a commercial control, see if you can find a demo site that shows the control in action. Try googling things like class names for the elements and often you get lucky

Extract localizable content from a HTML page

I need some advice on the best aproach to a feature I need to implement in a project I'm working on.
Basically, I need to be able to extract all localizable content (i.e. all the strings) from a HTML page. I really don't want to have to go and write a HTML parser. The application is written in C#.
Has anybody got any experience with this, or can anyone recommend an existing library that I could use to accomplish this?
Thanks.
You do not have to write your own parser. Fortunately somebody else already did that.
To parse HTML file, you can use HTML Agility Pack.
In this case you would receive Document Object Model, which you can walk just like any other DOM. Please find these examples:
https://web.archive.org/web/20211020001935/https://www.4guysfromrolla.com/articles/011211-1.aspx
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
And this question:
How to use HTML Agility pack

BB Code versus restricted HTML

Are there any security risks in allowing(whitelist only) pure markup tags such as a, b, i, etc in post submission?
BB code seems like a heavy solution to the problem of injecting code and whitelisting "safe" html tags seems easier then going through all the parsing and conversion that bb code requires.
I have found that many bb code libraries have issues with nested elements(is this because they use a FSA or regex, instead of a proper parser?) and blockquote or fieldset are properly parsed by the web browser.
Any and all opinions are greatly appreciated.
This is something everyone seems to get wrong, while it is so simple.
Use a parser
It doesn't matter whether you use markdown, html, bbcode, whatever.
Use a parser. A real parser. Not a bunch of regexes.
The parser gives you a syntaxtree. From the syntaxtree you derive the html (still as a tree of objects). Clean the tree (using a whitelist), print the html.
Using html as syntax is perfectly fine. Just don't try to clean it with regexes.
There is nothing wrong with using HTML as long as you:
Use a proper HTML parser to process the input.
Whitelist the tags so that only things you want get through.
Whitelist the attributes on the tags. This includes parsing and whitelist things inside style attributes if you want to allow style (and, of course, use a real CSS parser for the style attributes).
Rewrite the HTML while you parse it.
The last point is mostly about getting consistent and correct HTML output. Your parser should take care of figuring out the usual confusion (such as incorrectly nested tags) that you find in hand written HTML.

Markdown to HTML conversion

I'm still in the middle of coding my final year project at university, and I have come across an issue where I need to either convert from HTML to Markdown or visa versa. Now I have no experience whatsoever of Perl, Python, etc. so I'm in need of an easy-to-implement solution, I only have about 6 weeks left to complete this now. I'm writing the data from a WMD text box to SQL Server, and I can either upload it as Markdown or HTML but if that data needs editing it cannot be in HTML as this would be too confusing for the end user who is perceived to have zero/very little computing "know how".
What should I do?
Karmastan's answer is probably the best here. Keeping the raw Markdown in the database is a really good solution as it allows users to upkeep the content in a form with which they're familiar.
However, if you have a bunch of HTML which is already converted, you might want to look at something like Markdownify: The HTML to Markdown converter for PHP.
Edit: based on what you've said below, there are a few things you should keep in mind:
Make sure that the following is set in wmd.js:
wmd_options = {"output": "Markdown"};
This ensures that you're storing Markdown in the database.
Source: How do you store the markdown using WMD in ASP.NET?
When outputting the Markdown to the web, you need to transform it to HTML. To do this, you'll need a library which does Markdown -> HTML conversion. Here are two examples:
Announcing Markdown.NET
Revisied Markdown.NET Library
I'm not a .NET developer, so I can't really help with how these libraries should be used, but hopefully the documentation will make that clear.
If you look at the web site for Markdown, you'll find a Perl script that converts Markdown-syntax documents to HTML. Keep Markdown text in your database and invoke the script whenever you need to display the text. No Perl knowledge required!

Resources