Count all nodes in a HTML file - c#-4.0

Is there an easy way to count the nodes in a HTML file? I also need to count nodes of a certain type such as div etc.
I'd like to do this if possible without having to use an external library like HTMLAgilityPack if possible. Also, the HTML I'm dealing with is not guarenteed to be well formed and valid.
Is there a way to do this from C#?
Thanks.

first of all. are your sure a client-side solution using javascript isn't sufficent to your needs?
because the easiest way to count nodes within an HTML document is using jQuery on the client-side browser.
<script src="http://code.jquery.com/jquery-1.7.min.js"></script>
<script>
$('html').children() // will give you all child elements of the html element
$('body').children() // same for body element
$('body').children('div') // will give you just the direct children elements of 'div' type
$('body').find('div') // will give you all the nested elements of 'div' type
</script>
if you are unfamilier with jQuery then take a look at www.jquery.com
if u still need a C# solution for server-side parsing of the document then then i would recommend to use HTMLAgilityPack (even thou you wish not to). writing your own parser seems to me like a waste of time as you need to consider malformed html/xml and such which can be a pain.
try and use this s-overflow article: What is the best way to parse html in C#?
hope it will satisfy your needs

If you have XHTML you can load it in a XDocument and use XML manipulation API or LINQ to XML to count the particular modes.
If you don't you can try using Regular Expressions. But this one works in small number of interesting tags since you have to define manually an expression for each tag.

With LinqToXml API, you can easily parse and loop through all the nodes of an HTML document. You can find helpful articles related to LinqToXml but all in context of parsing XML documents.
Following is a similar thread from StackOverflow : C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

Related

How to get an HTML tag as 2 strings (opening tag, closing tag), without its contents from kuchiki?

I am writing an HTML to Markdown converter in Rust, using Kuchiki to get access to the parsed tree from html5ever.
For unknown HTML tags, I want to provide the possibility to ignore them and pass them through to the output string, but still processing their children as normal. For that, I need the textual representation of the tag without its contents, but I can't figure how best to do that.
The best I can come up with is:
Clone the node
Drop its children
Call node.to_string
"parse" the string with a regular expression to separate the opening and closing tags.
I feel there must be a better way. I don't think Kuchiki provides this functionality out of the box, but I also don't know how to get access to the html5ever API through Kuchiki, and I also don't get from the html5ever API documentation whether they would provide some functionality like this.

How to break Larger HTML to a shorter one for google translate API

I am using google translate API and my requirement for translation is greater than the google max limit. I want to translate a HTML section which is far greater than google max limit which it allows for a single request. How can I break my HTML into pieces so that I send multiple request with my overall html structure being valid.
Also,I am using nodeJs as a server side language.
any other idea how to achieve this?
Use a parser like jsdom to transform your HTML content into a DOM structure.
Then, use the translate API to translate the contents of the text nodes in the DOM structure and replace the translated text to get the full translated page.
If you need it, you could also try to find and translate any relevant text outside of text nodes, like alt- or title-attributes.
If you care about performance, you could try to translate bigger subtrees of the DOM structure at once, but then you would have to be careful to not upload too much content again.

Extract localizable content from a HTML page

I need some advice on the best aproach to a feature I need to implement in a project I'm working on.
Basically, I need to be able to extract all localizable content (i.e. all the strings) from a HTML page. I really don't want to have to go and write a HTML parser. The application is written in C#.
Has anybody got any experience with this, or can anyone recommend an existing library that I could use to accomplish this?
Thanks.
You do not have to write your own parser. Fortunately somebody else already did that.
To parse HTML file, you can use HTML Agility Pack.
In this case you would receive Document Object Model, which you can walk just like any other DOM. Please find these examples:
https://web.archive.org/web/20211020001935/https://www.4guysfromrolla.com/articles/011211-1.aspx
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
And this question:
How to use HTML Agility pack

BB Code versus restricted HTML

Are there any security risks in allowing(whitelist only) pure markup tags such as a, b, i, etc in post submission?
BB code seems like a heavy solution to the problem of injecting code and whitelisting "safe" html tags seems easier then going through all the parsing and conversion that bb code requires.
I have found that many bb code libraries have issues with nested elements(is this because they use a FSA or regex, instead of a proper parser?) and blockquote or fieldset are properly parsed by the web browser.
Any and all opinions are greatly appreciated.
This is something everyone seems to get wrong, while it is so simple.
Use a parser
It doesn't matter whether you use markdown, html, bbcode, whatever.
Use a parser. A real parser. Not a bunch of regexes.
The parser gives you a syntaxtree. From the syntaxtree you derive the html (still as a tree of objects). Clean the tree (using a whitelist), print the html.
Using html as syntax is perfectly fine. Just don't try to clean it with regexes.
There is nothing wrong with using HTML as long as you:
Use a proper HTML parser to process the input.
Whitelist the tags so that only things you want get through.
Whitelist the attributes on the tags. This includes parsing and whitelist things inside style attributes if you want to allow style (and, of course, use a real CSS parser for the style attributes).
Rewrite the HTML while you parse it.
The last point is mostly about getting consistent and correct HTML output. Your parser should take care of figuring out the usual confusion (such as incorrectly nested tags) that you find in hand written HTML.

How to parse html in a client-side script?

What's the best way to create scripts for a browser?
I need to parse some html pages on different domains
I am on windows and use firefox most of all.
If it's just about retrieving the pages to do whatever you want with it, the built-in urllib module in python will do that for you.
It sounds like you want to retrieve webpages and parse them to extract meaningful data? I would suggest something like TagSoup (for Java) which fires off nice SAX events which you can use directly, or using an XML module of your choice (raw DOM, JDOM, dom4j, XOM, etc...). The TagSoup page also lists a number of references for other languages, suck as Beautiful Soup for Python, Rubyful Soup for Ruby and others.
From there, I would suggest using something like XPath to retrieve the bits of data that you want. Another option would be XSLT to transform the HTML into some unified format that you can more easily manipulate.
I'd recommend Synthetics Web. Here is a working example at jsFiddle.
jsFiddle
http://jsfiddle.net/dwayne05/YkLVw/
Synthetics Web
http://www.syntheticsweb.com/

Resources