Extract localizable content from a HTML page - c#-4.0

I need some advice on the best aproach to a feature I need to implement in a project I'm working on.
Basically, I need to be able to extract all localizable content (i.e. all the strings) from a HTML page. I really don't want to have to go and write a HTML parser. The application is written in C#.
Has anybody got any experience with this, or can anyone recommend an existing library that I could use to accomplish this?
Thanks.

You do not have to write your own parser. Fortunately somebody else already did that.
To parse HTML file, you can use HTML Agility Pack.
In this case you would receive Document Object Model, which you can walk just like any other DOM. Please find these examples:
https://web.archive.org/web/20211020001935/https://www.4guysfromrolla.com/articles/011211-1.aspx
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
And this question:
How to use HTML Agility pack

Related

How to make advanced CMS-like blog system from scratch?

When you create a news or blog tab with CMS it's really easy to make a feed of posts with content preview. Also when you follow a link to a particular post you can notice that it consists of a different html tags and css styling and not just plain text. It just uses rich text editor. So just getting text from db is not enough.
My question is how to achieve the same result when making a website from scratch. It doesn't matter what language is used for back-end. I'm just interested in the idea how to do it. But if you could provide a code examples (with any language) it would be greatly appreciated
Ok I've figured it out. Posting the answer for somebody who will have the similar question in the future.
The idea is that you need to put a text with html tags into your database and then to retrieve it you need to put it in your desired div in unescaped state. The thing is that almost all view (template) engines escape html tags by default. To do that you have to use some built in functions specific to that view engine.
To put the article with html tags in db you can just write raw html into input field or you can somehow add richtext editor to input field. Richtext editor will generate html for you.
I've researched it and found out that that's exactly how cms work.
So there you have it. If you want to add something feel free to do it

Need Direction - Web Bot Creation

I want to create something that will look at a specific location on a website and read the value at this location. Then take that value and put it into a already created block of text.
What do I need to start researching to create something like that? Simple direction such as key words to Google and such would be extremely helpful.
Do you want to get the value from your own website or someone else's website?
If you want to get the value from an HTML element, you can easily do so using Javascript/jQuery.
If you are looking to write a script that parses the HTML from somebody else's website, you will need a HTML parser. If you plan to use Python, look into Beautiful Soup.

Count all nodes in a HTML file

Is there an easy way to count the nodes in a HTML file? I also need to count nodes of a certain type such as div etc.
I'd like to do this if possible without having to use an external library like HTMLAgilityPack if possible. Also, the HTML I'm dealing with is not guarenteed to be well formed and valid.
Is there a way to do this from C#?
Thanks.
first of all. are your sure a client-side solution using javascript isn't sufficent to your needs?
because the easiest way to count nodes within an HTML document is using jQuery on the client-side browser.
<script src="http://code.jquery.com/jquery-1.7.min.js"></script>
<script>
$('html').children() // will give you all child elements of the html element
$('body').children() // same for body element
$('body').children('div') // will give you just the direct children elements of 'div' type
$('body').find('div') // will give you all the nested elements of 'div' type
</script>
if you are unfamilier with jQuery then take a look at www.jquery.com
if u still need a C# solution for server-side parsing of the document then then i would recommend to use HTMLAgilityPack (even thou you wish not to). writing your own parser seems to me like a waste of time as you need to consider malformed html/xml and such which can be a pain.
try and use this s-overflow article: What is the best way to parse html in C#?
hope it will satisfy your needs
If you have XHTML you can load it in a XDocument and use XML manipulation API or LINQ to XML to count the particular modes.
If you don't you can try using Regular Expressions. But this one works in small number of interesting tags since you have to define manually an expression for each tag.
With LinqToXml API, you can easily parse and loop through all the nodes of an HTML document. You can find helpful articles related to LinqToXml but all in context of parsing XML documents.
Following is a similar thread from StackOverflow : C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

Markdown to HTML conversion

I'm still in the middle of coding my final year project at university, and I have come across an issue where I need to either convert from HTML to Markdown or visa versa. Now I have no experience whatsoever of Perl, Python, etc. so I'm in need of an easy-to-implement solution, I only have about 6 weeks left to complete this now. I'm writing the data from a WMD text box to SQL Server, and I can either upload it as Markdown or HTML but if that data needs editing it cannot be in HTML as this would be too confusing for the end user who is perceived to have zero/very little computing "know how".
What should I do?
Karmastan's answer is probably the best here. Keeping the raw Markdown in the database is a really good solution as it allows users to upkeep the content in a form with which they're familiar.
However, if you have a bunch of HTML which is already converted, you might want to look at something like Markdownify: The HTML to Markdown converter for PHP.
Edit: based on what you've said below, there are a few things you should keep in mind:
Make sure that the following is set in wmd.js:
wmd_options = {"output": "Markdown"};
This ensures that you're storing Markdown in the database.
Source: How do you store the markdown using WMD in ASP.NET?
When outputting the Markdown to the web, you need to transform it to HTML. To do this, you'll need a library which does Markdown -> HTML conversion. Here are two examples:
Announcing Markdown.NET
Revisied Markdown.NET Library
I'm not a .NET developer, so I can't really help with how these libraries should be used, but hopefully the documentation will make that clear.
If you look at the web site for Markdown, you'll find a Perl script that converts Markdown-syntax documents to HTML. Keep Markdown text in your database and invoke the script whenever you need to display the text. No Perl knowledge required!

How to parse html in a client-side script?

What's the best way to create scripts for a browser?
I need to parse some html pages on different domains
I am on windows and use firefox most of all.
If it's just about retrieving the pages to do whatever you want with it, the built-in urllib module in python will do that for you.
It sounds like you want to retrieve webpages and parse them to extract meaningful data? I would suggest something like TagSoup (for Java) which fires off nice SAX events which you can use directly, or using an XML module of your choice (raw DOM, JDOM, dom4j, XOM, etc...). The TagSoup page also lists a number of references for other languages, suck as Beautiful Soup for Python, Rubyful Soup for Ruby and others.
From there, I would suggest using something like XPath to retrieve the bits of data that you want. Another option would be XSLT to transform the HTML into some unified format that you can more easily manipulate.
I'd recommend Synthetics Web. Here is a working example at jsFiddle.
jsFiddle
http://jsfiddle.net/dwayne05/YkLVw/
Synthetics Web
http://www.syntheticsweb.com/

Resources