Parsing html document in java - document

I want to know how is possible to load and traverse a document from URL. I want to load a document from URL and being able to select basically html elements by tag name or class name or ID.
Any help will be really appreciated.

Check out a Jsoup library. You should be able to extract data by tags using it.

Related

How to make advanced CMS-like blog system from scratch?

When you create a news or blog tab with CMS it's really easy to make a feed of posts with content preview. Also when you follow a link to a particular post you can notice that it consists of a different html tags and css styling and not just plain text. It just uses rich text editor. So just getting text from db is not enough.
My question is how to achieve the same result when making a website from scratch. It doesn't matter what language is used for back-end. I'm just interested in the idea how to do it. But if you could provide a code examples (with any language) it would be greatly appreciated
Ok I've figured it out. Posting the answer for somebody who will have the similar question in the future.
The idea is that you need to put a text with html tags into your database and then to retrieve it you need to put it in your desired div in unescaped state. The thing is that almost all view (template) engines escape html tags by default. To do that you have to use some built in functions specific to that view engine.
To put the article with html tags in db you can just write raw html into input field or you can somehow add richtext editor to input field. Richtext editor will generate html for you.
I've researched it and found out that that's exactly how cms work.
So there you have it. If you want to add something feel free to do it

Getting thumbnails in OpenSearchServer search results

I need an alternative to Google Custom Search for a website I look after, it has to be something that will crawl a website, index it, allow fiddling of priorities, and then allow search queries via REST or something similar and return XML or JSON etc. It needs to run on a Windows Server instance.
So, I'm up and running with http://www.opensearchserver.com/ and it seems to do the trick, but can't, for the life of me, work out how to get thumbnail images in the results? I've searched the documentation and read everything I could, but can't find out how to do this (or how to get my head around it).
I'm crawling standard web pages and they all have thumbnail meta data, which I'm assuming should be able to be parsed somehow for results and included in the JSON results?
Any pointers at all would be very helpful, thanks!
I figured this out, in case anyone else is struggling, here's how I did it. The answer is in the documentations, it's just not that simple.
Read: http://www.opensearchserver.com/documentation/faq/crawling/how_to_extract_specific_information_from_web_pages.md - it contains the method
Assume you set up a 'web crawler' index.
Assuming you're using a meta thumbnail like this:
<meta name="thumbnail" content="http://my_cdn.com/news/images/29637.jpg">
Go into Schema / Fields. Add a new field called 'thumbnail' with index no, store yes, vector no, analyser Text, copy of blank. Save that.
Now go to schema / parser list, edit HTML parser. Go to 'field mapping', now add a new regex for the thumbnail in the html. We map from the 'htmlSource' to the thumbnail' with the matching regex.
My imperfect regex (that works though) is:
htmlSource -> linked in: thumbnail -> captured by:
(?s)<meta name="thumbnail" content="(.*?)">
Now SAVE this and go to crawl/manual crawl, enter a url that has a thumbnail and then check if the field now appears in the list below when it's read. If not check your regex, and check you actually saved the HTML Parser changes.
To get the thumb in your results, simply add the fieldname to the JSON you send with the query:
"returnedFields": [ "
"url",
"thumbnail"
],

Which is the correct way to to retrieve all the informations from a dbpedia page?

I am working with dbpedia. In my work, my program need to read a dbpedia json file like(http://dbpedia.org/data/Germany.json) and extract all the information as a key value pair, same as dbpedia page(http://dbpedia.org/page/Germany). But I am facing some problem. For example, if you see the json file(please use some json viewer to make it human readable.), if i want to get the language(search language in the file), you will see it is in the json array, so i have to extract that information from the Array. On the other hand, if you search seeAlso, then you will find that you have to go one level up and find the information. Further more , there are some information in the HTML page(http://dbpedia.org/page/Germany) but that is not found in the metadata json
file(http://dbpedia.org/data/Germany.json). For example, "birthPlace" is in html page but not in the json file. I am totally confused that, how i will code that can read and store(as key value mapping) the data as like as seen in the html page.
DBpedia data is organized by resource, where each "resource" is a page on Wikipedia and (presumably) a thing in the real world. Each resource is referred to with a URL. The JSON file contains a whole bunch of resources (such as http://dbpedia.org/resource/Opel_Kadett_C) that have some link with the resource you're interested in, http://dbpedia.org/resource/Germany. I think this is supposed to include all the information at http://dbpedia.org/page/Germany, but clearly some entries -- such as db:Anja_Kling -- are missing. I'm not sure why that is, but it might be a bug -- if you don't get a better answer here, you should try e-mailing your questions to the dbpedia-discussion mailing list at https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion. Hope that helps!

Count all nodes in a HTML file

Is there an easy way to count the nodes in a HTML file? I also need to count nodes of a certain type such as div etc.
I'd like to do this if possible without having to use an external library like HTMLAgilityPack if possible. Also, the HTML I'm dealing with is not guarenteed to be well formed and valid.
Is there a way to do this from C#?
Thanks.
first of all. are your sure a client-side solution using javascript isn't sufficent to your needs?
because the easiest way to count nodes within an HTML document is using jQuery on the client-side browser.
<script src="http://code.jquery.com/jquery-1.7.min.js"></script>
<script>
$('html').children() // will give you all child elements of the html element
$('body').children() // same for body element
$('body').children('div') // will give you just the direct children elements of 'div' type
$('body').find('div') // will give you all the nested elements of 'div' type
</script>
if you are unfamilier with jQuery then take a look at www.jquery.com
if u still need a C# solution for server-side parsing of the document then then i would recommend to use HTMLAgilityPack (even thou you wish not to). writing your own parser seems to me like a waste of time as you need to consider malformed html/xml and such which can be a pain.
try and use this s-overflow article: What is the best way to parse html in C#?
hope it will satisfy your needs
If you have XHTML you can load it in a XDocument and use XML manipulation API or LINQ to XML to count the particular modes.
If you don't you can try using Regular Expressions. But this one works in small number of interesting tags since you have to define manually an expression for each tag.
With LinqToXml API, you can easily parse and loop through all the nodes of an HTML document. You can find helpful articles related to LinqToXml but all in context of parsing XML documents.
Following is a similar thread from StackOverflow : C# Is there a LINQ to HTML, or some other good .Net HTML manipulation API?

Extract localizable content from a HTML page

I need some advice on the best aproach to a feature I need to implement in a project I'm working on.
Basically, I need to be able to extract all localizable content (i.e. all the strings) from a HTML page. I really don't want to have to go and write a HTML parser. The application is written in C#.
Has anybody got any experience with this, or can anyone recommend an existing library that I could use to accomplish this?
Thanks.
You do not have to write your own parser. Fortunately somebody else already did that.
To parse HTML file, you can use HTML Agility Pack.
In this case you would receive Document Object Model, which you can walk just like any other DOM. Please find these examples:
https://web.archive.org/web/20211020001935/https://www.4guysfromrolla.com/articles/011211-1.aspx
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
And this question:
How to use HTML Agility pack

Resources