Parsing open graph tags with nutch (into ElasticSearch) - nutch

I have a running nutch 2.3.1/hbase installation that parses/indexes web pages just fine. Now I need to parse open graph tags (namely og:image, og:description). From several fragments found on the web I learned that tika basically supports parsing open graph tags, but I am lost trying to figure out how to integrate this into nutch.
Can someone point me into the right direction? Maybe an example?
Thanks

Related

What is another way to crawl the web besides following links?

What is another way to crawl the web besides following hyperlinks?
Most major sites use Sitemaps. This gives your crawler a fast way of discovering URLs and can be used with or instead of following outlinks.
The crawler commons project provides a Sitemap parser in Java.

Joomla find front-end content in back-end

I have a text in the front-end which I can't locate it in the back-end.
Is there a way to search or to figure it out?
To be more specific there is a text which I wrote it some time ago but I can't find in which component belongs. There are no articles at all and it's not in the component that is assigned to this menu item.
With the limited information you've provided there's very little specifics we could provide. As is, we don't know if the text is in the database, the code or even part of a language definition. In addition, the text could be the result of your web server being compromised. If there are ads for Viagra and other stuff skating up this is midst likely the issue.
You can use any flavor MySQL client to search to see if stored in the database.
To search the code, use your IDE search functionality. If your not using an IDE, download Notepad++ and install, once launched hit + F to open search utility. There is a tab to search all files from a specific directory root, use this to search for the text.
Although the search will find a language override, you should not edit core files, but there is a tool to override core language definitions in Joomla's backend.
Good luck!

How to access Wikipedia using Node.js

I am looking into the simplest way to integrate Wikipedia into a node.js app.
The requirements are to be able to search for entries and find entities in each entry.
Any known existing libs/methods for that?
Thanks
There's a newly available open source parser for wiki text (http://sweble.org/) that might be useful to you if you roll your own solution. Of course that would require you downloading the wikipedia data dump, parsing, and storing entities in a db.
You could also look at dbpedia (http://dbpedia.org/About), though that would require integrating the rdf stack into your app (either running a local rdf repository or communicating with the often flaky online version via sparql).
One easy approach is to use a search engine api and restrict to site:wikipedia.org - e.g:
http://www.google.com/search?q=node.js+site%3Awikipedia.org
I've found that can work really well.
Spider for scraping using jquery is fantastic:
https://github.com/mikeal/spider
Mikeal is the man
Presumably you'd be using this for a side (personal) project though. Not sure how kosher it is to run wild on wikipedia with a scraper.

a crawler that builds the link tree form a single website

I want to know if there are any outsource solutions for a crawler that will parse only the links and pages form a given website, and will output:
1.The link tree
2.The pages (where necessary)
thanks!
You dont need any particular framework to achieve this task. What languages do you know? If you know Java you can use HttpClient or HttpUnit libs to help you with crawling tasks.
If you are python user, there is great framework called Scrapy (http://scrapy.org/). You should check it out.

Using Windows Live Writer as a web site content editor

What are your thoughts on using Windows Live Writer communicating with your website as the content editing system?
Windows Live Writer supports multiple category blogs (i.e. can be news, articles, and blogs), multiple category pages, tagging, XHTML WYSIWYG editing, image and file uploading via services or ftp, and the client has a "Insert HTML" plug-in library with a lot of already developed plug-ins for popular sites.
The trickiest part is implementing all of the XmlRpc methods in your services, but some digging with Reflector has exposed them as being pretty simple to implement the features.
I've considered it, but it's kind of like putting a triangle into a round hole. It will fit, but not quite right. Since the primary focus is around blogging, the page editing would be counter intuitive if you presented to someone as a page editor.
Well in the case where a web site's normal update pattern is to post new "news" a.k.a. blog posts, the page editing then becomes secondary to update the static content.
I was thinking the exact same thing. Using Windows Live Writer or MS Word 2007 (it supports Atom Publishing as well) to edit web materials on a site would be awesome (in theory), right?
I tried looking into creating an AtomPub Server, (using Google Data API, Apache Abdera, or Project ROME), to create a simple atom publication server backend on Java Google App Engine. It would save entities and images into the GAE datastore. The saved data could be shown via a simple front end on the site. All editing would be done in MS Word.
But creating a custom Atom pub server turned out overwhelmingly hard for me. I'll throw in the towel for now, at least for the AtomPub protocol. Something dead-simple like a CRUD entity interface might still be possible for Windows Live Writer, as it supports simpler protocols.
As far as I know, it hasn't been done for GAE. Umbraco ASP.net CMS supports it though.

Resources