Developing a crawler and scraper for a vertical search engine

Developing a crawler and scraper for a vertical search engine - search

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?

Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.

In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.

Related

Search feature on website

I am interested in implementing a search feature on a website. It is a location search, so address/state/zip all should work. Which will then show results in that area and allow it to be filtered.
My question is:
What's the best approach for something like this?

There are literally dozens of ways of doing this (if not more). The exact implementation would depend on the technology stack that you use, but as a very top level overview:
you'd need to store the things you are searching for somewhere, and tag them with a lat/long location. Often, this would be in a database of some kind.
using a programming language, you would need to write a search that accepts a postcode, translates that to a lat/long and then searches the things in your database based on the distance between the location of the thing, and the location entered in the search.
if you want to support filtering, your search would need to support that too. This is often called "faceting" the search.
Working out the lat/long locations will need to be done using a GeoLocation service, there are some, such as PostCode Anywhere that will do this as a paid service, and others that are free (within reason), such as the Google Maps APIs.
There are probably some hosted services that will do what you want, you'd have to shop around.
Examples of search software that supports geolocation searching out of the box are things like Solr, Azure Search, Lucene and Elastic.

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.

I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.

As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

how schema.org can help in nlp

I am basically working on nlp, collecting interest based data from web pages.
I came across this source http://schema.org/ as being helpful in nlp stuff.
I go through the documentation, from which I can see it adds additional tag properties to identify html tag content.
It may help search engine to get specific data as per user query.
it says : Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!
But I don't understand how it can help me being nlp guy? Generally I parse web page content to process and extract data from it. schema.org may help there, but don't know how to utilize it.
Any example or guidance would be appreciable.

Schema.org uses microdata format for representation. People use microdata for text analytics and extracting curated contents. There can be numerous application.
Suppose you want to create news summarization system. So you can use hNews microformats to extract most relevant content and perform summrization onit
Suppose if you have review based search engine, where you want to list products with most positive review. You can use hReview microfomrat to extract the reviews, now perform sentiment analysis on it to identify product has -ve or +ve review
If you want to create skill based resume classifier then extract content with hResume microformat. Which can give you various details like contact (uses the hCard microformat), experience, achievements , related to this work, education , skills/qualifications, affiliations
, publications , performance/skills for performance etc. You can perform classifier on it to classify CVs with particular skillsets
Thought schema.org does not helps directly to nlp guys, it provides platform to perform text processing in better way.
Check out this http://en.wikipedia.org/wiki/Microformat#Specific_microformats to see various mircorformat, same page will give you more details.

Schema.org is something like a vocabulary or ontology to annotate data and here specifically Web pages.
It's a good idea to extract microdata from Web pages but is it really used by Web developper ? I don't think so and I think that the majority of microdata are used by company such as Google or Yahoo.
Finally, you can find data but not a lot and mainly used by a specific type of website.
What do you want to extract and for what type of application ? Because you can probably use another type of data such as DBpedia or Freebase for example.

GoodRelations also supports schema.org. You can annotate your content on the fly from the front-end based on the various domain contexts defined. So, schema.org is very useful for NLP extraction. One can even use it for HATEOS services for hypermedia link relations. Metadata (data about data) for any context is good for content and data in general. Alternatives, include microformats, RDFa, RDFa Lite, etc. The more context you have the better as it will turn your data into smart content and help crawler bots to understand the data. It also leads further into web of data and in helping global queries over resource domains. In long run such approaches will help towards domain adaptation of agents for transfer learning on the web. Pretty much making the web of pages an externalized unit of a massive commonsense knowledge base. They also help advertising agencies understand publisher sites and to better contextualize ad retargeting.

Why do web-developers still use meta-keywords and meta-description tags?

Google is not using meta-keywords tag at all because keywords are mostly used to spam search engines.
Google is not using the meta-description tag for ranking. Sometimes the meta-description tag is used for the site-snippet in search results if part of the content does not fit. But mostly meta-description is generated automatically from the content of the page and meta-description is the same as beginning of the content of the page.
Google has dropped the support of meta-keywords and meta-description tags for search ranking. Google handles about 92% of all search queries in the world. So now web-developers can stop using meta-keywords and meta-description meta tags, because spending time on them is not worth it.
Is there any real benefit for using meta-keywords and meta-description tags?
Links:
Google Webmasters Blog about meta tags support by Google;
Video with Matt Cutts about meta tags support by Google;
StatCounter Search Engines stats usage - Google handles about 92% of all search queries in the world;

No, we should carry on using meta tags, because we don't, and shouldn't be, just supporting Google. The approach should be: make documents as indexable as possible using a search-engine agnostic approach, and then put special handling in for one or two top engines - such as using Google's online tools to improve search ranking.
Google are very dominant in search at present, but there's no guarantee they will always be on top. Maybe it will be Facebook in the future, or perhaps Yahoo/Bing etc. will dramatically improve search quality, and people will switch back.
Side note: for search, I really like DuckDuckGo at the moment. Lots of nice search shortcuts (see bang operators) and a meaningful privacy policy.

We should use them because they are there. Who knows - perhaps they will be useful again in the future?
When the W3C drop them we can stop using them.
Just my opinion ofc...

keywords:
Google is not the only search engine. Google market share depends on many factors (country, age, technical know-how, …). Small percentages of big numbers are also big numbers.
There are special purpose search engines (for niches; only crawling hand-selected sites; etc.) that might still consider it.
Local search engines might use it. (Local) full text search engines anyway.
Some CMS use it for site search.
There are other consuming user-agents than search engines, e.g. parser/extractor.
description:
it can be useful even for Google, e.g. when someone searches only for the title/domain of your site, Google would often display snippets like "Login / Register … back to top … please insert CAPTCHA … " etc. If a description is provided, it could be used instead.
(the points mentionend under keywords are relevant for description, too.)

If google SEO is your only concern then meta keywords are a complete waste of time, but if you're targeting other search engines it may be worth investigating.
I believe Baidu still uses meta keywords, and that search engine is the dominant player in the Chinese market, so it'd be worth adding meta keywords if you want your site to be popular in China.
Regardless, I wouldn't go stuffing excessive numbers of irrelevant keywords in, as there is every chance that whatever search engine you're targeting will penalise you. 5-7 words summarising your page content is a good starting point.

Any flexible CMS perfect for restaurant website’s back-end?

I’m building a website for a restaurant which consists of several static pages like ‘About us’ and editable menu.
I need a CMS flexible enough to be able to add items individually (by individually, I mean adding items doesn’t equal pasting a HTML list of n products into another static page).
Each item should contain its name, description, price and category. The list of added items should be displayed using templates the way I want them to.
Can you suggest any lightweight CMS which can provide similar conditions?

There are tons of options for simple page creation. Have you considered just using one of the many free website builders out there? Then you don't even have to worry about finding hosting, just make it happen quickly and easily with one of them. For instance, take a look at Weebly (review here) or Wix. Both allow for free pages and both are incredibly easy to use. Squarespace (review here) is another solid option (and one of my favorites) but charges a small fee (which I personally think is worth it).
Weebly allows for some slick drag and drop of page elements into place as does Wix. They are what I would classify as the easiest of the batch while Squarespace provides for an excellent user interface experience.
Other options if you'd prefer something hosted on your own would depend on your experience level. I am a huge fan of Processwire and ImpressPages has come along nicely and is great little CMS too.
These are exceptions to the typical Top Three that everyone tends to recommend I know but I like to spread the word about other projects instead of the usual ones.
Cheers!
Mike

Sounds like a job for Wordpress 3.0 plus Custom Post Types UI + Verve Meta Boxes plugins. Wordpress will handle the static pages, the other two plugins will allow you to make a Menu Item post type with custom fields.

It is not exactly lightweight, but you could do it with Drupal. You can define you own content type "product", use the CCK module to add your fields (price, ...) and use the Views module to display it how you want.
Drupal has a relatively steep learning curve, so it may be overkill for this project. It is definitely flexible enough for this, though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Developing a crawler and scraper for a vertical search engine - search

Nutch builds on Lucene and already implements a crawler and several document parsers. You can also hook it to Hadoop for scalability.

Related

Search feature on website

Web Crawling and Pagerank

how schema.org can help in nlp

Why do web-developers still use meta-keywords and meta-description tags?

Any flexible CMS perfect for restaurant website’s back-end?

Categories

Resources