Web Crawling with seed URLs from search engine

Web Crawling with seed URLs from search engine - search

I need to know if it is worth to build a crawler on top of the results given by a search engine.
By that means, for a given query, grab N URLs from a search engine and input them into a crawler to find more relevant pages to the search. Is there any scientific paper/experiment claiming that doing this helps gathering more relevant pages instead of only getting URLs from the search engine?

If I understood it right, you would rebuild the search engine, because it was its job to bring the most related/relevant results first over a search. And, although you did not mention directly your search engine, which I guess it is google, I would suggest you to use the advanced search options before trying anything else. Google provides an API for performing searches, which you can use in your system. But if this approach does not fit to you, it is possible to craw over google results, and even perform custom searches (for example filtering results by site, term or etc) but google would not be happy with this and would eventually block your calls. I suggest you give a try over its open API...

Related

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.

I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.

As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

ElasticSearch - search statistic - like google analytics

I am looking into using ElasticSearch as a search engine for one of the projects I am working on.
There is still one thing which I need to find an answer for, and I hope someone inhere can help.
The customer want to be able to see some search statistic, like google analytics. Most searched words, new search words and so on.
Is there a way to easily setup this type of search statistic. My idea is something like ElasticSearch stores search history, about the search request made to the REST API. Then my customer can use Kibana or some other visual tool to monitor the search history of ElasticSearch.
Hope someone can help me with an answer for this.
Regards
Jacob

You could adjust the slow log to a time which it will capture all requests, however this will then produce large log files which will require maintenance. You could write an application which handles all of your ES requests, takes the search phrase and indexes this in a separate index i.e. your search history index and then deals with the actual request as normal, returning the response to the user.

Why do web-developers still use meta-keywords and meta-description tags?

Google is not using meta-keywords tag at all because keywords are mostly used to spam search engines.
Google is not using the meta-description tag for ranking. Sometimes the meta-description tag is used for the site-snippet in search results if part of the content does not fit. But mostly meta-description is generated automatically from the content of the page and meta-description is the same as beginning of the content of the page.
Google has dropped the support of meta-keywords and meta-description tags for search ranking. Google handles about 92% of all search queries in the world. So now web-developers can stop using meta-keywords and meta-description meta tags, because spending time on them is not worth it.
Is there any real benefit for using meta-keywords and meta-description tags?
Links:
Google Webmasters Blog about meta tags support by Google;
Video with Matt Cutts about meta tags support by Google;
StatCounter Search Engines stats usage - Google handles about 92% of all search queries in the world;

No, we should carry on using meta tags, because we don't, and shouldn't be, just supporting Google. The approach should be: make documents as indexable as possible using a search-engine agnostic approach, and then put special handling in for one or two top engines - such as using Google's online tools to improve search ranking.
Google are very dominant in search at present, but there's no guarantee they will always be on top. Maybe it will be Facebook in the future, or perhaps Yahoo/Bing etc. will dramatically improve search quality, and people will switch back.
Side note: for search, I really like DuckDuckGo at the moment. Lots of nice search shortcuts (see bang operators) and a meaningful privacy policy.

We should use them because they are there. Who knows - perhaps they will be useful again in the future?
When the W3C drop them we can stop using them.
Just my opinion ofc...

keywords:
Google is not the only search engine. Google market share depends on many factors (country, age, technical know-how, …). Small percentages of big numbers are also big numbers.
There are special purpose search engines (for niches; only crawling hand-selected sites; etc.) that might still consider it.
Local search engines might use it. (Local) full text search engines anyway.
Some CMS use it for site search.
There are other consuming user-agents than search engines, e.g. parser/extractor.
description:
it can be useful even for Google, e.g. when someone searches only for the title/domain of your site, Google would often display snippets like "Login / Register … back to top … please insert CAPTCHA … " etc. If a description is provided, it could be used instead.
(the points mentionend under keywords are relevant for description, too.)

If google SEO is your only concern then meta keywords are a complete waste of time, but if you're targeting other search engines it may be worth investigating.
I believe Baidu still uses meta keywords, and that search engine is the dominant player in the Chinese market, so it'd be worth adding meta keywords if you want your site to be popular in China.
Regardless, I wouldn't go stuffing excessive numbers of irrelevant keywords in, as there is every chance that whatever search engine you're targeting will penalise you. 5-7 words summarising your page content is a good starting point.

Number of google pages hits as a function of time

In each search in Google one can see number of google pages found.
Is it possible to find what was this number in the past?

Best what you can try is Google Zeitgeist. It doesn't give amount of found pages, but it gives insights of popularity of keywords, example: "stackoverflow". You can also check the popularity of a website, example: stackoverflow.com.
No other Google services comes to mind for your particular purpose.

No this number is not available. You could run the query and parse the Google results to capture the information, but I read the terms of use before you try to do that. In most case it will be against the terms of usage to do something like that.

Developing a crawler and scraper for a vertical search engine

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.
Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?

Nutch builds on Lucene and already implements a crawler and several document parsers.
You can also hook it to Hadoop for scalability.

In the enterprise-search context that I am used to working in,
crawlers,
content extractors,
search engine indexes (and the loading of your content into these indexes),
being able to query that data effciently and with a wide range of search operators,
programmatic interfaces to all of these layers,
optionally, user-facing GUIs
are all seperate topics.
(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)
You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.
I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.
http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar
Good luck and let us know what you find out and the approach you decide to take.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Web Crawling with seed URLs from search engine - search

Related

Web Crawling and Pagerank

ElasticSearch - search statistic - like google analytics

Why do web-developers still use meta-keywords and meta-description tags?

Number of google pages hits as a function of time

Developing a crawler and scraper for a vertical search engine

Categories

Resources