Why web crawler must have these properties like robustness, politeness etc

Why web crawler must have these properties like robustness, politeness etc - web

Why web crawler must have robustness, politeness, scalablity, quality, freshness and extensibility ?

Robustness: a web crawler must be robust to changes in web site contents. Web search needs to retrieve and index every new web page as soon as possible. if a website just became online, the crawler needs time to go through all the front nodes at the queue in frontier before focusing on this new website. To tackle this web crawler has distributed system which index different web pages with different specification
Politeness: a web search must respect every web server's policies to re-index their web pages. if a web crawler is asked not to crawl a page aggressively by a certain web server, the crawler can put that page into a priority queue and re-index it when the queue is at top
Scalability: new webpages are added every day on the internet, web crawler must index every page asap. for this it needs fault tolerance, distributed systems, extra machines, etc. if a certain node at a web crawler has a fault, other nodes can divide its work and index the particular web pages.
Quality: web search ability to get useful web pages to every user. if the page contains entries which contain content far from user's recent searches or user's interests, the web search must use the previous user experience to predict what kind of content user might like
Freshness: web crawler's ability to fetch and index fresh copies of each page. for eg news websites are updated every second and needed to be re-index urgently. for this web crawler keep a separate priority queue for such priority based contents, to reindex such pages in a small period of time.
Extensibility: during early times, new data formats, languages, and new protocols were introduced. web-crawlers ability to cope with new and unseen data formats and new protocols is called extensibility, this suggests that web crawler architecture must be modular so that changes in one module would not affect others. if a website would contain a new data format unknown to web crawler then the web crawler can fetch the data but requires human intervention to add the data format details to the crawler index module.

Related

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!

The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

SharePoint site hierarchy for company intranet - multiple sites or sub-sites with one root?

I'm the IT Manager at a mid-size manufacturing company. We are getting our feet wet with SharePoint - so far we're got one blog in production use> It's the CEO's.
We have use cases for a couple of list-based "applications" with some simple workflow that will be implemented by one of our developers. We also want to give our users (at least the more tech-savvy ones) the ability to create and work with their own departmental sites.
We're concerned, however, that we might be starting something that could quickly get out of control if it's widely adopted (which would be a good thing). Since we don't really understand all the architectural trade-offs, we could end up with massive amounts of user data in a structure that bites us down the road.
Our biggest question is whether to have multiple sites for each use vs. a single root site from which everything else descends. Multiple sites would give us flexibility to make changes or develop new features without creating problems for all the users. However, multiple sites might be harder to back-up, search, and maintain user profiles/security. A single massive site seems to reverse the cost/benefits.
I'd appreciate any insight on the one vs. many trade-offs, or links to resources that discuss it. Links to general SharePoint "enterprise best practices" (sorry) would also be appreciated.
Thanks.

However, multiple sites might be harder
to back-up, search, and maintain user
profiles/security. A single massive
site seems to reverse the
cost/benefits.
I would consider this as incorrect. First we need to clarify when we say multiple sites, do we mean multiple site collections or multiple sites - they are two entirely different things.
Now even if they are multiple different site collections, in SQL database, they are just one database, since the database is created as web application level and not site level.
That was regarding backup.
Coming to search and user profiles, again your assumption is wrong. Search and User Profiles are Shared SErvices and they work fine as long as they reside in single Shared Services Provider. Both are farm level services.
A single massive site is (if you really mean site here not site collection) is a complete no-no and a bad design.
I would recommend having multiple site collections (something like Overall department in your company like HR, Finance , IT) and then have subistes under it. This way you have one database in SQL to manage and still you can scale by adding content database to existing web application.
Again here, I assume that you are creating your topology at company level. If this is at some lower level it needs to be refined.
Read some articles on taxonomy and site architecture on Technet before going ahead with any one.
Planning worksheets for SharePoint Server 2010
http://technet.microsoft.com/en-us/library/cc262451.aspx
Plan sites and site collections
http://technet.microsoft.com/en-us/library/cc263267.aspx
Sites and site collections overview
http://technet.microsoft.com/en-us/library/cc262410.aspx
Plan site navigation
http://technet.microsoft.com/en-us/library/cc262951.aspx

It purely depends upon your needs and requirements. even having a deferent web applications for deferent site i can provide you one citation taking backup as advantage. You might have few sites where data does not changes frequently like organizational policies, process documents etc. in this case taking regular backups/search crawling does not make sense(although you can opt for differential backup and incremental crawl but still in a week or fortnightly you have to take full backup). hence i would suggest carefully analyze your requirements and then take a decision. Microsoft has provided a good list of checklist and templates for planning purpose. few of the links are provided in madhur's reply and rest you can google upon.

Difference between Ad company statistics, Google Analytics and Awstats on adult sites

I have this problem. I have web page with adult content and for several past months i had PPC advertisement on it. And I've noticed a big difference between Ad company statistics of my page, Google Analytics data and Awstats data on my server.
For example, Ad company tells me, that i have 10K pageviews per day, Google Analytics tells me, that i have 15K pageviews and on Awstats it's around 13K pageviews. Which system should I trust? Should i write my own (and reinvent a wheel again)? If so, how? :)
The joke is, that i have another web page, with "normal" content (MMORPG fan site) and those numbers are +- equal in all three systems (ad company, GA, Awstats). Do you think it's because it's not adult oriented page?
And final question, that is totally offtopic, do you know about Ad company that pays per impression and don't mind adult sites?
Thanks for the answers!

First, you should make sure not to mix up »hits«, »files«, »visits« and »unique visits«. They all have a different meaning and are sometimes called differently. I recommend you to look up some definitions if you are confused about the terms.
awstats has probably the most correct statistics, because it has access to the access.log from the web server. Unfortunately, a cached site (maybe cached by the browser, a proxy from an ISP or your own caching server) might not produce a hit on the web server. Especially if your site is served with good caching hints which don't enforce a revalidation and you are running your own web cache (e.g. Squid) in front of your site, the number will be considerable lower, because it only measures the work of the web server.
On the other hand, Google Analytics is only able to count requests from users which haven't blocked Google Analytics and have JavaScript enabled (but they will count pages served by a web cache). So, this count can be influenced by the user, but isn't affected by web caches.
The ad-company is probably simply counting the number of requests which they get from your site (probably based on their access.log). So, to get counted there, the add must not be cached and must not be blocked by the user.
So, as you can see, it's not that easy to get a single correct value. But as long as you use the measured values in comparison to those from the previous months, you should get at least a (nearly) correct rate of growth.
And your porn site probably serves a high amount of static content (e.g. images from the disk) and most of the web servers are really good at serving caching hints automatically for static files. Your MMORPG on the other hand, might mostly consist of some dynamic scripts (PHP?) which don't send any caching hints at all and web servers aren't able to determine those caching headers for dynamic content automatically. That's at least my explanation, without knowing your application and server configuration :)

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.

Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.

Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.

The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

Sitecollection Overview Page

I have the following situation:
MOSS 2007 Server Environment A -> Intranet
MOSS 2007 Server Environment B -> Collaboration Environment (approx. 150 site collections for various issues)
Both environments are on different infrastructures but we use the same Active Directory and the same groups. Now we would like to implement the following 2 things:
An overview page within the intranet with all available site collections on environment b.
An overview page within the intranet with only those site collections the user has access on.
now i'm searching for some good ideas what would be the best way to realise something like this.
thanks in advance for any response.

The main thing to be careful of in a solution like this is performance, particularly for your second requirement. That would require looping through every site collection and retrieving permission data, either using the web services or the object model.
I would recommend writing a custom timer job (or two for each requirement if that makes more sense) to execute at a low-traffic time and aggregate this information for storage in a custom SQL database. If there is never low traffic then delay your requests to reduce impact on the server.
A custom web part (or again, two if more appropriate) can then be deployed to both environments. The web part would query the database for the required information and display it to the user.
If the timer job needs to update this data more frequently then you would need to implement some sort of in-memory caching. Depending on your requirements this may need a lot of memory.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string