The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.
Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.
Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.
The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.
Related
I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.
I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.
I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!
The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture
this issue just did not make sense to me, perhaps someone smart can help?
why is it that many hosts do not allow you to set htaccess or server-side includes?
specifically alot of the free hosting plans do not allow this, what are some situations that they are thinking of that makes them disable these features?
Income - You're not paying anything, the only money they get are from advertisements.
Resources - they take up additional resources. Offering only the essentials reduces the amount of risk and maintenance.
Agreement violations / unethical practices - It may reduce the amount of people signing up for free accounts to redirect websites, rename, hog resources, etc.
That's just what comes to mind.
I am new to this as well, but one thing you can do with SSI is to make your web page design a lot easier.
You can seperate headers, footers, and nav divisions (or any HTML elements that are on every page of your site) into their own HTML files. This means that when you add a page, or change one of those elements you only have to change them once, and then ensure they are included with an SSI include command into the correct spot on every page, nearly eliminating the chance of mistakes when putting the same element on every page.
SSI basically writes the referenced html file directly into the page wherever the include statement sits.
I know a lot less about .htaccess because I haven't played with it yet, but you can change which pages the server looks for SSI code in using different configuration commands. It also looks like you can enable SSI.
I know there is a lot more to all this than I wrote here, but hopefully this gives you an idea.
Here is a website that probably better explains this: http://www.perlscriptsjavascripts.com/tutorials/howto/ssi.html
I just added rel="nofollow" to some links.
Anyone know how long it takes for google to stop following after "nofollow" is added to a link?
I added an hour ago and still see them crawling the "nofollow" links.
It might be the case that it won't stop following your rel="nofollow" links. According to Wikipedia:
Google states that their engine takes "nofollow" literally and does not "follow" the link at all. However, experiments conducted by SEOs show conflicting results. These studies reveal that Google does follow the link, but does not index the linked-to page, unless it was in Google's index already for other reasons (such as other, non-nofollow links that point to the page).
From Google's Webmaster Central:
Google's spiders regularly crawl the
web to rebuild our index. Crawls are
based on many factors such as
PageRank, links to a page, and
crawling constraints such as the
number of parameters in a URL. Any
number of factors can affect the crawl
frequency of individual sites.
Our crawl process is algorithmic;
computer programs determine which
sites to crawl, how often, and how
many pages to fetch from each site. We
don't accept payment to crawl a site
more frequently. For tips on
maintaining a crawler-friendly
website, please visit our webmaster
guidelines.
Using Google Webmaster Tools, you can see the last time it crawled your website and if the links are still showing in the searches, they may be conflicting as per #Bears post.
That depends on many things. Google's databases are massively distributed, so it may take a while for the change to propagate. Also, it may take the crawler some time to revisit the page where you added the nofollows - again, this is computed by some closed Google algorithm. In the worst cases, I've seen tens of days without the indexes getting updated; best case, few minutes. The modus would be a few days. Be patient, young Jedi ;)
I am keeping track of around 7000 pages on how google visits them. Yes it keeps following the pages even if I put the nofollow thing but for a while. It will crawl the same page a couple of times before it finally removes it. So it'll take time.
i have a website that i now support and need to list all live pages/ url's.
is there a crawler i can use to point to my homepage and have it list all the pages/url's that it finds.
then i can delete any that dont make their way into this listing as they will be orphan pages/url's that have never been cleaned up?
I am using DNN and want to kill un-needed pages.
Since you're using a database-driven CMS, you should be able to do this either via the DNN admin interface or by looking directly in the database. Far more reliable than a crawler.
Back in the old days I used wget for this exact purpose, using its recursive retrieval functionality. It might not be the most efficient way, but it was definitely effective. YMMV, of course, since some sites will return a lot more content than others.