google bot rel="nofollow" how long to stop following - googlebot

I just added rel="nofollow" to some links.
Anyone know how long it takes for google to stop following after "nofollow" is added to a link?
I added an hour ago and still see them crawling the "nofollow" links.

It might be the case that it won't stop following your rel="nofollow" links. According to Wikipedia:
Google states that their engine takes "nofollow" literally and does not "follow" the link at all. However, experiments conducted by SEOs show conflicting results. These studies reveal that Google does follow the link, but does not index the linked-to page, unless it was in Google's index already for other reasons (such as other, non-nofollow links that point to the page).

From Google's Webmaster Central:
Google's spiders regularly crawl the
web to rebuild our index. Crawls are
based on many factors such as
PageRank, links to a page, and
crawling constraints such as the
number of parameters in a URL. Any
number of factors can affect the crawl
frequency of individual sites.
Our crawl process is algorithmic;
computer programs determine which
sites to crawl, how often, and how
many pages to fetch from each site. We
don't accept payment to crawl a site
more frequently. For tips on
maintaining a crawler-friendly
website, please visit our webmaster
guidelines.
Using Google Webmaster Tools, you can see the last time it crawled your website and if the links are still showing in the searches, they may be conflicting as per #Bears post.

That depends on many things. Google's databases are massively distributed, so it may take a while for the change to propagate. Also, it may take the crawler some time to revisit the page where you added the nofollows - again, this is computed by some closed Google algorithm. In the worst cases, I've seen tens of days without the indexes getting updated; best case, few minutes. The modus would be a few days. Be patient, young Jedi ;)

I am keeping track of around 7000 pages on how google visits them. Yes it keeps following the pages even if I put the nofollow thing but for a while. It will crawl the same page a couple of times before it finally removes it. So it'll take time.

Related

I've had Google Analytics disabled on my site for more than a month, but data keeps coming in

This entire data set should have 0 visits. I have a one page site (nodejs) and have removed the analytics from the site more than a month ago. Just tonight I took a look at the GA data on their site and here's what I witnessed. Why are there still tons of views on my site on GA?
For curiosity's sake, I added GA on my site two nights ago and site visits have actually increased over the past few days, but not to the extend New Relic is tracking site visits. Any thoughts? This just seems creepy or maybe manipulative on Google's part.
This is most likely ghost spam. Check your referrals, you will probably have hits from free-social-buttons or guardlink, they don't need that the code is active is enough if it's created.
Here is an example of one of my inactive accounts.
You can find more information about ghost spam on this related questions:
https://stackoverflow.com/a/28354319/3197362
https://webmasters.stackexchange.com/a/81491/49561
https://stackoverflow.com/a/29312117/3197362
https://stackoverflow.com/a/29717606/3197362
If it isn't spam then probably someone is using your tracking code somehow as #MrSponge mention.

Search engine components

I'm a middle school student learning computer programming, and I just have some questions about search engines like Google and Yahoo.
As far as I know, these search engines consist of:
Search algorithm & code
(Example: search.py file that accepts search query from the web interface and returns the search results)
Web interface for querying and showing result
Web crawler
What I am confused about is the Web crawler part.
Do Google's and Yahoo's Web crawlers immediately search through every single webpage existing on WWW? Or do they:
First download all the existing webpages on WWW, save them on their huge server, and then search through these saved pages??
If the latter is the case, then wouldn't the search results appearing on the google search results be outdated, Since I suppose searching through all the webpages on WWW will take tremendous amount of time??
PS. One more question: Actually.. How exactly does a web crawler retrieve all the web pages existing on WWW? For example, does it search through all the possible web addresses, like www.a.com, www.b.com, www.c.com, and so on...? (although I know this can't be true)
Or is there some way to get access to all the existing webpages on world wide web?? (sorry for asking such a silly question..)
Thanks!!
The crawlers search through pages, download them and save (parts of them) for later processing. So yes, you are right that the results that search engines return can easily be outdated. And a couple of years ago they really were quite outdated. Only relatively recently Google and others started to do more realtime searching by collaborating with large content providers (such as Twitter) to get data from them directly and frequently but they took the realtime search again offline in July 2011. Otherwise they for example take notice how often a web page changes so they know which ones to crawl more often than others. And they have special systems for it, such as the Caffeine web indexing system. See also their blogpost Giving you fresher, more recent search results.
So what happens is:
Crawlers retrieve pages
Backend servers process them
Parse text, tokenize it, index it for full text search
Extract links
Extract metadata such as schema.org for rich snippets
Later they do additional computation based on the extracted data, such as
Page rank computation
In parallel they can be doing lots of other stuff such as
Entity extraction for Knowledge graph information
Discovering what pages to crawl happens simply by starting with a page and then its following links to other pages and following their links, etc. In addition to that, they have other ways of learning about new web sites - for example if people use their public DNS server, they will learn about pages that they visit. Sharing links on G+, Twitter, etc.
There is no way of knowing what all the existing web pages are. There may be some that are not linked from anywhere and noone publicly shares a link to them (and doesn't use their DNS, etc.) so they have no way of knowing what these pages are. Then there's the problem of the Deep Web. Hope this helps.
Crawling is not an easy task (for example Yahoo is now outsourcing crawling via Microsoft's Bing). You can read more about it in Page's and Brin's own paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine
More details about storage, architecture, etc. you can find for example on the High Scalability website: http://highscalability.com/google-architecture

How come google crawls some sites real time? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I posted a source code on codeplex and to my surprise found that it appeared on google within 13 hours. Also when i made some changes to my account on codeplex those changes reflected on google within a matter of minutes. How did that happen ? Is there some extra importance that google pays to sites like Codeplex, Stackoverflow etc to make their results appear in the search results fast ? Are there some special steps i can take to make google crawl my site somewhat faster, if not this fast.
Google prefers some sites over others. There is a lot of magic rules involved, in the case of CodePlex and Stackoverflow we can even assume that they had ben manually put on some whitelist. Then Google subscribes to the RSS feed of these sites and crawls them whenever there is a new RSS post.
Example: Posts on my blog are included in the index within minutes, but if I dont post for weeks, Google just passes by every week or so.
Huh?
Probably (and you have to be an insider to know...) if they find enough changes from crawl to crawl they narrow the window between crawling until - sites like popular blogs / news ect are being crawled every few min.
For popular sites like stackoverflow.com the indexing occurs more often than normal, you could notice this by searching for a question that has been just asked.
It is not well known but Google relies on pigeons to rank its pages. Some pages have particularly tasty corn, which attracts the pigeons' attentions much more frequently than other pages.
Actually ... Popular sites have certain feeds that they share will google. The site updates these feeds and google updates its index when the feed changes. For other sites that rank well, seach engines crawl more often, provided there are changes. True its not public knowledge and even for the popular sites there are no guarantees about when newly published data appears in the index.
Real time search is one of the newest buzzwords and battlegrounds in the search engine wars. Google's announced/Bing's twitter integration are good examples of this new focus on super-fresh content.
Incorporating fresh content is a real technical challenge and priority for companies like Google since one has to crawl the documents, incorporate them into the index (which is spread across hundreds/thousands of machines), and then somehow determine if the new content is relevant for a given query. Remember, since we are indexing brand new documents and tweets that these things aren't going to have many inbound links which is the typical thing that boosts PageRank.
The best way to get Google/Yahoo/Bing to crawl your site more often is to have a site with frequently updated content that gets a decent amount of traffic. (All of these companies know how popular sites are and will devote more resources indexing sites like stackoverflow, nytimes, and amazon)
The other thing you can do is also make sure that your robots.txt isn't preventing spiders from crawling your site as much as you want and to make sure to submit a sitemap to google/bing-hoo so that they will have a list of your urls. But be careful what you wish for: https://blog.stackoverflow.com/2009/06/the-perfect-web-spider-storm/
Well even my own blog appears in real time (it's pagerank 3 though) so it's not such a big deal I think :)
For example I just posted this and it appeared in Google at least 37 minutes ago (maybe it was in real-time as I didn't check before)
http://www.google.com/search?q=rebol+cgi+hosting

What are the benefits of having an updated sitemap.xml?

The text below is from sitemaps.org. What are the benefits to do that versus the crawler doing their job?
Sitemaps are an easy way for
webmasters to inform search engines
about pages on their sites that are
available for crawling. In its
simplest form, a Sitemap is an XML
file that lists URLs for a site along
with additional metadata about each
URL (when it was last updated, how
often it usually changes, and how
important it is, relative to other
URLs in the site) so that search
engines can more intelligently crawl
the site.
Edit 1: I am hoping to get enough benefits so I canjustify the development of that feature. At this moment our system does not provide sitemaps dynamically, so we have to create one with a crawler which is not a very good process.
Crawlers are "lazy" too, so if you give them a sitemap with all your site URLs in it, they are more likely to index more pages on your site.
They also give you the ability to prioritize your pages so the crawlers know how frequently they change, which ones are more important to keep updated, etc. so they don't waste their time crawling pages that haven't changed, missing ones that do, or indexing pages you don't care much about (and missing pages that you do).
There are also lots of automated tools online that you can use to crawl your entire site and generate a sitemap. If your site isn't too big (less than a few thousand urls) those will work great.
Well, like that paragraph says sitemaps also provide meta data about a given url that a crawler may not be able to extrapolate purely by crawling. The sitemap acts as table of contents for the crawler so that it can prioritize content and index what matters.
The sitemap helps telling the crawler which pages are more important, and also how often they can be expected to be updated. This is information that really can't be found out by just scanning the pages themselves.
Crawlers have a limit to how many pages the scan of your site, and how many levels deep they follow links. If you have a lot of less relevant pages, a lot of different URLs to the same page, or pages that need many steps to get to, the crawler will stop before it comes to the most interresting pages. The site map offers an alternative way to easily find the most interresting pages, without having to follow links and sorting out duplicates.

How do i test my antispam code against bots?

I have some code and i wonder how it would stand up against bots. Is there a way i can either run a bot to check the strength of my site or to set real live spam bots on it in a prerelease test? (i can use something.noip.com as a dummy domain)
You could always just drop by some of the shadier channels on IRC, and brag about your super-secret new breakthrough in software that is able to stop 100% of all spam bots which will get you the hot babes. Be as irritating as possible, and keep poking around the area.... Eventually, you'll provoke SOMEBODY :)
You can improve the google ranking of your site to attract more bots and you can as Edouard wrote, install some bots and try to "break" your tests. Although I don't think that the "good" bots are downloadable for free.
I'd go for a higher google rank and placing your url in many places on the web to raise the chance that it gets picked up.
Place it in your footer in forums, etc. Use it in your footer in high traffic mailinglists.
But don't post there just to place your url, this will make people angry if they notice it (and justifiably so)
You answered your own question. Set up bots and run them against your site to see if it works well.
Over the time, as the popularity of your site rises, bots will spam you. You will need to keep the race going on as bots get better and better.
Just publish it and make the page available to search engine spiders and other traffic. They will eventually find you!

Resources