Index all the intranet with nutch

Index all the intranet with nutch - nutch

I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ?
Thanks.

If you know all the urls of the intranet, then write a robots.txt (or an equivalent page with all the urls and point the crawler to it).
If you don't then you cannot be never secure that you'll have crawled all the urls, because you cannot verify it after the crawling.
In the last case the best chance is to do the crawl at the maximum depth.
Regards

Related

Ensure that Nutch has crawled all pages of a particular domain

I am using Nutch to collect all the data from a single domain. How can I ensure that Nutch has crawled every page under a given domain?

This is not technically possible. Since there is no limit on the number of different pages that you can have under the same domain. This is especially true for dynamic generated websites. What you could do is look for a sitemap.xml and ensure that all of those URLs are crawled/indexed by Nutch. Since the sitemap is the one indicating that those are the URLs you could use them as guide for what needs to be crawled.
Nutch has a sitemap processor that will inject all the URLs from the sitemap to the current crawldb (i.e it will "schedule" those URLs to be crawled).
As a hint, even Google enforces a maximum number of URLs to be indexed from the same domain when doing a deep crawl. This is usually referred to as a crawl budget.

Block Bots from crawling one of my sites on a multistore multidomain prestashop

Hello i have a multistore multidomain prestashop installation with main domain example.com and i want to block all bots from crawling a subdomain site subdomain.example.com made for resellers where they can buy at lower prices because the content is duplicate to the original site, and i am not exacly sure how to do it. Usualy if i want to block the bots for a site i would use
User-agent: *
Disallow: /
But how do i use it without hurting the whole store ? and is it possible to block the bots from the htacces too ?

Regarding your first question:
If you don't want search engines to gain access to the subdomain (sub.example.com/robots.txt), using a robots.txt file ON the subdomain is the way to go. Don't put it on your regular domain (example.com/robots.txt) - see Robots.txt reference guide.
Additionally, I would verify both domains in Google Search Console. There you can monitor and control the indexation of the subdomain and main domain.
Regarding your second question:
I've found a SO thread here which explains what you want to know: Block all bots/crawlers/spiders for a special directory with htaccess.

We use a canonical URL to tell the search engines where to find the original content.
https://yoast.com/rel-canonical/
A canonical URL allows you to tell search engines that certain similar
URLs are actually one and the same. Sometimes you have products or
content that is accessible under multiple URLs, or even on multiple
websites. Using a canonical URL (an HTML link tag with attribute
rel=canonical) these can exist without harming your rankings.

How to stop google from indexing a domain used for redirection

I have 2 domains, a site is deployed on one of them while the other has no content and simply redirects to the one with stuff. Google is indexing both of them, showing the same content from the first domain in the search details.
Q: How can I prevent the one that redirects from showing up on the search results?
Is it just a matter of deploying a robot.txt on the domain that redirects?

If all you want is to stop Google from indexing your site you should use the following robots.txt file:
User-agent: *
Disallow: /
However, if you want to make sure the correct domain shows up in Google's results you should:
a) Use HTTP 301 Redirects
b) Specify your canonical

According to Google...
Q: I have the same content available on two domains (example.com and example2.org). How do I let Google know that the two domains are the same site?
A: Use a 301 redirect to direct traffic from the alternative domain (example2.org) to your preferred domain (example.com). This tells Google to always look for your content in one location, and is the best way to ensure that Google (and other search engines!) can crawl and index your site correctly. Ranking signals (such as PageRank or incoming links) will be passed appropriately across 301 redirects. If you're changing domains, read about some best practices for making the move.
Source
So I'm guessing you aren't doing a 301 or Google changed.

Does marginally changing a well established domains URLs and page content affect site ranking in search engines

I am rewriting a web application and will be porting procedural PHP code to a framework.
At the same time, the layout of the pages (but not the critical content) is not changing.
I'm thinking that I need to some how port the old URLs by using a 301 Redirect permanent.
Any advice on how to proceed? As this site is long established with a good reputation, it's important we maintain rankings in Bing and Google.
Thanks.

You should use 301 redirects instead, that will carry over the data the engines have for those pages and you should be fine

How to get a domain un-indexed by search engines

I have a domain with a loto of indexed pages, I use this one as a online test domain. I understand that I should test it on a intranet or somewhat, but in time Google indexed a few websites which are not relavent anymore.
Does anyone know how to get a domain totlally unindexed from the most search engines?

There is a couple things you can do.
Set up a restrictive robots.txt file
Password protect the domain root
Request removal directly from SEs
If you have a static ip and you are the only one accessing the site, you can simply deny access to any ips other than yours.

Place a robots.txt file in the root directory of your webpage. It can be used to control how much access search engine spiders have to your content. You can specify certain areas of your site off limits to indexing, on a directory-by-directory basis.

Remove alias domain if you have
Remove url redirect from old to new
so that Search Engines can slowly de-index your old domain.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Index all the intranet with nutch - nutch

I use Nutch and i would like index an intranet, but how to make sure everything on the intranet will be indexed ? Thanks.

Related

Ensure that Nutch has crawled all pages of a particular domain

Block Bots from crawling one of my sites on a multistore multidomain prestashop

How to stop google from indexing a domain used for redirection

Does marginally changing a well established domains URLs and page content affect site ranking in search engines

How to get a domain un-indexed by search engines

Categories

Resources