301 or 404 Millions of indexed pages - .htaccess

Hope someone can shed some light on the best way to do this. We currently have a website with roughly 3 1/2 million pages indexed into google. Now we've decided we no longer need this site and we're taking it down and replacing it with a much smaller (totally different) site on the same domain.
So the pages that are already indexed i would like them all to direct to a single page explaining what's happened to them. (if that makes sense).
So for arguments sake lets say the site i want to remove all pages for is 'xyz.com' so if someone clicks on an indexed page e.g 'xyz.com/indexed-page/' i want that to go to a new page on 'xyz.com' let's say 'xyz.com/what-happened-to-indexed-page/'.
Now i'm guessing i shouldn't be doing 301's here because essentially i'm not moving the indexed pages, i'm actually removing them. So would it be best to just send all the current indexed pages to a custom 404 page and explain there what's happened to said indexed pages?
Hope that makes sense.
Thanks

Use a HTTP status code of 410 GONE to indicate the pages are permanently removed. Doing a 404 will tell the searches engines to keep retrying which they will do for a while before they finally assume it is gone permanently. That will punish your server and pollute your logs.
Regardless of status code your idea of having a page explain what happened to the pages is a good one. Users will continue to find those old URLs for quite some time and explaining why they are gone is great usability. Nothing is more frustrating then thinking you're finding one piece of content only to be taken to a page that does not contain that content.

Related

Performing many get requests

I am writing a python program that uses beautifulsoup to scrape the image link off a website and then categorize the image. The website puts their images on separate pages in the given url format:
(website.com/(a-z)(a-z)(0-9)(0-9)(0-9)(0-9)
This means the the number of url possibilities are very high (+1 million). I am afraid that if I do a get request to the site this many times, it might harm the site or put me in legal danger. How can I scrape the most amount of urls without damaging the site or putting myself in legal trouble? Please let me know if you guys would want anymore information. Thank you!
P.S. I have left a psudocode of what my code does below if that helps.
P.S.S. Sorry if the format is weird or messed up, I am posting from mobile
For url in urlPossibilities:
Request.get(url)
UrlLink = FindImgLink(url)
Categorize(urlLink)
A few options I can think of...
1) Is there a way to get a listing of these image URLs? E.g. a site map, or a page with a large list of them. This would be the preferred way as by using that listing you can then only scrape what you know to exist. Based on your question I feel this is unlikely but if you have one URL is there no way to work backwards and find more?
2) Is there a pattern to the image naming? The letters might be random but the numbers might incrementally count up. E.g. AA0001 and AA0002 might exist but there may be no other images for the AA prefix?
3) Responsible scraping - if the naming within that structure truly is random and you have no option but try all URLs till you get a hit do so responsibly. Respect robot.txt's and limit the rate of requests.

How to determine the stopping point of a loop when crawling a web-site

My program currently goes through pages of a website gathering information. How do I set my loop to end when I have visited all the websites pages?
Is there some way of knowing the amount of webpages in any site?
Or do I have compare a block of pages I have visited eg 10 and if the pages are checked in that order again i know its repeating itself.
I'm sure there has to be a better way of knowing when to stop.
Keep track of pages visited ( may be keeping visited URL in a set) and when trying to scan a new page, check if it is already visited.
Breadth first search
Depth first search
Check these two algorithms. Think of the site as a graph
whose nodes are the pages and whose edges/arcs are the links
from one page to another. So two pages are neighboring
A → B, if there's a link from page A to page B.
Then just implement one of these two algorithms
(whichever you find more appropriate for your case).
Both of them have their respective stop conditions.
Your search in both cases should start with the root
page(s) which is usually default.ext or index.ext or
something similar (ext = html, asp, aspx, jsp, php, whatever).
You may want to pre-process the website with a SitemapGenerator and only visit the webpages included in the sitemap.
Is there some way of knowing the amount of webpages in any site
No. All you can do to examine a web-site is to make HTTP GET (or HEAD) requests and examine the response. That will tell you whether the URI is a valid identifier for a resource, and get you a representation of that resource. You can not know which requests will indicate a valid resource, nor can you practically generate all the possible URIs to perform an exhaustive search.
At best, all you can do is to start with a URI and find all the resources reachable from that URI, by examining resources that contain links to other resources, and then following those links.

Sort domains by number of public web pages?

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

Cursors + Pagination & SEO

I would like to know if it's possible to paginate using cursors and keep those pages optimized for SEO at the same time.
/page/1
/page/2
Using offsets, gives to Google bot some information about the depth, that's not the case with curors:
/page/4wd3TsiqEIbc4QTcu9TIDQ
/page/5Qd3TvSUF6Xf4QSX14mdCQ
Should I just only use them as an parameter ?
/page?c=5Qd3TvSUF6Xf4QSX14mdCQ
Well, this question is really interesting and I'll try to answer your question thoroughly.
Introduction
A general (easy to solve) con
If you are using a pagination system, you're probably showing, for each page, a snippet of your items (news, articles, pages and so on). Thus, you're dealing with the famous duplicate content issue. In the page I've linked you'll find the solution to this problem too. In my opinion, this is one of the best thing you can do:
Use 301s: If you've restructured your site, use 301 redirects
("RedirectPermanent") in your .htaccess file to smartly redirect
users, Googlebot, and other spiders. (In Apache, you can do this with
an .htaccess file; in IIS, you can do this through the administrative
console.)
A little note to the general discussion: Since few weeks, Google has been introducing a "system" to help they recognise the relationship between pages as you can see here: Pagination with rel="next" and rel="prev"
Said that, now I can go to the core of the question. In each of the two solutions, there are pros and cons.
As subfolder (page/1)
Cons: You are losing link juice on the page "page" because every piece (page) of your pagination system, will be seen as an indipendent source because they have a different url (infact you are not using parameters).
Pros: If your whole system is doing using the '/' as separator between parameters (which is in a lot of case a good thing) this solution will give coninuity to your system.
As parameter (page?param=1)
Cons: Though Google and the other S.E.s manage the parameters without problems, you're letting them decide for you if a parameter is important or not and if they have to take care to manage them or ignore them. Obviously this is true unless you're not deciding how to manage them in their respective webmaster tool panel.
Pros: You're taking all the link juice on the page "page" but indeed this is not so important because you want to give the link juice to those pages which will show the detailed items.
An "alternative" to pagination
As you can see, I posted on this website a question which is related to your. To sum up, I wanted to know an alternative to pagination. Here is the question (read the accepter answer): How to avoid pagination in a website to have a flat architecture?
Well, I really hope I've answered your question thoroughly.

How Prevent Google Duplicate Content Problem | Multi Site

I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.

Resources