How Prevent Google Duplicate Content Problem | Multi Site - search

I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.

Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content

If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.

You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.

I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.

Related

How can I scrape data from a website?

I want to scrape only four data items from the following page in each and every product from the following link that was an infinitive scroll down page.
name of the product
price of the product
href of the product
img src of the product.
All the data will be stored in a single csv file.
How can I do this?
Any idea?
i have not sure of this method.
get the original source code where you can get all of info of the website including the photo link or any word
This is usually considered a bad idea. If you write code to scrape a website for it's content, what happens when they change their markup? Or what happens when they realize you're scraping (stealing) their original content and ban your server's IP address or IP range even. It's a losing battle, so unless you have permission from them to do so I wouldn't recommend trying. It may work for a little while, but probably not for long. It's generally considered poor form to do something like this, so personally I wouldn't encourage anyone to teach someone how to scrape a website for it's content.
Furthermore, it says very clearly in their Terms of Use not to do exactly that:
You agree not to access (or attempt to access) the Website and the materials
or Services by any means other than through the interface that is provided by
Snapdeal. You shall not use any deep-link, robot, spider or other automatic
device, program, algorithm or methodology, or any similar or equivalent manual
process, to access, acquire, copy or monitor any portion of the Website or
Content (as defined below), or in any way reproduce or circumvent the
navigational structure or presentation of the Website, materials or any
Content, to obtain or attempt to obtain any materials, documents or
information through any means not specifically made available through the
Website.

301 or 404 Millions of indexed pages

Hope someone can shed some light on the best way to do this. We currently have a website with roughly 3 1/2 million pages indexed into google. Now we've decided we no longer need this site and we're taking it down and replacing it with a much smaller (totally different) site on the same domain.
So the pages that are already indexed i would like them all to direct to a single page explaining what's happened to them. (if that makes sense).
So for arguments sake lets say the site i want to remove all pages for is 'xyz.com' so if someone clicks on an indexed page e.g 'xyz.com/indexed-page/' i want that to go to a new page on 'xyz.com' let's say 'xyz.com/what-happened-to-indexed-page/'.
Now i'm guessing i shouldn't be doing 301's here because essentially i'm not moving the indexed pages, i'm actually removing them. So would it be best to just send all the current indexed pages to a custom 404 page and explain there what's happened to said indexed pages?
Hope that makes sense.
Thanks
Use a HTTP status code of 410 GONE to indicate the pages are permanently removed. Doing a 404 will tell the searches engines to keep retrying which they will do for a while before they finally assume it is gone permanently. That will punish your server and pollute your logs.
Regardless of status code your idea of having a page explain what happened to the pages is a good one. Users will continue to find those old URLs for quite some time and explaining why they are gone is great usability. Nothing is more frustrating then thinking you're finding one piece of content only to be taken to a page that does not contain that content.

How to determine the stopping point of a loop when crawling a web-site

My program currently goes through pages of a website gathering information. How do I set my loop to end when I have visited all the websites pages?
Is there some way of knowing the amount of webpages in any site?
Or do I have compare a block of pages I have visited eg 10 and if the pages are checked in that order again i know its repeating itself.
I'm sure there has to be a better way of knowing when to stop.
Keep track of pages visited ( may be keeping visited URL in a set) and when trying to scan a new page, check if it is already visited.
Breadth first search
Depth first search
Check these two algorithms. Think of the site as a graph
whose nodes are the pages and whose edges/arcs are the links
from one page to another. So two pages are neighboring
A → B, if there's a link from page A to page B.
Then just implement one of these two algorithms
(whichever you find more appropriate for your case).
Both of them have their respective stop conditions.
Your search in both cases should start with the root
page(s) which is usually default.ext or index.ext or
something similar (ext = html, asp, aspx, jsp, php, whatever).
You may want to pre-process the website with a SitemapGenerator and only visit the webpages included in the sitemap.
Is there some way of knowing the amount of webpages in any site
No. All you can do to examine a web-site is to make HTTP GET (or HEAD) requests and examine the response. That will tell you whether the URI is a valid identifier for a resource, and get you a representation of that resource. You can not know which requests will indicate a valid resource, nor can you practically generate all the possible URIs to perform an exhaustive search.
At best, all you can do is to start with a URI and find all the resources reachable from that URI, by examining resources that contain links to other resources, and then following those links.

Sort domains by number of public web pages?

I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.

Cursors + Pagination & SEO

I would like to know if it's possible to paginate using cursors and keep those pages optimized for SEO at the same time.
/page/1
/page/2
Using offsets, gives to Google bot some information about the depth, that's not the case with curors:
/page/4wd3TsiqEIbc4QTcu9TIDQ
/page/5Qd3TvSUF6Xf4QSX14mdCQ
Should I just only use them as an parameter ?
/page?c=5Qd3TvSUF6Xf4QSX14mdCQ
Well, this question is really interesting and I'll try to answer your question thoroughly.
Introduction
A general (easy to solve) con
If you are using a pagination system, you're probably showing, for each page, a snippet of your items (news, articles, pages and so on). Thus, you're dealing with the famous duplicate content issue. In the page I've linked you'll find the solution to this problem too. In my opinion, this is one of the best thing you can do:
Use 301s: If you've restructured your site, use 301 redirects
("RedirectPermanent") in your .htaccess file to smartly redirect
users, Googlebot, and other spiders. (In Apache, you can do this with
an .htaccess file; in IIS, you can do this through the administrative
console.)
A little note to the general discussion: Since few weeks, Google has been introducing a "system" to help they recognise the relationship between pages as you can see here: Pagination with rel="next" and rel="prev"
Said that, now I can go to the core of the question. In each of the two solutions, there are pros and cons.
As subfolder (page/1)
Cons: You are losing link juice on the page "page" because every piece (page) of your pagination system, will be seen as an indipendent source because they have a different url (infact you are not using parameters).
Pros: If your whole system is doing using the '/' as separator between parameters (which is in a lot of case a good thing) this solution will give coninuity to your system.
As parameter (page?param=1)
Cons: Though Google and the other S.E.s manage the parameters without problems, you're letting them decide for you if a parameter is important or not and if they have to take care to manage them or ignore them. Obviously this is true unless you're not deciding how to manage them in their respective webmaster tool panel.
Pros: You're taking all the link juice on the page "page" but indeed this is not so important because you want to give the link juice to those pages which will show the detailed items.
An "alternative" to pagination
As you can see, I posted on this website a question which is related to your. To sum up, I wanted to know an alternative to pagination. Here is the question (read the accepter answer): How to avoid pagination in a website to have a flat architecture?
Well, I really hope I've answered your question thoroughly.

Resources