Redirects vs. "true" page hits : A Crawler's perspective [closed]

Redirects vs. "true" page hits : A Crawler's perspective [closed] - http-status-code-301

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Background :
Site domains such as bit.ly, ow.ly instagr.am, and gowal.la are shorteners which forward elsewhere . Since most of thes url's actually forward to other, third party sites, Im assuming they can handle a pretty heavy load.
Question :
Is there a different politeness metric when crawling 301 redirects from a single domain (i.e. ow.ly) , compared with crawling "real" content pages (i.e. blogger.com/) ?
More concretely : How many times a day would we expect to be able to hit a site which issues 301 redirects, compared with a normal site which streams real content.
Some initial thoughts :
My initial guess would be (10E6 = 1,000,000), given that what i see online suggests that hitting a mature site at the order of 10E3-10E5 times a day is not a huge issue, considering that large site like tumbler receives around (10E7 = 10,000,000+) views per day, with sites like google are on the order of 10E8 (billions) of view per day.
In any case, I hope this very raw bit of fact-finding that I've done will spur some thoughts on defining the difference in "politeness" metrics when we are discussing 301 redirects versus "true" page crawls (which are bandwidth intensive).

When in doubt, check robots.txt. There's a non-standard extension called Crawl-delay, which as you may be able to imagine, specifies how many seconds to wait between requests.
You mentioned bit.ly; their robots.txt has no such restrictions, and a human-friendly comment saying "robots welcome". As long as you are not abusive, you probably won't have a problem with them. There are also comments in there stating that they have an API. Using that API may be more useful than crawling.
As for defining abusive... well, unfortunately that's a very subjective thing, and there's not going to be any one right answer. You'd probably need to ask each specific vendor what their recommendations and limits are, if they don't provide this information through documentation on their site, robots.txt, or through an actual API, which itself may have well-defined access limits.

Related

Why would you stop Google from indexing pages in your website? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've read some articles on how to stop the indexing, but I'm not clear WHY you would actually want to do that.
1) The explanation I found for why was:
"For marketers, one common reason is to prevent duplicate content (when there is more than one version of a page indexed by the search engines, as in a printer-friendly version of your content) from being indexed.
Another good example? A thank-you page (i.e., the page a visitor lands on after converting on one of your landing pages). This is usually where the visitor gets access to whatever offer that landing page promised, such as a link to an ebook PDF." [Basically you don't want the user to find your Thank You page with freebies through search without signing up]
However, in both these cases it actually seems like a bad idea to prevent indexing? You'd rather just redirect to the sign-in page (in the second example) after your user finds you? At least the user will be able to reach your website.
2) It's also mentioned that indexing is not the same as appearing in Google search results, but it's not really clear what the difference is. Could someone enlighten?
TIA.

Let me provide few good reasons from my experience, though I believe many more exist.
Traditionally known primary reason is to save computing resources. Imagine a search engine - probably it would not like another search engine to index all of its results.
A big part of it is to prevent waste of resources. Imagine a search engine would index itself, that can take some time. This also applies to binary data which has no text.
Your example somewhat falls into this category
"For marketers, one common reason is to prevent duplicate content (when there is more than one version of a page indexed by the search engines, as in a printer-friendly version of your content) from being indexed.
But this is not considered a valid reason any more, as resource consumption is generally low, and proper disambiguation should be done with html metadata like
<link rel='canonical' href='<permanent link>' ...>
<link rel='alternate' media='printed' ...>
Another big reason to prevent indexing is privacy. E.g. facebook profiles are not indexed if owner chooses so.
Another good example? A thank-you page (i.e., the page a visitor lands on after converting on one of your landing pages). This is usually where the visitor gets access to whatever offer that landing page promised, such as a link to an ebook PDF." [Basically you don't want the user to find your Thank You page with freebies through search without signing up]
This falls into privacy category. Even better, a search engine once indexed a set of these "thank you" pages from a website of mobile operator, which also included the message sent.
One observed reason is general newbie paranoia. It is a bad reason, because paranoia solution would be much better implemented with HTTP authentication.

301 or 404 Millions of indexed pages

Hope someone can shed some light on the best way to do this. We currently have a website with roughly 3 1/2 million pages indexed into google. Now we've decided we no longer need this site and we're taking it down and replacing it with a much smaller (totally different) site on the same domain.
So the pages that are already indexed i would like them all to direct to a single page explaining what's happened to them. (if that makes sense).
So for arguments sake lets say the site i want to remove all pages for is 'xyz.com' so if someone clicks on an indexed page e.g 'xyz.com/indexed-page/' i want that to go to a new page on 'xyz.com' let's say 'xyz.com/what-happened-to-indexed-page/'.
Now i'm guessing i shouldn't be doing 301's here because essentially i'm not moving the indexed pages, i'm actually removing them. So would it be best to just send all the current indexed pages to a custom 404 page and explain there what's happened to said indexed pages?
Hope that makes sense.
Thanks

Use a HTTP status code of 410 GONE to indicate the pages are permanently removed. Doing a 404 will tell the searches engines to keep retrying which they will do for a while before they finally assume it is gone permanently. That will punish your server and pollute your logs.
Regardless of status code your idea of having a page explain what happened to the pages is a good one. Users will continue to find those old URLs for quite some time and explaining why they are gone is great usability. Nothing is more frustrating then thinking you're finding one piece of content only to be taken to a page that does not contain that content.

Should I let non-members add comment to a post? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
As long as it's SQL injection proof, would it be alright for me to let non-members add comments to a post and give the Author the ability to delete them?

Before you do it, consider the following questions
(and any other questions specific to your project that may spring to mind)
Do you have a good rate-limiting scheme set up so a user can't just fill your hard drive with randomly-generated comments?
Do you have a system in place to automatically ban users / IP addresses who seem to be abusive? Do you have a limit on the number / number of kilobytes of comments loaded per page (so someone can't fill a page with comments, making the page take forever to load / making it easy to DoS you by making a lot of requests for that page)?
Is it possible to fold comments out of sight on the webpage so users can easily hide spammy comments they'd rather not see?
Is it possible for legitimate users to report spammy comments?
These are all issues that apply to full members, of course. But it also matters for anonymous users, and since anonymous posting is low-hanging fruit, a botmaster would be more likely to target that. The main thing is simply to consider "If I were a skilled programmer who hated this website, or wanted to make money from advertising on it, and I have a small botnet, what is the worst thing I could do to this website using anonymous comments given the resources I have?" And that's a tough question, which depends a great deal on what other stuff you have in place.
If you do it, here are a few pointers:
HTML-escape the comments when you fetch them from the database before you display them, otherwise you're open to XSS.
Make sure you never run any eval-like function on the input the user gives you (this includes printf; to do something like that you'd want to stick with printf("%s", userStr);, so printf doesn't directly try to interpret userStr. If you care about why that's an issue, google for Aleph One's seminal paper on stack smashing),
Never rely on the size of the input to fall within a specific range (even if you check this in Javascript; in fact, especially if you try to ensure this in Javascript) and
Never trust anything about the content will be true (make no assumptions about character encoding, for example. Remember, a malicious user won't need to use a browser; they can craft their calls however they want).
Default to paranoia If someone posts 20 comments in a minute, ban them from commenting for a while. If they keep doing that, ban their IP. If they're a real person, and they care, they'll ask you to undo it. Plus, if they're a real person, and they have a history of posting 20 comments a minute, chances are pretty good those comments would be improved by some time under the banhammer; no one's that witty.

Typically this kind of question depends on the type of community, as well as the control you give your authors. Definitely implement safety and a verification system (eg CAPTCHA), but this is something you'll have to gauge over time more often than not. If users are being well-behaved, then it's fine. If they start spamming every post they get their hands on, then it's probably time a feature like that should just go away.

Cursors + Pagination & SEO

I would like to know if it's possible to paginate using cursors and keep those pages optimized for SEO at the same time.
/page/1
/page/2
Using offsets, gives to Google bot some information about the depth, that's not the case with curors:
/page/4wd3TsiqEIbc4QTcu9TIDQ
/page/5Qd3TvSUF6Xf4QSX14mdCQ
Should I just only use them as an parameter ?
/page?c=5Qd3TvSUF6Xf4QSX14mdCQ

Well, this question is really interesting and I'll try to answer your question thoroughly.
Introduction
A general (easy to solve) con
If you are using a pagination system, you're probably showing, for each page, a snippet of your items (news, articles, pages and so on). Thus, you're dealing with the famous duplicate content issue. In the page I've linked you'll find the solution to this problem too. In my opinion, this is one of the best thing you can do:
Use 301s: If you've restructured your site, use 301 redirects
("RedirectPermanent") in your .htaccess file to smartly redirect
users, Googlebot, and other spiders. (In Apache, you can do this with
an .htaccess file; in IIS, you can do this through the administrative
console.)
A little note to the general discussion: Since few weeks, Google has been introducing a "system" to help they recognise the relationship between pages as you can see here: Pagination with rel="next" and rel="prev"
Said that, now I can go to the core of the question. In each of the two solutions, there are pros and cons.
As subfolder (page/1)
Cons: You are losing link juice on the page "page" because every piece (page) of your pagination system, will be seen as an indipendent source because they have a different url (infact you are not using parameters).
Pros: If your whole system is doing using the '/' as separator between parameters (which is in a lot of case a good thing) this solution will give coninuity to your system.
As parameter (page?param=1)
Cons: Though Google and the other S.E.s manage the parameters without problems, you're letting them decide for you if a parameter is important or not and if they have to take care to manage them or ignore them. Obviously this is true unless you're not deciding how to manage them in their respective webmaster tool panel.
Pros: You're taking all the link juice on the page "page" but indeed this is not so important because you want to give the link juice to those pages which will show the detailed items.
An "alternative" to pagination
As you can see, I posted on this website a question which is related to your. To sum up, I wanted to know an alternative to pagination. Here is the question (read the accepter answer): How to avoid pagination in a website to have a flat architecture?
Well, I really hope I've answered your question thoroughly.

how to make our website to be among ten in Google results? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
how to make our site viewable in top ten of the Google search...
I want my website to be available for the user who Google with search name social networking or something like ssit How to do that?

GET to know SEO(Search Engine Optimization).
1) Use proper relevant meta tag's keywords and description
2) Include title tag, get more important keyword in heading tags
3) Use proper title and alt for images
4) Have a page for site-map in your website
5) Have more back-links to your site by submitting articles, press release and news
6) Cross linking between pages of the same website to provide more links to most important pages may improve its visibility
But before anything one should know about the type of users they want for their site and search for the relevant keywords for their site, Google analytic definitely helps for this purpose.
And most important don't expect you site to be on the top soon, it will take some time like 6 month at least to get on the top of search engine. As soon as users of your site increase rank will increase. So BEST OF LUCK

The algorithm used by Google gives a rank based on the number of other sites linking your site in association to specific keywords.
This has been spoofed to do so-called "google bombing": if a lot of people spread a link to a specific site using a specific word, they connect that word to that site and have the top rank (e.g., it has been used to associate insults to politics). The same technique has been used by spammers to rise the rank of garbage sites: they flood forums and blog comments with links to their sites. Although the algorithm has been improved to try to avoid this issue, it is still a viable way to raise the site rank.
It is clear that using such methods to improve visibility of your site will give a very bad reputation to your site.
I suggest instead to pay Google to advertise on it (so you will get top in a legitimate way).
Of course, you are supposed to get the top ten if your site is really the top for the specific argument.

Slip google a fiver (cough) I mean, uh, get other well known websites to link to you. Google's Pagerank works out a pages final rank or priority by determining how many other sites link to it, with links that come from already high ranks sites being weighted more then links form a 'unknown' site.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string