blocking my website from others - .htaccess

I would like to block my website and probably redirect them to a 404 page while i am updating it which can take some time.
Could a redirect to the 404 page everytime a user goes to my website work?

You shouldn't do that. Status code 503 "Temporarily Unavailable" is much better in this case.
RewriteRule . - [R=503,L]
This might work.

If it's just a temporary redirect during site-down maintenance then you probably don't want to use a 404 code. Take a look at the other codes available to you. For a scenario such as this, 307 (temporary redirect) would make a lot more sense. It would also be better if you have any SEO or rely on search crawlers at all, as they will remove results which now produce a 404 but are smart enough to keep results which temporarily produce a 307.
The redirect itself will work fine, just redirect all traffic to a static page. (Did you need advice on how to do that, or were you just looking for alternative options and viability? It's unclear from the question. If the former, I can't help much. It's been years since I've cracked open an .htaccess file.)
Basically, a 404 tells visitors: "This resource isn't here. Don't both asking again." Whereas a 307 tells visitors: "This resource is temporarily being handled by something else, but it hasn't really moved, please try again later."

Here's a simpler idea: just make a new index page that's your original, except with the content replaced with a "site currently being updated; please come back later" sort of message. And then you'd redirect all hits to your site to that index page.
That's what many sites I've seen tend to do, at least. And it makes sense, at least to me. I mean, would you rather your users not know why the pages they want to access are no longer there, or that they know the reason is because the site is being updated? It's basically the same as a 404 page, just with the specific information of why the desired pages aren't there.
EDIT: It seems I'm basically talking about a 503 page, going by David's link and Roland's answer.

That would work, but that would not only be wrong information (the page is not 'not found' - it's just currently being updated), but also mislead your users and crawlers. I would redirect them to a 'Update in progress' page and send this with the http status code 423 (LOCKED) to the client to provide a standard conform answer to exactly your scenario.

Related

duplicate URLs in my page, best solution?

I have a website that write URLs like this:
mypage.com/post/3453/post-title-name-person
In fact, what is important is the post and ID part (3453). The title I just add for SEO.
I changed some title names recently, but people can still using the old URL to access, because I just get the ID to open the page, so:
mypage.com/post/3453/post-title-name-person
mypage.com/post/3453/name-person
...
Will open the same page.
Is it wrong? Google webmaster tools tells me that I have 8765 duplications pages. So, to try to solve this I am redirecting old title to post/id/current-title but it seems that Google doesn't understand this redirecting and still give me duplications.
Should i redirect to not found if title doesn't match with the actual data base? (But this can be a problem because links that people shared won't open) Or what?
Maybe Google has not processed your redirections yet. It may take several weeks and sometimes several months to process all pages, especially if they are not revisited often. Make sure your redirects are 301 and not 302 (temporary).
That being said, there is a better method than redirections for duplicate pages: the canonical tag. If you can, implement it. There is less risk to mix up redirections.
Google can pick your new URL's only after the implementation of 301 redirection through .htaccess file. You should always need to remember that 301 re-direct should be proper and one to one to the new url. After this implementation you need to fetch those new URL via Google Search console so that Google index those URL's fast.

Directory listing protection, blank index vs 404 vs 401

In your opinion what is the best way to protect directory listing from external users?
Option 1: Blank index. This is the standar way that i have seen on several sites, it has te advantage of not showing anything but the disadvantage of implying that there is something there
Option 2: 404, send a fake 404 page and redirect, will this can cause problems with the webcrawlers?
Option 3: 401 error and redirection, this is similar to the blank index, except that it will show an "unauthorized" header, i think this will be a very bad option (because im implicity saying that there is something important inside), but i would like to hear your thoughts on this too
Thanks for your help if you know any other option that i might use please tell me as well
The 'best' way is to disable directory listing the server (this will normally cause a 403 error, see error 404 in the following list for discussion of information leakage)
The easiest way is a blank page (normally index.html or index.htm)
Other options with returning errorcodes:
403 (forbidden) is the default in apache httpd and i think this is better than a blank page.
404 is for 'not found' which is not the case here (could be used if nobody knows that the directory exists in order to prevent disclosure, but if ppl. know it exits it doesn't make any sense as its existance is already known) and
401 (authentication required) doesn't make any sense in any case
Other considerations
some browsers do not display custom error pages. If you want to provide a link to the main page (or somewhere else) a 'blank' page containing a link or a direct 301/302 redirect could be used.

Is there any way to tell a browser that this is a bad URL to remember?

I'm sending emails to customers, and I'm providing a custom URL for each, which when they go to, will log them in.
This is fine, except if they are using a shared browser that will remember the URL.
Is there any way at all to suggest to the browser that it shouldn't remember a URL?
Edit: This question has nothing to do with caching of the page.
Have the link log them in once. Then make them create credentials that let them access the site in the future. Whats to stop a random person from typing in the url and gaining access to the content?
Yes. You can redirect them with a 301 or 302. Then the browser won't save the URL they went to. At least that work with the Mozilla based browsers and I would imagine others too.
Another way, it is uglier though is to reply with an error and include a body which does a refresh. Whether that works in most browsers, probably not. However, browsers do not cache pages that return an error (404 Page Not Found would work, you could also use 403 Forbidden.)
Other than that, there isn't much you can do. JavaScript does not allow you to temper with the history anymore...

How to best normalize URLs

I'm creating a site that allows users to add Keyword --> URL links. I want multiple users to be able to link to the same url (exactly the same, same object instance).
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
The back end is in Python...
How does a search engine keep track of URLs? Do they keep a URL then take what ever it resolves to or do they toss URLs that are different from what they resolve to and just care about the resolved version?
Thanks!!!
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
You'd resolve user 3 by fixing up invalid URLs. www.facebook.com isn't a URL, but you can guess that http:// should go on the start. An empty path part is the same as the / path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.
You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of 301, you've got a permanent redirect to the real URL in the Location response header. Facebook does this to send facebook.com traffic to www.facebook.com, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the 3xx family to do the same; it's not really the right thing to do, but some sites use 302 instead of 301 for the redirect because they're a bit thick.
If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a <link rel="canonical" href="..." /> element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)
However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at / and a page at /index.php, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a 301 redirect response or a <link rel="canonical" /> to tell people that / was the proper format URL to access a particular resource rather than /index.php (or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.
To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.
I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.
(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)
There's no way to know, other than "magic" knowledge about the particular website, that "/index.php" is the same as fetching "/".
So, your problem, as stated, is impossible.
i'd save 3 link as separated, since you can never reliably tell they resolve to same page. it all depends on how the server (out of our control) resolve the url.

How to tell that a folder has been deleted permanently

I have deleted a folder called forums from my website from 3 months. but in my Google Webmaster Tools it keeps saying that e.g. /forums/member.php?u=1092 is missing (404). is there any way to stop these messages and tell google that i am not going to re-upload it? is this going to affect on my SEO ranking?
I tried this code, but it's not working.
RewriteRule ^forums/(.*)$ http://www.mysite.com [301, L]
Thanks.
Have you tried changing the status code to 410?
410 Gone
The requested resource is no longer
available at the server and no
forwarding address is known. This
condition is expected to be considered
permanent. Clients with link editing
capabilities SHOULD delete references
to the Request-URI after user
approval. If the server does not know,
or has no facility to determine,
whether or not the condition is
permanent, the status code 404 (Not
Found) SHOULD be used instead. This
response is cacheable unless indicated
otherwise.
More detail available in the rfc.
Google on Removing my own content from Google.

Resources