How to best normalize URLs

How to best normalize URLs - search

I'm creating a site that allows users to add Keyword --> URL links. I want multiple users to be able to link to the same url (exactly the same, same object instance).
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
The back end is in Python...
How does a search engine keep track of URLs? Do they keep a URL then take what ever it resolves to or do they toss URLs that are different from what they resolve to and just care about the resolved version?
Thanks!!!

So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
You'd resolve user 3 by fixing up invalid URLs. www.facebook.com isn't a URL, but you can guess that http:// should go on the start. An empty path part is the same as the / path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.
You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of 301, you've got a permanent redirect to the real URL in the Location response header. Facebook does this to send facebook.com traffic to www.facebook.com, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the 3xx family to do the same; it's not really the right thing to do, but some sites use 302 instead of 301 for the redirect because they're a bit thick.
If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a <link rel="canonical" href="..." /> element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)
However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at / and a page at /index.php, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a 301 redirect response or a <link rel="canonical" /> to tell people that / was the proper format URL to access a particular resource rather than /index.php (or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.
To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.
I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.
(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)

There's no way to know, other than "magic" knowledge about the particular website, that "/index.php" is the same as fetching "/".
So, your problem, as stated, is impossible.

i'd save 3 link as separated, since you can never reliably tell they resolve to same page. it all depends on how the server (out of our control) resolve the url.

Related

duplicate URLs in my page, best solution?

I have a website that write URLs like this:
mypage.com/post/3453/post-title-name-person
In fact, what is important is the post and ID part (3453). The title I just add for SEO.
I changed some title names recently, but people can still using the old URL to access, because I just get the ID to open the page, so:
mypage.com/post/3453/post-title-name-person
mypage.com/post/3453/name-person
...
Will open the same page.
Is it wrong? Google webmaster tools tells me that I have 8765 duplications pages. So, to try to solve this I am redirecting old title to post/id/current-title but it seems that Google doesn't understand this redirecting and still give me duplications.
Should i redirect to not found if title doesn't match with the actual data base? (But this can be a problem because links that people shared won't open) Or what?

Maybe Google has not processed your redirections yet. It may take several weeks and sometimes several months to process all pages, especially if they are not revisited often. Make sure your redirects are 301 and not 302 (temporary).
That being said, there is a better method than redirections for duplicate pages: the canonical tag. If you can, implement it. There is less risk to mix up redirections.

Google can pick your new URL's only after the implementation of 301 redirection through .htaccess file. You should always need to remember that 301 re-direct should be proper and one to one to the new url. After this implementation you need to fetch those new URL via Google Search console so that Google index those URL's fast.

Security concerns using robots.txt

I'm trying to prevent web search crawlers from indexing certain private pages on my web server. The instructions are to include these in the robots.txt file and place it into the root of my domain.
But I have an issue with such approach, mainly, anyone can go to www.mywebsite.com/robots.txt and see the results as such:
# robots.txt for Sites
# Do Not delete this file.
User-agent: *
Disallow: /php/dontvisit.php
Disallow: /hiddenfolder/
that will tell anyone the pages I don't want anyone to go to.
Any idea how to avoid this?
PS. Here's an example of a page that I don't want to be exposed to the public: PayPal validation page for my software license payment. The page logic will not let a dud request through, but it wastes bandwidth (for PayPal connection, as well as for validation on my server) plus it logs a connection-attempt entry into the database.
PS2. I don't know how the URL for this page got out "to the public". It is not listed anywhere but with the PayPal and via .php scripts on my server. The name of the page itself is something like: /php/ipnius726.php so it's not something simple that a crawler can just guess.

URLs are public. End of discussion. You have to assume that if you leave a URL unchanged for long enough, it'll be visited.
What you can do is:
Secure access to the functionality behind those URLs
Ask people nicely not to visit them
There are many ways to achieve number 1, but the simplest way would be with some kind of session token given to authorized users.
Number 2 is achieved using robots.txt, as you mention. The big crawlers will respect the contents of that file and leave the pages listed there alone.
That's really all you can do.

You can put the stuff you want to keep both uncrawled and obscure into a subfolder. So, for instance, put the page in /hiddenfolder/aivnafgr/hfaweufi.php (where aivnafgr is the only subfolder of hiddenfolder, but just put hiddenfolder in your robots.txt.

If you put your "hidden" pages under a subdirectory, something like private, then you can just Disallow: /private without exposing the names of anything within that directory.
Another trick I've seen suggested is to create a sort of honeypot for dishonest robots by explicitly listing a file that isn't actually part of your site, just to see who requests it. Something like Disallow: /honeypot.php, and you know that any requests for honeypot.php are from a client that's scraping your robots.txt, so you can blacklist that User-Agent string or IP address.

You said you don’t want to rewrite your URLs (e.g., so that all disallowed URLs start with the same path segment).
Instead, you could also specify incomplete URL paths, which wouldn’t require any rewrite.
So to disallow /php/ipnius726.php, you could use the following robots.txt:
User-agent: *
Disallow: /php/ipn
This will block all URLs whose path starts with /php/ipn, for example:
http://example.com/php/ipn
http://example.com/php/ipn.html
http://example.com/php/ipn/
http://example.com/php/ipn/foo
http://example.com/php/ipnfoobar
http://example.com/php/ipnius726.php

This is to supplement David Underwood's and unor's answers (not having enough rep points I am left with just answering the question). Recent digging is showing that Google has a clause that allows them to ignore the previously respected robots file on top of other security concerns. The link is a blog from Zac Gery explaining the new(er) policy and some simple explanations of how to "force" Google search engine to be nice. I realize this isn't precisely what you are looking for but on the QA and security side, I have found it to be very useful.
http://zacgery.blogspot.com/2013/01/why-robotstxt-file-is-no-longer.html

Is the url path '/#!' special or an exploit?

I am getting the path /#! requested regularly on my blog and i was wondering why this was (as it doesn't match to any URL/resource on my blog). The user agent says its always IE7 browsers which request this but from multiple different IP Addresses. I'm trying to work out if I can ignore this or if I need to do something about it.
I specifically want to know the following:
Is it some kind of special URL for certain web browsers/web servers?
Is it connected to a specific exploit?
Can I just ignore it?
If its relevant the site is hosted in windows azure and running on MVC4.

It's a hash-bang URL. They're used by some AJAX web applications, like Facebook and Twitter. Google has some special treatment for them, to make normally uncrawlable AJAX sites crawlable.
However, if your site is not running an app that uses them, you shouldn't be seeing them. And you definitely shouldn't be seeing them on the server side, since the whole point is that everything following a # in a URL is a fragment identifier, and should be stripped off by the user agent before requesting the URL from the server.
Edit: If I had to guess what's requesting such URLs, I'd say it might be some buggy bot. The fact that it's apparently pretending to be IE suggests that it might not be up to anything good; maybe it's a spambot of some sort. Anyway, the requests as such are most likely harmless, and you can ignore them. If it makes you feel better, you could always set up a rewrite rule to explicitly reject them, something like:
RewriteRule \x23 - [F]
This should reject any requests for URLs containing the # character with a 403 Forbidden error.

Well, # is a valid anchor that just means "the page". You can also make a '!' anchor, e.g.
<!-- some html here -->
Click me!
<!-- lots more html -->
<div id="!">
Wooaaaah!
</div>
So my guess is that you can safely ignore it... but that's just a guess ;)

SSL with CartThrob - in-template redirect or htaccess on the basis of URL segment?

this is a broader question than I would probably ask of the CartThrob folks, which is why I'm posting it here. What would the community recommend as far as SSL is concerned with CartThrob? The store functions are limited to a couple of key template groups. So my thinking was perhaps the best way to handle it would be htaccess on the basis of the presence of those URL segments. I would like to return the user to a non-SSL connection when they are not in the store area. So a trigger might be the first segment being "basket" or "account" for example. Or what about an in-template redirect to the secure URL? Very interested to hear the community's suggestions on how best to handle SSL within a given area of an EE site. I'm interested in whatever makes the most sense to implement, while also ensuring that, for example, all assets - even those loaded with path variables - are loaded via SSL. Thanks all!

I've always used CartThrob's https_redirect tag (docs) on my checkout screens, which will rewrite your {path}, {permalink} (etc)-created URLs to use https, as well as redirect you to the https:// version of your page if necessary.
That, combined with using the protocol-agnostic style of calling scripts and stylesheets should get you most of the way in getting your secure icon in the browser.
(Example:)
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.9.1/jquery-ui.min.js"></script>

OWASP TOP10 - #10 Unvalidated Redirects and Forwards

I read many of the articles to this topic, including the OWASP PAGE and the Google blog article about open redirects...
I also found this question on open redirects here on stack overflow but it's a different one
I know why i should not redirect ... this makes totaly sense to me.
But what I really don't understand: Where is exactly the difference between redirecting and putting this in a normal <a href link?
Maybe some of the users are looking in the status bar but i think most of them are not really looking to the status bar, when they klick a link.
Is this really the only reason?
like on this article they wrote:
Click here to log in
The user may assume that the link is safe since the URL starts with their trusted bank, bank.example.com. However, the user will then be redirected to the attacker's web site (attacker.example.net) which the attacker may have made to appear very similar to bank.example.com. The user may then unwittingly enter credentials into the attacker's web page and compromise their bank account. A Java servlet should never redirect a user to a URL without verifying that the redirect address is a trusted site.
So, if you have something like a guestbook, where the user can put the link to their homepage, then the only difference is that the link is not redirected, but it still goes to the evil webpage.
Am I seeing this problem right?

From my understanding, it is not that the redirect is the problem. The main problem here is allowing a redirect (where the target is potentially controllable by the user) that contains an absolute url.
The fact that the url is absolute (meaning it begins http://host/etc), means that you are un-intentionally allowing cross-domain redirects. This is very similar to classic XSS vulnerabilities whereby javascript can be reflected to make cross-domain calls (and leak your domain's information).
So, as I understand, the way to fix most of these sorts of problems is to make sure that any redirect (on the server) is done relative to the root. Then there is no way for the user-controlled query string value go somewhere else.
Does that answer your question or just create more?

The main problem is that its possible for an attacker to make the URL appear to be trustworthy as it’s actually a URL to web site the victim trusts, i. e. bank.example.com.
The redirect target does not need to be that obvious as in the example. Actually, the attacker will probably use further techniques to trick both the user and possibly even the web application if necessary with special encodings, parameter pollutioning, and other techniques to spoof a legitimate URL.
So even if a victim is so security-conscious to check a URL before clicking a link or requesting its resource otherwise, all they can verify is that the URL points to the trustworthy web site bank.example.com. And that alone suffices too often.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string