Summary of malformed URL's?

Summary of malformed URL's? - iis

I am working on a IIS http module that has the purpose of blocking our various common malformed URL's that can be used to attack my site.
Are there any good reference of what kind of URL's to look out for?
I know there is the URLScan project, but I want to understand the various attack vectors.

While you are asking for a blacklist, a more valuable approach is using a whitelist, i.e., think about which URLs are valid. This will minimize the chances of missing a specific URL pattern that can be used in a malicious fashion, but that is not caught by your blacklist pattern(s).

Related

Prevent indexing a domain in search engines like Google and Bing

I have a domain (e.g. domain.com) which is public for all users and I have a secret sub-domain (e.g. site1.secretdomain.com) of a general domain (here secretdomain.com) just for admins of the site.
I don't want Google or other Search Engines index either the secret domain or its subdomains. Do you have any idea for that? I think robots.txt doesn't work because it makes changes for all domains.

A not so fool-proof solution is to remove, or issue a NO-Follow directive to any references to the subdomain-pages along with other necessary changes in robots.txt.
Another little more expensive, but more concrete but on a pragmatic note, would be to look into CAPTCHA or Google's ReCaptcha.
On a more theoretical note, Without much research, I guess a typical approach to the problem would be to serve a unique Cryptographic/Some form of Challenge (Computationally Expensive Problem) upon a request to and use the solution to validate a session from the user.
Even the most advanced Crawlers work with a limited Javascript execution budget; and will decide to move on to other pages once exhausted. FInd a suitable challenge, Optimize the page design to factor in for a load delay, and you have a subdomain open to all humans but not bots.

SSL with CartThrob - in-template redirect or htaccess on the basis of URL segment?

this is a broader question than I would probably ask of the CartThrob folks, which is why I'm posting it here. What would the community recommend as far as SSL is concerned with CartThrob? The store functions are limited to a couple of key template groups. So my thinking was perhaps the best way to handle it would be htaccess on the basis of the presence of those URL segments. I would like to return the user to a non-SSL connection when they are not in the store area. So a trigger might be the first segment being "basket" or "account" for example. Or what about an in-template redirect to the secure URL? Very interested to hear the community's suggestions on how best to handle SSL within a given area of an EE site. I'm interested in whatever makes the most sense to implement, while also ensuring that, for example, all assets - even those loaded with path variables - are loaded via SSL. Thanks all!

I've always used CartThrob's https_redirect tag (docs) on my checkout screens, which will rewrite your {path}, {permalink} (etc)-created URLs to use https, as well as redirect you to the https:// version of your page if necessary.
That, combined with using the protocol-agnostic style of calling scripts and stylesheets should get you most of the way in getting your secure icon in the browser.
(Example:)
<script src="//ajax.googleapis.com/ajax/libs/jqueryui/1.9.1/jquery-ui.min.js"></script>

Implementing HTTP or HTTPS depending on page

I want to implement https on only a selection of my web-pages. I have purchased my SSL certificates etc and got them working. Despite this, due to speed demands i cannot afford to place them on every single page.
Instead i want my server to serve up http or https depending on the page being viewed. An example where this has been done is ‘99designs’
The problem in slightly more detail:
When my visitors first visit my site they only have access to non-sensitive information and therefore i want them to be presented with simple http.
Then once they login they are granted access to more sensitive information, e.g. profile information for which HTTPS is used to deliver.
Despite being logged in, if the user goes back to a non-sensitive page such as the homepage then i want it delivered using HTTP.
One common solution seems to be using the .htaccess file. The problem is that my site is relatively large meaning that to use this would require me to write a rule for every page (several hundred) to determine whether it should be server up using http or https.
And then there is the problem of defining user generated content pages.
Please help,
Many thanks,
David

You've not mentioned anything about the architecture you are using. Assuming that the SSL termination is on the webserver, then you should set up separate virtual hosts with completely seperate and non-overlapping document trees, and for preference, use a path schema which does not overlap (to avoid little accidents).

Is is possible to restrict a requesting domain at the application level?

I wonder how some video streaming sites can restrict videos to be played only on certain domains. More generally, how do some websites only respond to requests from certain domains.
I've looked at http://en.wikipedia.org/wiki/List_of_HTTP_header_fields and saw the referrer field that might be used, but I understand that HTTP headers can be spoofed (can they?)
So my question is, can this be done at the application level? By application, I mean, for example, web applications deployed on a server, not a network router's operating system.
Any programming language would work for an answer. I'm just curious how this is done.
If anything's unclear, let me know. Or you can use it as an opportunity to teach me what I need to know to clearly specify the question.

HTTP Headers regarding ip-information are helpful (because only a smaller portion is faked) but is not reliable. Usually web-applications are using web-frameworks, which give you easy access to these.
Some ways to gain source information:
originating ip-address from the ip/tcp network stack itself: Problem with it is that this server-visible address must not match the real-clients address (it could come from company-proxy, anonymous proxy, big ISP... ).
HTTP X-Forwarded-For Header, proxies are supposed to set this header to solve the mentioned problem above, but it also can be faked or many anonymous proxies aren't setting it at all.
apart from ip-source information you also can use machine identifiers (some use the User-Agent Header. Several sites for instance store this machine identifiers and store it inside flash cookies, so they can reidentify a recalling client to block it. But same story: this is unreliable and can be faked.
The source problem is that you need a lot of security-complexity to securely identify a client (e.g. by authentication and client based certificates). But this is high effort and adds a lot of usability problem, so many sites don't do it. Most often this isn't an issue, because only a small portion of clients are putting some brains to fake and access server.
HTTP Referer is a different thing: It shows you from which page a user was coming. It is included by the browser. It is also unreliable, because the content can be corrupted and some clients do not include it at all (I remember several IE browser version skipping Referer).

These type of controls are based on the originating IP address. From the IP address, the country can be determined. Finding out the IP address requires access to low-level protocol information (e.g. from the socket).
The referrer header makes sense when you click a link from one site to another, but a typical HTTP request built with a programming library doesn't need to include this.

How to best normalize URLs

I'm creating a site that allows users to add Keyword --> URL links. I want multiple users to be able to link to the same url (exactly the same, same object instance).
So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
The back end is in Python...
How does a search engine keep track of URLs? Do they keep a URL then take what ever it resolves to or do they toss URLs that are different from what they resolve to and just care about the resolved version?
Thanks!!!

So if user 1 types in "http://www.facebook.com/index.php" and user 2 types in "http://facebook.com" and user 3 types in "www.facebook.com" how do I best "convert" them to what these all resolve to: "http://www.facebook.com/"
You'd resolve user 3 by fixing up invalid URLs. www.facebook.com isn't a URL, but you can guess that http:// should go on the start. An empty path part is the same as the / path, so you can be sure that needs to go on the end too. A good URL parser should be able to do this bit.
You could resolve user 2 by making a HTTP HEAD request to the URL. If it comes back with a status code of 301, you've got a permanent redirect to the real URL in the Location response header. Facebook does this to send facebook.com traffic to www.facebook.com, and it's definitely something that sites should be doing (even though in the real world many aren't). You might allow consider allowing other redirect status codes in the 3xx family to do the same; it's not really the right thing to do, but some sites use 302 instead of 301 for the redirect because they're a bit thick.
If you have the time and network resources (plus more code to prevent the feature being abused to DoS you or others), you could also consider GETting the target web page and parsing it (assuming it turns out ot be HTML). If there is a <link rel="canonical" href="..." /> element in the page, you should also treat that URL as being the proper one. (View Source: Stack Overflow does this.)
However, unfortunately, user 1's case cannot be resolved. Facebook is serving a page at / and a page at /index.php, and though we can look at them and say they're the same, there is no technical method to describe that relationship. In an ideal world Facebook would include either a 301 redirect response or a <link rel="canonical" /> to tell people that / was the proper format URL to access a particular resource rather than /index.php (or vice versa). But they don't, and in fact most database-driven web sites don't do this yet either.
To get around this, some search engines(*) compare the content at different [sub]domains, and to a limited extent also different paths on the same host, and guess that they're the same if the content is sufficiently similar. Of course this is a lot of work, requires a lot of storage and processing, and is ultimately not terribly reliable.
I wouldn't really bother with much of this, beyond fixing up URLs like in the user 3 case. From your description it doesn't seem that essential that pages that “are the same” have to share actual identity, unless there's a particular use-case you haven't mentioned.
(*: well, Google anyway; more traditional ones traditionally didn't and would happily serve up multiple links for the same page, but I'd assume the other majors are doing something similar now.)

There's no way to know, other than "magic" knowledge about the particular website, that "/index.php" is the same as fetching "/".
So, your problem, as stated, is impossible.

i'd save 3 link as separated, since you can never reliably tell they resolve to same page. it all depends on how the server (out of our control) resolve the url.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string