is possible to add more than one '?' in a url? - .htaccess

i need to rewrite url
my Actual URL
http://www.domain.com/page.php?catName/ArticleName....?/&ca=7&prod=44&artId=446
i need to rewrite like this
http://www.domain.com/catID-catName/proID-prodName/artID-ArticleName....?/page.html

Yes it is possible. By the way, your modified URL only has one '?'.
From the [RFC][1] specifying the syntax of URIs and URLs, the query is the part of the URL that follows the http://www.example.com/path or http://www.example.com (the path is optional) component. Note that the "?" character must be the first character of the query section of the URL.
The crucial sentence in the section 3.4 of the RFC is
The characters slash ("/") and question mark ("?") may represent data within the query component.
Here is the pertinent section of the RFC governing URI syntax.
3.4 Query
The query component contains non-hierarchical data that, along with
data in the path component (Section 3.3), serves to identify a
resource within the scope of the URI's scheme and naming authority
(if any). The query component is indicated by the first question
mark ("?") character and terminated by a number sign ("#") character
or by the end of the URI.
query = *( pchar / "/" / "?" )
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters.
[1]: http://tools.ietf.org/html/rfc3986#section-3

Related

Do Content Security Policies support a wildcard suffix for the top-level domain?

I have a Content Security Policy img-src I am unable to configure. The URL being requested has a dot country code suffix.
Example:
In Hong Kong = www.example.com.hk
In Thailand = www.example.com.th
Is there a way to add a wildcard to the end of a path? all the documentation I can find uses wildcards as a prefix.
I have tried www.example.com.* but it's an invalid source.
From MDN:
Internet hosts by name or IP address, as well as an optional URL scheme and/or port number. The site's address may include an optional leading wildcard (the asterisk character, '*'), and you may use a wildcard (again, '*') as the port number, indicating that all legal ports are valid for the source.
Nowhere does it say a wildcard can be used anywhere else.
While for you it may seem convenient in this specific context, in the bigger picture it would be a strange and probably dangerous thing to allow wildcarding the toplevel domain. Each toplevel domain is separately authoritative for the domains below it. There is no rule that says that www.example.com.hk and www.example.com.th must be owned by the same entity, and the same goes for the same-origin-policy. If that happens to be the case, it should be seen as coincidental.
You can use csp-evaluator to try out CSP values and see what they mean.

How do I create a rule to block all user agents with ModSecurity V3?

I want to add a custom ModSecurity (V3) rule that can block all user agents, and allow me to whitelist certain User Agents from a file.
If this is possible, if someone could share the rule with me, that would be great. I cannot seem to figure out the rule to do this.
Thanks!
This is a bit dangerous what you want to do, but I try to give you some help.
I think CRS rule 913100 would be a good point to start for you.
It's a bit complex if you new in ModSecurity any SecLang, so in short, this would be a possible solution. Create a rule for your WAF, like this:
SecRule REQUEST_HEADERS:User-Agent "!#pmFromFile allowed-user-agents.data" \
"id:9013100,\
phase:1,\
deny,\
t:none,\
msg:'Found User-Agent associated with security scanner',\
logdata:'Matched Data: illegal UA found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}'"
Please note, that you can choose any id for your rule what you want, but there is a reservation list for id's:
https://coreruleset.org/docs/rules/ruleid/#id-reservations
It's highly recommended you choose a right one to avoid the collision with other rules. 9013100 would be a good choice, and it represents where this rule is derived from.
Then you have to make a file with a list of your allowed user agents. Note, that you have to place that list in same directory as the rule conf file exists. The name of file must be (as you can see above) allowed-user-agents.data. You can put an agent per line. Also you can use comments with # at the beginning of the rule - just see the CRS's data file.
How this rule works?
SecRule is a token which tells the engine that this is a rule. REQUEST_HEADERS is a collection (a special variable), what the engine expands from the HTTP request. The : after the name indicates that you want to investigate only the mentioned header, namely User-Agent.
The next block is the operator. As documentation says #pmFromFile "Performs a case-insensitive match of the provided phrases against the desired input value.". This is what you need exactly. There is a ! sign before the operator. This inverts the operator's behavior, so it will be TRUE if the User-Agent isn't there in the file.
The next section is the action's list. id is mandatory, this identifies the rule. phase:1 is optional but very recommended to place one. For more information, see the reference. deny is a disruptive action, it terminates the request immediately. msg will append a message to the log in every cases. logdata will show a detailed info about the rule result.
Why is this a little dangerous
As you can see in the documentation of #pmFromFile operator, it uses patterns. This means you do not have to place the exact User-Agent names, it's enough to put a pattern, like "curl", or "mozilla" - but be careful, a wrong pattern can lead to false positive results, which means - in this case - an attacker can bypass your rule: it's enough to place the pattern to trick it.
Consider you put the pattern my-user-agent into the data file. Now if someone just uses this pattern as User-Agent, the rule won't match.
It is generally true that handling whitelists in this way (in some special contexts, like this) is dangerous, because it's easy to bypass them.

& Ampersand in URL

I am trying to figure out how to use the ampersand symbol in an url.
Having seen it here: http://www.indeed.co.uk/B&Q-jobs I wish to do something similar.
Not exactly sure what the server is going to call when the url is accessed.
Is there a way to grab a request like this with .htaccess and rewrite to a specific file?
Thanks for you help!
Ampersands are commonly used in a query string. Query strings are one or more variables at the end of the URL that the page uses to render content, track information, etc. Query strings typically look something like this:
http://www.website.com/index.php?variable=1&variable=2
Notice how the first special character in the URL after the file extension is a ?. This designates the start of the query string.
In your example, there is no ?, so no query string is started. According to RFC 1738, ampersands are not valid URL characters except for their designated purposes (to link variables in a query string together), so the link you provided is technically invalid.
The way around that invalidity, and what is likely happening, is a rewrite. A rewrite informs the server to show a specific file based on a pattern or match. For example, an .htaccess rewrite rule that may work with your example could be:
RewriteEngine on
RewriteRule ^/?B&Q-(.*)$ /scripts/b-q.php?variable=$1 [NC,L]
This rule would find any URL's starting with http://www.indeed.co.uk/B&Q- and show the content of http://www.indeed.co.uk/scripts/b-q.php?variable=jobs instead.
For more information about Apache rewrite rules, check out their official documentation.
Lastly, I would recommend against using ampersands in URLs, even when doing rewrites, unless they are part of the query string. The purpose of an ampersand in a URL is to string variables together in a query string. Using it out of that purpose is not correct and may cause confusion in the future.
A URI like /B&Q-jobs gets sent to the server encoded like this: /B%26Q-jobs. However, when it gets sent through the rewrite engine, the URI has already been decoded so you want to actually match against the & character:
Rewrite ^/?B&Q-jobs$ /a/specific/file.html [L]
This makes it so when someone requests /B&Q-jobs, they actually get served the content at /a/specific/file.html.

Mod Rewrite Rule for Dynamic URL - Is this possible?

I've given myself a headache trying to figure out if this can be done. I have a forum that was recently migrated, leaving thousands of broken dynamic links.
A typical URL looks like this:
http://domain.com/Forum_Name/b10001/25/
('b10001' refers to the forum ID number and the last number refers to the page number.)
The new URL is formatted like this:
http://domain.com/forums/Forum_Name.10001/
(No page number. Also, notice the 'b' is no longer in front of the ID number.)
Is there a rewrite rule that can achieve this?
I'm not a rewriter, but following what I've read here, something like this should work:
RewriteRule ^([A-Za-z0-9-]+)/b([0-9])+(/[0-9]+)?/?.*$ forums/$1.$2/ [NC,L]
^([A-Za-z0-9-]+) says "begins with an alphanumeric string", then there's the /b constant, followed by [0-9]+ (one or more digits), and then an optional / with one or more digit (the page number, (/[0-9]+)?), and lastly, it ends with an optional slash (/?$).
If the URL matches that pattern, then it's rewritten to forums/$1\.$2/. \. escapes the dot (it's a wildcard), $1 is the first match of the pattern (that first alphanumeric string which is the forum name), and $2 is the second match, namely, the number after the b.
Finally, NC means pattern is case-insensitive, and L is "last" - so you don't process any other rule. I think that is most up to you, just read the linked article and pick the flags you need :)
Edit: corrected pattern checking with http://htaccess.madewithlove.be/
I think what you're looking for is
RewriteRule ^([a-zA-Z0-9_]+)/b([0-9]+)/.*$ forums/$1/$2/
Make sure the contents of the [] parts match the format you're using for forum names and ids.
For parameters, you probably want R=301 to force a permanent redirect.

Is it possible with canonical URL for this pattern in htaccess: /a/*/id/uniqueid?

A big problem is that I am not a programmer….! So I need to solve this with means within my own competence… I would be very happy for help!
I have an issue with a lot of duplicated URLs in the Google index and there are strong signs that it is causing SEO problems.
I don’t have duplicate links on the site itself, but as it once was set-up, for certain pages the system allows all sorts of variations in the URL. As long as is it has a specific article-id, the same content will be presented under an infinite number of URLs.
I guess the duplicates in Google's index has been growing over long time and is due to links gone wrong from other sites that links to mine. The problem is that the system have accepted the variations.
Here are examples of variations that exists in the Google index:
site.com/a/Cow_Cat/id/5272
site.com/a/cow_cat/id/5272
site.com/a/cow…cat/id/5272
site.com/a/cowcat/id/5272
site.com/a/bird/id/5272
The first URL with mixed case is the one used site-wide and for now I have to live with it, it would take too long time to make a change to all lower case. I cannot make a manual effort via htaccess as it is a total of 300.000 articles. I believe there are 10 ‘s of thousands that have one or more duplicates.
My question is this:
Is it possible to create rules for canonical URLs in htaccess in order to make the above URLs to be handled as one as well as for the rest of the 300.000?
I e, is there a way to say that all URLs having
/a/*/id/uniqueid
should be seen as one = based only on the unique ID and not give any regard to the text expressed with the “*”?
My hope is that it would be possible to say that a certain pattern like above should only be differentiated by the last unique segment.
If it is not possible in htaccess, how would it be done with link rel="canonical" on each page, can the code include wildcards?
I should add that the majority of the duplicates are caused by incoming links being lower case where the site itself is using a mix. Would it be OK to assign a canonical URL only with lower case although the site itself is basically always using a mix of lower/upper case?
If this is possible, I would be very happy to be helped with how to do it!!!!
Jonas
Hi Michael! I am not an expert but this is how I think it could be done:
1) My problem is that the URLs have mixed cases and I cannot change that now.
2) If it is OK for the searchengines, it would be fine for me to make the canonical URL identical to the actual URLs with the difference that it was all lower case, that would solve approx 90% of the duplicates. I e this would be the used URL: site.com/a/Cow_Cat/id/5272 and this would be the canonical: site.com/a/cow_cat/id/5272. As I understand, that would be good SEO...or...?
My idea was NOT to change the address browser address bar (i e using 301 redirect) but rather just telling the search engines which URLs that are duplicates, as I understand, that can be done by defining a canonical URL either in htaccess (as a pattern - I hope) or as a tag on each page.
3) IF, it would be possible to find a wildcard solution...I am not sure if this is possible at all, but that would mean it was possible to NOT assign a specific canonical URL but rather a "group pattern", i e "Please search engine, see all URLs with this patter - having the unique identifier in the end - as if they are one and the same URL, you SE, decide which one you prefer": /a/*/id/uniqueid
Would that work? It will only work in htaccess if canonical URLs can be defined as a group where the group is defined as a pattern with a defined part as the unique id.
Is it possible when adding a tag for each page to say that "all URLs containing this unique id should be treated the same"? If that would work it would look something similar to this
link rel="canonical" /a/*/id/5272
I dont know if this syntax with wildcard exist but it would be nice : )
My advice would be to use 301 redirects, with URL rewriting. Ask your webmaster to place this in your apache config or virtual host config:
RewriteMap lc int:tolower
Then inside your .htaccess file you can use the map ${lc:$1} to convert matches to lower case. Here, the $1 part is a match (backreference from brackets in a regex in the RewriteRule) and the ${lc: } part is just how you apply the lc (lowercase) function set up earlier. Here is an example of what you might want in your .htaccess file:
RewriteCond %{REQUEST_URI} [A-Z] #this matches a url with any uppercase characters
RewriteRule (.*) /${lc:$1} [L,R=301] #this makes it lowercase
As for matching the IDs, presuming your examples mean "always end with the ID" you could use a regex like:
^(.+/)(\d+))$
The first match (brackets) gets everything up to and including the forward slash before the ID, and the second part grabs the ID. We can then use it to point to a single, specific URL (like canonical, but with a 301).
If you do just want to use canonical tags, then you'll have to say what you're using code wise, but an example I use (so as not add tags to hundreds of individual pages, for instance) in PHP would be:
if ($_SERVER["REDIRECT_URL"] != "") {
$canonicalUrl = $_SERVER["SERVER_NAME"] . $_SERVER["REDIRECT_URL"];
} else if ($_SERVER["REQUEST_URI"] != "") {
$canonicalUrl = $_SERVER["SERVER_NAME"] . preg_replace('/^([^?]+)\?.*$/', "$1", $_SERVER['REQUEST_URI']);
}
Here, the redirect URL is used if it's available, and if not the request uri is used. This code strips off the query string (this bold bit in http://www.mysite.com/a/blah/12345/?something=true). Of course you can add to this code to specify a custom path, not just taking off the query string, by playing with the regex.

Resources