How to extract protocols (http)[s] to check url using shell script

How to extract protocols (http)[s] to check url using shell script - linux

I have tried below one:
A=https://xyz.site
echo -e ${A//:*}
Result: https
Please describe me that, how this ${A//:*} term results https or http and what's the concept behind it, share a article or pdf if possible.
For Worldwide web [www]
Its preety simple to extract this one:
A=www.google.com
echo -e ${A::3}
Result: www

${parameter:offset:length} — This is referred to as Substring Expansion. In your example ${A::3} means ${A:0:3} and returns the first 3 characters of the variable A.
${parameter/pattern/string} — This notation replaces the first match of pattern with a string. If pattern begins with /, all matches of pattern are replaced with string. In your example ${A//:*} means ${A//:*/} and it replaces all patterns :* with an empty string.

Related

RewriteRule cuts off part of a variable name

I have a RewriteRule inside my .htaccess file:
RewriteRule ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ /incl/pages/seo.club.php?state=$1&county=$2&title=$3 [NC,L]
For most cases it works fine, however, if the title starts with the word "club" that word is cut off.
For example, if the name is fast-cars-club the $_GET['title'] will be unchanged, as desired, however if the slug is club-of-fast-cars the $_GET['title'] will output -of-fast-cars
In the following URL:
mysite.com/tx/travis/club/fast-cars-club
$_GET['title'] == 'fast-cars-club'
But in the this URL:
mysite.com/tx/travis/club/club-fast-cars
$_GET['title'] == '-fast-cars'
What am I missing?

Your rule is too broad, so it can match strings in multiple different ways. The way that you were hoping it would match isn't necessarily the one that the regular expression engine will actually process.
First, let's break down your pattern ^([a-zA-Z-/]{2})/([a-zA-Z-/]+)/club/?([a-zA-Z0-9-]+)?/?$ into the parts the engine will process:
^ start of string
[a-zA-Z-/] a lower-case letter, an upper-case letter, a hyphen - or a slash /
([a-zA-Z-/]{2}) the above must match exactly 2 characters, which will be captured as $1
/ a literal slash, not optional, not captured
([a-zA-Z-/]+) the same set of characters as earlier; this time required to match one or more times (+); captured as $2
/club the literal string /club, not optional, not captured
/? a literal slash, optional (specifically, ? means must occur zero or one times)
[a-zA-Z0-9-] a lower-case letter, an upper-case letter, a digit, or a hyphen -
([a-zA-Z0-9-]+) the above must match one or more times; captured as $3
([a-zA-Z0-9-]+)? the above capture group as a whole is optional
/? a literal slash, optional
$ end of string
Next, look at how this matches a URL, starting with the one which works how you hoped (tx/travis/club/fast-cars-club, since the mysite.com/ is processed separately):
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/fast-cars-club but this leaves nothing for the rest of the pattern to match.
The regex engine now applies "back-tracking": it tries shorter matches until it finds one that matches more of the pattern. In this case, it finds that if it takes just travis and puts it in $2, it can match the mandatory /club which comes next
/club is followed by /, so /? matches
fast-cars-club matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
Now look at the "misbehaving" string, tx/travis/club/club-fast-cars:
the ^ indicates that we can't throw anything away at the start of the string
tx matches ([a-zA-Z-/]{2}) and goes into $1
/ matches
([a-zA-Z-/]+) could match the whole of travis/club/club-fast-cars but this leaves nothing for the rest of the pattern to match.
While "back-tracking", the regex engine tries putting travis/club into $2; this is followed by another /club, so the match succeeds
there is no following /, but that's fine: /? can match zero occurrences
the remainder of the string, -fast-cars matches [a-zA-Z0-9-]+, so is captured into $3
we've used the whole input string, so $ succeeds
This behaviour of "greediness" and "back-tracking" is a key one to understanding complex regular expressions, but most of the time the solution is simply to make the regular expression less complex, and more specific.
Only you know the full rules you want to specify, but as a starting point, let's make everything mandatory:
exactly two letters (the state) [a-zA-Z]{2}
/
one or more letters or hyphens (the county) [a-zA-Z-]+
/
the literal word club
/
one or more letters or hyphens (the title) [a-zA-Z-]+
/
Adding parentheses to capture the three parts gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club/([a-zA-Z-]+)/$
Now we can decide to make some parts optional, remembering that the more we make optional, the more ways there might be to re-interpret a URL.
We can probably safely make the trailing / optional. Alternatively, we can have a separate rule that matches any URL without a trailing / and redirects to a URL with it added on (this is quite common to allow both forms but reduce the number of duplicate URLs in search engines).
If we wanted to allow mysite.com/tx/travis/ in addition to mysite.com/tx/travis/club/club-fast-cars/ we could make the whole /club/([a-zA-Z-]+) section optional: ^([a-zA-Z]{2})/([a-zA-Z-]+)(/club/([a-zA-Z-]+))?/$ Note that the extra parentheses capture an extra variable, so what was $3 will now be $4.
Or maybe we want to allow mysite.com/tx/travis/club/, in which case we would make /([a-zA-Z-]+) optional - note that we want to include the / in the optional part, even though we don't want to capture it. That gives ^([a-zA-Z]{2})/([a-zA-Z-]+)/club(/([a-zA-Z-]+))?/$
The two things we almost certainly don't want, which you had are:
Allowing / inside any of the character ranges; keep it for separating components only unless you have a really good reason to allow it elsewhere.
Making / optional in the middle; as we saw, this just leads to multiple ways of matching the same string, and makes the whole thing more complicated than it needs to be.

Extract/Parse all subdomains and domains from any file content [duplicate]

This question already has answers here:
Extract domain names from a file in Shell [closed]
(2 answers)
Closed 4 years ago.
I'm trying to make a regex for grep that match only valid domains.
My version work pretty well but match the following invalid domain :
#subdom..dom.ext
Here is my regex :
echo "#dom.ext" | grep "^#[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
I'm working with bash so I escaped special characters.
Sample that should match :
#subdom.dom.ext
#subsubdom.subdom.dom.ext
#subsub-dom.sub-dom.ext
Thanks for help

A truly complete solution requires more work, but here's an approximation that may work well enough (note that a # prefix is assumed and the input string is expected to start with it):
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$
You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.
Makes the following assumptions, which are more permissive than actual DNS name constraints:
Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.
There's no limit on the number of nested subdomains.
There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
The TLD (last component) is composed of letters only and has a length of at least 2.
Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.
Here's a quick test:
for d in #subdom..dom.ext #dom.ext #subdom.dom.ext #subsubdom.subdom.dom.ext #subsub-dom.sub-dom.ext #x.org; do
[[ $d =~ \
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
]] && echo YES || echo NO
done
Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:
A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:
^#(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$
Caveats:
No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.
As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):
False positive: an invalid Punycode-encoded name such as ab--whatever
False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
... and many more
Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.
I'm unclear on whether names in right-to-left writing scripts are properly matched.
For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.
Tip of the hat to #Alfe and for pointing out the problem with IDNs, and to #Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

echo "#dom.ext" | grep -E "^#[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"
This did the job.

Use
grep '#[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

Looking for the best way in bash shell to extract a string

I have the following string being exported from a program that is analyzing the certificate on a website which will be part of a bugfix analysis
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:170902005715Z:270831005715Z:self signed certificate
(consider output above to be a single line)
What I need is the best way in a bash shell to extract the sha256WithRSAEncryption. This could be anything like sha384withRSAEncryption or something else.
After the CERTSUMMARY it will always be 127.0.0.1:127.0.0.1:portnum above its port 631, but it could be anything.
This runs internally on a system and returns this string along with SSL or TLS (not pictured)
Here is another example of a return
CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/
ST=California/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=
Hash`sha1/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999/name=CustomRetrievedProperties`Enable:
170702212459Z:270630212459Z:unable to get local issuer certificate
Thanks in advance.
Novice at shell programming, but learning!!

you need the best way and yet do not seem to provide the best description - "This could be anything like sha384withRSAEncryption or something else."
Given the examples, the string you are looking for is the 4th, when : is a separator, so the command should be OK:
cut -f4 -d":"
If the output string has a strict length format, one easy option is the 'cut' command with -c. This is not the case though since there is a port number.
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:

as #cyrus pointed out, this was as simple as picking the right column with awk... I am learning.
This worked
awk -F ":" '/CERT_SUMMARY/ {print $5}'
Thanks for the help!!

| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
Regular expressions are you friend. If there is one thing one really should be familiar with if one needs to do a lot of string parsing or string processing, it's definitely regular expressions.
echo 'CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/ST=California
/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=Hash`sha1
/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999
/name=CustomRetrievedProperties
`Enable:170702212459Z:270630212459Z:unable to get local issuer
certificate'
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
prints
sha256WithRSAEncryption
It's probably a bit overkill here, but there is almost nothing that cannot be done with regular expressions and as you have also built-in regex support in many languages today, knowing regex is never going to be a waste of time.
See also here to get a nice explanation of what each regex expression actually means, including an interactive editing view. Basically I'm telling the regex parser to skip the first 4 groups consisting of any number of characters that are not :, followed by a single : and then capture the 5th group that consists of any number of characters that are not : and finally match anything else (no matter what) to the end of the string. The whole regex is part of a sed "replace" operation, where I replace the whole string by just the content that has been captured by the second capture group (everything in round parenthesis is a capture group).

Could you please use following also, not printing it by field's number so if your Input_file's sha256 location is a bit here and there too than shown one then this could be more helpful too.
awk '{match($0,/sha.*Encryption:/);if(substr($0,RSTART,RLENGTH)){print substr($0,RSTART,RLENGTH-1)}}' Input_file

Pipe the output to:
awk ‘BEGIN{FS=“:”} {print $5}’

You could also take a step back to the openssl x509 command 'name options'. Using sep_comma_plus avoids the slashes in the output and therefore your regex will be simpler.

Finding substring of variable length in bash

I have a string, such as time=1234, and I want to extract just the number after the = sign. However, this number could be in the range of 0 and 100000 (eg. - time=1, time=23, time=99999, etc.).
I've tried things like $(string:5:8}, but this will only work for examples of a certain length.
How do I get the substring of everything after the = sign? I would prefer to do it without outside commands like cut or awk, because I will be running this script on devices that may or may not have that functionality. I know there are examples out there using outside functions, but I am trying to find a solution without the use of such.

s=time=1234
time_int=${s##*=}
echo "The content after the = in $s is $time_int"
This is a parameter expansion matching everything matching *= from the front of the variable -- thus, everything up to and including the last =.
If intending this to be non-greedy (that is, to remove only content up to the first = rather than the last =), use ${s#*=} -- a single # rather than two.
References:
The bash-hackers page on parameter expansion
BashFAQ #100 ("How do I do string manipulations in bash?")
BashFAQ #73 ("How can I use parameter expansion? How can I get substrings? [...])
BashSheet quick-reference, paramater expansion section

if time= part is constant you can remove prefix by using ${str#time=}
Let's say you have str='time=123123' if you execute echo ${str#time=} you would get 123123

What is the functionality of ## and %% in bash

disk="/dev/sda"
local dev_node=${disk##*/}
dev_node is assigned with "sda".
Also,
partition="/dev/sda3"
echo ${partition%%[0-9]*}
It returns /dev/sda and remove 3.
I did not understand the functionality of ##*/ and %%[0-9]*in the above commands. I tried searching but could not get enough information.
Please explain and provide any links to tutorial related to this.

This is a very good manual / tutorial. What concerns your question:
${string##substring} Deletes longest match of $substring from front of
$string.
and
${string%%substring} Deletes longest match of $substring from back of
$string.
applied to your example: removing the longest substring matching */ from /dev/sda results in sda

This procedure is commonly described as parameter expansion.
In your case ## and %% are operators that extract part of the string.
## deletes longest match of defined substring starting at the start of given string.
%% does the same, except it starts from back of the string.
Good guide is here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract protocols (http)[s] to check url using shell script - linux

Related

RewriteRule cuts off part of a variable name

Extract/Parse all subdomains and domains from any file content [duplicate]

Looking for the best way in bash shell to extract a string

Finding substring of variable length in bash

What is the functionality of ## and %% in bash

Categories

Resources