Extract/Parse all subdomains and domains from any file content [duplicate] - linux

This question already has answers here:
Extract domain names from a file in Shell [closed]
(2 answers)
Closed 4 years ago.
I'm trying to make a regex for grep that match only valid domains.
My version work pretty well but match the following invalid domain :
#subdom..dom.ext
Here is my regex :
echo "#dom.ext" | grep "^#[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
I'm working with bash so I escaped special characters.
Sample that should match :
#subdom.dom.ext
#subsubdom.subdom.dom.ext
#subsub-dom.sub-dom.ext
Thanks for help

A truly complete solution requires more work, but here's an approximation that may work well enough (note that a # prefix is assumed and the input string is expected to start with it):
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$
You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.
Makes the following assumptions, which are more permissive than actual DNS name constraints:
Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.
There's no limit on the number of nested subdomains.
There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
The TLD (last component) is composed of letters only and has a length of at least 2.
Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.
Here's a quick test:
for d in #subdom..dom.ext #dom.ext #subdom.dom.ext #subsubdom.subdom.dom.ext #subsub-dom.sub-dom.ext #x.org; do
[[ $d =~ \
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
]] && echo YES || echo NO
done
Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:
A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:
^#(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$
Caveats:
No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.
As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):
False positive: an invalid Punycode-encoded name such as ab--whatever
False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
... and many more
Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.
I'm unclear on whether names in right-to-left writing scripts are properly matched.
For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.
Tip of the hat to #Alfe and for pointing out the problem with IDNs, and to #Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

echo "#dom.ext" | grep -E "^#[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"
This did the job.

Use
grep '#[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

Related

How to extract protocols (http)[s] to check url using shell script

I have tried below one:
A=https://xyz.site
echo -e ${A//:*}
Result: https
Please describe me that, how this ${A//:*} term results https or http and what's the concept behind it, share a article or pdf if possible.
For Worldwide web [www]
Its preety simple to extract this one:
A=www.google.com
echo -e ${A::3}
Result: www
${parameter:offset:length} — This is referred to as Substring Expansion. In your example ${A::3} means ${A:0:3} and returns the first 3 characters of the variable A.
${parameter/pattern/string} — This notation replaces the first match of pattern with a string. If pattern begins with /, all matches of pattern are replaced with string. In your example ${A//:*} means ${A//:*/} and it replaces all patterns :* with an empty string.

Looking for the best way in bash shell to extract a string

I have the following string being exported from a program that is analyzing the certificate on a website which will be part of a bugfix analysis
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:170902005715Z:270831005715Z:self signed certificate
(consider output above to be a single line)
What I need is the best way in a bash shell to extract the sha256WithRSAEncryption. This could be anything like sha384withRSAEncryption or something else.
After the CERTSUMMARY it will always be 127.0.0.1:127.0.0.1:portnum above its port 631, but it could be anything.
This runs internally on a system and returns this string along with SSL or TLS (not pictured)
Here is another example of a return
CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/
ST=California/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=
Hash`sha1/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999/name=CustomRetrievedProperties`Enable:
170702212459Z:270630212459Z:unable to get local issuer certificate
Thanks in advance.
Novice at shell programming, but learning!!
you need the best way and yet do not seem to provide the best description - "This could be anything like sha384withRSAEncryption or something else."
Given the examples, the string you are looking for is the 4th, when : is a separator, so the command should be OK:
cut -f4 -d":"
If the output string has a strict length format, one easy option is the 'cut' command with -c. This is not the case though since there is a port number.
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
as #cyrus pointed out, this was as simple as picking the right column with awk... I am learning.
This worked
awk -F ":" '/CERT_SUMMARY/ {print $5}'
Thanks for the help!!
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
Regular expressions are you friend. If there is one thing one really should be familiar with if one needs to do a lot of string parsing or string processing, it's definitely regular expressions.
echo 'CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/ST=California
/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=Hash`sha1
/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999
/name=CustomRetrievedProperties
`Enable:170702212459Z:270630212459Z:unable to get local issuer
certificate'
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
prints
sha256WithRSAEncryption
It's probably a bit overkill here, but there is almost nothing that cannot be done with regular expressions and as you have also built-in regex support in many languages today, knowing regex is never going to be a waste of time.
See also here to get a nice explanation of what each regex expression actually means, including an interactive editing view. Basically I'm telling the regex parser to skip the first 4 groups consisting of any number of characters that are not :, followed by a single : and then capture the 5th group that consists of any number of characters that are not : and finally match anything else (no matter what) to the end of the string. The whole regex is part of a sed "replace" operation, where I replace the whole string by just the content that has been captured by the second capture group (everything in round parenthesis is a capture group).
Could you please use following also, not printing it by field's number so if your Input_file's sha256 location is a bit here and there too than shown one then this could be more helpful too.
awk '{match($0,/sha.*Encryption:/);if(substr($0,RSTART,RLENGTH)){print substr($0,RSTART,RLENGTH-1)}}' Input_file
Pipe the output to:
awk ‘BEGIN{FS=“:”} {print $5}’
You could also take a step back to the openssl x509 command 'name options'. Using sep_comma_plus avoids the slashes in the output and therefore your regex will be simpler.

XML schema restriction pattern for not allowing specific string

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?
OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Ruby - Parse a multi-line tab-delimited string into an array of arrays

My apologies if this has already been asked in a Ruby setting--I checked before posting but to be perfectly honest it has been a very long day and If I am missing the obvious, I apologize in advance!
I have the following string which contains a list of software packages installed on a system and for some reason I am having the hardest time parsing it. I know there has got to be a straight forward means of doing this in Ruby but I keep coming up short.
I would like to parse the below multi-line, tab-delimited, string into an array of arrays where I can then loop through each array element with an each_with_index and spit out the HTML code into my Rails app.
str = 'Product and/or Software Full Name 5242 [version 6.5.24] [Installed on: 12/31/2015]
Product and/or Software Full Name 5426 [version 22.4] [Installed on: 06/11/2013]
Product and/or Software Full Name 2451 [version 1.63] [Installed on: 12/17/2015]
Product and/or Software Full Name 5225 [version 43.22.51] [Installed on: 11/15/2011]
Product and/or Software Full Name 2420 [version 43.51-r2] [Installed on: 12/31/2015]'
The end result would be an array of arrays with 5 elements like so:
[["Product and/or Software Full Name 5245"],["version 6.5.24"],
["Installed on: 12/31/2015"],["Product and/or Software Full Name 5426"],["version 22.4"],["Installed on: 06/11/2013"],["Product and/or Software Full Name 2451"],["version 1.63"],["Installed on: 12/17/2015"]]
Please Note: Only 3 of 5 arrays are shown for brevity
I would prefer to strip out the brackets from both 'version' and 'Installed on' but I can do that with gsub separately if that cannot easily be baked into an answer.
Last thing is that there won't always be an 'Installed on' entry for every line in the multiline string, so the answer will need to take that into account as applicable.
This ought to do:
expr = /(.+?)\s+\[([^\]]+)\](?:\s+\[([^\]]+)\])?/
str.scan(expr)
The expression is actually a lot less complex than it looks. It looks complex because we're matching square brackets, which have to be escaped, and also using character classes, which are enclosed in square brackets in the regular expression language. All together it adds a lot of noise.
Here it is split up:
expr = /
(.+?) # Capture #1: Any characters (non-greedy)
\s+ # Whitespace
\[ # Literal '['
( # Capture #2:
[^\]]+ # One or more characters that aren't ']'
)
\] # Literal ']'
(?: # Non-capturing group
\s+ # Whitespace
\[ # Literal '['
([^\]]+) # Capture #3 (same as #2)
\] # Literal ']'
)? # Preceding group is optional
/x
As you can see, the third part is identical to the second part, except it's in a non-capture group followed by a ? to make it optional.
It's worth noting that this may fail if e.g. the product name contains square brackets. If that's a possibility, one potential solution is include the version and Installed text in the match, e.g.:
expr = /(.+?)\s+\[(version [^\]]+)\](?:\s+\[(Installed [^\]]+)\])?/
P.S. Here's a solution that uses String#split instead:
expr = /\]?\s+\[|\]$/
res = str.each_line.map {|ln| ln.strip.split(expr) }
.reject {|arr| arr.empty? }
If you have brackets in your product names, a possible workaround here is to specify a minimum number of spaces between parts, e.g.:
expr = /\]?\s{3,}\[|\]$/
...which of course depends on product names never having more than three consecutive spaces.

Allowing only usernames using "reasonable" characters

A username for a website can contain the space character, and yet it cannot be composed only of space characters. It can contain some symbols (like underscore and dash), but starting with certain symbols would look weird. Non-latin letters should be allowed, preferably for all languages, but tab and newline characters shouldn't. And definitely no Zalgo.
The rules composing what should and shouldn't be allowed in a reasonable naming system are complicated, however they are virtually the same for every website. Reimplementing them is probably a bad idea. Where can I find an implementation? I'm using PHP.
You should validate the username entered by the new user against a regular expression that run a match against the allowed character set.
Example: The following allows only english alphanumeric characters and - and _.
function isNewUsernameValid ($name, $filter = "[^a-zA-Z0-9\-\_\.]"){
return preg_match("~" . $filter . "~iU", $name) ? false : true;
}
if ( !isNewUsernameValid ($name) ){
print "Not a valid name.";
}
For your particular case, you'll have to come up with and test the regular expression.

Resources