AWK different versions behavior when using regex pattern

AWK different versions behavior when using regex pattern - linux

Background:
Recently I have tried to build libopencm3-examples on Ubuntu 14.04 and encountered a build error (while for Ubuntu 16.04.1 LTS it works ok). I started digging in order to find the reason. As I have discovered, libopencm3 uses specific linker script generator:
see libopencm3-examples/libopencm3/ld/README
the purpose of this tool is to pass target micro controller specific defines to linker script template. So it use preprocessor under template script and pass target specific parameters like so:
-D_FPU=hard-fpv5-sp-d16 -D_ROM_OFF=0x08000000 -D_RAM_OFF=0x20000000
to retrieve this parameters awk script is used.
./libopencm3/scripts/genlink.awk
for generating -D_XXX keys this script operates under device database ./libopencm3/ld/devices.data
like so:
awk -v PAT="$(DEVICE)" -v MODE="DEFS" -f $(OPENCM3_DIR)/scripts/genlink.awk $(OPENCM3_DIR)/ld/devices.data
Question:
The awk script part, extracting the defines info from database looks like this:
...
for (i = 3; i <= NF; i = i + 1) {
...
else if ($i ~ /[[:upper:]]*=/) {
if ("DEFS" == MODE)
printf "-D_%s ",$i;
}
}
the row in database, processed by the script:
stm32f3[01]3?c* stm32f3ccm ROM=256K RAM=40K CCM=8K
What confuse me, is that the proposed pattern (/[[:upper:]]*=/) should match [ROM]=256K f.e., but not ROM=256K (yes?). Anyhow, as i already mentioned /[[:upper:]]*=/ works for ubuntu 16.04 (GNU Awk 4.1.3) (Why?), while for 14.04 i needed to change /[[:upper:]]*=/ -> /[:upper:]*=/ to force it work (is this a bug or what?). Am I missing something?

No, the square bracket character is special in regex. If you want to match it literally, use \[. The expression [:upper:] inside square brackets refers to the character class consisting of upper-case characters. I'm guessing you want
/[][:upper:][]+=/
to form a bracket expression consisting of literal closing square bracket, uppercase characters, and literal opening square bracket. Notice also the switch to + instead of * to prevent matching on a lone equals sign (* means zero or more, so with zero repetitions, it would match on any equals sign).
Possibly the Awk you have doesn't support POSIX character classes at all. Then, you could replace [:upper:] with A-Z though it won't then match locale-sensitively.

Related

Extract/Parse all subdomains and domains from any file content [duplicate]

This question already has answers here:
Extract domain names from a file in Shell [closed]
(2 answers)
Closed 4 years ago.
I'm trying to make a regex for grep that match only valid domains.
My version work pretty well but match the following invalid domain :
#subdom..dom.ext
Here is my regex :
echo "#dom.ext" | grep "^#[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
I'm working with bash so I escaped special characters.
Sample that should match :
#subdom.dom.ext
#subsubdom.subdom.dom.ext
#subsub-dom.sub-dom.ext
Thanks for help

A truly complete solution requires more work, but here's an approximation that may work well enough (note that a # prefix is assumed and the input string is expected to start with it):
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$
You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.
Makes the following assumptions, which are more permissive than actual DNS name constraints:
Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.
There's no limit on the number of nested subdomains.
There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
The TLD (last component) is composed of letters only and has a length of at least 2.
Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.
Here's a quick test:
for d in #subdom..dom.ext #dom.ext #subdom.dom.ext #subsubdom.subdom.dom.ext #subsub-dom.sub-dom.ext #x.org; do
[[ $d =~ \
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
]] && echo YES || echo NO
done
Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:
A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:
^#(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$
Caveats:
No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.
As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):
False positive: an invalid Punycode-encoded name such as ab--whatever
False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
... and many more
Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.
I'm unclear on whether names in right-to-left writing scripts are properly matched.
For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.
Tip of the hat to #Alfe and for pointing out the problem with IDNs, and to #Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

echo "#dom.ext" | grep -E "^#[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"
This did the job.

Use
grep '#[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

Looking for the best way in bash shell to extract a string

I have the following string being exported from a program that is analyzing the certificate on a website which will be part of a bugfix analysis
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:
/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:/O=bfcentos7-test/CN=bfcentos7-test/emailAddress=root$bfcentos7-
test:170902005715Z:270831005715Z:self signed certificate
(consider output above to be a single line)
What I need is the best way in a bash shell to extract the sha256WithRSAEncryption. This could be anything like sha384withRSAEncryption or something else.
After the CERTSUMMARY it will always be 127.0.0.1:127.0.0.1:portnum above its port 631, but it could be anything.
This runs internally on a system and returns this string along with SSL or TLS (not pictured)
Here is another example of a return
CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/
ST=California/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=
Hash`sha1/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999/name=CustomRetrievedProperties`Enable:
170702212459Z:270630212459Z:unable to get local issuer certificate
Thanks in advance.
Novice at shell programming, but learning!!

you need the best way and yet do not seem to provide the best description - "This could be anything like sha384withRSAEncryption or something else."
Given the examples, the string you are looking for is the 4th, when : is a separator, so the command should be OK:
cut -f4 -d":"
If the output string has a strict length format, one easy option is the 'cut' command with -c. This is not the case though since there is a port number.
CERT_SUMMARY:127.0.0.1:127.0.0.1:631:sha256WithRSAEncryption:

as #cyrus pointed out, this was as simple as picking the right column with awk... I am learning.
This worked
awk -F ":" '/CERT_SUMMARY/ {print $5}'
Thanks for the help!!

| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
Regular expressions are you friend. If there is one thing one really should be familiar with if one needs to do a lot of string parsing or string processing, it's definitely regular expressions.
echo 'CERT_SUMMARY:127.0.0.1:127.0.0.1:52311:sha256WithRSAEncryption:
/CN=ServerSigningCertificate_0/name=Type`Administrator
/name=DBName`ServerSigningCertificate_0:/C=US/CN=BLAHBLAH/ST=California
/L=Address, Emeryville CA 94608/O=IBM BigFix Evaluation
License/OU=Customer/emailAddress=blahblay#gmail.com/name=Hash`sha1
/name=Server`bigfix01/name=CustomActions`Enable
/name=LicenseAllocation`999999
/name=CustomRetrievedProperties
`Enable:170702212459Z:270630212459Z:unable to get local issuer
certificate'
| sed -E 's/^([^:]*:){4}([^:]*):.*/\2/'
prints
sha256WithRSAEncryption
It's probably a bit overkill here, but there is almost nothing that cannot be done with regular expressions and as you have also built-in regex support in many languages today, knowing regex is never going to be a waste of time.
See also here to get a nice explanation of what each regex expression actually means, including an interactive editing view. Basically I'm telling the regex parser to skip the first 4 groups consisting of any number of characters that are not :, followed by a single : and then capture the 5th group that consists of any number of characters that are not : and finally match anything else (no matter what) to the end of the string. The whole regex is part of a sed "replace" operation, where I replace the whole string by just the content that has been captured by the second capture group (everything in round parenthesis is a capture group).

Could you please use following also, not printing it by field's number so if your Input_file's sha256 location is a bit here and there too than shown one then this could be more helpful too.
awk '{match($0,/sha.*Encryption:/);if(substr($0,RSTART,RLENGTH)){print substr($0,RSTART,RLENGTH-1)}}' Input_file

Pipe the output to:
awk ‘BEGIN{FS=“:”} {print $5}’

You could also take a step back to the openssl x509 command 'name options'. Using sep_comma_plus avoids the slashes in the output and therefore your regex will be simpler.

How to find a substring of a double-quoted string with a dollar sign in Groovy

I wanted to correct the automatically created Linux scripts. I use findAll(String, String) function to change "$APP_ARGS" for something else.
I have tried variants:
replaceAll('"$APP_ARGS"', 'simulators ' + '"\\\\$APP_ARGS"') - doesn't find
replaceAll('\"\$APP_ARGS\"',... - doesn't find
replaceAll('"\$APP_ARGS"',... - doesn't find
replaceAll('\\"\\$APP_ARGS\\"',... - editor warning - excessive escape
replaceAll('"\\\\$APP_ARGS"',... - doesn't find
replaceAll('\\\\"\\\\$APP_ARGS\\\\"',... - doesn't find
replaceAll($/"$$APP_ARGS"/$, ...) - does not find
replaceAll('"[$]APP_ARGS"', 'something simple') - finds.
replaceAll('"[$]APP_ARGS"', '"\\\\$APP_ARGS"') - fails.
As you see, if I use the regex format, the finding works ok. But is there a way to make an escaping work? For I need that $ in the replacing string, too.
According to Groovy manuals, /../ string needn't escaping for anything except slashes themselves. But
replaceAll(/"$APP_ARGS"/,...
fails, too, with a message: Could not get unknown property 'APP_ARGS'.
It seems that behaviour of that function has no logic and we have to find the correct solution by experiments.

replaceAll('"\\$APP_ARGS"', 'simulators ' + '"\\$APP_ARGS"')
The additional possible problem is that \\ before $ should be in the both strings, replacing and replaced.
The first argument of replaceAll is always treated as an regexp, so we need to quote $ (line end). The second param may contain backreferences to groups from the regexp, which start with a $, so that one must be quoted too.
A saner way is to use replace instead of replaceAll, which already quotes/escapes both params according to that useage.

Finding substring of variable length in bash

I have a string, such as time=1234, and I want to extract just the number after the = sign. However, this number could be in the range of 0 and 100000 (eg. - time=1, time=23, time=99999, etc.).
I've tried things like $(string:5:8}, but this will only work for examples of a certain length.
How do I get the substring of everything after the = sign? I would prefer to do it without outside commands like cut or awk, because I will be running this script on devices that may or may not have that functionality. I know there are examples out there using outside functions, but I am trying to find a solution without the use of such.

s=time=1234
time_int=${s##*=}
echo "The content after the = in $s is $time_int"
This is a parameter expansion matching everything matching *= from the front of the variable -- thus, everything up to and including the last =.
If intending this to be non-greedy (that is, to remove only content up to the first = rather than the last =), use ${s#*=} -- a single # rather than two.
References:
The bash-hackers page on parameter expansion
BashFAQ #100 ("How do I do string manipulations in bash?")
BashFAQ #73 ("How can I use parameter expansion? How can I get substrings? [...])
BashSheet quick-reference, paramater expansion section

if time= part is constant you can remove prefix by using ${str#time=}
Let's say you have str='time=123123' if you execute echo ${str#time=} you would get 123123

Determine the firefox version available for update using Python

I am looking for snippet which will check which version is available to download for updates.
I use python 3.x. So it would be nice if anyone has a hint how i can check the version available on the server. The OUtput should generate a variable in which the version number of firefox is stored. for example 22.0
I am using linux as the operating system of my choice.
to be clear:
I don't want to know whhich version is already installed on my system. i want to know which version can be updated.
So far i got the following code:
def firefox_version_remote():
firefox_version_fresh = os.popen("curl -s -l ftp.mozilla.org/pub/mozilla.org/firefox/releases/latest/linux-i686/de/").read()
# short name for firefox version num fresh
fvnf = " "
for i in firefox_version_fresh:
if i.isalpha() != True:
fvnf = fvnf + i
return fvnf.strip()
this returns -22.0..2 where it should return 22.0

Have you considered using a regular expression to match the numbers you're trying to extract. That would be a lot easier. Something like this:
matches = re.findall(r'\d+(?:\.\d+)+', firefox_version_fresh)
if matches:
fvnf = matches[0]
That's assuming the version is of the form x.y potentially followed by more sub versions (e.g. x.y.z).
\d+ is one or more digits
(?: )+ is one or more of everything in the parentheses. The ?: tells the compiler that it's a non-capturing group - i.e. you're not interesting in extracting the data inside the parentheses as a separate group.
\.\d+ matches a dot followed by one or more digits
So the whole expression can be described as one or more digits followed by one or more occurences of a dot and one or more digits.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWK different versions behavior when using regex pattern - linux

Related

Extract/Parse all subdomains and domains from any file content [duplicate]

Looking for the best way in bash shell to extract a string

How to find a substring of a double-quoted string with a dollar sign in Groovy

Finding substring of variable length in bash

Determine the firefox version available for update using Python

Categories

Resources