Is there a way to avoid g4 tokenize a variable name as a laxer rule when we want? - antlr4

I defined some lexer rules as given below:
DATE: D A T E ;
ID : '&'*? IDENTIFIER ;
IDENTIFIER : [a-zA-Z_] [a-zA-Z_0-9]*;```
But for the line of coding as given below:
keep date column1 column2;
Because in here the date is a variable name instead of a keyword DATE. So my question is that is it possible for me to let g4 to treat the date as a lexer rule of ID but not a DATE?

The ANTLR Lexer is, in no way, influenced by your parser rules.
It operates directly against the input stream of characters, and, if multiple rules match a sequence of characters, the tie is broken by these two rules.
1 - The rule that matches the longest stream of input characters will take precedence. (In your case the IDENTIFIER rule and the DATE rule, both match the "date" sequence of characters.
2 - If two rules match the same length character sequence, the first rule "wins". (In your case, the DATE rule occurs first, so the "date" sequence of characters will be recognized as a DATE token.
It makes absolutely no difference that a parse rule might be looking for an IDENTIFIER; the Lexer has tokenized the input without influence from the parser rules, and the parser rules match the input stream of tokens generated from the Lexer.
IF you want "date" in this context to be acceptable, then you'll need to have your parser rule accept both an IDENTIFIER and a DATE token in that parser rule.

Related

Exact pattern match for Regex

I am creating a python program to evaluate a user entered expression f(x) in terms of x and any numbers and mathematical operations. I want to match my regular expression to validate the string entered by the user.
It must contain an 'x'.
It may or may not contain any of the numbers, mathematical operators like +,-,*,/ and a decimal point '.'
example 1+x**2 is valid but 1+x+y is not (since there is a 'y')
please help me with the regex expression. So far I have [x0-9\.\+\-\*\/]+

Regex: VBA : Evaluation of Expression with Conditions

I am trying to extract a cell number from the formula expression (in vba) which I need to replace by another cell number. eg: I have the following formulae in different cells "=AL82+L8+L82", "=L8+L82" and "=AL82+L8" . I have to change "L8" in each of the formulae to "L9". I am new to Regex and was trying the following expression in regex pattern:
"(?=[^A-Z])([L8])(?=[^0-9])"
However only 8 is changed to L9. Please assist me with the error.
Thanks
You can capture either plus or an equals sign in a capturing group.
Then Match L8 and assert using a negative lookahead, that the 8 is not directly followed by a digit.
In the replacement use group 1 followed by L9: $1L9
([+=])L8(?!\d)
See a regex demo

remove exponents from a formula string

I would like to isolate all operands from a formula (in the form of a string) by taking out the arithmetic operators so take out: "+","-","/","*","**2"
the formula string is something like:
"y=A+B1*options+B2*items**2+B3*factor+B4"
However: I can manage for most arithmetic operators, except for the exponents "**2" part. It has to be a wildcard search or so (not positional), because the whole formula might change in future and also might have another exponent (eg **5 or **54)
What would be the easiest way to strip "**?" out of the formula where ? can be any number?
To match the pattern you want, use the regex string r"\*\*\d+"
Breakdown:
r"" is the how one denotes regex in python (see the re module for more info)
\* matches a single * character - because the * is a special character in regex, we escape it with the \
\d matches a digit
+ matches the previous pattern at least once greedily: this means it will try to find at least one digit, then keep finding digits until it can find no more. So, it will match **2, **44382, and so on
As for stripping the pattern from the equation, you can do re.sub(pattern, "", equation) - replacing all instances of the pattern with nothing

XSD Regular expression to match first 4 characters

XSD Regular expression to match first 4 characters and it should not match any other chars for group. I have tried [A]{4}[^A]*, but the rest values are been taken by group.

Match regex defined in text/config file against strings in database with filter/whitelist

The main thrust is "how do I take a list of regex strings, contained in a text file, and find all matches on a specific database column taking into account exclusions/whitelist".
Sample Text file:
[Bad 192 address]=192\.168\.(1|2)\.\d{1,3} / (192\.168\.(1|2)\.1)
[Bad 172 address]=172\.(?:1[6-9]|2[0-9]|3[01])\.\d{1,3}\.\d{1,3} / (172\.(?:1[6-9]|2[0-9]|3[01])\.\d{1,3}\.1)
[Bad 10 address]=10\.\d{1,3}\.\d{1,3}\.\d{1,3} / (10\.0\.(0|120|250)\.1
In the above example, I have the name of the regex match in brackets, then the raw regex match, and finally the filter I wish to apply to this regex match in parenthesis separated from the original regex match with a "/".
I would like to point to a file containing regex matches structured similarly, and running them against a column within a table, or number of tables, in a database. In this example the idea is to find all private IP space matches that aren't whitelisted, and outputting the matches along with the associated signature's name. So a hit might look like "[Bad 192 address] 192.168.1.58".
Initially I was iterating through the file line-by-line, splitting each rule into an array of 3 items (the sig, the regex, and the filter), assigning them within a function to variables that I can work with and sending a SQL SELECT query per rule, trying to discard the whitelisted values from the returned matches but this isn't working reliably and is giving me terrible performance. For context, I need to be able to potentially iterate through 1 million rows, but I might be able reduce that value out-of-band into thousands of unique values.
Mainly concerned with the mechanism for taking defined regexe strings in File A, running them against Database B, discarding any whitelisted values, and spitting out the string match as well as the associated "signature".

Resources