Matching [STRING] or [STRING] - antlr4

I'm trying to match lines like these in Antlr4:
John or apple and John Smith or apple sauce.
I use the following rules:
conjunction : WORDS OR WORDS ;
WORDS: [A-Za-z ]+ ;
OR: ' or ' ;
But the first WORDS antlr finds also includes 'or'. So it does not see John and apple as two different words seperated by or.
How can I fix this?

In case 2 or more lexer rules match the same amount of characters, the rule define first will win. In other words, for the input or, bot the rules WORDS and OR can be matched. And since WORDS is defined first, it wins.
Swap the order:
conjunction : WORDS OR WORDS ;
OR: ' or ' ;
WORDS: [A-Za-z ]+ ;
However, ANTLR's lexer matches as much as possible. So the above will only work if you tokenize or. If you try to tokenize John Smith or apple sauce, the rule WORDS will match everything!
You should not include space:
conjunction : WORDS+ OR WORDS+ ;
OR: 'or' ;
WORDS: [A-Za-z]+ ;
SPACES: [ \t\r\n] -> skip ;
When I test the input John or apple with a parser generated from the grammar above, I get this:
and for the input John Smith or apple sauce this:

Related

Is there a way that I can use the cut command with space as delimiter and treat a word with space like Costa Rica as a single word?

I have created this file concacaf.txt with the following input
David Canada 5
Larin Canada 5
Borges Costa Rica 2
Buchanan Canada 2
Davis Panama 2
Gray Jamaica 2
Henriquez El Salvador 2
Is there a way that I can either use the cut command and treat Costa Rica or El Salvador as a single word or modify the text so that when I use:
cut -f 1,3 -d ' ' concacaf.txt
I get 'Borges 2' instead of 'Borges Rica'. Thanks
It is not possible using cut but it is possible using sed:
sed -E 's/^([^ ]*) .* ([^ ]*)$/\1 \2/' concacaf.txt
It searches for the first word ([^ ]*, a sequence of non-space characters) at the beginning of the line and the word at the end of the line and replaces the entire line with the first word and the last word and a space between them.
The option -E tells sed to use modern regular expressions (by default it uses basic regular expressions and the parentheses need to be escaped).
The sed command is s (search). It searches in each line using a regular expression and replaces the matching substring with the provided replacement string. In the replacement string, \1 represents the substring matching the first capturing group, \2 the second group and so on.
The regular expression is explained below:
^ # matches the beginning of line
( # starts a group (it is not a matcher)
[^ ] # matches any character that is not a space (there is a space after `^`)
* # the previous sub-expression, zero or more times
) # close the group; the matched substring is captured
# there is a space here in the expression; it matches a space
.* # match any character, any number of times
# match a space
([^ ]*) # another group that matches a sequence of non-space characters
$ # match the end of the line
You can use rev to cut out that last field containing the integer:
$ cat concacaf.txt | rev | cut -d' ' -f2- | rev
David Canada
Larin Canada
Borges Costa Rica
Buchanan Canada
Davis Panama
Gray Jamaica
Henriquez El Salvador

Regex - Stop after finding the first pattern

For a string like this:
1. Jane, Doe2. Good, Jay3. Turn, Bob[key]
Either Jane, Doe needs to be extracted if no [key] is present then whatever is between 1. and 2.
(or)
Turn, Bob if [key] is present
Put another way:
If [key] is present, then the person before [key] needs to be extracted and the process stopped.
If [key] is not present, then pick up whoever is after 1.
I tried this but it pulls up both Jane, Doe and Turn, Bob
(\.([^\.])(.+)\[key\])|(1\.(.+)2\.)
How to stop after finding the first successful pattern, knowing that patterns are read left to right? [key] can be anyone - 1,2 or 3.
Thanks.
For these requirements, you may use this regex in Python with an alternation:
(?<=\d\.\s)[a-zA-Z, ]+(?=\[key])|(?<=1\.\s)(?!.*\[key])[a-zA-Z, ]+
RegEx Demo
RegEx Details:
(?<=\d\.\s): Positive lookbehind to assert that there is a digit followed by dot followed by a whitespace before the current position
[a-zA-Z, ]+: Match 1+ of letter, space or comma characters
(?=\[key]): Positive lookahead to assert that there is a text [key] after the current position
|: OR
(?<=1\.\s): Positive lookbehind to assert that there is a digit 1 followed by dot followed by a whitespace before the current position
(?!.*\[key]): Negative lookbehind to assert that there is no [key] text after the current position
[a-zA-Z, ]+: Match 1+ of letter, space or comma characters
Not sure why you put .+ into your regex but it's greedy and matches . Good, Jay3. Turn, Bob. so the left part of the alternation matches.
Suggest you remove the .+ on both sides of the alternation ( | ).

XML Schema validation pattern shall not allow string

I want to allow the alphanumeric characters except for the world "AAAA"
I am using the below regex
To allow alphanumeric characters <xs:pattern value="[A-Za-z0-9]{2,4}"/>
Not to allow AAAA as <xs:pattern value="[^A]{4}"/>
But if I combine both it does not work.
Please help
It is not easy to match strings using a regex. The pattern [^A]{4} does not mean not 4 occurrences of A. It means 4 occurrences of 'not A'.
I think something like this should work:
[A-Za-z0-9]{2,3} |
[B-Za-z0-9][A-Za-z0-9]{3} |
[A-Za-z0-9][B-Za-z0-9][A-Za-z0-9]{2} |
[A-Za-z0-9]{2}[B-Za-z0-9][A-Za-z0-9] |
[A-Za-z0-9]{3}[B-Za-z0-9]
which means,
a 2-char or 3-char alphanumeric string or
a 4 char alphanumeric string with the 1st char not 'A' or
a 4 char alphanumeric string with the 2nd char not 'A' or
a 4 char alphanumeric string with the 3rd char not 'A' or
a 4 char alphanumeric string with the 4th char not 'A'
There might be an easier solution, but I cannot think of it.

Linux Shell - Grep command

I'm having a problem using grep with these options: \{n\} , \{n,\} and \{n,m\} . I have a file named "new" with this lines:
aaaa
aaa
aa
a
When i use grep 'a\{1\}' new i get this output:
aaaa
aaa
aa
a
So, basically, this command will show me the lines that include 1, or more, consecutive occurrences of the character "a" right?
Also, grep 'a\{1,\} new will do the same thing as grep 'a\{1\}' new ? Because i get the same output for both.
The last one, \{n,m\} , i cant really understand what it does.
I would really appreciate if anyone could help me out.
From man grep:
Repetition
A regular expression may be followed by one of several repetition
operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more
than m times.
That example grep 'a\{2,3\}' new matches also the line with aaaa because of the first three (or 2) a. The rest of the line isn't important.
If you want that really only 2 or 3 consecutive a are matched, you could use the -o flag. But be aware that this would output aa and aaa from a line with aaaaa. To avoid this you have to use additional information, like in the example line breakings ^ and $.
Btw. I would suggest to use the -E flag (or egrep which is the same) so you have extended regex support. You don't have to escape the brackets then.
For input
aaaaa
aaaa
aaa
aa
a
a call of grep -o -E '^a{2,3}$' will give the output:
aaa
aa
grep 'a\{n,m\}' new means grepping at least n number of a and at most m number of a from new.
For example, grep 'a\{2,3\}' new will output
aaaa
aaa
aa
the last line doesnot match because it only has ONE a.
For a{n,\}' , omitting m means any number larger than or equal to n.

Matching only a <tab> that is between two numbers

How to match a tab only when it is between two numbers?
Sample script
209.65834 27.23204908
119.37987 15.03317082
74.240635 8.30561924
29.1014 0
931.8861 -100.00000
-16.03784 -8.30562
;
_mirror
l
;
29.1014 0
1028.10 0.00
n
_spline
935.4875 250
924.2026913 269.8820375
912.9178825 277.4506484
890.348265 287.3181854
(in the above script, the tabs are between the numbers, not the spaces) (blank lines are significant; there is nothing in them, but I can't lose them)
I wish to get a "," between the numbers. Tried with :%s/\t/\,/ but that will touch the empty lines too, and the end of lines.
Try this:
:%s/\(\d\)\t\(-\?\d\)/\1,\2/
\d matches any digit. -? means "an optional -. The pair of (escaped) parenthesis capture the match, and \1 refers to the first captured match, \2 refers to the second.
google://vim+regex -> http://vimregex.com/ ->
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/gc
You have 2 groups of numbers here ([0-9]) and tab-symbols \t between them. Add some escape symbols and you have the answer.
g for multichange in single line, c for some asking.
\1 and \2 are matching groups (numbers in your case).
It's not really hard to find answer for questions like that by yourself.
try
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/g
explanation - search the patten <digit>\t<digit> and remember the part that matches <digit> .
\( ... \) captures and remembers the part that matches.
\1 recalls the first captured digit, \2 the second captured digit.
so if the match was on 123\t789, <digit>,<digit> matches 3\t7
the 3 and 7 are rememberd as \1 and \2
or
:g/[0-9]/ s/\t/,/g
explanation - filter all lines with a digit, then substitute tabs with a comma on those lines

Resources