Lexer definition to match keyword or an abbreviation - antlr4

For writing a parser I would like to not only match full keywords but also abbreviations thereof, for example
MY-KEYWORD
must at least match
MY-KEY
but also any exact match longer than that, namely
MY-KEYW or MY-KEYWO or MY-KEYWOR or the full MY-KEYWORD
Is this possible with a reasonable lexer fragment or will I have to define specific alternative matches ?
TIA
Alex

Easiest would be to do something like this:
MY_KEYWORD
: 'MY-KEY' ('W' ('O' ('R' 'D'?)?)?)?
;

Related

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

spacy how do I make a matcher which is noun-noun without white space within it?

I tried to make a matcher which could detect words like
'all-purpose'
I was trying to make a pattern like
pattern=[{'POS':'NOUN'}, {'ORTH':'-'},{'POS':'NOUN'}]
However, I realized that it only find the matches like
'all - purpose' with white space between tokens instead of 'all-purpose'.
How could I make a matcher like this?
It has to be a generalized pattern like noun-noun instead of
specific words like 'Barak Obama' as in the example in spacy documentation
Best,
What exactly are you trying to match? Using en_core_web_sm, "all-purpose" is three tokens and all has the ADV POS tag for me. So that might be the issue with your match pattern. If you just want hyphenated words this might be a better match:
pattern = [{'IS_ALPHA': True}, {'ORTH':'-'}, {'IS_ALPHA': True}]
More generally, you are correct that your pattern will only match three tokens, though that doesn't require white space - it depends on how the tokenizer works. For example, that's has no spaces but is two tokens.
If you are finding hyphenated words that occur as one token and want to match them, you can use regular expressions in Matcher rules. Here's an example ofhow that would work from the docs:
pattern = [{"TEXT": {"REGEX": "deff?in[ia]tely"}}]
In your case it could just look like this:
pattern = [{"TEXT": {"REGEX": "-"}}]

XML schema restriction pattern for not allowing specific string

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?
OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Troubles with returns declaration on the first parser rule in an ANTLR4 grammar

I am using returns for my parser rules which works for all parser rules except the first one. If the first parser rule in my grammer uses the returns declaration ANTLR4 complains as follows:
expecting ARG_ACTION while matching a rule
If I add another parser rule above which does not use "returns" ANTLR does not complain.
Here you have a grammar reduced to the problem:
grammar FirstParserRuleReturnIssue;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
aRule returns [String s]: ID { $s = $ID.text; };
I searched to find a special role of the first rule that could explain the behaviour but did not find anything. Is it a bug? Do I miss some understanding?
You need to place parser rules (start with a lowercase letter) before lexer rules (start with an uppercase letter) in your grammar. After encountering a lexer rule, the [ triggers a LEXER_CHAR_SET instead of ARG_ACTION, so the token stream seen by the compiler looks like you're passing a set of characters where the return value should be.

In R, how do I replace a string that contains a certain pattern with another string?

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
Thank you!
You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.
Anyway, agrep should do the trick:
agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1
EDIT:
You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...
You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
biologySp = c("biolgy","biologee","bologee","bugs")
for(sp in biologySp){
Major = gsub(sp,"Biology",Major)
}
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
http://en.wikipedia.org/wiki/Approximate_string_matching
You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).
The vwr package has methods for string matching:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
> levenshtein.distance("physcs",c("biology","physics","geography"))
biology physics geography
7 1 9
If you get identical minima then flip a coin:
> levenshtein.distance("biolsics",c("biology","physics","geography"))
biology physics geography
4 4 8
example 1a) perl/linux regex: 's/oldstring/newstring/'
example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex: 's/oldstring//'
example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)

Resources