In Spacy pattern matching, how do we get bounded Kleene operator? - nlp

In Spacy pattern matching, I know that we can use Kleene operator for ranges. For example,
pattern = [{"LOWER": "hello"},{ "OP": "*"}]. Here the star, known as kleene operator, means match against zero or any number of tokens. How can I modify the rule such that only 4 or 5 tokens are matched after the token "hello"?
In other NLP applications, for example,in GATE application, we can use some pattern like {Token.string == "hello"}({Token})[4,5] for the above task. Does Spacy have any such mechanism?
Thanks

This isn't currently supported, see the feature request: https://github.com/explosion/spaCy/issues/5603.
In v3.0.6+, you can use the new match_alignments to filter matches in post-processing: https://spacy.io/api/matcher. The matcher will still be slow if your patterns with just * end up with a lot of long/overlapping matches.

Related

Is there information in a spacy token indicative of the meaning of the token?

Suppose I have a spacy system and can easily mark a verb or punctuation member as having semantic meaning.
However, wherever possible I'd like to instead rely on native spacy information generated from the natural language processing pipeline.
For now, I have marked the following three items as semantic assignment operators in my code and rely on spacy's branch head identification system (obtained via an entity's head.lefts or head.rights) to isolate the colon. Then, I analyze the semantic meaning of the sentence with understanding that the lemma of the colon is in fact "be" or "list":
{ 'is', 'are', ':' }
However, I'd instead like to rely on some generic spacy linguistic information so that the system is less English-specific.
Is there any information, member, or property that will allow me to derive that the punctuation token is a semantic assignment operator?
For example, the verbs have the .lemma_ property that indicates they are what I am characterizing as assignment operators (.lemma_ = 'be') whereas the punctuation mark ':' does register as a token, but seems to have no indicative information as to its logical purpose.
Yet it is an explicit transitive operator, and it comes up almost 35% of the time a noun is given a state or membership in the technical prose I am analyzing.
I substituted textual colons with "is listed as" like so (the regex may not be correct under all circumstances):
re.sub(r'([A-z][.]?): ', r'\1 is listed as', text)
And spacy was able to process the sentence with the textual colon as a proper semantic token with a reasonably clear lemma.

Repeating Pattern Matching in antlr4

I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

How to improve SpaCy matcher pattern

I use SpaCy tocken matcher to retrieve sentences with a certain structure, for example, "I want a banana".
Now I use pattern like this, based on POS tagging:
pattern = [{"POS": "PRON"}, {"POS": "VERB"},{"POS": "NOUN"}]
But in this case, SpaCy matcher is only looking for a literal coincidence, and I would like him to look for offers in which these tokens are in the declared order, but not necessarily one after the other. For example, the pattern should find the sentence "I want this banana".
I need a pattern that can match the sentence with the tokens that have the necessary order (as in pattern)but can have other token between.
You can use {"OP": "*"} to match zero or more tokens of any type.
See all the operators here: https://spacy.io/usage/rule-based-matching#quantifiers

Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

What is the difference between lemmatization vs stemming?

When do I use each ?
Also...is the NLTK lemmatization dependent upon Parts of Speech?
Wouldn't it be more accurate if it was?
Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
From the NLTK docs:
Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.
Lemmatisation is closely related to stemming. The difference is that a
stemmer operates on a single word without knowledge of the context,
and therefore cannot discriminate between words which have different
meanings depending on part of speech. However, stemmers are typically
easier to implement and run faster, and the reduced accuracy may not
matter for some applications.
For instance:
The word "better" has "good" as its lemma. This link is missed by
stemming, as it requires a dictionary look-up.
The word "walk" is the base form for word "walking", and hence this
is matched in both stemming and lemmatisation.
The word "meeting" can be either the base form of a noun or a form
of a verb ("to meet") depending on the context, e.g., "in our last
meeting" or "We are meeting again tomorrow". Unlike stemming,
lemmatisation can in principle select the appropriate lemma
depending on the context.
Source: https://en.wikipedia.org/wiki/Lemmatisation
Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. Sometimes, the same word can have multiple different Lemmas. We should identify the Part of Speech (POS) tag for the word in that specific context. Here are the examples to illustrate all the differences and use cases:
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.
If you lemmatize the word 'Stripes' in verb context, it would return 'Strip'. If you lemmatize it in noun context, it would return 'Stripe'. If you just stem it, it would just return 'Strip'.
You would get same results whether you lemmatize or stem words such as walking, running, swimming... to walk, run, swim etc.
Lemmatization is computationally expensive since it involves look-up tables and what not. If you have large dataset and performance is an issue, go with Stemming. Remember you can also add your own rules to Stemming. If accuracy is paramount and dataset isn't humongous, go with Lemmatization.
There are two aspects to show their differences:
A stemmer will return the stem of a word, which needn't be identical to the morphological root of the word. It usually sufficient that related words map to the same stem,even if the stem is not in itself a valid root, while in lemmatisation, it will return the dictionary form of a word, which must be a valid word.
In lemmatisation, the part of speech of a word should be first determined and the normalisation rules will be different for different part of speech, while the stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.
Reference http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
The purpose of both stemming and lemmatization is to reduce morphological variation. This is in contrast to the the more general "term conflation" procedures, which may also address lexico-semantic, syntactic, or orthographic variations.
The real difference between stemming and lemmatization is threefold:
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas. This difference is apparent in languages with more complex morphology, but may be irrelevant for many IR applications;
Lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;
In terms of implementation, lemmatization is usually more sophisticated (especially for morphologically complex languages) and usually requires some sort of lexica. Satisfatory stemming, on the other hand, can be achieved with rather simple rule-based approaches.
Lemmatization may also be backed up by a part-of-speech tagger in order to disambiguate homonyms.
As MYYN pointed out, stemming is the process of removing inflectional and sometimes derivational affixes to a base form that all of the original words are probably related to. Lemmatization is concerned with obtaining the single word that allows you to group together a bunch of inflected forms. This is harder than stemming because it requires taking the context into account (and thus the meaning of the word), while stemming ignores context.
As for when you would use one or the other, it's a matter of how much your application depends on getting the meaning of a word in context correct. If you're doing machine translation, you probably want lemmatization to avoid mistranslating a word. If you're doing information retrieval over a billion documents with 99% of your queries ranging from 1-3 words, you can settle for stemming.
As for NLTK, the WordNetLemmatizer does use the part of speech, though you have to provide it (otherwise it defaults to nouns). Passing it "dove" and "v" yields "dive" while "dove" and "n" yields "dove".
An example-driven explanation on the differenes between lemmatization and stemming:
Lemmatization handles matching “car” to “cars” along
with matching “car” to “automobile”.
Stemming handles matching “car” to “cars” .
Lemmatization implies a broader scope of fuzzy word matching that is
still handled by the same subsystems. It implies certain techniques
for low level processing within the engine, and may also reflect an
engineering preference for terminology.
[...] Taking FAST as an example,
their lemmatization engine handles not only basic word variations like
singular vs. plural, but also thesaurus operators like having “hot”
match “warm”.
This is not to say that other engines don’t handle synonyms, of course
they do, but the low level implementation may be in a different
subsystem than those that handle base stemming.
http://www.ideaeng.com/stemming-lemmatization-0601
Stemming is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn't have any meaning.
Examples,
"beautiful" -> "beauti"
"corpora" -> "corpora"
Stemming can be done very quickly.
Lemmatization on the other hand, is the process of converting the given word into it's base form according to the dictionary meaning of the word.
Examples,
"beautiful" -> "beauty"
"corpora" -> "corpus"
Lemmatization takes more time than stemming.
I think Stemming is a rough hack people use to get all the different forms of the same word down to a base form which need not be a legit word on its own
Something like the Porter Stemmer can uses simple regexes to eliminate common word suffixes
Lemmatization brings a word down to its actual base form which, in the case of irregular verbs, might look nothing like the input word
Something like Morpha which uses FSTs to bring nouns and verbs to their base form
Huang et al. describes the Stemming and Lemmatization as the following. The selection depends upon the problem and computational resource availability.
Stemming identifies the common root form of a word by removing or replacing word suffixes (e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a word and returns its base form (e.g. “better” is lemmatized as “good”).
Huang, X., Li, Z., Wang, C., & Ning, H. (2020). Identifying disaster related social media for rapid response: a visual-textual fused CNN architecture. International Journal of Digital Earth, 13(9), 1017–1039. https://doi.org/10.1080/17538947.2019.1633425
Stemming and Lemmatization both generate the foundation sort of the inflected words and therefore the only difference is that stem may not be an actual word whereas, lemma is an actual language word.
Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used a corpus also to supply lemma which makes it slower than stemming. you furthermore might had to define a parts-of-speech to get the proper lemma.
The above points show that if speed is concentrated then stemming should be used since lemmatizers scan a corpus which consumes time and processing. It depends on the problem you’re working on that decides if stemmers should be used or lemmatizers.
for more info visit the link:
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
Stemming
is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.
Often when searching text for a certain keyword, it helps if the search returns variations of the word.
For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats].
Lemmatization
looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.
I did refer this link,
https://towardsdatascience.com/stemming-vs-lemmatization-2daddabcb221
In short, the difference between these algorithms is that only lemmatization includes the meaning of the word in the evaluation. In stemming, only a certain number of letters are cut off from the end of the word to obtain a word stem. The meaning of the word does not play a role in it.
In short:
Lemmatization: uses context to transform words to their
dictionary(base) form also known as Lemma
Stemming: uses the stem of the word, most of the time removing derivational affixes.
source

Resources