How to improve SpaCy matcher pattern

How to improve SpaCy matcher pattern - python-3.x

I use SpaCy tocken matcher to retrieve sentences with a certain structure, for example, "I want a banana".
Now I use pattern like this, based on POS tagging:
pattern = [{"POS": "PRON"}, {"POS": "VERB"},{"POS": "NOUN"}]
But in this case, SpaCy matcher is only looking for a literal coincidence, and I would like him to look for offers in which these tokens are in the declared order, but not necessarily one after the other. For example, the pattern should find the sentence "I want this banana".
I need a pattern that can match the sentence with the tokens that have the necessary order (as in pattern)but can have other token between.

You can use {"OP": "*"} to match zero or more tokens of any type.
See all the operators here: https://spacy.io/usage/rule-based-matching#quantifiers

Related

Is there information in a spacy token indicative of the meaning of the token?

Suppose I have a spacy system and can easily mark a verb or punctuation member as having semantic meaning.
However, wherever possible I'd like to instead rely on native spacy information generated from the natural language processing pipeline.
For now, I have marked the following three items as semantic assignment operators in my code and rely on spacy's branch head identification system (obtained via an entity's head.lefts or head.rights) to isolate the colon. Then, I analyze the semantic meaning of the sentence with understanding that the lemma of the colon is in fact "be" or "list":
{ 'is', 'are', ':' }
However, I'd instead like to rely on some generic spacy linguistic information so that the system is less English-specific.
Is there any information, member, or property that will allow me to derive that the punctuation token is a semantic assignment operator?
For example, the verbs have the .lemma_ property that indicates they are what I am characterizing as assignment operators (.lemma_ = 'be') whereas the punctuation mark ':' does register as a token, but seems to have no indicative information as to its logical purpose.
Yet it is an explicit transitive operator, and it comes up almost 35% of the time a noun is given a state or membership in the technical prose I am analyzing.

I substituted textual colons with "is listed as" like so (the regex may not be correct under all circumstances):
re.sub(r'([A-z][.]?): ', r'\1 is listed as', text)
And spacy was able to process the sentence with the textual colon as a proper semantic token with a reasonably clear lemma.

In Spacy pattern matching, how do we get bounded Kleene operator?

In Spacy pattern matching, I know that we can use Kleene operator for ranges. For example,
pattern = [{"LOWER": "hello"},{ "OP": "*"}]. Here the star, known as kleene operator, means match against zero or any number of tokens. How can I modify the rule such that only 4 or 5 tokens are matched after the token "hello"?
In other NLP applications, for example,in GATE application, we can use some pattern like {Token.string == "hello"}({Token})[4,5] for the above task. Does Spacy have any such mechanism?
Thanks

This isn't currently supported, see the feature request: https://github.com/explosion/spaCy/issues/5603.
In v3.0.6+, you can use the new match_alignments to filter matches in post-processing: https://spacy.io/api/matcher. The matcher will still be slow if your patterns with just * end up with a lot of long/overlapping matches.

Semantically matching camelcase or underscore separated words

I need to merge two ontologies together to a single file based on semantic similarity of concepts. For doing this, I use nlp for determining similar concepts semantically. There are cases that ontology concepts are named in camelcase or underscores to separated way. Are there any algorithms to semantically match camelcase or underscore separated words? What I need is semantically match two concepts which are named in camelcase of underscore separated way. This image has an ontology that contain concepts with camelcase namings. If there are no algorithms, please could you suggest a way?
I already found some algorithms for matching two words or sentences semantically (SEMILAR Library, cortical.io, Similarity Library and of course, WordNet). But none of them can match two camelcase or underscore separated words semantically. I know we can try by separating the words of the camelcase. But I have no clue what to do next. I am also new to nlp and I don't know if there a simple way to achieve this.
I expect an algorithm or a way that matches two camelcase or underscore separated words semantically and outputting a similarity score to determine the semantic similarity of them.
Update:
I also found this WS4J demo for measuring semantic similarity between words and sentences. But still cannot use it for camelcase and underscore separated words.

What Is the Difference Between POS Tagging and Shallow Parsing?

I'm currently taking a Natural Language Processing course at my University and still confused with some basic concept. I get the definition of POS Tagging from the Foundations of Statistical Natural Language Processing book:
Tagging is the task of labeling (or tagging) each word in a sentence
with its appropriate part of speech. We decide whether each word is a
noun, verb, adjective, or whatever.
But I can't find a definition of Shallow Parsing in the book since it also describe shallow parsing as one of the utilities of POS Tagging. So I began to search the web and found no direct explanation of shallow parsing, but in Wikipedia:
Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.
I frankly don't see the difference, but it may be because of my English or just me not understanding simple basic concept. Can anyone please explain the difference between shallow parsing and POS Tagging? Is shallow parsing often also called Shallow Semantic Parsing?
Thanks before.

POS tagging would give a POS tag to each and every word in the input sentence.
Parsing the sentence (using the stanford pcfg for example) would convert the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these these words are joining together to make the overall sentence. For example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).
You can see how parser output looks like at http://nlp.stanford.edu:8080/parser/index.jsp
A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you. A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.

POS tagging is a process deciding what is the type of every token from a text, e.g. NOUN, VERB, DETERMINER, etc. Token can be word or punctuation.
Meanwhile shallow parsing or chunking is a process dividing a text into syntactically related group.
Pos Tagging output
My/PRP$ dog/NN likes/VBZ his/PRP$ food/NN ./.
Chunking output
[NP My Dog] [VP likes] [NP his food]

The Constraint Grammar framework is illustrative. In its simplest, crudest form, it takes as input POS-tagged text, and adds what you could call Part of Clause tags. For an adjective, for example, it could add #NN> to indicate that it is part of an NP whose head word is to the right.

In POS_tagger, we tag words using a "tagset" like {noun, verb, adj, adv, prob...}
while shallow parser try to define sub-components such as Name Entity and phrases in the sentence like
"I'm currently (taking a Natural (Language Processing course) at (my University)) and (still confused with some basic concept.)"

D. Jurafsky and J. H. Martin say in their book, that shallow parse (partial parse) is a parse that doesn't extract all the possible information from the sentence, but just extract valuable in the specific case information.
Chunking is just a one of the approaches to shallow parsing. As it was mentioned, it extracts only information about basic non-recursive phrases (e.g. verb phrases or noun phrases).
Other approaches, for example, produce flatted parse trees. These trees may contain information about part-of-speech tags, but defer decisions that may require semantic or contextual factors, such as PP attachments, coordination ambiguities, and nominal compound analyses.
So, shallow parse is the parse that produce a partial parse tree. Chunking is an example of such parsing.

using Dependency Parser in Stanford coreNLP

I am using the Stanford coreNLP ( http://nlp.stanford.edu/software/corenlp.shtml ) in order to parse sentences and extract dependencies between the words.
I have managed to create the dependencies graph like in the example in the supplied link, but I don't know how to work with it. I can print the entire graph using the toString() method, but the problem I have is that the methods that search for certain words in the graph, such as getChildList, require an IndexedWord object as a parameter. Now, it is clear why they do because the nodes of the graph are of IndexedWord type, but it's not clear to me how I create such an object in order to search for a specific node.
For example: I want to find the children of the node that represents the word "problem" in my sentence. How I create an IndexWord object that represents the word "problem" so I can search for it in the graph?

In general, you shouldn't be creating your own IndexedWord objects. (These are used to represent "word tokens", i.e., particular words in a text, not "word types", and so asking for the word "problem" -- a word type -- isn't really valid; in particular, a sentence could have multiple tokens of this word type.)
There are a couple of convenience methods that let you do what you want:
sg.getNodeByWordPattern(String pattern)
sg.getAllNodesByWordPattern(String pattern)
The first is a little dangerous, since it just returns the first IndexedWord matching the pattern, or null if there are none. But it's most directly what you asked for.
Some other methods to start from are:
sg.getFirstRoot() to find the (first, usually only) root of the graph and then to navigate down from there, such as by using the sg.getChildren(root) method.
sg.vertexSet() to get all of the IndexWord objects in the graph.
sg.getNodeByIndex(int) if you already know the input sentence, and therefore can ask for words by their integer index.
Commonly these methods leave you iterating through nodes. Really, the first two get...Node... methods just do the iteration for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string