How to define pos_pattern for extracting nouns followed by zero or more sequence of nouns or adjectives for KeyphraseCountVectorizer? - nlp

I'm trying to extract Arabic keywords from tweets. I'm using keyBERT with KeyphraseCountVectorizer
vectorizer = KeyphraseCountVectorizer(pos_pattern='< N.*>*')
I'm trying to write more custom pos patterns regExp to select nouns followed by zero or more sequence of nouns or adjectives but not verbs.
can you please help me to write the right regExp?
Thank you

I interpret your requirement to match "nouns followed by zero or more sequence of nouns or adjectives" as matching at least one or more sequential nouns (i.e. <N.*>+), followed by zero or more adjectives (i.e. <J.*>*). So putting these together you get the full RegExp as follows:
vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+<J.*>*")
As a side point, you note that you are attempting to extract Arabic keywords. From my understanding the keyphrase_vectorizers package relies on the text being annotated with spaCy PoS tags, and so to change languages from the default (English) you have to load a corresponding pipeline/model in the desired language and set the stop words to those of the new language. For example, if using the Keyphrase Vectorizer for German:
vectorizer = KeyphraseCountVectorizer(spacy_pipeline='de_core_news_sm', stop_words='german')
However, at present spaCy does not have a pipeline trained for Arabic text, which means that using KeyphraseCountVectorizer in a straightforward manner with Arabic text is not possible without workarounds (something you may have already solved but I just thought I'd mention it).

Related

How to add an tokenizer exception for whitespaces in Spacy language models

The following is my code where I take an user input.
import en_core_web_sm
nlp = en_core_web_sm.load()
text = input("please enter your text or words here")
doc = nlp(text)
print([t.text for t in doc])
If the user input the text as Deep Learning, the text is broken into
('Deep', 'Learning')
How to add an whitespace exception in nlp? such that the output is like below
(Deep Learning)
The raw text from the user input is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
So if your user types in: Looking for Deep Learning experts
It will be tokenized as: ('Looking', 'for, 'Deep', 'Learning', 'experts')
Spacy does not know that Deep Learning is an entity on it's own. If you want spaCy to recognize Deep Learning as a single entity, you need to teach it. If you have a predefined list of words that you would want spaCy to recognize as a single entity, you can use PhraseMatcher to do that.
You can check the details on how to use PhraseMatcher here
UPDATE - Reply to OP's comment below
I do not think there is a way spaCy can know about the entity you are looking for without being trained in the context of your domain or being provided a predefined subset of the entities.
The only solution I can think of is to use an annotation tool to teach spaCy
- Take a subset of your user inputs and annotate them manually (you can use the prodigy tool by the makers of spaCy or Brat - it's free)
- Use the annotations to train a new or existing NER model. Details on training a model can be found [here](here
Given a text like "Looking for Deep Learning experts", you would annotate "Deep Learning" with a label such as "FIELD". Then train a new entity type, 'FIELD'.
Once you have trained the model in the context, spaCy will learn to detect entities of interest.

training sentence tokenizer in spaCy

I'm trying to tokenize sentences using spacy.
The text includes lots of abbreviations and comments which ends with a period. Also, the text was obtained with OCR and sometimes there are line breaks in the middle of sentences. Spacy doesn't seem to be performing so well in these situations.
I have extracted some examples of how I want these sentences to be split. Is there any way to train spacy's sentence tokenizer?
Spacy is a little unusual in that the default sentence segmentation comes from the dependency parser, so you can't train a sentence boundary detector directly as such, but you can add your own custom component to the pipeline or pre-insert some boundaries that the parser will respect. See their documentation with examples: Spacy Sentence Segmentation
For the cases you're describing it would potentially be useful also be able to specify that a particular position is NOT a sentence boundary, but as far as I can tell that's not currently possible.

Internal implementation of nltk pos tagger

i am new to NLP and trying to use nltk pos tagger, and got a doubt on usage,
It usually accepts a word or a complete sentence, and gives pos tag of the input, why it is working in both way ?
i got this doubt because, i tried removing stop words and used spacy pos tagging technique and my colleague said i shouldn't do in that way because results change as it checks for positioning of words also,
Will it be same for nltk pos tagger also? if yes then why it accepts single words since positioning is considered?
sample usage found here for both use cases in nltk: https://github.com/acrosson/nlp/blob/master/subject_extraction/subject_extraction.py#L61
https://github.com/acrosson/nlp/blob/master/subject_extraction/subject_extraction.py#L44
A sentence of one word is still a sentence, so from a software engineering point of view, I would expect a tagger module to work the same regardless of the length of the sentence. From a linguistic point of view, that's not the case.
The word positioning is what seems to be confusing you. Many PoS taggers are based on sequence models, such as HMMs or CRFs*. These use context feature, e.g. what are the previous/next words in the sentence. I think that's what your colleague meant. If you only consider the previous one word as context, then it doesn't matter how long the sentence is. The first word in any sentence has no previous word, so the tagger has to learn to deal with that. However, adding context can change the decision of the tagger- let's look at an example using nltk
In [4]: import nltk
In [5]: nltk.pos_tag(['fly'])
Out[5]: [('fly', 'NN')]
In [6]: nltk.pos_tag(['I', 'fly'])
Out[6]: [('I', 'PRP'), ('fly', 'VBP')]
In [7]: nltk.pos_tag(['Large', 'fly'])
Out[7]: [('Large', 'JJ'), ('fly', 'NN')]
As you can see, changing the first word affects the tagger's output for the second word. As a consequence, you should not be removing stopwords before feeding your text into a PoS tagger.
* Although that's not always true. NLTK 3.3's PoS tagger is an averaged perceptron, and spacy 2.0 uses a neural model- the argument about context still holds though.
The nltk.pos_tag() function takes a list of tokens as input. This list can contain an arbitrary number of tokens, including, of course, 1. There is more info in the API documentation.
So in the first example you cite, nltk.pos_tag([w]), w is supposedly a single word string and [w] places it into a list, as required by the function.
In the second case, nltk.pos_tag(sent), the sent variable in the list comprehension is a sentence that has already been tokenised into a list of tokens (see line 41 in the code you cite - sentences = tokenize_sentences(document)), which is also the format required by pos_tag().
I'm not sure why your colleague advised against using spaCy. It depends on what you want to do. Contrary to NLTK, spaCy stores a rich set of features on each token, including the token's index (position) in the document and character offset in the original text. As far as I know, NLTK does not store token index and character offsets by default, so you would have to try and retrieve this yourself (something like this perhaps).

NLP: Arrange words with tags into proper English sentence?

lets say I have a sentence:
"you hello how are ?"
I get output of:
you_PRP hello_VBP how_WRB are_VBP
What is best way to arrange the wording into proper English sentence like: Hello how are you ?
I am new to this whole natural language processing so I am unfamiliar with many terms.
The only way I can think of on top of my head is - Using statements to determine:
adverb - verb - noun and then re-arrange them based on that?
Note: Lets assume I am trying to form proper question, so ignore determining if it's a question or a statement.
You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability of occurring than "are how". If you multiply the probabilities of all these two-word sequences in a sentence, you will get the probability of the sentence.
In other words, this is how you can solve your problem:
Find a corpus (either a simple text corpus, or a corpus that has been tagged with part-of-speech tags).
Learn a language model from that corpus. You can do this simply on the basis of the words, or on the basis of the words and their part-of-speech tags, as in your example.
Generate all possible sequences of your target words.
Use the language model to compute the probabilities of all those sequences.
Pick the sequence with the highest probability.
If you work with Python, nltk has an api for training and using language models. Otherwise, KenLM is a popular language modelling package.

How to create a simple feature to detect sentiment of a sentence using CRFs?

I want to use CRF for sentence level sentiment classiciation (positive or negative). But, I am lost on how to create a very simple feature to detect this using either CRFsuite or CRF++. Been trying for a few days, can anyone suggest how to design a simple feature which I can use as starting point to understand how to use the tools.
Thanks.
You could start providing gazetteers containing words separated by sentiment (e.g. positive adjectives, negative nouns, etc) and so using CRF to label relevant portions of the sentences. Using gazetteers you can also provide lists of other words which won't be labeled themselves, but could help identifying sentiment terms. You could also use WordNet instead of gazetteers. Your gazetteer features could be binary, i.e. gazetteer matched or not matched. Check out http://crfpp.googlecode.com for more examples and references.
I hope this helps!

Resources