How can I add new POS tag in spacy for English - nlp

For POS tagging, I am using spacy. I found pos tags for gerund and infinitives are not given. How can I add these two new tags in spacy? I can change tags within the list but not able to add new tags. Please help. Thank you.
**pattern = [tokens[t].pos_== "VERB" and tokens[t-1].pos_=="ADP" for t in range
(len(tokens)-1)]
spacy.pipeline.tagger.Tagger.add_label(u"GERUND")**
this gives error:
TypeError: add_label() takes exactly 2 positional arguments (1 given)

Rather than modifying the output, it sounds like you should just use the .tag_ attribute, which is more detailed and language specific. In this case the value will be VB for an infinitive and VBG for a gerund. These are Penn Treebank Tags.
The .pos_ values are from Universal Dependencies, and are less detailed but appropriate for multi-lingual applications.

Related

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

spaCy's rule-based Matcher finds tokens longer than specified by the shape

I want to use the rule-based Matcher (spaCy version 2.0.12) to locate in text codes that consists of 4 letters followed by 4 digits (e.g. CAPA1234). I am trying to use a pattern with attribute SHAPE:
pattern = [{'SHAPE': 'XXXXdddd'}]
You can test it yourself with the Rule-based Matcher Explorer.
It is finding the codes I am expecting but also longer ones like CAPABCD1234 or CAPA1234567. XXXX seems to mean 4 capital letters or more and the same goes for dddd.
Is there a setting to make the shape match the text exactly?
I found a workaround that solves my problem, but doesn't really explain why spaCy behaves the way it does. I will leave the question open.
Use SHAPE and additionally specify LENGTH explicitly:
pattern = [{'LENGTH': 8, 'SHAPE': 'XXXXdddd'}]
Please note that the online Explorer seems to fails when LENGTH is used (no tokens are highlighted). It is working fine on my machine.

How to get dependency information about a word?

I have already successfully parsed sentences to get dependency information using stanford parser (version 3.9.1(run it in IDE Eclipse)) with command "TypedDependencies", but how could I get depnedency information about a single word( it's parent, siblings and children)? I have searched javadoc, it seems Class semanticGraph is used to do this job, but it need a IndexedWord type as input, how do I get IndexedWord? Do you have any simple samples?
You can create a SemanticGraph from a List of TypedDependencies and then you can use the methods getChildren(IndexedWord iw), getParent(IndexedWord iw), and getSiblings(IndexedWord iw). (See the javadoc of SemanticGraph).
To get the IndexedWord of a specific word, you can, for example, use the SemanticGraph method getNodeByIndex(int i), which will return the IndexNode of the i-th token in a sentence.

Python 3.3: Process inlineXML

Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.
In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.

Resources