I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.
I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)
I'm trying to use Stanford CoreNLP for French texts.
I have two questions:
I want to know if french lemmatization is available with Core NLP?
In some cases the output dependencies do not make sense for example for the sentence "Le chat mange la souris" (the cat is eating the mouse) there is a problem in the token "mange" which is typed as adj and not verb, for that it's not considered as the root of sentence.
But when I use the plurial "Les chats mangent la souris" it's correct.
Any help would be appreciated!
At this time we do not have a French language lemmatizer.
We will be releasing a new French dependencies model soon with our official 3.7.0 release. I am curious though, how are you generating dependencies, with the "parse" annotator or "depparse" annotator?
Thanks for your response.
I use the following configuration for the parse and depparse methods:
StanfordCoreNLP pipeline = new StanfordCoreNLP(
PropertiesUtils.asProperties(
"annotators", "tokenize, ssplit, pos, depparse, parse",
"tokenize.language", "fr",
"pos.model", "edu/stanford/nlp/models/pos- tagger/french/french.tagger",
"parse.model", "edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"depparse.model", "edu/stanford/nlp/models/parser/nndep/UD_French.gz"));
The official documentation of token.tag_ in spaCy is as follows:
A fine-grained, more detailed tag that represents the word-class and some basic morphological information for the token. These tags are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent. The tagger is trained to predict these fine-grained tags, and then a mapping table is used to reduce them to the coarse-grained .pos tags.
But it doesn't list the full available tags and each tag's explanation. Where can I find it?
Finally I found it inside spaCy's source code: glossary.py. And this link explains the meaning of different tags.
Available values for token.tag_ are language specific. With language here, I don't mean English or Portuguese, I mean 'en_core_web_sm' or 'pt_core_news_sm'. In other words, they are language model specific and they are defined in the TAG_MAP, which is customizable and trainable. If you don't customize it, it will be default TAG_MAP for that language.
As of the writing of this answer, spacy.io/models lists all of the pre trained models and their labeling scheme.
Now, for the explanations. If you are working with English or German text, you're in luck! You can use spacy.explain() or access its glossary on github for the full list. If you are working with other languages, token.pos_ values are always those of Universal dependencies and will work regardless.
To finish up, if you are working with other languages, for a full explanation of the tags, you are going to have to look for them in the sources listed in the models page for your model of interest. For instance, for Portuguese I had to track the explanations for the tags in the Portuguese UD Bosque Corpus used to train the model.
Here is the list of tags:
TAG_MAP = [
".",
",",
"-LRB-",
"-RRB-",
"``",
"\"\"",
"''",
",",
"$",
"#",
"AFX",
"CC",
"CD",
"DT",
"EX",
"FW",
"HYPH",
"IN",
"JJ",
"JJR",
"JJS",
"LS",
"MD",
"NIL",
"NN",
"NNP",
"NNPS",
"NNS",
"PDT",
"POS",
"PRP",
"PRP$",
"RB",
"RBR",
"RBS",
"RP",
"SP",
"SYM",
"TO",
"UH",
"VB",
"VBD",
"VBG",
"VBN",
"VBP",
"VBZ",
"WDT",
"WP",
"WP$",
"WRB",
"ADD",
"NFP",
"GW",
"XX",
"BES",
"HVS",
"_SP",
]
Here is the list of tags and POS Spacy uses in the below link.
https://spacy.io/api/annotation
Universal parts of speech tags
English
German
You can get an explaination using
from spacy import glossary
tag_name = 'ADP'
glossary.explain(tag_name)
Version: 3.3.0
Source: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
You can use below:
dir(spacy.parts_of_speech)
I'm using the Stanford's CoreNLP Named Entity Recognizer (NER) and Part-of-Speech (POS) tagger in my application. The problem is that my code tokenizes the text beforehand and then I need to NER and POS tag each token. However I was only able to find out how to do that using the command line options but not programmatically.
Can someone please tell me how programmatically can I NER and POS tag pretokenized text using Stanford's CoreNLP?
Edit:
I'm actually using the individual NER and POS instructions. So my code was written as instructed in the tutorials given in the Stanford's NER and POS packages. But I have CoreNLP in my classpath. So I have the CoreNLP in my classpath but using the tutorials in the NER and POS packages.
Edit:
I just found that there are instructions as how one can set the properties for CoreNLP here http://nlp.stanford.edu/software/corenlp.shtml but I wish if there was a quick way to do what I want with Stanford NER and POS taggers so I don't have to recode everything!
If you set the property:
tokenize.whitespace = true
then the CoreNLP pipeline will tokenize on whitespace rather than the default PTB tokenization. You may also want to set:
ssplit.eolonly = true
so that you only split sentences on newline characters.
To programmatically run a classifier over a list of tokens that you've already gotten via some other means, without a kludge like pasting them together with whitespace and then tokenizing again, you can use the Sentence.toCoreLabelList method:
String[] token_strs = {"John", "met", "Amy", "in", "Los", "Angeles"};
List<CoreLabel> tokens = edu.stanford.nlp.ling.Sentence.toCoreLabelList(token_strs);
for (CoreLabel cl : classifier.classifySentence(tokens)) {
System.out.println(cl.toShorterString());
}
Output:
[Value=John Text=John Position=0 Answer=PERSON Shape=Xxxx DistSim=463]
[Value=met Text=met Position=1 Answer=O Shape=xxxk DistSim=476]
[Value=Amy Text=Amy Position=2 Answer=PERSON Shape=Xxx DistSim=396]
[Value=in Text=in Position=3 Answer=O Shape=xxk DistSim=510]
[Value=Los Text=Los Position=4 Answer=LOCATION Shape=Xxx DistSim=449]
[Value=Angeles Text=Angeles Position=5 Answer=LOCATION Shape=Xxxxx DistSim=199]