Arabic lemmatization and Stanford NLP

Arabic lemmatization and Stanford NLP - nlp

I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example:
يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi)
==> و ص ل
Do you think Stanford NLP can do that?
Best Regards,

The Stanford Arabic segmenter can't do true lemmatization. However, it is possible to train a new model to do something like stemming:
تكتبون ← ت+ كتب +ون
يتصل ← ي+ تصل
If it is very important that the output is real Arabic lemmas ("تصل" is not a true lemma), you might be better off with a tool like MADAMIRA (http://nlp.ldeo.columbia.edu/madamira/).
Elaboration: The Stanford Arabic segmenter produces its output character-by-character using only these operations (implemented in edu.stanford.nlp.international.arabic.process.IOBUtils):
Split a word between two characters
Transform lil- (للـ) into li+ al- (ل+ الـ)
Transform ta (ت) or ha (ه) into ta marbuta (ة)
Transform ya (ي) or alif (ا) into alif maqsura (ى)
Transform alif maqsura (ى) into ya (ي)
So lemmatizing يتصل to ي+ اتصل would require implementing an extra rule, i.e., to insert an alif after ya or ta. Lemmatization of certain irregular forms would be completely impossible (for example, نساء ← امرأة).
The version of the Stanford segmenter available for download also only breaks off pronouns and particles:
وسيكتشفونه ← و+ س+ يكتشفون +ه
However, if you have access to the LDC Arabic Treebank or a similarly rich source of Arabic text with morphological segmentation annotated, it is possible to train your own model to remove all morphological affixes, which is closer to lemmatization:
وسيكتشفونه ← و+ س+ ي+ كتشف +ون +ه
Note that "كتشف" is not a real Arabic word, but the segmenter should at least consistently produce "كتشف" for تكتشفين ,أكتشف ,يكتشف, etc. If this is acceptable, you would need to change the ATB preprocessing script to instead use the morphological segmentation annotations. You could do this by replacing the script called parse_integrated with a modified version like this: https://gist.github.com/futurulus/38307d98992e7fdeec0d
Then follow the instructions for "TRAINING THE SEGMENTER" in the README.

I am not sure if Stanford NLP toolkit has a lammetizer, but you can try
The state-of-the-art is Farasa Lemmatizer.
MADAMIRA for Arabic processing
Farasa Lemmatizer outperforms MADAMIRA Lemmatizer based on accuracy. With accuracy about 97.23% It gives +7% relative gain above MADAMIRA in lemmatization task.
You can read more about Farasa Lemmatizer from the following link:
https://arxiv.org/pdf/1710.06700.pdf

Related

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?

I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.

Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)

french lemmatization Core NLP

I'm trying to use Stanford CoreNLP for French texts.
I have two questions:
I want to know if french lemmatization is available with Core NLP?
In some cases the output dependencies do not make sense for example for the sentence "Le chat mange la souris" (the cat is eating the mouse) there is a problem in the token "mange" which is typed as adj and not verb, for that it's not considered as the root of sentence.
But when I use the plurial "Les chats mangent la souris" it's correct.
Any help would be appreciated!

At this time we do not have a French language lemmatizer.
We will be releasing a new French dependencies model soon with our official 3.7.0 release. I am curious though, how are you generating dependencies, with the "parse" annotator or "depparse" annotator?

Thanks for your response.
I use the following configuration for the parse and depparse methods:
StanfordCoreNLP pipeline = new StanfordCoreNLP(
PropertiesUtils.asProperties(
"annotators", "tokenize, ssplit, pos, depparse, parse",
"tokenize.language", "fr",
"pos.model", "edu/stanford/nlp/models/pos- tagger/french/french.tagger",
"parse.model", "edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"depparse.model", "edu/stanford/nlp/models/parser/nndep/UD_French.gz"));

spaCy token.tag_ full list

The official documentation of token.tag_ in spaCy is as follows:
A fine-grained, more detailed tag that represents the word-class and some basic morphological information for the token. These tags are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent. The tagger is trained to predict these fine-grained tags, and then a mapping table is used to reduce them to the coarse-grained .pos tags.
But it doesn't list the full available tags and each tag's explanation. Where can I find it?

Finally I found it inside spaCy's source code: glossary.py. And this link explains the meaning of different tags.

Available values for token.tag_ are language specific. With language here, I don't mean English or Portuguese, I mean 'en_core_web_sm' or 'pt_core_news_sm'. In other words, they are language model specific and they are defined in the TAG_MAP, which is customizable and trainable. If you don't customize it, it will be default TAG_MAP for that language.
As of the writing of this answer, spacy.io/models lists all of the pre trained models and their labeling scheme.
Now, for the explanations. If you are working with English or German text, you're in luck! You can use spacy.explain() or access its glossary on github for the full list. If you are working with other languages, token.pos_ values are always those of Universal dependencies and will work regardless.
To finish up, if you are working with other languages, for a full explanation of the tags, you are going to have to look for them in the sources listed in the models page for your model of interest. For instance, for Portuguese I had to track the explanations for the tags in the Portuguese UD Bosque Corpus used to train the model.

Here is the list of tags:
TAG_MAP = [
".",
",",
"-LRB-",
"-RRB-",
"``",
"\"\"",
"''",
",",
"$",
"#",
"AFX",
"CC",
"CD",
"DT",
"EX",
"FW",
"HYPH",
"IN",
"JJ",
"JJR",
"JJS",
"LS",
"MD",
"NIL",
"NN",
"NNP",
"NNPS",
"NNS",
"PDT",
"POS",
"PRP",
"PRP$",
"RB",
"RBR",
"RBS",
"RP",
"SP",
"SYM",
"TO",
"UH",
"VB",
"VBD",
"VBG",
"VBN",
"VBP",
"VBZ",
"WDT",
"WP",
"WP$",
"WRB",
"ADD",
"NFP",
"GW",
"XX",
"BES",
"HVS",
"_SP",
]

Here is the list of tags and POS Spacy uses in the below link.
https://spacy.io/api/annotation
Universal parts of speech tags
English
German

You can get an explaination using
from spacy import glossary
tag_name = 'ADP'
glossary.explain(tag_name)
Version: 3.3.0
Source: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

You can use below:
dir(spacy.parts_of_speech)

How to NER and POS tag a pre-tokenized text with Stanford CoreNLP?

I'm using the Stanford's CoreNLP Named Entity Recognizer (NER) and Part-of-Speech (POS) tagger in my application. The problem is that my code tokenizes the text beforehand and then I need to NER and POS tag each token. However I was only able to find out how to do that using the command line options but not programmatically.
Can someone please tell me how programmatically can I NER and POS tag pretokenized text using Stanford's CoreNLP?
Edit:
I'm actually using the individual NER and POS instructions. So my code was written as instructed in the tutorials given in the Stanford's NER and POS packages. But I have CoreNLP in my classpath. So I have the CoreNLP in my classpath but using the tutorials in the NER and POS packages.
Edit:
I just found that there are instructions as how one can set the properties for CoreNLP here http://nlp.stanford.edu/software/corenlp.shtml but I wish if there was a quick way to do what I want with Stanford NER and POS taggers so I don't have to recode everything!

If you set the property:
tokenize.whitespace = true
then the CoreNLP pipeline will tokenize on whitespace rather than the default PTB tokenization. You may also want to set:
ssplit.eolonly = true
so that you only split sentences on newline characters.

To programmatically run a classifier over a list of tokens that you've already gotten via some other means, without a kludge like pasting them together with whitespace and then tokenizing again, you can use the Sentence.toCoreLabelList method:
String[] token_strs = {"John", "met", "Amy", "in", "Los", "Angeles"};
List<CoreLabel> tokens = edu.stanford.nlp.ling.Sentence.toCoreLabelList(token_strs);
for (CoreLabel cl : classifier.classifySentence(tokens)) {
System.out.println(cl.toShorterString());
}
Output:
[Value=John Text=John Position=0 Answer=PERSON Shape=Xxxx DistSim=463]
[Value=met Text=met Position=1 Answer=O Shape=xxxk DistSim=476]
[Value=Amy Text=Amy Position=2 Answer=PERSON Shape=Xxx DistSim=396]
[Value=in Text=in Position=3 Answer=O Shape=xxk DistSim=510]
[Value=Los Text=Los Position=4 Answer=LOCATION Shape=Xxx DistSim=449]
[Value=Angeles Text=Angeles Position=5 Answer=LOCATION Shape=Xxxxx DistSim=199]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Arabic lemmatization and Stanford NLP - nlp

I try to make lemmatization, ie identifying the lemma and possibly the Arabic root of a verb, for example: يتصل ==> lemma (infinitive of the verb) ==> اتصل ==> root (triliteral root / Jidr thoulathi) ==> و ص ل Do you think Stanford NLP can do that? Best Regards,

Related

How to get a sort of inverse lemmatizations for every language?

What are some of the data preparation steps or techniques one needs to follow when dealing with multi-lingual data?

french lemmatization Core NLP

spaCy token.tag_ full list

How to NER and POS tag a pre-tokenized text with Stanford CoreNLP?

Categories

Resources