I have been trying to use the Stanford CoreNLP API included in the 2015-12-09 release. I start the server using:
java -mx5g -cp "./*" edu.stanford.nlp.pipelinStanfordCoreNLPServer
The server works in general, but fails for some setnences including the following:
"Aside from her specifically regional accent, she reveals by the use of the triad, ``irritable, tense, depressed, a certain pedantic itemization that indicates she has some familiarity with literary or scientific language ( i.e., she must have had at least a highschool education ) , and she is telling a story she has mentally rehearsed some time before."
I end up with a result that starts with :
{"sentences":[{"index":0,"parse":"SENTENCE_SKIPPED_OR_UNPARSABLE","basic-dependencies":
I would greatly appriciate some help in setting this up - am I not including some annotators in the nlp pipeline.
This same sentence works at http://corenlp.run/
If you're looking for a dependency parse (like that in corenlp.run), you should look at the basic-dependencies field rather than the parse field. If you want a constituency parse, you should include the parse annotator in the list of annotators you are sending to the server. By default, the server does not include the parser annotator, as it's relatively slow.
Related
I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.
I'm working on multilingual word embedding code where I need to train my data on English and test it on Spanish. I'll be using the MUSE library by Facebook for the word-embeddings.
I'm looking for a way to pre-process both my data the same way. I've looked into diacritics restoration to deal with the accents.
I'm having trouble coming up with a way in which I can carefully remove stopwords, punctuations and weather or not I should lemmatize.
How can I uniformly pre-process both the languages to create a vocabulary list which I can later use with the MUSE library.
Hi Chandana I hope you're doing well. I would look into using the library spaCy https://spacy.io/api/doc the man that created it has a youtube video in which he discusses the implementation of of NLP in other languages. Below you will find code that will lemmatize and remove stopwords. as far as punctuation you can always set specific characters such as accent marks to ignore. Personally I use KNIME which is free and open source to do preprocessing. You will have to install nlp extentions but what is nice is that they have different extensions for different languages you can install here: https://www.knime.com/knime-text-processing the Stop word filter (since 2.9) and the Snowball stemmer node can be applied for Spanish language. Make sure to select the right language in the dialog of the node. Unfortunately there is no part of speech tagger node for Spanish so far.
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stop_word2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I hope this helps let me know if you have any questions :)
I'm trying to use Stanford CoreNLP for French texts.
I have two questions:
I want to know if french lemmatization is available with Core NLP?
In some cases the output dependencies do not make sense for example for the sentence "Le chat mange la souris" (the cat is eating the mouse) there is a problem in the token "mange" which is typed as adj and not verb, for that it's not considered as the root of sentence.
But when I use the plurial "Les chats mangent la souris" it's correct.
Any help would be appreciated!
At this time we do not have a French language lemmatizer.
We will be releasing a new French dependencies model soon with our official 3.7.0 release. I am curious though, how are you generating dependencies, with the "parse" annotator or "depparse" annotator?
Thanks for your response.
I use the following configuration for the parse and depparse methods:
StanfordCoreNLP pipeline = new StanfordCoreNLP(
PropertiesUtils.asProperties(
"annotators", "tokenize, ssplit, pos, depparse, parse",
"tokenize.language", "fr",
"pos.model", "edu/stanford/nlp/models/pos- tagger/french/french.tagger",
"parse.model", "edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz",
"depparse.model", "edu/stanford/nlp/models/parser/nndep/UD_French.gz"));
Using Stanford CoreNLP, I am trying to parse text using the neural nets dependency parser. It runs really fast (that's why I want to use this and not the LexicalizedParser), and produces high-quality dependency relations. I am also interested in retrieving the parse trees (Penn-tree style) from that too. So, given the GrammaticalStructure, I am getting the root of that (using root()), and then trying to print it out using the toOneLineString() method. However, root() returns the root node of the tree, with an empty/null list of children. I couldn't find anything on this in the instructions or FAQs.
GrammaticalStructure gs = parser.predict(tagged);
// Print typed dependencies
System.err.println(gs);
// get the tree and print it out in the parenthesised form
TreeGraphNode tree = gs.root();
System.err.println(tree.toOneLineString());
The output of this is:
ROOT-0{CharacterOffsetBeginAnnotation=-1, CharacterOffsetEndAnnotation=-1, PartOfSpeechAnnotation=null, TextAnnotation=ROOT}Typed Dependencies:
[nsubj(tell-5, I-1), aux(tell-5, can-2), advmod(always-4, almost-3), advmod(tell-5, always-4), root(ROOT-0, tell-5), advmod(use-8, when-6), nsubj(use-8, movies-7), advcl(tell-5, use-8), amod(dinosaurs-10, fake-9), dobj(use-8, dinosaurs-10), punct(tell-5, .-11)]
ROOT-0
How can I get the parse tree too?
Figured I can use the Shift-Reduce constituency parser made available by Stanford. It's very fast and the results are comparable.
I am new in the NLP domain, but my current research needs some text parsing (or called keyword extraction) from URL addresses, e.g. a fake URL,
http://ads.goole.com/appid/heads
Two constraints are put on my parsing,
The first "ads" and last "heads" should be distinct because "ads" in the "heads" means more suffix rather than an advertisement.
The "appid" can be parsed into two parts; that is 'app' and 'id', both taking semantic meanings on the Internet.
I have tried the Stanford NLP toolkit and Google search engine. The former tries to classify each word in a grammar meaning which is under my expectation. The Google engine shows more smartness about "appid" which gives me suggestions about "app id".
I can not look over the reference of search history in Google search so that it gives me "app id" because there are many people have searched these words. Can I get some offline line methods to perform similar parsing??
UPDATE:
Please skip the regex suggestions because there is a potentially unknown number of compositions of words like "appid" in even simple URLs.
Thanks,
Jamin
Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.
I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.
http://jeremykun.com/2012/01/15/word-segmentation/
The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.
Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).
Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.
Installation is easy with pip:
$ pip install wordsegment
Simply call segment to get a list of words:
>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']
As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.
>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']
I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.