I am on an NLP project right now and I need to use Stanford Open information extraction tool with python (nltk if possible). I found a python wrapper
but it's poorly documented and does not give full functionality interface to Stanford Open IE. Any suggestion?
One approach is to use the CoreNLP Server, which outputs OpenIE triples (see, e.g., corenlp.run). Among other libraries, Stanford's Stanza library is written in Python can call a server instance to get annotations. Make sure to include all the required annotators: tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie.
I just found another way with pycorenlp and corenlp
nlp = StanfordCoreNLP(<<url_to_your_server>>)
text = "'the quick brown fox jumps over the lazy dog.'"
output = nlp.annotate(text, properties={
'annotators': 'tokenize, ssplit, pos, depparse, parse, openie',
'outputFormat': 'json'
})
and the properties can be found through the keys you can get from
print(output['sentences'][0].keys)
Related
There are quite alot of posts on removing namespaces in Python, but nearly all use the lxml package. Which seems very nice, but I've had trouble implementing in Windows.
What I want to achieve with my tags is similar to to:
removing namespace aliases from xml
but that answer is oriented toward json and doesn't seem to be python-based.
Similarly, I'm unclear on how to implement this:
https://stackoverflow.com/a/61786754/9249533
This older post is somewhat helpful:
https://stackoverflow.com/a/18160058/9249533
But I'm wondering as of July 2021 what the options are?
FWIW, my data look like:
My aim is to just access the data. I do not care to move it back to Excel.
presently, if I run:
from xml.etree import ElementTree as ET
tree = ET.parse(in_path + 'myfile.xml')
root = tree.getroot()
for child in root.iter():
print(child.tag)
returns this for the 'Data' tag:
{urn:schemas-microsoft-com:office:spreadsheet}Data
I'd like it to just be:
Data
The prefix/namespace is hampering my very modest xml interpretation skills. Any guidance for doing this with Packages that are more readily Windows compatible (or better still with conda installs) much appreciated.
All I want is to generate API docs from function docstrings in my source code, presumably through sphinx's autodoc extension, to comprise my lean API documentation. My code follows the functional programming paradigm, not OOP, as demonstrated below.
I'd probably, as a second step, add one or more documentation pages for the project, hosting things like introductory comments, code examples (leveraging doctest I guess) and of course linking to the API documentation itself.
What might be a straightforward flow to accomplish documentation from docstrings here? Sphinx is a great popular tool, yet I find its getting started pages a bit dense.
What I've tried, from within my source directory:
$ mkdir documentation
$ sphinx-apidoc -f --ext-autodoc -o documentation .
No error messages, yet this doesn't find (or handle) the docstrings in my source files; it just creates an rst file per source, with contents like follows:
tokenizer module
================
.. automodule:: tokenizer
:members:
:undoc-members:
:show-inheritance:
Basically, my source files look like follows, without much module ceremony or object oriented contents in them (I like functional programming, even though it's python this time around). I've truncated the sample source file below of course, it contains more functions not shown below.
tokenizer.py
from hltk.util import clean, safe_get, safe_same_char
"""
Basic tokenization for text
not supported:
+ forms of pseuod elipsis (...)
support for the above should be added only as part of an automata rewrite
"""
always_swallow_separators = u" \t\n\v\f\r\u200e"
always_separators = ",!?()[]{}:;"
def is_one_of(char, chars):
'''
Returns whether the input `char` is any of the characters of the string `chars`
'''
return chars.count(char)
Or would you recommend a different tool and flow for this use case?
Many thanks!
If you find Sphinx too cumbersome and particular to use for simple projects, try pdoc:
$ pdoc --html tokenizer.py
Im gensims latest version, loading trained vectors from a file is done using KeyedVectors, and dosent requires instantiating a new Word2Vec object. But now my code is broken because I can't use the model.vector_size property. What is the alternative to that? I mean something better than just kv[kv.index2word[0]].size.
kv.vector_size still works; I'm using gensim 2.3.0, which is the latest as I write. (I am assuming kv is your KeyedVectors object.) It appears object properties are not documented on the API page, but auto-complete suggests it, and there is no deprecated warning or anything.
Your question helped me answer my own, which was how to get the number of words: len(kv.index2word)
I can use the en-us things that come with Sphinx4, no problem:
cfg.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us")
cfg.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict")
cfg.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin")
I can use this to transcribe an English sound file recording.
Now I want to use this with German recordings. On the website I find a link to Acoustic and Language Models. In it there is an archive 'German Voxforge'. It it I find the corresponding files for the acoustic model path. But it does not contain a dictionary or language model as far as I can see.
How do I get the dictionary and language model path for German in Sphinx4?
You create them yourself. You can create language model from subtitles or wikipedia dumps. The documentation is here.
Latest German models are actually not on CMUSphinx page, they are at github/gooofy. In this gooofy project you can find dictionary documentation, models and related matherials.
I have tried the German model with pocketsphinx and got some errors due to the "invalid" language model *.lm.bin files were used.
I have switched to the *.lm.gz and it working fine.
The proper configuration list is:
fst = voxforge-de.fst
hmm folder = model_parameters/voxforge.cd_cont_6000
dictionary = cmusphinx-voxforge-de.dic
language model = cmusphinx-voxforge-de.lm.gz
To get the "hmm" path you should unzip an archive:
cmusphinx-de-voxforge-5.2.tar.gz
I think it should be the same for a Sphinx4, so please give it a try.
I have recently upgraded to the latest version of Stanford CoreNLP. The code I previously used to get the subject or object in a sentence was
System.out.println("subject: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), EnglishGrammaticalRelations.NOMINAL_SUBJECT));
but this now returns null.
I have tried creating a relation with
GrammaticalRelation subjreln =
edu.stanford.nlp.trees.GrammaticalRelation.valueOf("nsubj");
without success. If I extract a relation using code like
GrammaticalRelation target = (dependencies.childRelns(dependencies.getFirstRoot())).iterator().next();
Then run the same request,
System.out.println("target: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), target));
then I get the desired result, confirming that the parsing worked fine (I also know this from printing out the full dependencies).
I suspect my problem has to do with the switch to universal dependencies, but I don't know how to create the GrammaticalRelation from scratch in a way that will match what the dependency parser found.
Since version 3.5.2 the default dependency representation in CoreNLP is Universal Dependencies. This new representation is implemented in a different class (UniversalEnglishGrammaticalRelations) so the GrammaticalStructure objects are now defined somewhere else.
All you have to do to use the new version is to replace EnglishGrammaticalRelations with UniversalGrammaticalRelations:
System.out.println("subject: "+dependencies.getChildWithReln(dependencies.getFirstRoot(), UniversalEnglishGrammaticalRelations.NOMINAL_SUBJECT));
Note, however, that some relations in the new representation are different and might no longer exist (nsubj still does). We are currently compiling migration guidelines from the old representation to the new Universal Dependencies relations. It is still incomplete but it already contains all relation names and their class names in CoreNLP.