I have already successfully parsed sentences to get dependency information using stanford parser (version 3.9.1(run it in IDE Eclipse)) with command "TypedDependencies", but how could I get depnedency information about a single word( it's parent, siblings and children)? I have searched javadoc, it seems Class semanticGraph is used to do this job, but it need a IndexedWord type as input, how do I get IndexedWord? Do you have any simple samples?
You can create a SemanticGraph from a List of TypedDependencies and then you can use the methods getChildren(IndexedWord iw), getParent(IndexedWord iw), and getSiblings(IndexedWord iw). (See the javadoc of SemanticGraph).
To get the IndexedWord of a specific word, you can, for example, use the SemanticGraph method getNodeByIndex(int i), which will return the IndexNode of the i-th token in a sentence.
Related
For POS tagging, I am using spacy. I found pos tags for gerund and infinitives are not given. How can I add these two new tags in spacy? I can change tags within the list but not able to add new tags. Please help. Thank you.
**pattern = [tokens[t].pos_== "VERB" and tokens[t-1].pos_=="ADP" for t in range
(len(tokens)-1)]
spacy.pipeline.tagger.Tagger.add_label(u"GERUND")**
this gives error:
TypeError: add_label() takes exactly 2 positional arguments (1 given)
Rather than modifying the output, it sounds like you should just use the .tag_ attribute, which is more detailed and language specific. In this case the value will be VB for an infinitive and VBG for a gerund. These are Penn Treebank Tags.
The .pos_ values are from Universal Dependencies, and are less detailed but appropriate for multi-lingual applications.
I am writing detailed information for each function in a class, and hope when others use my python code, they could check the info for each function, by using something like help(function_name)
For example, I have created a python file called text_preprocess.py, in this file I have created a class which includes functions.
class Preprocess():
def str_process(self, row_string):
'''
This Standford CoreNLP package requires the text input as 1 single string.
The input annotators are in you command line input.
:param row_string: The string format input for Standford CoreNLP
:return: Json format output
'''
parsed_json = self.nlp.annotate(row_string, properties={
'annotators': self.standford_annotators,
'outputFormat': 'json'
})
return parsed_json
In this function, as you can see the info is within triple quotes.
But I don't know how could other users see this info without looking into my code.
I have checked many solutions online, people use help(function_name), but it seems that they wrote the code through terminal, and then type help(function_name), many of those examples do not have class either. Using --help through command line only gives them the parameter descriptions I added through argparse....
So, if I hope others could check the info of my function without looking into the code, where and how could they do that?
Either be in the same directory as your script, or make sure your script is in one of the directories listed in sys.path (usually, the first option is easier for simple things, but if you want to do the second, use a virtualenv rather than trying to install the module system-wide).
Run pydoc3 text_preprocess to get documentation for the whole module. This recursively includes all of the items below the module, such as classes, their members, and functions.
Run pydoc3 text_preprocess.Preprocess if you just want the class.
Run pydoc3 text_preprocess.Preprocess.str_process if you just want the method.
Use Sphinx if you want nicely-formatted HTML or other formats such as PDF.
You may also want to remove that empty line at the beginning of your docstring; some docstring-parsing code may misinterpret it.
Putting triple quoted strings just below the declaration of a class or method is called documentation. You can read this using Preprocess.str_process.__doc__.
The ChunkerME class in OpenNLP has a chunk() method which takes two String[]. The first one should be the tags (tags from part of speech tagging process) and the second one is the actual terms.
I'm having a tagged string in the format of Sir_NNP Arthur_NNP Conan_NNP... and I'd like to chunk it using the ChunkerME class. However the chunker does not accept this string as is. however the OpenNLP command line has a command (opennlp ChunkerME en-chunker.bin) which directly accepts a tagged sentence and return a chunked sentence.
How can I use something like the one in the command line.
Using Stanford CoreNLP, I am trying to parse text using the neural nets dependency parser. It runs really fast (that's why I want to use this and not the LexicalizedParser), and produces high-quality dependency relations. I am also interested in retrieving the parse trees (Penn-tree style) from that too. So, given the GrammaticalStructure, I am getting the root of that (using root()), and then trying to print it out using the toOneLineString() method. However, root() returns the root node of the tree, with an empty/null list of children. I couldn't find anything on this in the instructions or FAQs.
GrammaticalStructure gs = parser.predict(tagged);
// Print typed dependencies
System.err.println(gs);
// get the tree and print it out in the parenthesised form
TreeGraphNode tree = gs.root();
System.err.println(tree.toOneLineString());
The output of this is:
ROOT-0{CharacterOffsetBeginAnnotation=-1, CharacterOffsetEndAnnotation=-1, PartOfSpeechAnnotation=null, TextAnnotation=ROOT}Typed Dependencies:
[nsubj(tell-5, I-1), aux(tell-5, can-2), advmod(always-4, almost-3), advmod(tell-5, always-4), root(ROOT-0, tell-5), advmod(use-8, when-6), nsubj(use-8, movies-7), advcl(tell-5, use-8), amod(dinosaurs-10, fake-9), dobj(use-8, dinosaurs-10), punct(tell-5, .-11)]
ROOT-0
How can I get the parse tree too?
Figured I can use the Shift-Reduce constituency parser made available by Stanford. It's very fast and the results are comparable.
Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.
In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.