Using Stanford CoreNLP, I am trying to parse text using the neural nets dependency parser. It runs really fast (that's why I want to use this and not the LexicalizedParser), and produces high-quality dependency relations. I am also interested in retrieving the parse trees (Penn-tree style) from that too. So, given the GrammaticalStructure, I am getting the root of that (using root()), and then trying to print it out using the toOneLineString() method. However, root() returns the root node of the tree, with an empty/null list of children. I couldn't find anything on this in the instructions or FAQs.
GrammaticalStructure gs = parser.predict(tagged);
// Print typed dependencies
System.err.println(gs);
// get the tree and print it out in the parenthesised form
TreeGraphNode tree = gs.root();
System.err.println(tree.toOneLineString());
The output of this is:
ROOT-0{CharacterOffsetBeginAnnotation=-1, CharacterOffsetEndAnnotation=-1, PartOfSpeechAnnotation=null, TextAnnotation=ROOT}Typed Dependencies:
[nsubj(tell-5, I-1), aux(tell-5, can-2), advmod(always-4, almost-3), advmod(tell-5, always-4), root(ROOT-0, tell-5), advmod(use-8, when-6), nsubj(use-8, movies-7), advcl(tell-5, use-8), amod(dinosaurs-10, fake-9), dobj(use-8, dinosaurs-10), punct(tell-5, .-11)]
ROOT-0
How can I get the parse tree too?
Figured I can use the Shift-Reduce constituency parser made available by Stanford. It's very fast and the results are comparable.
Related
I am new to ANTLR, and I am digging into it for a project. My work would require me to generate a parse tree from a source code file, convert the parse tree into a string that holds all the information about the parse tree in a somewhat "human-readable" form. Parts of this string (representing the parse tree) will then be modified, and the modified string will have to be converted to a changed source code.
I have found out that the .toStringTree(tree) method can be used in ANTLR to print out the tree in LISP format. Is there a better way to represent the parse tree as a string that holds all information?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR? If no, are there any tools for this?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR?
That string does not contain the token types, just the matched text. In other words: you cannot create a parse tree from the output of the ToStringTree. Besides, many ANTLR grammars have lexer rules that skip certain input (white spaces and line breaks, for example), so converting a parse tree back to the original input source is not always possible.
If no, are there any tools for this?
Without a doubt, I suggest you do a search on GitHub. But when you have the parse tree, it is trivial to create a custom tree structure and convert that to JSON.
I am working on using clang bingings python to travers c/c++ code into AST,how can I get a tree based AST structure?
Some pointers on where to start, tutorials or anything in this regard will be of great help!!!
I found a very useful work(If you want to check this out ,here is the link:https://www.chess.com/blog/lockijazz/using-python-to-traverse-and-modify-clang-s-ast-tree) and tried his code,unfortunately I didn't get a useful output.
function_calls = []
function_declarations = []
def traverse(node):
for child in node.get_children():
traverse(child)
if node.type == clang.cindex.CursorKind.CALL_EXPR:
function_calls.append(node)
if node.type == clang.cindex.CursorKind.FUNCTION_DECL:
function_declarations.append(node)
print 'Found %s [line=%s, col=%s]' % (node.displayname, node.location.line, node.location.column)
clang.cindex.Config.set_library_path("/Users/tomgong/Desktop/build/lib")
index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
root = tu.cursor
traverse(root)
Just in case anyone was having trouble still, I found that if you should be using kind instead of type
you can run clang.cindex.CursorKind.get_all_kinds() to retrieve all kinds and see that when using the node.type does not appear in any of them.
function_calls = []
function_declarations = []
def traverse(node):
for child in node.get_children():
traverse(child)
if node.kind == clang.cindex.CursorKind.CALL_EXPR:
function_calls.append(node)
if node.kind == clang.cindex.CursorKind.FUNCTION_DECL:
function_declarations.append(node)
print 'Found %s [line=%s, col=%s]' % (node.displayname, node.location.line, node.location.column)
clang.cindex.Config.set_library_path("/Users/tomgong/Desktop/build/lib")
index = clang.cindex.Index.create()
tu = index.parse(sys.argv[1])
root = tu.cursor
traverse(root)
how can I get a tree based AST structure?
The translation unit object's cursor (tu.cursor) is actually the start node of an AST. You might wanna use clang tool to visually analyze the tree. Maybe this will shed the light and give you the intuition on how to work with the tree.
clang++ -cc1 -ast-dump test.cpp
But basically, it boils down to getting children nodes of the main node (tu.cursor) and recursively traversing them, and getting to the nodes which are of interest to you.
You might wanna also check an article from Eli Benderski how to start working with the python binding:
https://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang#id9
unfortunately I didn't get a useful output.
You might run into incomplete or wrong parsing, when you don't provide paths to includes in the parsed file to libclang module. For example, if the source file you want to parse uses some of the QT includes, then you need to specify relevant include paths in the parse() call like in the example here:
index = clang.cindex.Index.create()
tu = index.parse(src_file, args = [
'-I/usr/include/x86_64-linux-gnu/qt5/',
'-I/usr/include/x86_64-linux-gnu/qt5/QtCore'])
Also look for some comments in the libclang.cindex python module, they can help you. For example, I found the solution above by reading those comments.
I have been using pycparser in order to do obtain the AST of C/C++ source code and explore the same using python.
You can find the API for exploring the AST in this example from the repository.
I have already successfully parsed sentences to get dependency information using stanford parser (version 3.9.1(run it in IDE Eclipse)) with command "TypedDependencies", but how could I get depnedency information about a single word( it's parent, siblings and children)? I have searched javadoc, it seems Class semanticGraph is used to do this job, but it need a IndexedWord type as input, how do I get IndexedWord? Do you have any simple samples?
You can create a SemanticGraph from a List of TypedDependencies and then you can use the methods getChildren(IndexedWord iw), getParent(IndexedWord iw), and getSiblings(IndexedWord iw). (See the javadoc of SemanticGraph).
To get the IndexedWord of a specific word, you can, for example, use the SemanticGraph method getNodeByIndex(int i), which will return the IndexNode of the i-th token in a sentence.
Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.
In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.
How to Extract SVO using NLP in java, i am new in nlp.i am currently using opennlp. but how to do in java with a perticular in java sentence.
LexicalizedParser lp = **new LexicalizedParser("englishPCFG.ser.gz");**
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.print(parse);
getting an compilation error at
new LexicalizedParser("englishPCFG.ser.gz");**
The constructor LexicalizedParser(String) is undefined
it seems as if you are using new version of Stanford NLP parser.
in new version of this parser constructors are not used to allocate memory, instead we are having dedicated functions . you can use :
LexicalizedParser lp = LexicalizedParser.loadModel("englishPCFG.ser.gz");
you can use various overloads of this API.
Stanford documentation for various overloads of loadModel
This is code from the Stanford dependency parser, not from OpenNLP. Follow the example given in ParserDemo.java (and/or ParserDemo2.java) that's included in the stanford-parser directory and make sure that your demo code and the stanford-parser.jar in your classpath are from the same version of the parser. I suspect you are using a more recent version of the parser with older demo code.
You can use Stanford CoreNLP. Check answer here for "rough algorithm" how to get subject-predicate-object from a sentence.
You can use reverb. Check answer here for "reVerb" how to get information extraction from a sentence