using minipar parser output - nlp

how to use minipar parser output to extract features like subject , object ,verb, tense etc to be use for english text to ASL conversion project

Have a look at the Predicate-Argument Extractor (PAX), a GATE component to extract (subject, predicate, object) triples from the output of several parsers, including RASP, SUPPLE, MiniPar, Stanford, and MuNPEx.

Related

Text first data serialization with separate metadata

I'm trying to find a format that will help solve a very particular problem:
Text first solution.
Ability to specify complex objects in a single text line (properties, key\value, lists, complex objects)
Object metadata structure should be separate from the data.
For example:
Metadata: Prop1:int|Prop2:string|PropList:int[,]
Data: 20|Something|10,20,30
that would mean:
Prop1 = 20
Prop2 = "Something"
PropList = [10,20,30]
Is there any existing serialization format resembling this?
I don't see any format can support the scheme from the example you provided. If you really need this schema (Type section, Data section), then you need to write your own parser, and it's easy.
But the most suitable mature format should still be JSON if you don't want to write your own parser.
specify complex objects in a single text line: not YAML, not XML, not INI, not TOML.
Any common format is designed less semantics or business related.

How to get a sort of inverse lemmatizations for every language?

I found the spacy lib that allows me to apply lemmatization to words (blacks -> black, EN) (bianchi -> bianco, IT). My work is to analyze entities, not verbs or adjectives.
I'm looking for something that allows me to have all the possible words starting from the caninical form.
Like from "black" to "blacks", for english, or from "bianco" (in italian) and get "bianca", "bianchi", "bianche", etc. Is there any library that do this?
I'm not clear on exactly what you're looking for but if a list of English lemma is all you need you can extract that easily enough from a GitHub library I have. Take a look at lemminflect. Initially, this uses a dictionary approach to lemmatization and there is a .csv file in here with all the different lemmas and their inflections. The file is LemmInflect/lemminflect/resources/infl_lu.csv.gz. You'll have to extract the lemmas from it. Something like...
with gzip.open('LemmInflect/lemminflect/resources/infl_lu.csv.gz)` as f:
for line in f.readlines():
parts = lines.split(',')
lemma = parts[0]
pos = parts[1]
print(lemma, pos)
Alternatively, if you need a system to inflect words, this is what Lemminflect is designed to do. You can use it as a stand-alone library or as an extension to SpaCy. There's examples on how to use it in the README.md or in the ReadTheDocs documentation.
I should note that this is for English only. I haven't seen a lot of code for inflecting words and you may have some difficulty finding this for other languages.

Converting ANTLR parse trees into string and then reverting it

I am new to ANTLR, and I am digging into it for a project. My work would require me to generate a parse tree from a source code file, convert the parse tree into a string that holds all the information about the parse tree in a somewhat "human-readable" form. Parts of this string (representing the parse tree) will then be modified, and the modified string will have to be converted to a changed source code.
I have found out that the .toStringTree(tree) method can be used in ANTLR to print out the tree in LISP format. Is there a better way to represent the parse tree as a string that holds all information?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR? If no, are there any tools for this?
Can the string-parse-tree be reverted back to the original source code (in the same language) using ANTLR?
That string does not contain the token types, just the matched text. In other words: you cannot create a parse tree from the output of the ToStringTree. Besides, many ANTLR grammars have lexer rules that skip certain input (white spaces and line breaks, for example), so converting a parse tree back to the original input source is not always possible.
If no, are there any tools for this?
Without a doubt, I suggest you do a search on GitHub. But when you have the parse tree, it is trivial to create a custom tree structure and convert that to JSON.

How to use Chunker Class in OpenNLP?

The ChunkerME class in OpenNLP has a chunk() method which takes two String[]. The first one should be the tags (tags from part of speech tagging process) and the second one is the actual terms.
I'm having a tagged string in the format of Sir_NNP Arthur_NNP Conan_NNP... and I'd like to chunk it using the ChunkerME class. However the chunker does not accept this string as is. however the OpenNLP command line has a command (opennlp ChunkerME en-chunker.bin) which directly accepts a tagged sentence and return a chunked sentence.
How can I use something like the one in the command line.

Python 3.3: Process inlineXML

Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.
In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.

Resources