How to use Chunker Class in OpenNLP? - nlp

The ChunkerME class in OpenNLP has a chunk() method which takes two String[]. The first one should be the tags (tags from part of speech tagging process) and the second one is the actual terms.
I'm having a tagged string in the format of Sir_NNP Arthur_NNP Conan_NNP... and I'd like to chunk it using the ChunkerME class. However the chunker does not accept this string as is. however the OpenNLP command line has a command (opennlp ChunkerME en-chunker.bin) which directly accepts a tagged sentence and return a chunked sentence.
How can I use something like the one in the command line.

Related

how to match the lemma of a string with the lemmas of the list using phrase matcher in spacy

I have a list and a sentence and I want to match the list with lemma of words in the sentence, i.e,
list_words = ['play', 'burn fireworks', 'eat']
sentence = "sita was playing with her friends while her broter was burning fireworks"
I tried,
patterns = [__model.make_doc(text) for text in list_words]
spacy_doc = __model(sentence)
matcher = PhraseMatcher(__model.vocab, attr="LEMMA")
mather.add(id, None, *patterns)
that is adding LEMMA as attr in PhraseMtcher,
but it did not help me
as it should have matched burning fireworks and playing from the sentence and instead of that, i am getting a empty list.
If __model has the tagger enabled (which it probably does by default), this will work if you change __model.make_doc(text) to __model(text) when you create the patterns. make_doc() only works for attr="ORTH" because it doesn't do anything beyond tokenization.
If you have a lot of lemma-based patterns and none of them need parses or named entities, you could disable parser and ner in __model to make things faster, since the lemmatizer only depends on the tagger.
(PhraseMatcher warns you that nlp(text) might be slow for ORTH-only patterns and suggests using nlp.make_doc() instead, but I think it should also try to warn you if your document doesn't have the attributes you're trying to match.)

Search a string and list all sentences matching that string

I am trying to code in Scala for the following use case:
Search a string in a text file and list only the sentences that has a match for this string.
I tried using the following:
val fileContents = Source.fromFile("/Users/sc/Documents/Scala_Code/input.txt").getLines.mkString
val sentence = fileContents.filter(line => fileContents.contains("string to search"))
This lists the entire text file even if there is one match. I need just the sentences that has a match.
Appreciate if someone can provide some inputs.
I think it's kind of hard to be sure to describe a sentence in regex. Nevertheless, here's my suggestion:
for all sentences (in case you want to pattern match on them):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString) //.toSeq
For a specific string (for example you):
"""\A?\b((?!\?+"?|!+"?|\.+)(.|\n))+(\Z|\?+"?|!+"?|\.+)""".r.findAllIn(fileContents.mkString).toIterator.withFilter(_.contains("you")) //.toSeq
toSeq (or toList) is useful for checking on small amount of data...
You can test it here: https://scalafiddle.io/sf/0znMzyi/8
Hope it helps.

How to get dependency information about a word?

I have already successfully parsed sentences to get dependency information using stanford parser (version 3.9.1(run it in IDE Eclipse)) with command "TypedDependencies", but how could I get depnedency information about a single word( it's parent, siblings and children)? I have searched javadoc, it seems Class semanticGraph is used to do this job, but it need a IndexedWord type as input, how do I get IndexedWord? Do you have any simple samples?
You can create a SemanticGraph from a List of TypedDependencies and then you can use the methods getChildren(IndexedWord iw), getParent(IndexedWord iw), and getSiblings(IndexedWord iw). (See the javadoc of SemanticGraph).
To get the IndexedWord of a specific word, you can, for example, use the SemanticGraph method getNodeByIndex(int i), which will return the IndexNode of the i-th token in a sentence.

Python 3.3: Process inlineXML

Whilst trying to tag named entities with the stanford NRE tool, I get this kind of output:
A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.
Of course processing any XML without a root does not work, so I added this:
<root>A jury in <ORGANIZATION>Marion County Superior Court</ORGANIZATION> was expected to begin deliberations in the case on <DATE>Wednesday</DATE> or <DATE>Thursday</DATE>.</root>
I tried building a tree with this method: stripping inline tags with python's lxml but it does not work... It yields this error on the line tree = etree.fromstring(text):
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 1, column 1793
Does anyone know a solution for this? Or perhaps another method which allows me to build a tree from any text with inlineXML tags, keeping only the tagged tokens and removing/ignoring the rest of the text.
In the end I did it without using a parser or a tree but just used regular expressions. This is the code that works nice and fast:
import re
NER = ['TIME','LOCATION','ORGANISATION','PERSON','MONEY','PERCENT','DATA']
entities = {}
for cat in NER:
regex_cat = re.compile('<'+cat+'>(.*?)</'+cat+'>')
entities[cat] = re.findall(regex_cat,data)
Here data is just a string of text. It uses regular expressions to find all entities of a category specified in NER and stores it as is list in a dictionary. This could be used for all inlineXML strings where NER is just a list of all possible tags in the string.

using minipar parser output

how to use minipar parser output to extract features like subject , object ,verb, tense etc to be use for english text to ASL conversion project
Have a look at the Predicate-Argument Extractor (PAX), a GATE component to extract (subject, predicate, object) triples from the output of several parsers, including RASP, SUPPLE, MiniPar, Stanford, and MuNPEx.

Resources