I am annotating and analyzing a series of text files.
The pipeline.annotate method becomes increasingly slow each time it reads a file. Eventually, I get an OutOfMemoryError.
Pipeline is initialized ONCE:
protected void initializeNlp()
{
Log.getLogger().debug("Starting Stanford NLP");
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner, depparse, natlog, openie");
props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
Log.getLogger().debug("\n\n\nStarted Stanford NLP Successfully\n\n\n");
}
I then process each file using same instance of pipeline (as recommended elsewhere on SO and by Stanford).
public void processFile(Path file)
{
try
{
Instant start = Instant.now();
Annotation document = new Annotation(cleanString);
Log.getLogger().info("ANNOTATE");
pipeline.annotate(document);
Long millis= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Annotation Duration in millis: "+millis);
AnalyzedFile af = AnalyzedFileFactory.getAnalyzedFile(AnalyzedFileFactory.GENERIC_JOB_POST, file);
processSentences(af, document);
Log.getLogger().info("\n\n\nFile Processing Complete\n\n\n\n\n");
Long millis1= Duration.between(start, Instant.now()).toMillis();
Log.getLogger().info("Total Duration in millis: "+millis1);
allFiles.put(file.toUri().toString(), af);
}
catch (Exception e)
{
Log.getLogger().debug(e.getMessage(), e);
}
}
To be clear, I expect the problem is with my configuration. However, I am certain that the stall and memory issues occur at the pipeline.annotate(file) method.
I dispose of all references to Stanford-NLP objects other than pipeline (e.g., CoreLabel) after processing each file. That is, I do not keep references to any Stanford objects in my code beyond the method level.
Any tips or guidance would be deeply appreciated
OK, that last sentence of the question made me go double check. The answer is that I WAS keeping reference to CoreMap in one of my own classes. In other words, I was keeping in memory all the Trees, Tokens and other analyses for every sentence in my corpus.
In short, keep StanfordNLP CoreMaps for a given number of sentences and then dispose.
(I expect a hard core computational linguist would say there is rarely any need to keep a CoreMap once it has been analyzed, but I have to declare my neophyte status here)
Related
The date detected from my following program gets split into two separate mentions whereas the detected date in the NER output of CoreNLP demo is single as it should be. What should I edit in my program to correct this.
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "This software was released on Februrary 5, 2015.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
List<CoreMap> mentions = sentence.get(MentionsAnnotation.class);
if (mentions != null) {
for (CoreMap mention : mentions) {
System.out.println("== Token=" + mention.get(TextAnnotation.class));
System.out.println("NER=" + mention.get(NamedEntityTagAnnotation.class));
System.out.println("Normalized NER=" + mention.get(NormalizedNamedEntityTagAnnotation.class));
}
}
}
Output from this program:
== Token=Februrary 5,
NER=DATE
Normalized NER=****0205
== Token=2015
NER=DATE
Normalized NER=2015
Output from CoreNLP online demo:
Note that the online demo is showing any sequence of consecutive tokens with the same NER tag as belonging to the same unit. Consider this sentence:
The event happened on February 5th January 9th.
This example yields "February 5th January 9th" as a single DATE in the online demo.
Yet it recognizes "February 5th" and "January 9th" as separate entity mentions.
Your sample code is looking at mentions, not NER chunks. Mentions are not being shown by the online demo.
That being said, I am not sure why SUTime is not joining February 5th and 2015 together in your example. Thanks for bringing this up, I will look into improving the module to fix this issue in future releases.
I am trying to find the Theme/Noun phrase from a sentence using Stanford NLP
For eg: the sentence "the white tiger" I would love to get
Theme/Nound phrase as : white tiger.
For this I used pos tagger. My sample code is below.
Result I am getting is "tiger" which is not correct. Sample code I used to run is
public static void main(String[] args) throws IOException {
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,parse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("the white tiger)");
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation
.get(CoreAnnotations.SentencesAnnotation.class);
System.out.println("the size of the senetence is......"
+ sentences.size());
for (CoreMap sentence : sentences) {
System.out.println("the senetence is..." + sentence.toString());
Tree tree = sentence.get(TreeAnnotation.class);
PrintWriter out = new PrintWriter(System.out);
out.println("The first sentence parsed is:");
tree.pennPrint(out);
System.out.println("does it comes here.....1111");
TregexPattern pattern = TregexPattern.compile("#NP");
TregexMatcher matcher = pattern.matcher(tree);
while (matcher.find()) {
Tree match = matcher.getMatch();
List<Tree> leaves1 = match.getChildrenAsList();
StringBuilder stringbuilder = new StringBuilder();
for (Tree tree1 : leaves1) {
String val = tree1.label().value();
if (val.equals("NN") || val.equals("NNS")
|| val.equals("NNP") || val.equals("NNPS")) {
Tree nn[] = tree1.children();
String ss = Sentence.listToString(nn[0].yield());
stringbuilder.append(ss).append(" ");
}
}
System.out.println("the final stringbilder is ...."
+ stringbuilder);
}
}
}
Any help is really appreciated.Any other thoughts to get this achieved.
It looks like you're descending the dependency trees looking for NN.*.
"white" is a JJ--an adjective--which won't be included searching for NN.*.
You should take a close look at the Stanford Dependencies Manual and decide what part of speech tags encompass what you're looking for. You should also look at real linguistic data to try to figure out what matters in the task you're trying to complete. What about:
the tiger [with the black one] [who was white]
Simply traversing the tree in that case will give you tiger black white. Exclude PP's? Then you lose lots of good info:
the tiger [with white fur]
I'm not sure what you're trying to accomplish, but make sure what you're trying to do is restricted in the right way.
You ought to polish up on your basic syntax as well. "the white tiger" is what linguists call a Noun Phrase or NP. You'd be hard pressed for a linguist to call an NP a sentence. There are also often many NPs inside a sentence; sometimes, they're even embedded inside one another. The Stanford Dependencies Manual is a good start. As in the name, the Stanford Dependencies are based on the idea of dependency grammar, though there are other approaches that bring different insights to the table.
Learning what linguists know about the structure of sentences could help you significantly in getting at what you're trying to extract or--as happens often--realizing that what you're trying to extract is too difficult and that you need to find a new route to a solution.
so the problem i'm trying to tackle is the following:
I need a data source that emits messages at a certain frequency
There are N neural nets that need to process each message individually
The outputs from all neural nets are aggregated and only when all N outputs for each message are collected, should a message be declared fully processed
At the end i should measure the time it took for a message to be fully processed (time between when it was emitted and when all N neural net outputs from that message have been collected)
I'm curious as to how one would approach such a task using spark streaming.
My current implementation uses 3 types of components: a custom receiver and two classes that implement Function, one for the neural nets, one for the end aggregator.
In broad strokes, my application is built as follows:
JavaReceiverInputDStream<...> rndLists = jssc.receiverStream(new JavaRandomReceiver(...));
Function<JavaRDD<...>, Void> aggregator = new JavaSyncBarrier(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
rndLists.map(new NeuralNetMapper(neuralNetConfig)).foreachRDD(aggregator);
}
The main problem i'm having with this, though, is that it runs faster in local mode than when submitted to a 4-node cluster.
Is my implementation wrong to begin with or is something else happening here ?
There's also a full post here http://apache-spark-user-list.1001560.n3.nabble.com/Developing-a-spark-streaming-application-td12893.html with more details regarding the implementation of each of the three components mentioned previously.
It seems there might be a lot of repetitive instantiation and serialization of objects. The later might be hitting your performance in a cluster.
You should try instantiating your neural networks only once. You will have to ensure that they are serializable. You should use flatMap instead of multiple maps + union. Something along these lines:
// Initialize neural net first
List<NeuralNetMapper> neuralNetMappers = new ArrayList<>(numberOfNets);
for(int i = 0; i < numberOfNets; i++){
neuralNetMappers.add(new NeuralNetMapper(neuralNetConfig));
}
// Then create a DStream applying all of them
JavaDStream<Result> neuralNetResults = rndLists.flatMap(new FlatMapFunction<Item, Result>() {
#Override
public Iterable<Result> call(Item item) {
List<Result> results = new ArrayList<>(numberOfNets);
for (int i = 0; i < numberOfNets; i++) {
results.add(neuralNetMappers.get(i).doYourNeuralNetStuff(item));
}
return results;
}
});
// The aggregation stuff
neuralNetResults.foreachRDD(aggregator);
If you can afford to initialize the networks this way, you can save quite a lot of time. Also, the union stuff you included in your linked posts seems unnecessary and is penalizing your performance: a flatMap will do.
Finally, in order to further tune your performance in the cluster, you can use the Kryo serializer.
I have a program where I want to draw data that is constantly being updated (it is microhpone-line-data incidentally). The data is an 8000-length array of doubles, I don't really care about 'losing' data which is overriden between takes of the paint method.
In my naive implementation it became obvous that there's synchronisation issues, where the audio-data is updated while the painting routine is under-way.
I'm also aware I'm slightly out-of-date on Java and the Concurrency package it has, but my first response was to just put synchronised blocks around the shared data code. Unsurprisingly this blocks the graphics thread sometimes, so I'm thinking there's probaly a much better way of doing this.
Essentially I just dont have much experience with synchronisation and am screwing things up a bit somewhere. I wonder if someone with a better understanding of these matters might be able to suggest a more elegant solution that doesn't block the graphics thread?
My naive code:
Object lock = new Object();
double[] audio = new double[8000]
// array size is always exactly 8000
public update( double[] audio ) {
synchronized( lock ) {
this.audio=audio; // and some brief processing
}
repaint();
}
public void paint( Graphics g ) {
synchronized( lock ) {
// draw the contents of this.audio
}
}
Self-answer, unless some more intelligent person can offer something better, I just save a reference to the audio array at the beginning of the routine and draw from that, then any updates to the audio buffer do their calculations in a separate array and then assign this.audio to the new array in a single step.
It appears to work, although the paint routine does get an occasional flicker it is nothing like it was, where it was flashing very noticeably about 10% of the time due to synchronised blocking. The audio-data does not update half-way through the drawing routine either. so... problem solved. probably.
double[] audio = new double[8000]
// array size is always exactly 8000
public update( double[] audio ) {
// do any brief processing
this.audio=audio; // the reference is re-assigned in one step
repaint();
}
public void paint( Graphics g ) {
audioNow = this.audio; // save the reference
// draw the contents of audioNow (not this.audio)
}
I am just starting to use the SimpleTagger class in mallet. My impression is that it expects binary features. The model that I want to implement has positive integer-valued features and I wonder how to implement this in mallet. Also, I heard that non-binary features need to be normalized if the model is to make sense. I would appreciate any suggestions on how to do this.
ps. yes, I know that there is a dedicated mallet mail list but I am waiting for nearly a day already to get my subscription approved to be able to post there. I'm simply in a hurry.
Well it's 6 years later now. If you're not in a hurry anymore, you could check out the Java API to create your instances. A minimal example:
private Instance createInstance(LabelAlphabet labelAlphabet){
// observations and labels should be equal size for linear chain CRFs
TokenSequence observations = new TokenSequence();
LabelSequence labels = new LabelSequence(labelAlphabet, n);
observations.add(createToken());
labels.add("idk, some target or something");
return new Instance(
observations,
label,
"myInstance",
null
);
}
private Token createToken() {
Token token = new Token("exampleToken");
// Note: properties are not used for computing (I think)
token.setProperty("SOME_PROPERTY", "hello");
// Any old double value
token.setFeatureValue(featureVal, 666.0);
// etc for more features ...
return token;
}
public static void main(String[] args){
// Note the first arg is false to denote we *do not* deal with binary features
InstanceList instanceList = new InstanceList(new TokenSequence2FeatureVectorSequence(false, false));
LabelAlphabet labelAlphabet = new LabelAlphabet();
// Converts our tokens to feature vectors
instances.addThruPipe(createInstance(labelAlphabet));
}
Or, if you want to keep using SimpleTagger, just define binary features like HAS_1_LETTER, HAS_2_LETTER, etc, though this seems tedious.