Stanford NER CharacterOffsetBegin - nlp

I am using Stanford CoreNLP for NER for a list of short documents.
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
-ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
-file .../input -outputDirectory .../stanford_ner
The problem is the CharacterOffsetBegin and CharacterOffsetEnd I get from each token are continuous number from the previous documents. Therefore for example the very first token of document_2 has a CharacterOffsetBegin of 240 rather than 0. Is there any option I can use in the command line? Any help would be greatly appreciated, thanks!

Yes--if you split your input into separate files. There's a -filelist option for batch jobs. In your case, each line of the file list has a path to a document file. For example, if you have all of your separate doc files in a directory .../input, then input.txt contains something like:
.../input/doc_1.txt
.../input/doc_2.txt
.../input/doc_3.txt
Though it might be a good idea to put the full paths there if possible. Then, you'd execute CoreNLP as such:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP
-annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
-ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
-filelist .../input.txt -outputDirectory .../stanford_ner
If you write some script to split input up into multiple documents, it would probably be a good idea to generate input.txt concurrently.
This will restart the token counter for each document you process.

Related

How to reproduce the Stanford NLP tagging demo page?

I would like to reproduce the POS tagging shown here:
http://nlp.stanford.edu:8080/parser/index.jsp
They say they use the englishPCFG.ser.gz parser, but it is not specified which tagger they used, and other properties.
So which command line to should I run to replicate the same tagging of the demo page? Currently I use:
java -Xmx500m -cp "*:/models/stanford-english-corenlp-2018-02-27-models.jar" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -annotators tokenize,ssplit,pos -port 9001 -timeout 15000
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse -file example.txt -outputFormat text
If you don't specify a part of speech model the parser annotator will use the parsing algorithm to generate part of speech tags.
The solution is to add a server property (can be done in the properties file):
enforceRequirements = false

Tokenizer Training with StanfordNLP

So my requirement is verbally simple. I need StanfordCoreNLP default models along with my custom trained model, based on custom entities. In a final run, I need to be able to isolate specific phrases from a given sentence (RegexNER will be used)
Following are my efforts :-
EFFORT I :-
So I wanted to use the StanfordCoreNLP CRF files, tagger files and ner model files, along with my custom trained ner models.
I tried to find if there is any official way of doing this, but didnt get anything. There is a property "ner.model" for StanfordCoreNLP pipeline, but it will skip the default ones if used.
EFFORT II :-
Next (might not be the smartest thing ever. Sorry! Just a guy trying to make ends meet!) , I extracted the model stanford-corenlp-models-3.7.0.jar , and copied all :-
*.ser.gz (Parser Models)
*.tagger (POS Tagger)
*.crf.ser.gz (NER CRF Files)
and tried to put Comma Separated Values with properties "parser.model", "pos.model" and "ner.model" respectively, as follows :-
parser.model=models/ner/default/anaphoricity_model.ser.gz,models/ner/default/anaphoricity_model_conll.ser.gz,models/ner/default/classification_model.ser.gz,models/ner/default/classification_model_conll.ser.gz,models/ner/default/clauseSearcherModel.ser.gz,models/ner/default/clustering_model.ser.gz,models/ner/default/clustering_model_conll.ser.gz,models/ner/default/english-embeddings.ser.gz,models/ner/default/english-model-conll.ser.gz,models/ner/default/english-model-default.ser.gz,models/ner/default/englishFactored.ser.gz,models/ner/default/englishPCFG.caseless.ser.gz,models/ner/default/englishPCFG.ser.gz,models/ner/default/englishRNN.ser.gz,models/ner/default/englishSR.beam.ser.gz,models/ner/default/englishSR.ser.gz,models/ner/default/gender.map.ser.gz,models/ner/default/md-model-dep.ser.gz,models/ner/default/ranking_model.ser.gz,models/ner/default/ranking_model_conll.ser.gz,models/ner/default/sentiment.binary.ser.gz,models/ner/default/sentiment.ser.gz,models/ner/default/truecasing.fast.caseless.qn.ser.gz,models/ner/default/truecasing.fast.qn.ser.gz,models/ner/default/word_counts.ser.gz,models/ner/default/wsjFactored.ser.gz,models/ner/default/wsjPCFG.ser.gz,models/ner/default/wsjRNN.ser.gz
ner.model=models/ner/default/english.all.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.all.3class.distsim.crf.ser.gz,models/ner/default/english.all.3class.nodistsim.crf.ser.gz,models/ner/default/english.conll.4class.caseless.distsim.crf.ser.gz,models/ner/default/english.conll.4class.distsim.crf.ser.gz,models/ner/default/english.conll.4class.nodistsim.crf.ser.gz,models/ner/default/english.muc.7class.caseless.distsim.crf.ser.gz,models/ner/default/english.muc.7class.distsim.crf.ser.gz,models/ner/default/english.muc.7class.nodistsim.crf.ser.gz,models/ner/default/english.nowiki.3class.caseless.distsim.crf.ser.gz,models/ner/default/english.nowiki.3class.nodistsim.crf.ser.gz
pos.model=models/tagger/default/english-left3words-distsim.tagger
But, I get the following exception :-
Caused by: edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file)
Caused by: java.io.StreamCorruptedException: invalid stream header: EFBFBDEF
EFFORT III :-
I thought I will be able to handle with RegexNER, and I was successful to some extent. Just that the entities that it learns through RegexNER, it doesn't apply to forthcoming expressions. Eg: It will find the entity "CUSTOM_ENTITY" inside a text, but if i put a RegexNER like ( [ {ner:CUSTOM_ENTITY} ] /with/ [ {ner:CUSTOM_ENTITY} ] ) it never succeeds in finding the right phrase.
Really need help here!!! I don't wanna train the complete model again, Stanford guys got over a GB of model information which is useful to me. Just that I want to add custom entities too.
First of all make sure your CLASSPATH has the proper jars in it.
Here is how you should include your custom trained NER model:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.model <csv-of-model-paths> -file example.txt
-ner.model should be set to a comma separated list of all models you want to use.
Here is an example of what you could put:
edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz,/path/to/custom_model.ser.gz
Note in my example that all of the standard models will be run, and then finally your custom model will be run. Make sure your custom model is in the CLASSPATH.
You also probably need to add this to your command: -ner.combinationMode HIGH_RECALL. By default the NER combination will only use the tags for a particular class from the first model. So if you have model1,model2,model3 only model1's LOCATION will be used. If you set things to HIGH_RECALL then model2 and model3's LOCATION tags will be used as well.
Another thing to keep in mind, model2 can't overwrite decisions by model1. It can only overwrite "O". So if model1 says that a particular token is a LOCATION, model2 can't say it's an ORGANIZATION or a PERSON or anything. So the order of the models in your list matters.
If you want to write rules that use entities found by previous rules, you should look at my answer to this question:
TokensRegex rules to get correct output for Named Entities
from your given context
use this instead of comma separated values and try to have all the jars within the same directory:
parser.model=models/ner/default/anaphoricity_model.ser.gz
parser.model=models/ner/default/anaphoricity_model_conll.ser.gz
parser.model=models/ner/default/classification_model.ser.gz
parser.model=models/ner/default/classification_model_conll.ser.gz
parser.model=models/ner/default/clauseSearcherModel.ser.gz
parser.model=models/ner/default/clustering_model.ser.gz
parser.model=models/ner/default/clustering_model_conll.ser.gz
parser.model=models/ner/default/english-embeddings.ser.gz
parser.model=models/ner/default/english-model-conll.ser.gz
parser.model=models/ner/default/english-model-default.ser.gz
parser.model=models/ner/default/englishFactored.ser.gz
parser.model=models/ner/default/englishPCFG.caseless.ser.gz
parser.model=models/ner/default/englishPCFG.ser.gz
parser.model=models/ner/default/englishRNN.ser.gz
parser.model=models/ner/default/englishSR.beam.ser.gz
parser.model=models/ner/default/englishSR.ser.gz
parser.model=models/ner/default/gender.map.ser.gz
parser.model=models/ner/default/md-model-dep.ser.gz
parser.model=models/ner/default/ranking_model.ser.gz
parser.model=models/ner/default/ranking_model_conll.ser.gz
parser.model=models/ner/default/sentiment.binary.ser.gz
parser.model=models/ner/default/sentiment.ser.gz
parser.model=models/ner/default/truecasing.fast.caseless.qn.ser.gz
parser.model=models/ner/default/truecasing.fast.qn.ser.gz
parser.model=models/ner/default/word_counts.ser.gz
parser.model=models/ner/default/wsjFactored.ser.gz
parser.model=models/ner/default/wsjPCFG.ser.gz
parser.model=models/ner/default/wsjRNN.ser.gz
now copy the above lines,and similarly make the other models too and paste it in a server.properties file.
if u don't have server.properties file then create it.
and use the following command too start you server:
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -serverProperties server.properties

How to use cTAKES from the command line?

I wonder how to use Apache cTAKES from the command line.
E.g. :
I have a file note.txt that contains some text like "Patient had
elevated blood sugar but tests confirm no diabetes. Patient's father had
adult onset diabetes."
I want to use the provided analysis engine
\apache-ctakes-3.2.2-bin\apache-ctakes-3.2.2\desc\ctakes-clinical-pipeline\desc\analysis_engine\AggregatePlaintextUMLSProcessor.xml
How can I get the analyse engine's output (viz. the annotations) using the
command line (i.e. without using graphical user interfaces such as UIMA CAS
Visual Debugger or the Collection Processing Engine)? I'd prefer to
use the provided JAR files rather than having to compile the code.
The question is fairly simple but I couldn't find the information in
cTAKES's README or
on Confluence.
Please try the following steps to use cTAKES CPE from the command line (the key class is "org.apache.uima.examples.cpe.SimpleRunCPE"):
Change directory to $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/
Copy test_plaintext.xml to another file (e.g., "test_plaintext_test.xml").
Edit "test_plaintext_test.xml" to set input directory; find "nameValuePair" with name = "InputDirectory", and set the value string to the input directory. The following example set the input directory as "$CTAKES_HOME/note_input":
<nameValuePair>
<name>InputDirectory</name>
<value>
<string>note_input</string>
</value>
</nameValuePair>
Similarly, edit "test_plaintext_test.xml" to set the output directory ("$CTAKES_HOME/result_output" in the following example):
<nameValuePair>
<name>OutputDirectory</name>
<value>
<string>result_output</string>
</value>
</nameValuePair>
Save "test_plaintext_test.xml" and change directory to $CTAKES_HOME/bin.
Copy runctakesCPE.sh to another file (e.g., "runctakesCPE_CLI.sh").
Edit "runctakesCPE_CLI.sh"; replace the last line ("java ...") to the following line ("USER" and "PW" should be replaced by your UMLS Username and Password, and the memory setting Xms and Xms may be adjusted based on the size of memory on your machine):
java -Dctakes.umlsuser=USER -Dctakes.umlspw=PW -cp $CTAKES_HOME/lib/*:$CTAKES_HOME/desc/:$CTAKES_HOME/resources/ -Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g org.apache.uima.examples.cpe.SimpleRunCPE $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_processing_engine/test_plaintext_test.xml
Save "runctakesCPE_CLI.sh", and then create the input directory ("$CTAKES_HOME/note_input") and the output directory ("$CTAKES_HOME/result_output").
Put your note.txt to the input directory (e.g., "$CTAKES_HOME/note_input/note.txt"), and then run "runctakesCPE_CLI.sh".
cTAKES CPE will start running under command line mode, and the resulting file will be generated in the output directory (e.g., "$CTAKES_HOME/result_output/note.txt.xml").
I actually used your note.txt to run the steps above and here are the first several lines of the generated note.txt.xml:
<?xml version="1.0" encoding="UTF-8"?><CAS version="2">
<uima.cas.Sofa _indexed="0" _id="3" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="Patient had elevated blood sugar but tests confirm no diabetes. Patient's father had adult onset diabetes.
"/>
<org.apache.ctakes.typesystem.type.structured.DocumentID _indexed="1" _id="1" documentID="note.txt"/>
<uima.tcas.DocumentAnnotation _indexed="1" _id="10" _ref_sofa="3" begin="0" end="107" language="x-unspecified"/>
<org.apache.ctakes.typesystem.type.textspan.Segment _indexed="1" _id="15" _ref_sofa="3" begin="0" end="107" id="SIMPLE_SEGMENT"/>
<org.apache.ctakes.typesystem.type.textspan.Sentence _indexed="1" _id="21" _ref_sofa="3" begin="0" end="63" sentenceNumber="0"/>
Hope this helps :-)
java -Dctakes.umlsuser=USER -Dctakes.umlspw=PW -cp $CTAKES_HOME/lib/*;$CTAKES_HOME/desc/;$CTAKES_HOME/resources‌​/ -
Dlog4j.configuration=file:$CTAKES_HOME/config/log4j.xml -Xms2g -Xmx3g to_replace $CTAKES_HOME/desc/ctakes-clinical-pipeline/desc/collection_p‌​rocessing_engine/tes‌​t_plaintext_test.xml
replace "to_replace" with either
org.apache.ctakes.ytex.tools.RunCPE or
org.apache.ctakes.core.cpe.CmdLineCpeRunner

How to call a large list of paired files to be executed by a program in BASH?

I have a large directory of files (100+) that I'd like to pass through a program via the terminal.
The files are paired and all follow a naming scheme like such:
TS-8_S53_L001_R1_001.fastq
TS-8_S53_L001_R2_001.fastq
RS-9_S54_L001_R1_001.fastq
RS-9_S54_L001_R2_001.fastq
And the program execution looks like:
Seqprogram -i1 Blah_R1_001.fastq -i2 Blah_R2_001.fastq -o Blah_paired.fastq
All of these files are in one directory.
I'd like to able to run the program on all of the files, using the files paired together in the proper sequence (R1 files are passed through i1, the R1 and R2 files have the same base name) and the output file (-o) is saved under the base name with some identifier attached ("_paired", etc).
I've envisioned on how I'd do this over Python; however, I am trying to get better with BASH.
I'm familiar with how one might call multiple files into a single command; i.e., uncompressing all .gz files in a particular directory
gunzip "*.gz"
But this command has two inputs, and the inputs must be ordered, so the wildcard scheme isn't sufficient.
Thanks
Use a wildcard to get one file of the pair, and then use parameter substitution to get the other corresponding filenames.
for i1 in *_R1_001.fastq; do
i2=${i1/R1_001/R2_001}
paired=${i1/R1_001/paired}
Seqprogram -i1 "$i1" -i2 "$i2" -o "$paired"
done
The easiest way to do this is to match a single one of the three filenames patterned, and to modify it to get the other two.
That is to say:
for r1file in *_R1_*.fastq; do
r2file=${r1file/_R1_/_R2_}
pairfile=${r1file%_R1_*}_paired.fastq
Seqprogram -i1 "$r1file" -i2 "$r2file" -o "$pairfile"
done

Passing data into perl script from command line

I have a perl script the creates a report based on an xml definition. Currently these definitions all exist as .xml files.
So I have the script run-report.pl, which can take a path to a definition file and create the report.
Now I want to create run-reports-from-db.pl, which will generate the report definition based on same database entries. I don't want to create temp files to pass to run-report.pl, I would just like to pass in the definition somehow.
So instead of saying:
run-report.pl -def=./path/to/def.xml
I want to be able to say:
run-report.pl --stream
And have the report definition available in <STDIN>
I am sure there is pretty trivial way to do this???
If I understand your question correctly, all you need is one | (pipe).
./generate-xml-from-db.pl | ./run-report.pl --stream
Anything the first process in the pipeline prints to stdout will appear in the second process's stdin.
As long as you read from STDIN, you have it available. Notice what happens with you take the code below name it something like echo.pl run it at the command line and paste reams of text.
#!/usr/bin/perl -w
use 5.010;
use strict;
use warnings;
while ( <> ) {
say;
}
<> is the Perl shorthand for "read from STDIN".
As long as the method you're using to launch the process has a way to get a hold of the standard input and outputs, you can just write it to that handle. You have to use the ways that are available to you. In Java, for example, you'd have to get the input stream of the process, in a batch command you have to pipe it. At a GUI terminal you can cut and paste.

Resources