How can I use the MIST library to de-identify a text? - nlp

I wonder how I use can the MIST library to de-identify a text, e.g., transforming
Patient ID: P89474
Mary Phillips is a 45-year-old woman with a history of diabetes.
She arrived at New Hope Medical Center on August 5 complaining
of abdominal pain. Dr. Gertrude Philippoussis diagnosed her
with appendicitis and admitted her at 10 PM.
to
Patient ID: [ID]
[NAME] is a [AGE]-year-old woman with a history of diabetes.
She arrived at [HOSPITAL] on [DATE] complaining
of abdominal pain. Dr. [PHYSICIAN] diagnosed her
with appendicitis and admitted her at 10 PM.
I've wandered through the documentation but no luck so far.

This answer was tested on Windows 7 SP1 x64 Ultimate with Anaconda Python 2.7.11 x64, and MIST 2.0.4. MIST 2.0.4 does not work with Python 3.x (according to the manual, I haven't tested it myself).
MIST (MITRE Identification Scrubber Toolkit) [1] is a customization of MAT (MITRE Annotation Toolkit), which is a tool to tag documents automatically or with humans (for the latter it provides a GUI via webserver). The automatic tagger is based on Carafe (ConditionAl RAndom Fields) [2], which is an OCaml implementation of conditional random fields (CRF).
MIST does not come with any trained model, and is has only ~10 short, non-medical documents annotated with typical NER class (like organization and person).
De-id (de-identification) is the process of tagging PHIs (Private Health Information) in a document, and replacing them with fake data. Let's ignore PHI replacement for now, and focus on tagging. In order to tag a document (e.g., a patient note), MAT follows a typical machine learning scheme: the CRF needs to be trained on a labeled dataset (= a set of labeled documents), then we use it to tag unlabeled documents.
The main technical concept in MAT is tasks. A task is a set of activities, called workflows, which can be broken down into steps. Named-entity recognition (NER) is one task. De-id is another task (mostly, NER geared toward medical texts): in other words, MIST is just one task of MAT (actually 3: core, HIPAA, and AMIA. Core is a parent task, while HIPAA and AMIA are two different tagets). Steps are for example tokenization, tagging, or cleaning. Workflows are just list of steps that one may follow.
With this in mind, here is the code for Microsoft Windows:
#######
rem Instructions for Windows 7 SP1 x64 Ultimate
rem Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
SET MAT_PKG_HOME=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\src\MAT
SET TMP=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\temp
cd C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd %MAT_PKG_HOME%
bin\MATManagePluginDirs.cmd install %CD%\sample\ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin\MATModelBuilder.cmd --task "Named Entity" --model_file %TMP%\ne_model ^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tuto 2: Add trained model as the default model
bin\MATModelBuilder.cmd --task "Named Entity" --save_as_default_model ^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tudo 5: use CLI -> prepare the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" ^
--input_file %CD%\sample\ne\resources\data\raw\voa2.txt --input_file_type raw ^
--output_file %CD%\voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" ^
--input_file %CD%\voa2_txt.json --input_file_type mat-json ^
--output_file %CD%\voa2_txt.json --output_file_type mat-json ^
--tagger_local
NER is now done.
Here are the same instructions for Ubuntu 14.04.4 LTS x64:
#######
# Instructions for Ubuntu 14.04.4 LTS x64
# Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
export MAT_PKG_HOME=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/src/MAT
export TMP=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/temp
mkdir $TMP
cd /home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd $MAT_PKG_HOME
bin/MATManagePluginDirs install $PWD/sample/ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin/MATModelBuilder --task "Named Entity" --model_file $TMP/ne_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tuto 2: Add trained model as the default model
bin/MATModelBuilder --task "Named Entity" --save_as_default_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tudo 5: use CLI -> prepare the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file $PWD/sample/ne/resources/data/raw/voa2.txt --input_file_type raw \
--output_file $PWD/voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "tag" \
--input_file $PWD/voa2_txt.json --input_file_type mat-json \
--output_file $PWD/voa2_txt.json --output_file_type mat-json \
--tagger_local
To run de-id, there is no need to install the de-id tasks are they are pre-installed. There are 2 de-id tasks (\MIST_2_0_4\src\tasks\HIPAA\task.xml and \MIST_2_0_4\src\tasks\AMIA\task.xml). They don't come with any trained model nor labeled dataset, so you may want to get some data at Physician notes with annotated PHI.
For Microsoft Windows ( tested with Windows 7 SP1 x64 Ultimate ):
To train the model (you can replace HIPAA Deidentification with AMIA Deidentification depending on the tagset you wish to use):
bin\MATModelBuilder.cmd --task "HIPAA Deidentification" ^
--save_as_default_model --nthreads=3 --max_iterations=15 ^
--lexicon_dir="%CD%\sample\mist\gazetteers" ^
--input_files "%CD%\sample\mist\i2b2-60-00-40\train\*.json"
To run the trained model on one file:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_file .\note.txt --input_file_type raw ^
--output_file .\note.json --output_file_type mat-json ^
--tagger_local ^
--steps "clean,zone,tag"
To run the trained model on one directory:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_dir "%CD%\sample\test" --input_file_type raw ^
--output_dir "%CD%\sample\test" --output_file_type mat-json ^
--tagger_local ^
--steps "clean,zone,tag"
As usual, one can specify the input file format to be JSON:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_dir "%CD%\sample\mist\i2b2-60-00-40\test" --input_file_type mat-json ^
--output_dir "%CD%\sample\mist\i2b2-60-00-40\test_out" --output_file_type mat-json ^
--tagger_local --steps "tag"
For Ubuntu 14.04.4 LTS x64:
To train the model (you can replace HIPAA Deidentification with AMIA Deidentification depending on the tagset you wish to use):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--input_files "$PWD/sample/mist/i2b2-60-00-40/train/*.json"
To run the trained model on one file:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_file ./note.txt --input_file_type raw \
--output_file ./note.json --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
To run the trained model on one directory:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/test" --input_file_type raw \
--output_dir "$PWD/sample/test" --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
As usual, one can specify the input file format to be JSON:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/mist/i2b2-60-00-40/test" --input_file_type mat-json \
--output_dir "$PWD/sample/mist/i2b2-60-00-40/test_out" --output_file_type mat-json \
--tagger_local --steps "tag"
Typical error messages:
raise PluginError, "Carafe not configured properly for this task and workflow: " + str(e) (when trying to tag a document): it often means that no model was specified. You need to either defined a default model, or use --tagger_model /path/to/model/.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded (when training a model): it's easy to go over the heap_size limit ( the default is 2GB ). You can increase the heap_size with the --heap_size parameter. Example (Linux):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--heap_size=60G \
--input_files "$PWD/sample/mist/mimic-140-20-40/train/*.json"
[1] John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin, Lynette Hirschman, The MITRE identification scrubber toolkit: design, training, and assessment, Int. J. Med. Informatics 79 (12) (2010) 849–859, http://dx.doi.org/10.1016/j.ijmedinf.2010.09.007.
[2] B. Wellner, Sequence Models and Ranking Methods for
Discourse Parsing [Ph.D. Dissertation]. Brandeis University,
Waltham, MA, 2009. http://www.cs.brandeis.edu/~wellner/pubs/wellner_dissertation.pdf
Documentation for MATModelBuilder.cmd:
Usage: MATModelBuilder.cmd [task option] [config name option] [other options]
Options:
-h, --help show this help message and exit
Task option:
--task=task name of the task to use. Must be the first argument,
if present. Obligatory if the system knows of more
than one task. Known tasks are: AMIA Deidentification,
Named Entity, HIPAA Deidentification, Enhanced Named
Entity
Config name option:
--config_name=name name of the model build config to use. Must be the
first argument after --task, if present. Optional.
Default model build config will be used if no config
is specified.
Control options:
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.
Options for model class creation:
--partial_training_on_gold_only
When the trainer is presented with partially tagged
documents, by default MAT will ask it to train on all
annotated segments, completed or not. If this flag is
specified, only completed segments should be used for
training.
--feature_spec=FEATURE_SPEC
path to the Carafe feature spec file to use. Optional
if feature_spec is set in the <build_settings> for the
relevant model config in the task.xml file for the
task.
--training_method=TRAINING_METHOD
If present, specify a training method other than the
standard method. Currently, the only recognized value
is psa. The psa method is noticeably faster, but may
result in somewhat poorer results. You can use a value
of '' to override a previously specified training
method (e.g., a default method in your task).
--max_iterations=MAX_ITERATIONS
number of iterations for the optimized PSA training
mechanism to use. A value between 6 and 10 is
appropriate. Overrides any possible default in
<build_settings> for the relevant model config in the
task.xml file for the task.
--lexicon_dir=LEXICON_DIR
If present, the name of a directory which contains a
Carafe training lexicon. This pathname should be an
absolute pathname, and should have a trailing slash.
The content of the directory should be a set of files,
each of which contains a sequence of tokens, one per
line. The name of the file will be used as a training
feature for the token. Overrides any possible default
in <build_settings> for the relevant model config in
the task.xml file for the task.
--parallel If present, parallelizes the feature expectation
computation, which reduces the clock time of model
building when multiple CPUs are available
--nthreads=NTHREADS
If --parallel is used, controls the number of threads
used for training.
--gaussian_prior=GAUSSIAN_PRIOR
A positive float, default is 10.0. See the jCarafe
docs for details.
--no_begin Don't introduce begin states during training. Useful
if you're certain that you won't have any adjacent
spans with the same label. See the jCarafe
documentation for more details.
--l1 Use L1 regularization for PSA training. See the
jCarafe docs for details.
--l1_c=L1_C Change the penalty factor for the L1 regularizer. See
the jCarafe docs for details.
--heap_size=HEAP_SIZE
If present, specifies the -Xmx argument for the Java
JVM
--stack_size=STACK_SIZE
If present, specifies the -Xss argument for the Java
JVM
--tags=TAGS if present, a comma-separated list of tags to pass to
the training engine instead of the full tag set for
the task (used to create per-tag pre-tagging models
for multi-stage training and tagging)
--pre_models=PRE_MODELS
if present, a comma-separated list of glob-style
patterns specifying the models to include as pre-
taggers.
--add_tokens_internally
If present, Carafe will use its internal tokenizer to
tokenize the document before training. If your
workflow doesn't tokenize the document, you must
provide this flag, or Carafe will have no tokens to
base its training on. We recommend strongly that you
tokenize your documents separately; you should not use
this flag.
--word_properties=WORD_PROPERTIES
See the jCarafe docs for --word-properties.
--word_scores=WORD_SCORES
See the jCarafe docs for --word-scores.
--learning_rate=LEARNING_RATE
See the jCarafe docs for --learning-rate.
--disk_cache=DISK_CACHE
See the jCarafe docs for --disk_cache.
Input options:
--input_dir=dir A directory, all of whose files will be used in the
model construction. Can be repeated. May be specified
with --input_files.
--input_files=re A glob-style pattern describing full pathnames to use
in the model construction. May be specified with
--input_dir. Can be repeated.
--file_type=fake-xml-inline | mat-json | xml-inline
The file type of the input. One of fake-xml-inline,
mat-json, xml-inline. Default is mat-json.
--encoding=encoding
The encoding of the input. The default is the
appropriate default for the file type.
Output options:
--model_file=file Location to save the created model. The directory must
already exist. Obligatory if --save_as_default_model
isn't specified.
--save_as_default_model
If the the task.xml file for the task specifies the
<default_model> element, save the model in the
specified location, possibly overriding any existing
model.
Documentation for MATEngine:
Usage: MATEngine [core options] [input/output/task options] [other options]
Options:
-h, --help show this help message and exit
Core options:
--other_app_dir=dir
additional directory to load a task from. Optional and
repeatable.
--settings_file=file
a file of settings to use which overwrites existing
settings. The file should be a Python config file in
the style of the template in
etc/MAT_settings.config.in. Optional.
--task=task name of the task to use. Obligatory if the system
knows of more than one task. Known tasks are: AMIA
Deidentification, Named Entity, HIPAA
Deidentification, Enhanced Named Entity
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.

Related

PytestUnknownMarkWarning: Unknown pytest.mark.xxx - is this a typo?

I have a file called test.py with the following code:
import pytest
#pytest.mark.webtest
def test_http_request():
pass
class TestClass:
def test_method(self):
pass
pytest -s test.py passed but gave the following warnings:
pytest -s test.py
=============================== test session starts ============================
platform linux -- Python 3.7.3, pytest-5.2.4, py-1.8.0, pluggy-0.13.1
rootdir: /home/user
collected 2 items
test.py ..
=============================== warnings summary ===============================
anaconda3/lib/python3.7/site-packages/_pytest/mark/structures.py:325
~/anaconda3/lib/python3.7/site-packages/_pytest/mark/structures.py:325:
PytestUnknownMarkWarning: Unknown pytest.mark.webtest - is this a typo? You can register
custom marks to avoid this warning - for details, see https://docs.pytest.org/en/latest/mark.html
PytestUnknownMarkWarning,
-- Docs: https://docs.pytest.org/en/latest/warnings.html
=============================== 2 passed, 1 warnings in 0.03s ===================
Environment: Python 3.7.3, pytest 5.2.4, anaconda3
What is the best way to get rid of the warning message?
To properly handle this you need to register the custom marker. Create a pytest.ini file and place the following inside of it.
[pytest]
markers =
webtest: mark a test as a webtest.
Next time you run the tests, the warning about the unregistered marker will not be there.
without updating pytest.ini, we can ignore warning using --disable-warnings
We can also use --disable-pytest-warnings
Example using your case:
pytest -s test.py -m webtest --disable-warnings
#gold_cy's answer works. If you have too many custom markers need to register in pytest.ini, an alternative way is to use the following configuration in pytest.ini:
[pytest]
filterwarnings =
ignore::UserWarning
or in general, use the following:
[pytest]
filterwarnings =
error
ignore::UserWarning
the configuration above will ignore all user warnings, but will transform all other warnings into errors. See more at Warnings Capture
test.py (updated with two custom markers)
import pytest
#pytest.mark.webtest
def test_http_request():
print("webtest::test_http_request() called")
class TestClass:
#pytest.mark.test1
def test_method(self):
print("test1::test_method() called")
Use the following commands to run desired tests:
pytest -s test.py -m webtest
pytest -s test.py -m test1
The best way to get rid of the message is to register the custom marker as per #gold_cy's answer.
However if you just wish to suppress the warning as per Jonathon's answer, rather than ignoring UserWarning (which will suppress all instances of the warning regardless of their source) you can specify the particular warning you want to suppress like so (in pytest.ini):
ignore::_pytest.warning_types.PytestUnknownMarkWarning
Note: For third party libraries/modules the full path to the warning is required to avoid an _OptionError exception
If you don't have pytest.ini and don't wanna create one just for this then you can also register it programmatically in conftest.py as described here:
def pytest_configure(config):
# register an additional marker
config.addinivalue_line(
"markers", "env(name): mark test to run only on named environment"
)
To add to the existing answers - custom markers can also be registered in pyproject.toml:
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"webtest: mark a test as a webtest.",
]
Related docs.

cookie cutter: what's the easiest way to specify variables for the prompts

Is there anything that offers replay-type functionality, by pointing at a predefined prompt-answer file?
What works and what I'd like to achieve.
Let's take an example, using a cookiecutter to prep a Python package for pypi
cookiecutter https://github.com/audreyr/cookiecutter-pypackage.git
You've downloaded /Users/jluc/.cookiecutters/cookiecutter-pypackage before. Is it okay to delete and re-download it? [yes]:
full_name [Audrey Roy Greenfeld]: Spartacus 👈 constant for me/my organization
email [audreyr#example.com]: spartacus#example.com 👈 constant for me/my organization
...
project_name [Python Boilerplate]: GladiatorRevolt 👈 this will vary.
project_slug [q]: gladiator-revolt 👈 this too
...
OK, done.
Now, I can easily redo this, for this project, via:
cookiecutter https://github.com/audreyr/cookiecutter-pypackage.git --replay
This is great!
What I want:
Say I create another project, UnleashHell.
I want to prep a file somehow that has my developer-info and project level info for Unleash. And I want to be able to run it multiple times against this template, without having to deal with prompts. This particular pypi template gets regular updates, for example python 2.7 support has been dropped.
The problem:
A --replay will just inject the last run for this cookiecutter template. If it was run against a different pypi project, too bad.
I'm good with my developer-level info, but I need to vary all the project level info.
I tried copying the replay file via:
cp ~/.cookiecutter_replay/cookiecutter-pypackage.json unleash.json
Edit unleash.json to reflect necessary changes.
Then specify it via --config-file flag
cookiecutter https://github.com/audreyr/cookiecutter-pypackage.git --config-file unleash.json
I get an ugly error, it wants YAML, apparently.
cookiecutter.exceptions.InvalidConfiguration: Unable to parse YAML file .../000.packaging/unleash.json. Error: None of the known patterns match for {
"cookiecutter": {
"full_name": "Spartacus",
No problem, json2yaml to the rescue.
That doesn't work either.
cookiecutter.exceptions.InvalidConfiguration: Unable to parse YAML file ./cookie.yaml. Error: Unable to determine type for "
full_name: "Spartacus"
I also tried a < stdin redirect:
cookiecutter.prompts.txt:
yes
Spartacus
...
It doesn't seem to use it and just aborts.
cookiecutter https://github.com/audreyr/cookiecutter-pypackage.git < ./cookiecutter.prompts.txt
You've downloaded ~/.cookiecutters/cookiecutter-pypackage before. Is it okay to delete and re-download it? [yes]
: full_name [Audrey Roy Greenfeld]
: email [audreyr#example.com]
: Aborted
I suspect I am missing something obvious, not sure what. To start with, what is the intent and format expected for the --config file?
Debrief - how I got it working from accepted answer.
Took accepted answer, but adjusted it for ~/.cookiecutterrc usage. It works but the format is not super clear. Especially not on the rc which has to be yaml, though that's not always/often the case with rc files.
This ended up working:
file ~/.cookiecutterrc:
without nesting under default_context... tons of unhelpful yaml parse errors (on a valid yaml doc).
default_context:
#... cut out for privacy
add_pyup_badge: y
command_line_interface: "Click"
create_author_file: "y"
open_source_license: "MIT license"
# the names to use here are:
# full_name:
# email:
# github_username:
# project_name:
# project_slug:
# project_short_description:
# pypi_username:
# version:
# use_pytest:
# use_pypi_deployment_with_travis:
# add_pyup_badge:
# command_line_interface:
# create_author_file:
# open_source_license:
I still could not get a combination of ~/.cookiecutterrc and a project-specific config.yaml to work. Too bad that expected configuration format is so lightly documented.
So I will use the .rc but enter the project name, slug and description each time. Oh well, good enough for now.
You are near.
Try this cookiecutter https://github.com/audreyr/cookiecutter-pypackage.git --no-input --config-file config.yaml
The --no-input parameter will suppress the terminal user input, it is optional of course.
The config.yaml file could look like this:
default_context:
full_name: "Audrey Roy"
email: "audreyr#example.com"
github_username: "audreyr"
cookiecutters_dir: "/home/audreyr/my-custom-cookiecutters-dir/"
replay_dir: "/home/audreyr/my-custom-replay-dir/"
abbreviations:
pp: https://github.com/audreyr/cookiecutter-pypackage.git
gh: https://github.com/{0}.git
bb: https://bitbucket.org/{0}
Reference to this example file: https://cookiecutter.readthedocs.io/en/1.7.0/advanced/user_config.html
You probably just need the default_context block since that is where the user input goes.

Stanford CoreNLP: nndep.DependencyParser in pipeline with geman model

i want to use the nndep in CoreNLP for dependency Parsing. So the Input is a simple german sentence and the output should be like this:
case(Schulen-3, An-1)
amod(Schulen-3, Stuttgarter-2)
nmod(gegrüßt-13, Schulen-3)
aux(gegrüßt-13, darf-4)
case(MitschülerInnen-7, wegen-5)
amod(MitschülerInnen-7, muslimischer-6)
nmod(gegrüßt-13, MitschülerInnen-7)
neg(gegrüßt-13, nicht-8)
advmod(nicht-8, mehr-9)
case(Gott-12, mit-10)
amod(Gott-12, Grüß-11)
nmod(gegrüßt-13, Gott-12)
root(ROOT-0, gegrüßt-13)
auxpass(gegrüßt-13, werden-14)
punct(gegrüßt-13, .-15)
and this command is working for a single file:
java -cp "*" -Xmx2g edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/UD_German.gz -textFile /Users/.../input.txt
But I need to to this with 60.000 files. So i need the nlp.pipeline. If i execute the following command, the output is only the normal parse tree but not the parsed dependencies.
java -Xmx6g -cp "*:." -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -filelist /Users/.../filelist.txt -props StanfordCoreNLP-german.properties -outputFormat text -parse.originalDependencies
Can someone help?
You need to add
-annotators tokenize,ssplit,pos,lemma,parse,depparse
and
-pos.model edu/stanford/nlp/models/pos-tagger/german/german-ud.tagger
The first addition is telling it to use the dependency parser, the second is telling it to use the UD POS tagger, which is necessary since the dependency parser expects UD POS tags.
Also make sure to use the latest Stanford CoreNLP from GitHub or the last released version (more stable) from here:
https://stanfordnlp.github.io/CoreNLP/download.html

ipython3 - Almost every time I tab complete in ipython3 it runs %rehashx, is there a workaround?

I've tried googling around but haven't found much / anything, the following also doesn't help at all...
https://ipython.org/ipython-doc/3/interactive/magics.html
typical usecase is:
In [31]: from sqlalch<TAB>
Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)
em
Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)
Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)
sqlalchemy
Also running %rehashx by itself also doesn't help. I also pip installed pyreadline.
Any ideas what is going wrong? Where does %rehashx store info?
EDIT
The output from get_ipython().db['rootmodules_cache'] gives the following:
for key in d.keys(): print key
# /usr/local/bin
# /usr/lib/python3/dist-packages
# /usr/lib/python3.5
# /usr/local/lib/python3.5/dist-packages <- should be in here
# /usr/lib/python3.5/lib-dynload
# /usr/lib/python35.zip
# /usr/local/lib/python3.5/dist-packages/IPython/extensions
# /usr/lib/python3.5/plat-x86_64-linux-gnu
# /home/user/.ipython
import sqlalchemy
sqlalchemy.__file__
# /user/local/lib/python3.5/dist-packages/sqlalchemy/__init__.py
However sqlalchemy is not in the list
d = get_ipython().db['rootmodules_cache']
'sqlalchemy' in d['/user/local/lib/python3.5/dist-packages']
# False
This command solved to me, in the Ipython:
!rm .ipython/profile_default/db/*
I hope it adds to yours.

"invalid option" error when running cucumber with "--tags"

I've been playing around with Cucumber for about three weeks now, and everything works well, except this little thing here.
Whenever I run my tests with e.g. cucumber checkout.feature --tags #monthly, I get the following on my console after the test have run successfully:
invalid option: --tags
Test::Unit automatic runner.
Usage: /Users/myusername/.rvm/gems/ruby-2.0.0-p0/bin/cucumber [options] [-- untouched arguments]
-r, --runner=RUNNER Use the given RUNNER.
(c[onsole], e[macs], x[ml])
--collector=COLLECTOR Use the given COLLECTOR.
(de[scendant], di[r], l[oad], o[bject]_space)
-n, --name=NAME Runs tests matching NAME.
(patterns may be used).
--ignore-name=NAME Ignores tests matching NAME.
(patterns may be used).
-t, --testcase=TESTCASE Runs tests in TestCases matching TESTCASE.
(patterns may be used).
--ignore-testcase=TESTCASE Ignores tests in TestCases matching TESTCASE.
(patterns may be used).
--location=LOCATION Runs tests that defined in LOCATION.
LOCATION is one of PATH:LINE, PATH or LINE
--attribute=EXPRESSION Runs tests that matches EXPRESSION.
EXPRESSION is evaluated as Ruby's expression.
Test attribute name can be used with no receiver in EXPRESSION.
EXPRESSION examples:
!slow
tag == 'important' and !slow
--[no-]priority-mode Runs some tests based on their priority.
--default-priority=PRIORITY Uses PRIORITY as default priority
(h[igh], i[mportant], l[ow], m[ust], ne[ver], no[rmal])
-I, --load-path=DIR[:DIR...] Appends directory list to $LOAD_PATH.
--color-scheme=SCHEME Use SCHEME as color scheme.
(d[efault])
--config=FILE Use YAML fomat FILE content as configuration file.
--order=ORDER Run tests in a test case in ORDER order.
(a[lphabetic], d[efined], r[andom])
--max-diff-target-string-size=SIZE
Shows diff if both expected result string size and actual result string size are less than or equal SIZE in bytes.
(1000)
-v, --verbose=[LEVEL] Set the output level (default is verbose).
(important-only, n[ormal], p[rogress], s[ilent], v[erbose])
--[no-]use-color=[auto] Uses color output
(default is auto)
--progress-row-max=MAX Uses MAX as max terminal width for progress mark
(default is auto)
--no-show-detail-immediately Shows not passed test details immediately.
(default is yes)
--output-file-descriptor=FD Outputs to file descriptor FD
-- Stop processing options so that the
remaining options will be passed to the
test.
-h, --help Display this help.
Deprecated options:
--console Console runner (use --runner).
I probably didn't need to put all of that here, but I wanted to give you an impression of how much text appears on my screen after each test, which can be a bit distracting.
Here is my setup:
Gemfile
source 'https://rubygems.org'
gem "rspec"
gem "cucumber"
gem "capybara"
gem "capybara-webkit"
gem "selenium"
gem "selenium-client"
gem "selenium-webdriver"
env.rb
require_relative '../../../config.rb'
require 'capybara/cucumber'
require 'capybara/rspec'
Capybara.app_host = AT_ROOT
Capybara.default_driver = :selenium
Capybara.javascript_driver = :webkit
Capybara.default_wait_time = DEFAULT_WAIT_TIME
Capybara.ignore_hidden_elements = IGNORE_HIDDEN_ELEMENTS
# Define window size of the browser here
Capybara.current_session.driver.browser.manage.window.resize_to(DEFAULT_WINDOW_HEIGHT, DEFAULT_WINDOW_WIDTH)
I couldn't find any connection to the Test::Unit automatic runner in the console output, but apparently it's got something to do with it.
Do you have any idea what that could be? I found some threads related to this issue, but they didn't help me unfortunately.
Thank you
Try
cucumber features -t #monthly

Resources