Ta odpowiedź została przetestowana na systemie Windows 7 SP1 x64 Ultimate z Anaconda Python 2.7.11 x64 i MIST 2.0.4. MIST 2.0.4 nie działa z Pythonem 3.x (zgodnie z instrukcją, sam tego nie testowałem).
MIST (MITER Identification Scrubber Toolkit) [1] to personalizacja MAT (MITER Adnotation Toolkit), która jest narzędziem do automatycznego oznaczania dokumentów lub z ludźmi (w tym ostatnim dostarcza GUI przez serwer WWW). Automatyczny tagger jest oparty na Carafe (ConditionAl RAndom Fields) [2], który jest implementacją OCaml losowych zmiennych losowych (CRF).
MIST nie jest wyposażony w żaden wyszkolony model i ma tylko ~ 10 krótkich, niemedycznych dokumentów opatrzonych typową klasą NER (np. Organizacja i osoba).
De-id (de-identification) to proces oznaczania PHI (Private Health Information) w dokumencie i zastępowania ich fałszywymi danymi. Zignorujmy teraz zamianę PHI i skupmy się na tagowaniu. W celu oznaczenia dokumentu (np. Notatki pacjenta), MAT stosuje się do typowego schematu uczenia maszynowego: CRF musi zostać przeszkolony na etykietowanym zbiorze danych (= zestaw etykietowanych dokumentów), a następnie używamy go do oznaczania dokumentów bez etykiety.
Podstawową koncepcją techniczną w MAT są zadania. Zadanie jest zbiorem czynności zwanych przepływami pracy, które można podzielić na kroki. Rozpoznanie nazwy obiektu (NER) to jedno zadanie. De-id to kolejne zadanie (głównie NER nastawiony na teksty medyczne): innymi słowy, MIST jest tylko jednym zadaniem MAT (w rzeczywistości 3: core, HIPAA i AMIA, Core jest zadaniem nadrzędnym, podczas gdy HIPAA i AMIA są dwoma różne znaczniki). Kroki są na przykład tokenizacją, tagowaniem lub czyszczeniem. Przepływy pracy to tylko lista kroków, które można wykonać.
Mając to na uwadze, tutaj jest kod Microsoft Windows:
#######
rem Instructions for Windows 7 SP1 x64 Ultimate
rem Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
SET MAT_PKG_HOME=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\src\MAT
SET TMP=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\temp
cd C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd %MAT_PKG_HOME%
bin\MATManagePluginDirs.cmd install %CD%\sample\ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin\MATModelBuilder.cmd --task "Named Entity" --model_file %TMP%\ne_model^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tuto 2: Add trained model as the default model
bin\MATModelBuilder.cmd --task "Named Entity" --save_as_default_model^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tudo 5: use CLI -> prepare the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize"^
--input_file %CD%\sample\ne\resources\data\raw\voa2.txt --input_file_type raw^
--output_file %CD%\voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag"^
--input_file %CD%\voa2_txt.json --input_file_type mat-json^
--output_file %CD%\voa2_txt.json --output_file_type mat-json^
--tagger_local
NER jest teraz zrobić.
Oto te same instrukcje Ubuntu 14.04.4 LTS x64:
#######
# Instructions for Ubuntu 14.04.4 LTS x64
# Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
export MAT_PKG_HOME=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/src/MAT
export TMP=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/temp
mkdir $TMP
cd /home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd $MAT_PKG_HOME
bin/MATManagePluginDirs install $PWD/sample/ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin/MATModelBuilder --task "Named Entity" --model_file $TMP/ne_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tuto 2: Add trained model as the default model
bin/MATModelBuilder --task "Named Entity" --save_as_default_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tudo 5: use CLI -> prepare the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file $PWD/sample/ne/resources/data/raw/voa2.txt --input_file_type raw \
--output_file $PWD/voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "tag" \
--input_file $PWD/voa2_txt.json --input_file_type mat-json \
--output_file $PWD/voa2_txt.json --output_file_type mat-json \
--tagger_local
Aby uruchomić de-id, nie ma potrzeby instalowania zadania de-id są wstępnie -instalowany. Istnieją 2 zadania de-id (\MIST_2_0_4\src\tasks\HIPAA\task.xml
i \MIST_2_0_4\src\tasks\AMIA\task.xml
). Nie są one wyposażone w żaden wyszkolony model ani etykietowany zestaw danych, więc możesz chcieć uzyskać pewne dane pod numerem Physician notes with annotated PHI.
Dla Microsoft Windows (testowane z Windows 7 SP1 x64 Ostatecznego):
Aby trenować model (można zastąpić HIPAA Deidentification
z AMIA Deidentification
zależności od tagset chcesz używać):
bin\MATModelBuilder.cmd --task "HIPAA Deidentification"^
--save_as_default_model --nthreads=3 --max_iterations=15^
--lexicon_dir="%CD%\sample\mist\gazetteers"^
--input_files "%CD%\sample\mist\i2b2-60-00-40\train\*.json"
Aby uruchomić wyszkolony model w jednym pliku:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo^
--input_file .\note.txt --input_file_type raw^
--output_file .\note.json --output_file_type mat-json^
--tagger_local^
--steps "clean,zone,tag"
Aby uruchomić przeszkolony Model w jednym katalogu:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo^
--input_dir "%CD%\sample\test" --input_file_type raw^
--output_dir "%CD%\sample\test" --output_file_type mat-json^
--tagger_local^
--steps "clean,zone,tag"
Jak zwykle, można określić format pliku wejściowego być JSON:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo^
--input_dir "%CD%\sample\mist\i2b2-60-00-40\test" --input_file_type mat-json^
--output_dir "%CD%\sample\mist\i2b2-60-00-40\test_out" --output_file_type mat-json^
--tagger_local --steps "tag"
Dla Ubuntu 14.04.4 LTS x64:
trenować model (można zamienić HIPAA Deidentification
na AMIA Deidentification
w zależności od zestawu tagów, którego chcesz użyć):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--input_files "$PWD/sample/mist/i2b2-60-00-40/train/*.json"
Aby uruchomić wyszkolony modelu na jednym pliku:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_file ./note.txt --input_file_type raw \
--output_file ./note.json --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
Aby uruchomić wyszkolony modelu w jednym katalogu:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/test" --input_file_type raw \
--output_dir "$PWD/sample/test" --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
Jak zwykle, można określić format pliku wejściowego być JSON:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/mist/i2b2-60-00-40/test" --input_file_type mat-json \
--output_dir "$PWD/sample/mist/i2b2-60-00-40/test_out" --output_file_type mat-json \
--tagger_local --steps "tag"
Typowe komunikaty o błędach:
raise PluginError, "Carafe not configured properly for this task and workflow: " + str(e)
(podczas próby oznaczenia dokumentu): często oznacza to, że nie określono modelu. Musisz zdefiniować model domyślny lub użyć --tagger_model /path/to/model/
.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
(podczas treningu modelu): jest to łatwe, aby przejść ponad limit heap_size (domyślnie jest to 2GB). Możesz zwiększyć wartość heap_size za pomocą parametru --heap_size
. Przykład (Linux):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--heap_size=60G \
--input_files "$PWD/sample/mist/mimic-140-20-40/train/*.json"
[1] Jan Aberdeen, Samuel Bayer, Reyyan Yeniterzi Ben Wellner Cheryl Clark David Hanauer Bradley Malin, Lynette Hirschman, mitrę identyfikacja płuczka Toolkit : projektowanie, szkolenie i ocena, Int. J. Med. Informatyka 79 (12) (2010) 849-859, http://dx.doi.org/10.1016/j.ijmedinf.2010.09.007.
[2] B. Wellner, Modele sekwencji i metody rankingowe dla Parsowanie [Ph.D. Rozprawa]. Brandeis University, Waltham, MA, 2009.http://www.cs.brandeis.edu/~wellner/pubs/wellner_dissertation.pdf
Dokumentacja MATModelBuilder.cmd
:
Usage: MATModelBuilder.cmd [task option] [config name option] [other options]
Options:
-h, --help show this help message and exit
Task option:
--task=task name of the task to use. Must be the first argument,
if present. Obligatory if the system knows of more
than one task. Known tasks are: AMIA Deidentification,
Named Entity, HIPAA Deidentification, Enhanced Named
Entity
Config name option:
--config_name=name name of the model build config to use. Must be the
first argument after --task, if present. Optional.
Default model build config will be used if no config
is specified.
Control options:
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.
Options for model class creation:
--partial_training_on_gold_only
When the trainer is presented with partially tagged
documents, by default MAT will ask it to train on all
annotated segments, completed or not. If this flag is
specified, only completed segments should be used for
training.
--feature_spec=FEATURE_SPEC
path to the Carafe feature spec file to use. Optional
if feature_spec is set in the <build_settings> for the
relevant model config in the task.xml file for the
task.
--training_method=TRAINING_METHOD
If present, specify a training method other than the
standard method. Currently, the only recognized value
is psa. The psa method is noticeably faster, but may
result in somewhat poorer results. You can use a value
of '' to override a previously specified training
method (e.g., a default method in your task).
--max_iterations=MAX_ITERATIONS
number of iterations for the optimized PSA training
mechanism to use. A value between 6 and 10 is
appropriate. Overrides any possible default in
<build_settings> for the relevant model config in the
task.xml file for the task.
--lexicon_dir=LEXICON_DIR
If present, the name of a directory which contains a
Carafe training lexicon. This pathname should be an
absolute pathname, and should have a trailing slash.
The content of the directory should be a set of files,
each of which contains a sequence of tokens, one per
line. The name of the file will be used as a training
feature for the token. Overrides any possible default
in <build_settings> for the relevant model config in
the task.xml file for the task.
--parallel If present, parallelizes the feature expectation
computation, which reduces the clock time of model
building when multiple CPUs are available
--nthreads=NTHREADS
If --parallel is used, controls the number of threads
used for training.
--gaussian_prior=GAUSSIAN_PRIOR
A positive float, default is 10.0. See the jCarafe
docs for details.
--no_begin Don't introduce begin states during training. Useful
if you're certain that you won't have any adjacent
spans with the same label. See the jCarafe
documentation for more details.
--l1 Use L1 regularization for PSA training. See the
jCarafe docs for details.
--l1_c=L1_C Change the penalty factor for the L1 regularizer. See
the jCarafe docs for details.
--heap_size=HEAP_SIZE
If present, specifies the -Xmx argument for the Java
JVM
--stack_size=STACK_SIZE
If present, specifies the -Xss argument for the Java
JVM
--tags=TAGS if present, a comma-separated list of tags to pass to
the training engine instead of the full tag set for
the task (used to create per-tag pre-tagging models
for multi-stage training and tagging)
--pre_models=PRE_MODELS
if present, a comma-separated list of glob-style
patterns specifying the models to include as pre-
taggers.
--add_tokens_internally
If present, Carafe will use its internal tokenizer to
tokenize the document before training. If your
workflow doesn't tokenize the document, you must
provide this flag, or Carafe will have no tokens to
base its training on. We recommend strongly that you
tokenize your documents separately; you should not use
this flag.
--word_properties=WORD_PROPERTIES
See the jCarafe docs for --word-properties.
--word_scores=WORD_SCORES
See the jCarafe docs for --word-scores.
--learning_rate=LEARNING_RATE
See the jCarafe docs for --learning-rate.
--disk_cache=DISK_CACHE
See the jCarafe docs for --disk_cache.
Input options:
--input_dir=dir A directory, all of whose files will be used in the
model construction. Can be repeated. May be specified
with --input_files.
--input_files=re A glob-style pattern describing full pathnames to use
in the model construction. May be specified with
--input_dir. Can be repeated.
--file_type=fake-xml-inline | mat-json | xml-inline
The file type of the input. One of fake-xml-inline,
mat-json, xml-inline. Default is mat-json.
--encoding=encoding
The encoding of the input. The default is the
appropriate default for the file type.
Output options:
--model_file=file Location to save the created model. The directory must
already exist. Obligatory if --save_as_default_model
isn't specified.
--save_as_default_model
If the the task.xml file for the task specifies the
<default_model> element, save the model in the
specified location, possibly overriding any existing
model.
Dokumentacja MATEngine
:
Usage: MATEngine [core options] [input/output/task options] [other options]
Options:
-h, --help show this help message and exit
Core options:
--other_app_dir=dir
additional directory to load a task from. Optional and
repeatable.
--settings_file=file
a file of settings to use which overwrites existing
settings. The file should be a Python config file in
the style of the template in
etc/MAT_settings.config.in. Optional.
--task=task name of the task to use. Obligatory if the system
knows of more than one task. Known tasks are: AMIA
Deidentification, Named Entity, HIPAA
Deidentification, Enhanced Named Entity
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.
powodzenia robi to na rzeczywistych danych pacjentów i gwarantuje 100%, że nie będzie "miss" coś to powinno być ukryte ... –
@MarcB w porządku, nie dążę do 100% (chciałbym): nie oczekuję od pacjenta notatek bądź czysty. –
@MarcB Mamy jednak dobre wyniki: * Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, Peter Szolovits. [De-identyfikacja notatek pacjenta z nawracającymi sieciami neuronowymi] (http://arxiv.org/abs/1606.03475). arXiv preprint arXiv: 1606.03475, 2016. * –