I want to display RabbitMQ Queues in ELK, for this I ran Kibana, ElasticSearch and LogStash, below are all the details.
The result of running Kiabana
samira#elk:/var/www/apps/kibana-8.4.2-linux-x86_64(1)/kibana-8.4.2/bin$ ./kibana
[2022-10-20T12:37:57.842+03:30][INFO ][node] Kibana process configured with roles: [background_tasks, ui]
[2022-10-20T12:38:08.711+03:30][INFO ][http.server.Preboot] http server running at http://localhost:5601
[2022-10-20T12:38:08.740+03:30][INFO ][plugins-system.preboot] Setting up [1] plugins: [interactiveSetup]
[2022-10-20T12:38:08.771+03:30][WARN ][config.deprecation] The default mechanism for Reporting privileges will work differently in future versions, which will affect the behavior of this cluster. Set "xpack.reporting.roles.enabled" to "false" to adopt the future behavior before upgrading.
[2022-10-20T12:38:08.935+03:30][INFO ][plugins-system.standard] Setting up [121] plugins: [translations,monitoringCollection,licensing,globalSearch,globalSearchProviders,features,mapsEms,licenseApiGuard,usageCollection,taskManager,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,share,embeddable,uiActionsEnhanced,screenshotMode,banners,newsfeed,fieldFormats,expressions,dataViews,charts,esUiShared,customIntegrations,home,searchprofiler,painlessLab,grokdebugger,management,advancedSettings,spaces,security,lists,encryptedSavedObjects,cloud,snapshotRestore,screenshotting,telemetry,licenseManagement,eventLog,actions,console,bfetch,data,watcher,reporting,fileUpload,ingestPipelines,alerting,unifiedSearch,savedObjects,graph,savedObjectsTagging,savedObjectsManagement,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,controls,eventAnnotation,dataViewFieldEditor,triggersActionsUi,transform,stackAlerts,ruleRegistry,discover,fleet,indexManagement,remoteClusters,crossClusterReplication,indexLifecycleManagement,cloudSecurityPosture,discoverEnhanced,aiops,visualizations,canvas,visTypeXy,visTypeVislib,visTypeVega,visTypeTimeseries,rollup,visTypeTimelion,visTypeTagcloud,visTypeTable,visTypeMetric,visTypeHeatmap,visTypeMarkdown,dashboard,dashboardEnhanced,expressionXY,expressionTagcloud,expressionPartitionVis,visTypePie,expressionMetricVis,expressionLegacyMetricVis,expressionHeatmap,expressionGauge,lens,osquery,maps,dataVisualizer,ml,cases,timelines,sessionView,kubernetesSecurity,securitySolution,visTypeGauge,sharedUX,observability,synthetics,infra,upgradeAssistant,monitoring,logstash,enterpriseSearch,apm,dataViewManagement]
[2022-10-20T12:38:08.948+03:30][INFO ][plugins.taskManager] TaskManager is identified by the Kibana UUID: 114adb80-0285-4b29-b403-64ea1e454f19
[2022-10-20T12:38:09.009+03:30][WARN ][plugins.security.config] Session cookies will be transmitted over insecure connections. This is not recommended.
[2022-10-20T12:38:09.028+03:30][WARN ][plugins.security.config] Session cookies will be transmitted over insecure connections. This is not recommended.
[2022-10-20T12:38:09.034+03:30][INFO ][plugins.encryptedSavedObjects] Hashed 'xpack.encryptedSavedObjects.encryptionKey' for this instance: B6ABZzCc0sMI2CQc1eJYyeLXhC0I61v8xdNjUusVvp0=
[2022-10-20T12:38:09.159+03:30][INFO ][plugins.ruleRegistry] Installing common resources shared between all indices
[2022-10-20T12:38:09.189+03:30][INFO ][plugins.cloudSecurityPosture] Registered task successfully [Task: cloud_security_posture-stats_task]
[2022-10-20T12:38:09.678+03:30][INFO ][plugins.screenshotting.config] Chromium sandbox provides an additional layer of protection, and is supported for Linux Ubuntu 20.04 OS. Automatically enabling Chromium sandbox.
[2022-10-20T12:38:09.707+03:30][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. connect ECONNREFUSED 127.0.0.1:9200
[2022-10-20T12:38:10.162+03:30][INFO ][plugins.screenshotting.chromium] Browser executable: /var/www/apps/kibana-8.4.2-linux-x86_64(1)/kibana-8.4.2/x-pack/plugins/screenshotting/chromium/headless_shell-linux_x64/headless_shell
[2022-10-20T12:39:34.886+03:30][INFO ][savedobjects-service] Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...
[2022-10-20T12:39:34.887+03:30][INFO ][savedobjects-service] Starting saved objects migrations
[2022-10-20T12:39:34.935+03:30][INFO ][savedobjects-service] [.kibana] INIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 28ms.
[2022-10-20T12:39:34.937+03:30][INFO ][savedobjects-service] [.kibana_task_manager] INIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 27ms.
[2022-10-20T12:39:34.948+03:30][ERROR][savedobjects-service] [.kibana_task_manager] Action failed with 'search_phase_execution_exception: '. Retrying attempt 1 in 2 seconds.
[2022-10-20T12:39:34.949+03:30][INFO ][savedobjects-service] [.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 11ms.
[2022-10-20T12:39:34.950+03:30][ERROR][savedobjects-service] [.kibana] Action failed with 'search_phase_execution_exception: '. Retrying attempt 1 in 2 seconds.
[2022-10-20T12:39:34.950+03:30][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 15ms.
[2022-10-20T12:39:36.961+03:30][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 2011ms.
The result of running elasticsearch
samira#elk:/var/www/apps/elasticsearch-8.4.2/bin$ ./elasticsearch
warning: ignoring JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64; using bundled JDK
[2022-10-20T12:39:26,417][INFO ][o.e.n.Node ] [elk.kifarunix-demo.com] version[8.4.2], pid[3594], build[tar/89f8c6d8429db93b816403ee75e5c270b43a940a/2022-09-14T16:26:04.382547801Z], OS[Linux/5.15.0-52-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/18.0.2.1/18.0.2.1+1-1]
[2022-10-20T12:39:26,422][INFO ][o.e.n.Node ] [elk.kifarunix-demo.com] JVM home [/var/www/apps/elasticsearch-8.4.2/jdk], using bundled JDK [true]
[2022-10-20T12:39:26,422][INFO ][o.e.n.Node ] [elk.kifarunix-demo.com] JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -Djava.security.manager=allow, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=SPI,COMPAT, --add-opens=java.base/java.io=ALL-UNNAMED, -XX:+UseG1GC, -Djava.io.tmpdir=/tmp/elasticsearch-11974649730192173872, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Xms2g, -Xmx2g, -XX:MaxDirectMemorySize=1073741824, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -Des.distribution.type=tar, --module-path=/var/www/apps/elasticsearch-8.4.2/lib, --add-modules=jdk.net, -Djdk.module.main=org.elasticsearch.server]
[2022-10-20T12:39:27,685][INFO ][c.a.c.i.j.JacksonVersion ] [elk.kifarunix-demo.com] Package versions: jackson-annotations=2.13.2, jackson-core=2.13.2, jackson-databind=2.13.2.2, jackson-dataformat-xml=2.13.2, jackson-datatype-jsr310=2.13.2, azure-core=1.27.0, Troubleshooting version conflicts: https://aka.ms/azsdk/java/dependency/troubleshoot
[2022-10-20T12:39:28,669][INFO ][o.e.p.PluginsService ] [elk.kifarunix-demo.com] loaded module [aggs-matrix-stats]
[2022-10-20T12:39:28,670][INFO ][o.e.p.PluginsService ] [elk.kifarunix-demo.com] loaded module [analysis-common]
[2022-10-20T12:39:28,670][INFO ][o.e.p.PluginsService ] [elk.kifarunix-demo.com] loaded module [constant-keyword]
[2022-10-20T12:39:28,670][INFO ][o.e.p.PluginsService ] [elk.kifarunix-demo.com] loaded module [data-streams]
The result of running logstash
$ bin/logstash -f sdamiii.conf
this is my sdamiii.conf
input {
rabbitmq {
host => "localhost"
port => 5672
heartbeat => 30
durable => true
queue => "system_logs"
user => "guest"
password => "guest"
vhost => "/"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
rabbitmq {
exchange => "system_logs"
host => "localhost"
exchange_type => "fanout"
key => "logstash"
persistent => false
}
}
This is kibana.yml
elasticsearch.username: "kibana_system"
elasticsearch.password: "pass"
xpack.encryptedSavedObjects.encryptionKey: 5b6d5d7b20e971b2e562cc7c8ca181ae
xpack.reporting.encryptionKey: ba46a278d1dcb511339ea01f2a9d2651
xpack.security.encryptionKey: 11f8ec40b5bf442404f5c5a53b38ad13
This is elasticsearch.yml
network.host: localhost
xpack.security.enabled: false
xpack.security.enrollment.enabled: false
# Enable encryption for HTTP API client connections, such as Kibana, Logstash, and Agents
xpack.security.http.ssl:
enabled: true
keystore.path: certs/http.p12
# Enable encryption and mutual authentication between cluster nodes
xpack.security.transport.ssl:
enabled: false
verification_mode: certificate
keystore.path: certs/transport.p12
truststore.path: certs/transport.p12
# Create a new cluster with the current node only
# Additional nodes can still join the cluster later
cluster.initial_master_nodes: ["elk.kifarunix-demo.com"]
# Allow HTTP API connections from anywhere
# Connections are encrypted and require user authentication
http.host: localhost
This is logstash.yml
# Settings file in YAML
#
# Settings can be specified either in hierarchical form, e.g.:
#
# pipeline:
# batch:
# size: 125
# delay: 5
#
# Or as flat keys:
#
# pipeline.batch.size: 125
# pipeline.batch.delay: 5
#
# ------------ Node identity ------------
#
# Use a descriptive name for the node:
#
# node.name: test
#
# If omitted the node name will default to the machine's host name
#
# ------------ Data path ------------------
#
# Which directory should be used by logstash and its plugins
# for any persistent needs. Defaults to LOGSTASH_HOME/data
#
# path.data:
#
# ------------ Pipeline Settings --------------
#
# The ID of the pipeline.
#
# pipeline.id: main
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#
# pipeline.workers: 2
#
# How many events to retrieve from inputs before sending to filters+workers
#
# pipeline.batch.size: 125
#
# How long to wait in milliseconds while polling for the next event
# before dispatching an undersized batch to filters+outputs
#
# pipeline.batch.delay: 50
#
# Force Logstash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: Enabling this can lead to data loss during shutdown
#
# pipeline.unsafe_shutdown: false
#
# Set the pipeline event ordering. Options are "auto" (the default), "true" or "false".
# "auto" automatically enables ordering if the 'pipeline.workers' setting
# is also set to '1', and disables otherwise.
# "true" enforces ordering on the pipeline and prevent logstash from starting
# if there are multiple workers.
# "false" disables any extra processing necessary for preserving ordering.
#
# pipeline.ordered: auto
#
# Sets the pipeline's default value for `ecs_compatibility`, a setting that is
# available to plugins that implement an ECS Compatibility mode for use with
# the Elastic Common Schema.
# Possible values are:
# - disabled
# - v1
# - v8 (default)
# Pipelines defined before Logstash 8 operated without ECS in mind. To ensure a
# migrated pipeline continues to operate as it did before your upgrade, opt-OUT
# of ECS for the individual pipeline in its `pipelines.yml` definition. Setting
# it here will set the default for _all_ pipelines, including new ones.
#
# pipeline.ecs_compatibility: v8
#
# ------------ Pipeline Configuration Settings --------------
#
# Where to fetch the pipeline configuration for the main pipeline
#
# path.config:
#
# Pipeline configuration string for the main pipeline
#
# config.string:
#
# At startup, test if the configuration is valid and exit (dry run)
#
# config.test_and_exit: false
#
# Periodically check if the configuration has changed and reload the pipeline
# This can also be triggered manually through the SIGHUP signal
#
# config.reload.automatic: false
#
# How often to check if the pipeline configuration has changed (in seconds)
# Note that the unit value (s) is required. Values without a qualifier (e.g. 60)
# are treated as nanoseconds.
# Setting the interval this way is not recommended and might change in later versions.
#
# config.reload.interval: 3s
#
# Show fully compiled configuration as debug log message
# NOTE: --log.level must be 'debug'
#
# config.debug: false
#
# When enabled, process escaped characters such as \n and \" in strings in the
# pipeline configuration files.
#
# config.support_escapes: false
#
# ------------ API Settings -------------
# Define settings related to the HTTP API here.
#
# The HTTP API is enabled by default. It can be disabled, but features that rely
# on it will not work as intended.
#
# api.enabled: true
#
# By default, the HTTP API is not secured and is therefore bound to only the
# host's loopback interface, ensuring that it is not accessible to the rest of
# the network.
# When secured with SSL and Basic Auth, the API is bound to _all_ interfaces
# unless configured otherwise.
#
# api.http.host: 127.0.0.1
#
# The HTTP API web server will listen on an available port from the given range.
# Values can be specified as a single port (e.g., `9600`), or an inclusive range
# of ports (e.g., `9600-9700`).
#
# api.http.port: 9600-9700
#
# The HTTP API includes a customizable "environment" value in its response,
# which can be configured here.
#
# api.environment: "production"
#
# The HTTP API can be secured with SSL (TLS). To do so, you will need to provide
# the path to a password-protected keystore in p12 or jks format, along with credentials.
#
# api.ssl.enabled: false
# api.ssl.keystore.path: /path/to/keystore.jks
# api.ssl.keystore.password: "y0uRp4$$w0rD"
#
# The HTTP API can be configured to require authentication. Acceptable values are
# - `none`: no auth is required (default)
# - `basic`: clients must authenticate with HTTP Basic auth, as configured
# with `api.auth.basic.*` options below
# api.auth.type: none
#
# When configured with `api.auth.type` `basic`, you must provide the credentials
# that requests will be validated against. Usage of Environment or Keystore
# variable replacements is encouraged (such as the value `"${HTTP_PASS}"`, which
# resolves to the value stored in the keystore's `HTTP_PASS` variable if present
# or the same variable from the environment)
#
# api.auth.basic.username: "logstash-user"
# api.auth.basic.password: "s3cUreP4$$w0rD"
#
# When setting `api.auth.basic.password`, the password should meet
# the default password policy requirements.
# The default password policy requires non-empty minimum 8 char string that
# includes a digit, upper case letter and lower case letter.
# Policy mode sets Logstash to WARN or ERROR when HTTP authentication password doesn't
# meet the password policy requirements.
# The default is WARN. Setting to ERROR enforces stronger passwords (recommended).
#
# api.auth.basic.password_policy.mode: WARN
#
# ------------ Module Settings ---------------
# Define modules here. Modules definitions must be defined as an array.
# The simple way to see this is to prepend each `name` with a `-`, and keep
# all associated variables under the `name` they are associated with, and
# above the next, like this:
#
# modules:
# - name: MODULE_NAME
# var.PLUGINTYPE1.PLUGINNAME1.KEY1: VALUE
# var.PLUGINTYPE1.PLUGINNAME1.KEY2: VALUE
# var.PLUGINTYPE2.PLUGINNAME1.KEY1: VALUE
# var.PLUGINTYPE3.PLUGINNAME3.KEY1: VALUE
#
# Module variable names must be in the format of
#
# var.PLUGIN_TYPE.PLUGIN_NAME.KEY
#
# modules:
#
# ------------ Cloud Settings ---------------
# Define Elastic Cloud settings here.
# Format of cloud.id is a base64 value e.g. dXMtZWFzdC0xLmF3cy5mb3VuZC5pbyRub3RhcmVhbCRpZGVudGlmaWVy
# and it may have an label prefix e.g. staging:dXMtZ...
# This will overwrite 'var.elasticsearch.hosts' and 'var.kibana.host'
# cloud.id: <identifier>
#
# Format of cloud.auth is: <user>:<pass>
# This is optional
# If supplied this will overwrite 'var.elasticsearch.username' and 'var.elasticsearch.password'
# If supplied this will overwrite 'var.kibana.username' and 'var.kibana.password'
# cloud.auth: elastic:<password>
#
# ------------ Queuing Settings --------------
#
# Internal queuing model, "memory" for legacy in-memory based queuing and
# "persisted" for disk-based acked queueing. Defaults is memory
#
# queue.type: memory
#
# If `queue.type: persisted`, the directory path where the pipeline data files will be stored.
# Each pipeline will group its PQ files in a subdirectory matching its `pipeline.id`.
# Default is path.data/queue.
#
# path.queue:
#
# If using queue.type: persisted, the page data files size. The queue data consists of
# append-only data files separated into pages. Default is 64mb
#
# queue.page_capacity: 64mb
#
# If using queue.type: persisted, the maximum number of unread events in the queue.
# Default is 0 (unlimited)
#
# queue.max_events: 0
#
# If using queue.type: persisted, the total capacity of the queue in number of bytes.
# If you would like more unacked events to be buffered in Logstash, you can increase the
# capacity using this setting. Please make sure your disk drive has capacity greater than
# the size specified here. If both max_bytes and max_events are specified, Logstash will pick
# whichever criteria is reached first
# Default is 1024mb or 1gb
#
# queue.max_bytes: 1024mb
#
# If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.acks: 1024
#
# If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.writes: 1024
#
# If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
# Default is 1000, 0 for no periodic checkpoint.
#
# queue.checkpoint.interval: 1000
#
# ------------ Dead-Letter Queue Settings --------------
# Flag to turn on dead-letter queue.
#
# dead_letter_queue.enable: false
# If using dead_letter_queue.enable: true, the maximum size of each dead letter queue. Entries
# will be dropped if they would increase the size of the dead letter queue beyond this setting.
# Default is 1024mb
# dead_letter_queue.max_bytes: 1024mb
# If using dead_letter_queue.enable: true, the interval in milliseconds where if no further events eligible for the DLQ
# have been created, a dead letter queue file will be written. A low value here will mean that more, smaller, queue files
# may be written, while a larger value will introduce more latency between items being "written" to the dead letter queue, and
# being available to be read by the dead_letter_queue input when items are written infrequently.
# Default is 5000.
#
# dead_letter_queue.flush_interval: 5000
# If using dead_letter_queue.enable: true, controls which entries should be dropped to avoid exceeding the size limit.
# Set the value to `drop_newer` (default) to stop accepting new events that would push the DLQ size over the limit.
# Set the value to `drop_older` to remove queue pages containing the oldest events to make space for new ones.
#
# dead_letter_queue.storage_policy: drop_newer
# If using dead_letter_queue.enable: true, the interval that events have to be considered valid. After the interval has
# expired the events could be automatically deleted from the DLQ.
# The interval could be expressed in days, hours, minutes or seconds, using as postfix notation like 5d,
# to represent a five days interval.
# The available units are respectively d, h, m, s for day, hours, minutes and seconds.
# If not specified then the DLQ doesn't use any age policy for cleaning events.
#
# dead_letter_queue.retain.age: 1d
# If using dead_letter_queue.enable: true, defines the action to take when the dead_letter_queue.max_bytes is reached,
# could be "drop_newer" or "drop_older".
# With drop_newer, messages that were inserted most recently are dropped, logging an error line.
# With drop_older setting, the oldest messages are dropped as new ones are inserted.
# Default value is "drop_newer".
# dead_letter_queue.storage_policy: drop_newer
# If using dead_letter_queue.enable: true, the directory path where the data files will be stored.
# Default is path.data/dead_letter_queue
#
# path.dead_letter_queue:
#
# ------------ Debugging Settings --------------
#
# Options for log.level:
# * fatal
# * error
# * warn
# * info (default)
# * debug
# * trace
#
# log.level: info
# path.logs:
#
# ------------ Other Settings --------------
#
# Allow or block running Logstash as superuser (default: true)
# allow_superuser: false
#
# Where to find custom plugins
# path.plugins: []
#
# Flag to output log lines of each pipeline in its separate log file. Each log filename contains the pipeline.name
# Default is false
# pipeline.separate_logs: false
#
# ------------ X-Pack Settings (not applicable for OSS build)--------------
#
# X-Pack Monitoring
# https://www.elastic.co/guide/en/logstash/current/monitoring-logstash.html
#xpack.monitoring.enabled: false
#xpack.monitoring.elasticsearch.username: logstash_system
#xpack.monitoring.elasticsearch.password: password
#xpack.monitoring.elasticsearch.proxy: ["http://proxy:port"]
#xpack.monitoring.elasticsearch.hosts: ["https://es1:9200", "https://es2:9200"]
# an alternative to hosts + username/password settings is to use cloud_id/cloud_auth
#xpack.monitoring.elasticsearch.cloud_id: monitoring_cluster_id:xxxxxxxxxx
#xpack.monitoring.elasticsearch.cloud_auth: logstash_system:password
# another authentication alternative is to use an Elasticsearch API key
#xpack.monitoring.elasticsearch.api_key: "id:api_key"
#xpack.monitoring.elasticsearch.ssl.certificate_authority: "/path/to/ca.crt"
#xpack.monitoring.elasticsearch.ssl.ca_trusted_fingerprint: xxxxxxxxxx
#xpack.monitoring.elasticsearch.ssl.truststore.path: path/to/file
#xpack.monitoring.elasticsearch.ssl.truststore.password: password
#xpack.monitoring.elasticsearch.ssl.keystore.path: /path/to/file
#xpack.monitoring.elasticsearch.ssl.keystore.password: password
#xpack.monitoring.elasticsearch.ssl.verification_mode: certificate
#xpack.monitoring.elasticsearch.sniffing: false
#xpack.monitoring.collection.interval: 10s
#xpack.monitoring.collection.pipeline.details.enabled: true
#
# X-Pack Management
# https://www.elastic.co/guide/en/logstash/current/logstash-centralized-pipeline-management.html
#xpack.management.enabled: false
#xpack.management.pipeline.id: ["main", "apache_logs"]
#xpack.management.elasticsearch.username: logstash_admin_user
#xpack.management.elasticsearch.password: password
#xpack.management.elasticsearch.proxy: ["http://proxy:port"]
#xpack.management.elasticsearch.hosts: ["https://es1:9200", "https://es2:9200"]
# an alternative to hosts + username/password settings is to use cloud_id/cloud_auth
#xpack.management.elasticsearch.cloud_id: management_cluster_id:xxxxxxxxxx
#xpack.management.elasticsearch.cloud_auth: logstash_admin_user:password
# another authentication alternative is to use an Elasticsearch API key
#xpack.management.elasticsearch.api_key: "id:api_key"
#xpack.management.elasticsearch.ssl.ca_trusted_fingerprint: xxxxxxxxxx
#xpack.management.elasticsearch.ssl.certificate_authority: "/path/to/ca.crt"
#xpack.management.elasticsearch.ssl.truststore.path: /path/to/file
#xpack.management.elasticsearch.ssl.truststore.password: password
#xpack.management.elasticsearch.ssl.keystore.path: /path/to/file
#xpack.management.elasticsearch.ssl.keystore.password: password
#xpack.management.elasticsearch.ssl.verification_mode: certificate
#xpack.management.elasticsearch.sniffing: false
#xpack.management.logstash.poll_interval: 5s
# X-Pack GeoIP plugin
# https://www.elastic.co/guide/en/logstash/current/plugins-filters-geoip.html#plugins-filters-geoip-manage_update
#xpack.geoip.download.endpoint: "https://geoip.elastic.co/v1/database"
Now I don't know how to save the queues that Lagstash reads from RabbitMQ in Elasticsearch and watch them as a chart in Kiabana??
I wonder how I use can the MIST library to de-identify a text, e.g., transforming
Patient ID: P89474
Mary Phillips is a 45-year-old woman with a history of diabetes.
She arrived at New Hope Medical Center on August 5 complaining
of abdominal pain. Dr. Gertrude Philippoussis diagnosed her
with appendicitis and admitted her at 10 PM.
to
Patient ID: [ID]
[NAME] is a [AGE]-year-old woman with a history of diabetes.
She arrived at [HOSPITAL] on [DATE] complaining
of abdominal pain. Dr. [PHYSICIAN] diagnosed her
with appendicitis and admitted her at 10 PM.
I've wandered through the documentation but no luck so far.
This answer was tested on Windows 7 SP1 x64 Ultimate with Anaconda Python 2.7.11 x64, and MIST 2.0.4. MIST 2.0.4 does not work with Python 3.x (according to the manual, I haven't tested it myself).
MIST (MITRE Identification Scrubber Toolkit) [1] is a customization of MAT (MITRE Annotation Toolkit), which is a tool to tag documents automatically or with humans (for the latter it provides a GUI via webserver). The automatic tagger is based on Carafe (ConditionAl RAndom Fields) [2], which is an OCaml implementation of conditional random fields (CRF).
MIST does not come with any trained model, and is has only ~10 short, non-medical documents annotated with typical NER class (like organization and person).
De-id (de-identification) is the process of tagging PHIs (Private Health Information) in a document, and replacing them with fake data. Let's ignore PHI replacement for now, and focus on tagging. In order to tag a document (e.g., a patient note), MAT follows a typical machine learning scheme: the CRF needs to be trained on a labeled dataset (= a set of labeled documents), then we use it to tag unlabeled documents.
The main technical concept in MAT is tasks. A task is a set of activities, called workflows, which can be broken down into steps. Named-entity recognition (NER) is one task. De-id is another task (mostly, NER geared toward medical texts): in other words, MIST is just one task of MAT (actually 3: core, HIPAA, and AMIA. Core is a parent task, while HIPAA and AMIA are two different tagets). Steps are for example tokenization, tagging, or cleaning. Workflows are just list of steps that one may follow.
With this in mind, here is the code for Microsoft Windows:
#######
rem Instructions for Windows 7 SP1 x64 Ultimate
rem Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
SET MAT_PKG_HOME=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\src\MAT
SET TMP=C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4\temp
cd C:\Users\Francky\Downloads\MIST_2_0_4\MIST_2_0_4
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd %MAT_PKG_HOME%
bin\MATManagePluginDirs.cmd install %CD%\sample\ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin\MATModelBuilder.cmd --task "Named Entity" --model_file %TMP%\ne_model ^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tuto 2: Add trained model as the default model
bin\MATModelBuilder.cmd --task "Named Entity" --save_as_default_model ^
--input_files "%CD%\sample\ne\resources\data\json\*.json"
# Tudo 5: use CLI -> prepare the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "zone,tokenize" ^
--input_file %CD%\sample\ne\resources\data\raw\voa2.txt --input_file_type raw ^
--output_file %CD%\voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin\MATEngine.cmd --task "Named Entity" --workflow Demo --steps "tag" ^
--input_file %CD%\voa2_txt.json --input_file_type mat-json ^
--output_file %CD%\voa2_txt.json --output_file_type mat-json ^
--tagger_local
NER is now done.
Here are the same instructions for Ubuntu 14.04.4 LTS x64:
#######
# Instructions for Ubuntu 14.04.4 LTS x64
# Installing MIST: set MAT_PKG_HOME depending on where you downloaded it
export MAT_PKG_HOME=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/src/MAT
export TMP=/home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/temp
mkdir $TMP
cd /home/ubuntu/mist/MIST_2_0_4/MIST_2_0_4/
python install.py
# MAT is now installed. We'll show how to use it for NER.
# We will be taking snippets from some of the 8 tutorials.
# A lot of the tutorial content are about the annotation GUI,
# which we don't care here.
# Tuto 1: install task
cd $MAT_PKG_HOME
bin/MATManagePluginDirs install $PWD/sample/ne
# Tuto 2: build model (i.e., train it on labeled dataset)
bin/MATModelBuilder --task "Named Entity" --model_file $TMP/ne_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tuto 2: Add trained model as the default model
bin/MATModelBuilder --task "Named Entity" --save_as_default_model \
--input_files "$PWD/sample/ne/resources/data/json/*.json"
# Tudo 5: use CLI -> prepare the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "zone,tokenize" \
--input_file $PWD/sample/ne/resources/data/raw/voa2.txt --input_file_type raw \
--output_file $PWD/voa2_txt.json --output_file_type mat-json
# Tuto 5: use CLI -> tag the document
bin/MATEngine --task "Named Entity" --workflow Demo --steps "tag" \
--input_file $PWD/voa2_txt.json --input_file_type mat-json \
--output_file $PWD/voa2_txt.json --output_file_type mat-json \
--tagger_local
To run de-id, there is no need to install the de-id tasks are they are pre-installed. There are 2 de-id tasks (\MIST_2_0_4\src\tasks\HIPAA\task.xml and \MIST_2_0_4\src\tasks\AMIA\task.xml). They don't come with any trained model nor labeled dataset, so you may want to get some data at Physician notes with annotated PHI.
For Microsoft Windows ( tested with Windows 7 SP1 x64 Ultimate ):
To train the model (you can replace HIPAA Deidentification with AMIA Deidentification depending on the tagset you wish to use):
bin\MATModelBuilder.cmd --task "HIPAA Deidentification" ^
--save_as_default_model --nthreads=3 --max_iterations=15 ^
--lexicon_dir="%CD%\sample\mist\gazetteers" ^
--input_files "%CD%\sample\mist\i2b2-60-00-40\train\*.json"
To run the trained model on one file:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_file .\note.txt --input_file_type raw ^
--output_file .\note.json --output_file_type mat-json ^
--tagger_local ^
--steps "clean,zone,tag"
To run the trained model on one directory:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_dir "%CD%\sample\test" --input_file_type raw ^
--output_dir "%CD%\sample\test" --output_file_type mat-json ^
--tagger_local ^
--steps "clean,zone,tag"
As usual, one can specify the input file format to be JSON:
bin\MATEngine --task "HIPAA Deidentification" --workflow Demo ^
--input_dir "%CD%\sample\mist\i2b2-60-00-40\test" --input_file_type mat-json ^
--output_dir "%CD%\sample\mist\i2b2-60-00-40\test_out" --output_file_type mat-json ^
--tagger_local --steps "tag"
For Ubuntu 14.04.4 LTS x64:
To train the model (you can replace HIPAA Deidentification with AMIA Deidentification depending on the tagset you wish to use):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--input_files "$PWD/sample/mist/i2b2-60-00-40/train/*.json"
To run the trained model on one file:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_file ./note.txt --input_file_type raw \
--output_file ./note.json --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
To run the trained model on one directory:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/test" --input_file_type raw \
--output_dir "$PWD/sample/test" --output_file_type mat-json \
--tagger_local \
--steps "clean,zone,tag"
As usual, one can specify the input file format to be JSON:
bin/MATEngine --task "HIPAA Deidentification" --workflow Demo \
--input_dir "$PWD/sample/mist/i2b2-60-00-40/test" --input_file_type mat-json \
--output_dir "$PWD/sample/mist/i2b2-60-00-40/test_out" --output_file_type mat-json \
--tagger_local --steps "tag"
Typical error messages:
raise PluginError, "Carafe not configured properly for this task and workflow: " + str(e) (when trying to tag a document): it often means that no model was specified. You need to either defined a default model, or use --tagger_model /path/to/model/.
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded (when training a model): it's easy to go over the heap_size limit ( the default is 2GB ). You can increase the heap_size with the --heap_size parameter. Example (Linux):
bin/MATModelBuilder --task "HIPAA Deidentification" \
--save_as_default_model --nthreads=20 --max_iterations=15 \
--lexicon_dir="$PWD/sample/mist/gazetteers" \
--heap_size=60G \
--input_files "$PWD/sample/mist/mimic-140-20-40/train/*.json"
[1] John Aberdeen, Samuel Bayer, Reyyan Yeniterzi, Ben Wellner, Cheryl Clark, David Hanauer, Bradley Malin, Lynette Hirschman, The MITRE identification scrubber toolkit: design, training, and assessment, Int. J. Med. Informatics 79 (12) (2010) 849–859, http://dx.doi.org/10.1016/j.ijmedinf.2010.09.007.
[2] B. Wellner, Sequence Models and Ranking Methods for
Discourse Parsing [Ph.D. Dissertation]. Brandeis University,
Waltham, MA, 2009. http://www.cs.brandeis.edu/~wellner/pubs/wellner_dissertation.pdf
Documentation for MATModelBuilder.cmd:
Usage: MATModelBuilder.cmd [task option] [config name option] [other options]
Options:
-h, --help show this help message and exit
Task option:
--task=task name of the task to use. Must be the first argument,
if present. Obligatory if the system knows of more
than one task. Known tasks are: AMIA Deidentification,
Named Entity, HIPAA Deidentification, Enhanced Named
Entity
Config name option:
--config_name=name name of the model build config to use. Must be the
first argument after --task, if present. Optional.
Default model build config will be used if no config
is specified.
Control options:
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.
Options for model class creation:
--partial_training_on_gold_only
When the trainer is presented with partially tagged
documents, by default MAT will ask it to train on all
annotated segments, completed or not. If this flag is
specified, only completed segments should be used for
training.
--feature_spec=FEATURE_SPEC
path to the Carafe feature spec file to use. Optional
if feature_spec is set in the <build_settings> for the
relevant model config in the task.xml file for the
task.
--training_method=TRAINING_METHOD
If present, specify a training method other than the
standard method. Currently, the only recognized value
is psa. The psa method is noticeably faster, but may
result in somewhat poorer results. You can use a value
of '' to override a previously specified training
method (e.g., a default method in your task).
--max_iterations=MAX_ITERATIONS
number of iterations for the optimized PSA training
mechanism to use. A value between 6 and 10 is
appropriate. Overrides any possible default in
<build_settings> for the relevant model config in the
task.xml file for the task.
--lexicon_dir=LEXICON_DIR
If present, the name of a directory which contains a
Carafe training lexicon. This pathname should be an
absolute pathname, and should have a trailing slash.
The content of the directory should be a set of files,
each of which contains a sequence of tokens, one per
line. The name of the file will be used as a training
feature for the token. Overrides any possible default
in <build_settings> for the relevant model config in
the task.xml file for the task.
--parallel If present, parallelizes the feature expectation
computation, which reduces the clock time of model
building when multiple CPUs are available
--nthreads=NTHREADS
If --parallel is used, controls the number of threads
used for training.
--gaussian_prior=GAUSSIAN_PRIOR
A positive float, default is 10.0. See the jCarafe
docs for details.
--no_begin Don't introduce begin states during training. Useful
if you're certain that you won't have any adjacent
spans with the same label. See the jCarafe
documentation for more details.
--l1 Use L1 regularization for PSA training. See the
jCarafe docs for details.
--l1_c=L1_C Change the penalty factor for the L1 regularizer. See
the jCarafe docs for details.
--heap_size=HEAP_SIZE
If present, specifies the -Xmx argument for the Java
JVM
--stack_size=STACK_SIZE
If present, specifies the -Xss argument for the Java
JVM
--tags=TAGS if present, a comma-separated list of tags to pass to
the training engine instead of the full tag set for
the task (used to create per-tag pre-tagging models
for multi-stage training and tagging)
--pre_models=PRE_MODELS
if present, a comma-separated list of glob-style
patterns specifying the models to include as pre-
taggers.
--add_tokens_internally
If present, Carafe will use its internal tokenizer to
tokenize the document before training. If your
workflow doesn't tokenize the document, you must
provide this flag, or Carafe will have no tokens to
base its training on. We recommend strongly that you
tokenize your documents separately; you should not use
this flag.
--word_properties=WORD_PROPERTIES
See the jCarafe docs for --word-properties.
--word_scores=WORD_SCORES
See the jCarafe docs for --word-scores.
--learning_rate=LEARNING_RATE
See the jCarafe docs for --learning-rate.
--disk_cache=DISK_CACHE
See the jCarafe docs for --disk_cache.
Input options:
--input_dir=dir A directory, all of whose files will be used in the
model construction. Can be repeated. May be specified
with --input_files.
--input_files=re A glob-style pattern describing full pathnames to use
in the model construction. May be specified with
--input_dir. Can be repeated.
--file_type=fake-xml-inline | mat-json | xml-inline
The file type of the input. One of fake-xml-inline,
mat-json, xml-inline. Default is mat-json.
--encoding=encoding
The encoding of the input. The default is the
appropriate default for the file type.
Output options:
--model_file=file Location to save the created model. The directory must
already exist. Obligatory if --save_as_default_model
isn't specified.
--save_as_default_model
If the the task.xml file for the task specifies the
<default_model> element, save the model in the
specified location, possibly overriding any existing
model.
Documentation for MATEngine:
Usage: MATEngine [core options] [input/output/task options] [other options]
Options:
-h, --help show this help message and exit
Core options:
--other_app_dir=dir
additional directory to load a task from. Optional and
repeatable.
--settings_file=file
a file of settings to use which overwrites existing
settings. The file should be a Python config file in
the style of the template in
etc/MAT_settings.config.in. Optional.
--task=task name of the task to use. Obligatory if the system
knows of more than one task. Known tasks are: AMIA
Deidentification, Named Entity, HIPAA
Deidentification, Enhanced Named Entity
--version Print version number and exit
--debug Enable debug output.
--subprocess_debug=int
Set the subprocess debug level to the value provided,
overriding the global setting. 0 disables, 1 shows
some subprocess activity, 2 shows all subprocess
activity.
--subprocess_statistics
Enable subprocess statistics (memory/time), if the
capability is available and it isn't globally enabled.
--tmpdir_root=dir Override the default system location for temporary
files. If the directory doesn't exist, it will be
created. Use this feature to control where temporary
files are created, for added security, or in
conjunction with --preserve_tempfiles, as a debugging
aid.
--preserve_tempfiles
Preserve the temporary files created, as a debugging
aid.
--verbose_config If specified, print to stderr the source of each MAT
configuration variable the first time it's accessed.