capture journald properties with rsyslog - linux

I am struggling on how to capture systemd-journald properties into rsyslog files.
My setup
ubuntu inside docker on arm (raspberrypi): FROM arm64v8/ubuntu:20.04
docker command (all subsequent actions taken inside running docker container)
$ docker run --privileged -ti --cap-add SYS_ADMIN --security-opt seccomp=unconfined --cgroup-parent=docker.slice --cgroupns private --tmpfs /tmp --tmpfs /run --tmpfs /run/lock systemd:origin
rsyslog under $ sytemctl status rsyslog
● rsyslog.service - System Logging Service
Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled; vendor prese>
Active: active (running)
...
[origin software="rsyslogd" swVersion="8.2001.0" x-pid="39758" x-info="https://www.rsyslog.com"] start
...
My plan
Having a small c program to put some information into journal:
#include <systemd/sd-journal.h>
#include <stdio.h>
#include <unistd.h>
int main(int arcg, char** args) {
char buffer [50];
sprintf (buffer, "%lu", (unsigned long)getpid());
printf("writing to journal\n");
sd_journal_print(LOG_WARNING, "%s", "a little journal test message");
sd_journal_send("MESSAGE=%s", "there shoud be a text", "SYSLOG_PID=%s", buffer, "PRIORITY=%i", LOG_ERR, "DOCUMENTATION=%s", "any doc link", "MESSAGE_ID=%s", "e5e4132e441541f89bca0cc3e7be3381", "MEAS_VAL=%d", 1394, NULL);
return 0;
}
Compile it: $ gcc joutest.c -lsystemd -o jt
Execute it: $ ./jt
This results inside the journal as $ journalctl -r -o json-pretty:
{
"_GID" : "0",
"MESSAGE" : "there shoud be a text",
"_HOSTNAME" : "f1aad951c039",
"SYSLOG_IDENTIFIER" : "jt",
"_TRANSPORT" : "journal",
"CODE_FILE" : "joutest.c",
"DOCUMENTATION" : "any doc link",
"_BOOT_ID" : "06a36b314cee462591c65a2703c8b2ad",
"CODE_LINE" : "14",
"MESSAGE_ID" : "e5e4132e441541f89bca0cc3e7be3381",
"_CAP_EFFECTIVE" : "3fffffffff",
"__REALTIME_TIMESTAMP" : "1669373862349599",
"_SYSTEMD_UNIT" : "init.scope",
"CODE_FUNC" : "main",
"_MACHINE_ID" : "5aba31746bf244bba6081297fe061445",
"SYSLOG_PID" : "39740",
"PRIORITY" : "3",
"_COMM" : "jt",
"_SYSTEMD_SLICE" : "-.slice",
"MEAS_VAL" : "1394",
"__MONOTONIC_TIMESTAMP" : "390853282189",
"_PID" : "39740",
"_SOURCE_REALTIME_TIMESTAMP" : "1669373862336503",
"_UID" : "0",
"_SYSTEMD_CGROUP" : "/init.scope",
"__CURSOR" : "s=63a46a30bbbb4b8c9288a9b12c622b37;i=6cb;b=06a36b314cee46>
}
Now as a test, extracting all properties from that journal entry via rsyslog; property in the jargon of rsyslog in principle is the name of a key in the formatted json entry. But if a property (or key name) matches, the whole dictionary item (key and value) shall be captured
To start with this, I've configured rsyslog as:
module(load="imjournal")
module(load="mmjsonparse")
action(type="mmjsonparse")
if $programname == 'jt' and $syslogseverity == 3 then
action(type="omfile" file="/var/log/jt_err.log" template="RSYSLOG_DebugFormat")
This config is located in /etc/rsyslog.d/filter.conf and gets automatically included by /etc/rsyslog.conf:
# /etc/rsyslog.conf configuration file for rsyslog
#
# For more information install rsyslog-doc and see
# /usr/share/doc/rsyslog-doc/html/configuration/index.html
#
# Default logging rules can be found in /etc/rsyslog.d/50-default.conf
#################
#### MODULES ####
#################
#module(load="imuxsock") # provides support for local system logging
#module(load="immark") # provides --MARK-- message capability
# provides UDP syslog reception
#module(load="imudp")
#input(type="imudp" port="514")
# provides TCP syslog reception
#module(load="imtcp")
#input(type="imtcp" port="514")
# provides kernel logging support and enable non-kernel klog messages
module(load="imklog" permitnonkernelfacility="on")
###########################
#### GLOBAL DIRECTIVES ####
###########################
#
# Use traditional timestamp format.
# To enable high precision timestamps, comment out the following line.
#
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
# Filter duplicated messages
$RepeatedMsgReduction on
#
# Set the default permissions for all log files.
#
$FileOwner syslog
$FileGroup adm
$FileCreateMode 0640
$DirCreateMode 0755
$Umask 0022
$PrivDropToUser syslog
$PrivDropToGroup syslog
#
# Where to place spool and state files
#
$WorkDirectory /var/spool/rsyslog
#
# Include all config files in /etc/rsyslog.d/
#
$IncludeConfig /etc/rsyslog.d/*.conf
Applied this config: $ systemctl restart rsyslog
Which results in the following: $ cat /var/log/jt_err.log
Debug line with all properties:
FROMHOST: 'f1aad951c039', fromhost-ip: '127.0.0.1', HOSTNAME:
'f1aad951c039', PRI: 11,
syslogtag 'jt[39765]:', programname: 'jt', APP-NAME: 'jt', PROCID:
'39765', MSGID: '-',
TIMESTAMP: 'Nov 25 11:47:50', STRUCTURED-DATA: '-',
msg: ' there shoud be a text'
escaped msg: ' there shoud be a text'
inputname: imuxsock rawmsg: '<11>Nov 25 11:47:50 jt[39765]: there
shoud be a text'
$!:{ "msg": "there shoud be a text" }
$.:
$/:
My problem
Looking on the resulting rsyslog, I miss a majority, if not all, of items originating from the journal entry.
There is really no property (key) matching. Shouldn't be there all properties matched as it is a debug output?
Specifically I am concentrating on my custom property, MEAS_VAL, it is not there.
The only property which occurs is "msg", which by the way is questionable whether it is a match of the journal, since the originating property name attached to the resulting content "there shoud be a text" is MESSAGE
So it feels that I don't hit the whole journal capturing mechanism at all, why?
Can we be sure that imjournal gets loaded properly?
I would say yes because of systemd's startup messages:
Nov 28 16:27:38 f1aad951c039 rsyslogd[144703]: imjournal: Journal indicates no msgs when positioned at head. [v8.2212.0.master try https://www.rsyslog.com/e/0 ]
Nov 28 16:27:38 f1aad951c039 rsyslogd[144703]: imjournal: journal files changed, reloading... [v8.2212.0.master try https://www.rsyslog.com/e/0 ]
Nov 28 16:27:38 f1aad951c039 rsyslogd[144703]: imjournal: Journal indicates no msgs when positioned at head. [v8.2212.0.master try https://www.rsyslog.com/e/0 ]
Edit 2022-11-29
Meanwhile I've compiled my own version 8.2212.0.master. But the phenomenon persists.

You're missing most items originating from the journal, because both templates RSYSLOG_DebugFormat and RSYSLOG_TraditionalFileFormat do not have the needed properties (See Reserved template names). RSYSLOG_DebugFormat, however, includes atleast some fields, e.g. procid, msgid and structured-data - which can be seen in the output you've provided.
This means, that if you want to include all the fields, you'll have to create your own template.
The journal fields are stored in key-value pairs. The imjournal module is able to parse these key-value pairs and generate the jsonf property,
which then can be used to access fields of the log message as if they were fields in a JSON object.
# load imjournal module
module(load="imjournal")
# specify journal as input source
input(type="imjournal")
template(name="journalTemplate" type="list") {
property(name="timestamp" dateFormat="rfc3339")
constant(value=" ")
property(name="hostname")
constant(value=" ")
property(name="syslogtag")
constant(value=": {")
property(name="jsonf")
constant(value="}")
}
if $programname == 'jt' and $syslogseverity == 3 then {
action(type="omfile" file="/var/log/jt_err.log" template="journalTemplate")
stop
}
The output of the provided log would then look something like the following:
YYYY-MM-DDTHH:mm:ss myHostname syslogtag: {"_GID" : "0", "MESSAGE" : "there shoud be a text", ... }
As seen in the log above, the output of the provided properties will be in JSON. By using the json property parser this can be prevented, as the output can be tailored as desired. If this is used, however, each property must be defined specifically.
template(name="journalTemplate" type="list") {
property(name="timestamp" dateFormat="rfc3339")
constant(value=" ")
property(name="hostname")
constant(value=" ")
property(name="syslogtag")
constant(value=": _GID=")
property(name="$._GID" format="json")
constant(value=" MESSAGE=")
property(name="$.MESSAGE" format="json")
constant(value=" _HOSTNAME=")
property(name="$._HOSTNAME" format="json")
...
}

Related

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.
I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:
For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page
After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.
For example, the text in the following XML path is not returned/indexed:
<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)
While the text in this path is returned:
<html> <body> <div> <div.container> <div.row> <p> (text)
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
config:
topology.workers: 3
topology.message.timeout.secs: 1000
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 100
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "My crawler"
http.agent.version: "1.0"
http.agent.description: ""
http.agent.url: ""
http.agent.email: ""
# The maximum number of bytes for returned HTTP response bodies.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# text extraction for JSoupParserBolt
# textextractor.include.pattern:
# - DIV[id="maincontent"]
# - DIV[itemprop="articleBody"]
# - ARTICLE
# textextractor.exclude.tags:
# - STYLE
# - SCRIPT
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
selenium.addresses: "http://localhost:9515"
es-conf.yaml
config:
# ES indexer bolt
es.indexer.addresses: "localhost"
es.indexer.index.name: "content"
# es.indexer.pipeline: "_PIPELINE_"
es.indexer.create: false
es.indexer.bulkActions: 100
es.indexer.flushInterval: "2s"
es.indexer.concurrentRequests: 1
# ES metricsConsumer
es.metrics.addresses: "http://localhost:9200"
es.metrics.index.name: "metrics"
# ES spout and persistence bolt
es.status.addresses: "http://localhost:9200"
es.status.index.name: "status"
es.status.routing: true
es.status.routing.fieldname: "key"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
# spout config #
# positive or negative filters parsable by the Lucene Query Parser
# es.status.filterQuery:
# - "-(key:stormcrawler.net)"
# - "-(key:digitalpebble.com)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
spout.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
spout.min.delay.queries: 2000
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
spout.reset.fetchdate.after: 120
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "key"
# fields to sort the URLs within a bucket
es.status.bucket.sort.field:
- "nextFetchDate"
- "url"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# max allowed duration of a query in sec
es.status.query.timeout: -1
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
es-crawler.flux
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 3
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 3
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 3
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 3
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 12
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 3
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 3
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 3
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
parsefilters.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.description": [
"//*[#name=\"description\"]/#content",
"//*[#name=\"Description\"]/#content"
],
"parse.title": [
"//TITLE",
"//META[#name=\"title\"]/#content"
],
"parse.keywords": "//META[#name=\"keywords\"]/#content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "LinkParseFilter",
"params": {
"pattern": "//FRAME/#src"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
"name": "CommaSeparatedToMultivaluedMetadata",
"params": {
"keys": ["parse.keywords"]
}
}
]
}
Attempting to use Chromedriver
I installed the latest versions of Chromedriver and Google Chrome for Ubuntu.
First I start chromedriver in headless mode at localhost:9515 as the stormcrawler user (via a separate python shell, as shown below), and then I restart the stormcrawler topology (also as stormcrawler user) but end up with a stack of errors related to Chrome. The odd thing however is that I can confirm chromedriver is running OK within the Python shell directly, and I can confirm that both the driver and browser are actively running via ps -ef). This same stack of errors also occurs when I attempt to simply start chromedriver from the command line (i.e., chromedriver --headless &).
Starting chromedriver in headless mode (in python3 shell)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')
Stack trace from starting stormcrawler topology
Run command: storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000
9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>
at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...
Confirming that chromedriver and chrome are both running and reachable
~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857 0 14:28 pts/0 00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862 0 14:28 pts/0 00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877 0 14:28 pts/0 00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100
~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 localhost:9222 0.0.0.0:* LISTEN 18026/google-chrome
tcp 0 0 localhost:9515 0.0.0.0:* LISTEN 18020/chromedriver
IIRC you need to set some additional config to work with ChomeDriver.
Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

logstash custom patterns don´t get resolved

I´m trying to setup an environment for grok debugging and made this with a docker.
Everything works fine, until logstash tries to resolve a custom pattern.
Here is my environment
I start the docker with
docker run -it --name logstash_debug -v
/home/cloud/docker-elk/logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
-v /home/cloud/docker-elk/logstash/pipeline/:/usr/share/logstash/pipeline/
-v /home/cloud/docker-elk/logstash/patterns/:/usr/share/logstash/patterns
docker.elastic.co/logstash/logstash:7.2.0
As I said, logstash starts up, loads the pipeline (debug.conf)
input { stdin {} }
filter {
grok {
patterns_dir => ["/usr/share/logstash/patterns"]
match => ["message", "%{YEAR1} \[%{LOGLEVEL:loglvl}\] %{GREEDYDATA:message}"]
}
date {
match => ["customer_time", "${YEAR1}"]
target => "#timestamp"
}
}
output { stdout { codec => rubydebug } }
and gives me this error:
Cannot evaluate ${YEAR1}. Replacement variable YEAR1 is not
defined in a Logstash secret store or as an Environment entry and
there is no default value given.
the patterns_dir contains a file "dateformats" which contains (stripped it down to a minimum)
YEAR1 %{YEAR}
the logstash debug output gives me this:
[DEBUG][logstash.filters.grok ] config
LogStash::Filters::Grok/#patterns_dir =
["/usr/share/logstash/patterns"]
[DEBUG][logstash.filters.grok ] config
LogStash::Filters::Grok/#match = {"message"=>"%{YEAR1}
\[%{LOGLEVEL:loglvl}\] %{GREEDYDATA:message}"}
.....
[DEBUG][logstash.filters.grok ] config
LogStash::Filters::Grok/#patterns_files_glob = "*"
Normally logstash should be able to grab this file (I even started the docker with --user 0 to be sure that I have no permission problem) but it somehow can´t.
Anyone can me give a hint to what´s going on ?
Thanks and cheers,
Wurzelseppi

Logging .net Core with Elastic stack

Trying to set up simple logging with Filebeats, Logstash and be able to view logs in Kibana. Running a simple mvc .net core app with log4net as logger. log4net FileAppender appending logs to C:\Logs\Debug.log just fine. However not able to push those to Kibana.
Based on this artice here, I would set up filebeats, then transform log via logstash and be able to view my logs in Kibana.
logstash.yml
- module: logstash
# logs
log:
enabled: true
#var.paths: ["C:/Logs/Debug.log"] - THIS CAUSES ERRROS - should this be UNCOMMENTED?
# Convert the timestamp to UTC. Requires Elasticsearch >= 6.1.
#var.convert_timezone: false
# Slow logs
slowlog:
enabled: true
#var.paths: ["C:/Logs/Debug.log"]
# Convert the timestamp to UTC. Requires Elasticsearch >= 6.1.
#var.convert_timezone: false
logstash-sample.conf
# Sample Logstash configuration for creating a simple
# Beats -> Logstash -> Elasticsearch pipeline.
input {
file {
path => "C:\Logs\Debug.log"
type => "log4net"
codec => multiline {
pattern => "^(DEBUG|WARN|ERROR|INFO|FATAL)"
negate => true
what => previous
}
}
}
filter {
if [type] == "log4net" {
grok {
match => [ "message", "(?m)%{LOGLEVEL:level} %{TIMESTAMP_ISO8601:sourceTimestamp} %{DATA:logger} \[%{NUMBER:threadId}\] \[%{IPORHOST:tempHost}\] %{GREEDYDATA:tempMessage}" ]
}
mutate {
replace => [ "message" , "%{tempMessage}" ]
replace => [ "host" , "%{tempHost}" ]
remove_field => [ "tempMessage" ]
remove_field => [ "tempHost" ]
}
}
}
output {
elasticsearch {
host => localhost
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
#user => "elastic"
#password => "changeme"
}
}
Running logstash with config-sample output:
Filebeat.yml
###################### Filebeat Configuration Example #########################
# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html
# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.
#=========================== Filebeat inputs =============================
filebeat.inputs:
# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.
- type: log
# Change to true to enable this input configuration.
enabled: false
# Paths that should be crawled and fetched. Glob based paths.
paths:
#- /var/log/*.log
- c:\Logs\*.log
# Exclude lines. A list of regular expressions to match. It drops the lines that are
# matching any regular expression from the list.
#exclude_lines: ['^DBG']
# Include lines. A list of regular expressions to match. It exports the lines that are
# matching any regular expression from the list.
#include_lines: ['^ERR', '^WARN']
# Exclude files. A list of regular expressions to match. Filebeat drops the files that
# are matching any regular expression from the list. By default, no files are dropped.
#exclude_files: ['.gz$']
# Optional additional fields. These fields can be freely picked
# to add additional information to the crawled log files for filtering
#fields:
# level: debug
# review: 1
### Multiline options
# Multiline can be used for log messages spanning multiple lines. This is common
# for Java Stack Traces or C-Line Continuation
# The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
#multiline.pattern: ^\[
# Defines if the pattern set under pattern should be negated or not. Default is false.
#multiline.negate: false
# Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
# that was (not) matched before or after or as long as a pattern is not matched based on negate.
# Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
#multiline.match: after
#============================= Filebeat modules ===============================
filebeat.config.modules:
# Glob pattern for configuration loading
path: ${path.config}/modules.d/*.yml
# Set to true to enable config reloading
reload.enabled: false
# Period on which files under path should be checked for changes
#reload.period: 10s
#==================== Elasticsearch template setting ==========================
setup.template.settings:
index.number_of_shards: 3
#index.codec: best_compression
#_source.enabled: false
#================================ General =====================================
# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:
# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]
# Optional fields that you can specify to add additional information to the
# output.
#fields:
# env: staging
#============================== Dashboards =====================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here, or by using the `-setup` CLI flag or the `setup` command.
#setup.dashboards.enabled: false
# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:
#============================== Kibana =====================================
# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:
host: "localhost:5601"
# Kibana Host
# Scheme and port can be left out and will be set to the default (http and 5601)
# In case you specify and additional path, the scheme is required: http://localhost:5601/path
# IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
#host: "localhost:5601"
# Kibana Space ID
# ID of the Kibana Space into which the dashboards should be loaded. By default,
# the Default Space will be used.
#space.id:
#============================= Elastic Cloud ==================================
# These settings simplify using filebeat with the Elastic Cloud (https://cloud.elastic.co/).
# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:
# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:
#================================ Outputs =====================================
# Configure what output to use when sending the data collected by the beat.
#-------------------------- Elasticsearch output ------------------------------
#output.elasticsearch:
# Array of hosts to connect to.
# hosts: ["localhost:9200"]
# Enabled ilm (beta) to use index lifecycle management instead daily indices.
#ilm.enabled: false
# Optional protocol and basic auth credentials.
#protocol: "https"
#username: "elastic"
#password: "changeme"
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["localhost:5044"]
# Optional SSL. By default is off.
# List of root certificates for HTTPS server verifications
#ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
# Certificate for SSL client authentication
#ssl.certificate: "/etc/pki/client/cert.pem"
# Client Certificate Key
#ssl.key: "/etc/pki/client/cert.key"
#================================ Processors =====================================
# Configure processors to enhance or manipulate events generated by the beat.
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
#================================ Logging =====================================
# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
#logging.level: debug
# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publish", "service".
#logging.selectors: ["*"]
#============================== Xpack Monitoring ===============================
# filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster. This requires xpack monitoring to be enabled in Elasticsearch. The
# reporting is disabled by default.
# Set to true to enable the monitoring reporter.
#xpack.monitoring.enabled: false
# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well. Any setting that is not set is
# automatically inherited from the Elasticsearch output configuration, so if you
# have the Elasticsearch output configured, you can simply uncomment the
# following line.
#xpack.monitoring.elasticsearch:
Output from my Browser windows:
I see my mvc app logging just fine (log4net) logs in C:\Logs\Debug.log, however not able to set it up so that these show up in Kibana.
How would I set it up so that I would see my logs in Kibana?
EDIT 1:
logstash.config
# Sample Logstash configuration for creating a simple
# Beats -> Logstash -> Elasticsearch pipeline.
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "(?m)^%{TIMESTAMP_ISO8601:timestamp}~~\[%{DATA:thread}\]~~\[%{DATA:user}\]~~\[%{DATA:requestId}\]~~\[%{DATA:userHost}\]~~\[%{DATA:requestUrl}\]~~%{DATA:level}~~%{DATA:logger}~~%{DATA:logmessage}~~%{DATA:exception}\|\|" }
add_field => {
"received_at" => "%{#timestamp}"
"received_from" => "%{host}"
}
remove_field => ["message"]
}
date {
match => [ "timestamp", "yyyy-MM-dd HH:mm:ss:SSS" ]
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
sniffing => true
index => "%{app_name}_%{app_env}_%{type}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
#user => "elastic"
#password => "changeme"
}
stdout { codec => rubydebug }
}
filebeat.yml
filebeat.inputs:
# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.
- type: log
# Change to true to enable this input configuration.
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
#- /var/log/*.log
- c:\Logs\*.log
.....
#-------------------------- Elasticsearch output ------------------------------
#output.elasticsearch:
# Array of hosts to connect to.
#hosts: ["localhost:9200"]
# Enabled ilm (beta) to use index lifecycle management instead daily indices.
#ilm.enabled: false
# Optional protocol and basic auth credentials.
#protocol: "https"
#username: "elastic"
#password: "changeme"
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["localhost:5044"]
I have filebeats enabled/running as service. Also logstash running (see powershell window below). When I change anything in Debug.log file and save, i see those changes being output to console right away.
However, when I go to dashboard I do not see any logs still. What am I doing wrong?
I was able to solve this. Logging in .net Core 2.0 using Log4Net.
1. Setup your log4net as always (make sure your logging works and you logs get written to some log file => for me it's C:\Logs\Debug.log"
Install Kibana, Elasticsearch, Logstash and Filebeat: https://www.elastic.co/start
configure filebeat.yml
filebeat.inputs:
#=========================== Filebeat inputs =============================
filebeat.inputs:
- type: log
# Change to true to enable this input configuration.
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
#- /var/log/*.log
- c:\Logs\*.log
multiline.pattern: '^(\d{4}-\d{2}-\d{2}\s)'
multiline.negate: true
multiline.match: after
#============================= Filebeat modules ===============================
filebeat.config.modules:
# Glob pattern for configuration loading
path: ${path.config}/modules.d/*.yml
# Set to true to enable config reloading
reload.enabled: false
# Period on which files under path should be checked for changes
#reload.period: 10s
#==================== Elasticsearch template setting ==========================
setup.template.settings:
index.number_of_shards: 3
#index.codec: best_compression
#_source.enabled: false
#============================== Kibana =====================================
# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:
host: "localhost:5601"
#-------------------------- Elasticsearch output ------------------------------
#output.elasticsearch: => MAKE SURE THIS IS COMMENTED OUT
# Array of hosts to connect to.
# hosts: ["localhost:9200"]
# Enabled ilm (beta) to use index lifecycle management instead daily indices.
#ilm.enabled: false
# Optional protocol and basic auth credentials.
#protocol: "https"
#username: "elastic"
#password: "changeme"
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["localhost:5044"]
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
logstash.yml
- module: logstash
# logs
log:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
#var.paths: -C:\Logs\*.log
# Convert the timestamp to UTC. Requires Elasticsearch >= 6.1.
#var.convert_timezone: false
# Slow logs
slowlog:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
#var.paths: C:\Logs\*.log
# Convert the timestamp to UTC. Requires Elasticsearch >= 6.1.
#var.convert_timezone: false
logstash.conf
# Sample Logstash configuration for creating a simple
# Beats -> Logstash -> Elasticsearch pipeline.
input {
beats {
port => 5044
}
}
filter {
grok {
match => { "message" => "(?m)^%{TIMESTAMP_ISO8601:timestamp}~~\[%{DATA:thread}\]~~\[%{DATA:user}\]~~\[%{DATA:requestId}\]~~\[%{DATA:userHost}\]~~\[%{DATA:requestUrl}\]~~%{DATA:level}~~%{DATA:logger}~~%{DATA:logmessage}~~%{DATA:exception}\|\|" }
add_field => {
"received_at" => "%{#timestamp}"
"received_from" => "%{host}"
}
remove_field => ["message"]
}
date {
match => [ "timestamp", "yyyy-MM-dd HH:mm:ss:SSS" ]
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
sniffing => true
index => "filebeat-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
#user => "elastic"
#password => "changeme"
}
stdout { codec => rubydebug }
}
Make sure logstash is running with this configuration (CMD):
\bin\logstash -f c:\Elastic\Logstash\config\logstash.conf
Open your Log file (C:\Logs\Debug.log) and add something. you should see output in powershell output window where logstash is running and pulling in data:
Open Kibana and go to index that you've written to (logstash.conf)
index => "filebeat-%{+YYYY.MM.dd}"

puppet not working with keys() and hiera_hash()

I have in hiera node variable solr_enabled = true. Also I have in this node list of fstab mount points like:
fstab_homes:
'/home1':
device: 'UUID=ac2ca97e-8bce-4774-92d7-051482253089'
'/home2':
device: 'UUID=d9daaeed-4e4e-40e9-aa6b-73632795e661'
'/home3':
device: 'UUID=21a358cf-2579-48cb-b89d-4ff43e4dd104'
'/home4':
device: 'UUID=c68041de-542a-4f72-9488-337048c41947'
'/home16':
device: 'UUID=d55eff53-3087-449b-9667-aeff49c556e7'
In solr.pp I want to get the first mounted home disk, create there folder and make symbolic link to /home/cpanelsolr.
For this I wrote the code /etc/puppet/environments/testing/modules/cpanel/manifests/solr.pp:
# Install SOLR - dovecot full text search plugin
class cpanel::solr(
$solr_enable = hiera('solr_enabled',false),
$homes = hiera_hash('fstab_homes', false),
$homesKeys = keys($homes),
)
{
if $solr_enable == true {
notify{"Starting Solr Installation ${homesKeys[0]}":}
if $homes != false and $homesKeys[0] != '/home' {
file { "Create Solr home symlink to ${homesKeys[0]}":
path => '/home/cpanelsolr',
ensure => 'link',
target => "${homesKeys[0]}/cpanelsolr",
}
}
exec { 'cpanel-dovecot-solr':
command => "/bin/bash -c
'/usr/local/cpanel/scripts/install_dovecot_fts'",
}
}
}
But when I run this in dev node I get error:
root#webcloud2 [/home1]# puppet agent -t --no-use_srv_records --server=puppet.development.internal --environment=testing --tags=cpanel::solr
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
2018-08-03 6:04:54 140004666824672 [Note] libgovernor.so found
2018-08-03 6:04:54 140004666824672 [Note] All governors functions found too
2018-08-03 6:04:54 140004666824672 [Note] Governor connected
2018-08-03 6:04:54 140004666824672 [Note] All governors lve functions found too
Error: Could not retrieve catalog from remote server: Error 400 on SERVER: keys(): Requires hash to work with at
/etc/puppet/environments/testing/modules/cpanel/manifests/solr.pp:6 on node webcloud2.development.internal
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
What's wrong?
You have at least two problems there.
First problem is $home won't be set at all in that context. You would need to rewrite as:
class cpanel::solr(
$solr_enable = hiera('solr_enabled',false),
$homes = hiera_hash('fstab_homes', false),
)
{
$homes_keys = keys($homes)
...
}
Second problem is that your YAML isn't correctly indented, so fstab_homes would not actually return a Hash. It should be:
fstab_homes:
'/home1':
device: 'UUID=ac2ca97e-8bce-4774-92d7-051482253089'
'/home2':
device: 'UUID=d9daaeed-4e4e-40e9-aa6b-73632795e661'
'/home3':
device: 'UUID=21a358cf-2579-48cb-b89d-4ff43e4dd104'
'/home4':
device: 'UUID=c68041de-542a-4f72-9488-337048c41947'
'/home16':
device: 'UUID=d55eff53-3087-449b-9667-aeff49c556e7'
Finally, be aware that use of camelCase in parameter names in Puppet can cause you issues in some contexts, so best to use snake_case.

IIS Logs and Event Logs

First off thank you for any advice and your time.
I recently setup an Elk stack for the company I just started working for. (This is my first experience using Logstash and Nxlog.) What I would like to do is send both IIS logs and EventLogs from the same webserver to logstash using nxlog.
I just don't understand how to send two types of logs from one source and have the logstash.conf filter this data correctly.
This is my nxlog.conf
## This is a sample configuration file. See the nxlog reference manual about the
## configuration options. It should be installed locally and is also available
## online at http://nxlog.org/nxlog-docs/en/nxlog-reference-manual.html
## Please set the ROOT to the folder your nxlog was installed into,
## otherwise it will not start.
#define ROOT C:\Program Files\nxlog
define ROOT C:\Program Files (x86)\nxlog
Moduledir %ROOT%\modules
CacheDir %ROOT%\data
Pidfile %ROOT%\data\nxlog.pid
SpoolDir %ROOT%\data
LogFile %ROOT%\data\nxlog.log
<Extension json>
Module xm_json
</Extension>
<Input iis_1>
Module im_file
File "F:\inetpub\logs\LogFiles\W3SVC1\u_ex*.log"
ReadFromLast True
SavePos True
Exec if $raw_event =~ /^#/ drop();
</Input>
<Input iis_2>
Module im_file
File "F:\inetpub\logs\LogFiles\W3SVC2\u_ex*.log"
ReadFromLast True
SavePos True
Exec if $raw_event =~ /^#/ drop();
</Input>
<Input iis_4>
Module im_file
File "F:\inetpub\logs\LogFiles\W3SVC4\u_ex*.log"
ReadFromLast True
SavePos True
Exec if $raw_event =~ /^#/ drop();
</Input>
<Input eventlog>
Module im_msvistalog
Exec $EventReceivedTime = integer($EventReceivedTime) / 1000000; to_json();
</Input>
<Output out_iis>
Module om_tcp
Host 10.191.132.86
Port 5555
OutputType LineBased
</Output>
<Route 1>
Path iis_1, iis_2, iis_4, eventlog=> out_iis
</Route>
My Current logstash.conf
input {
tcp {
type => "iis"
port => 5555
host => "10.191.132.86"
}
}
filter {
if [type] == "iis" {
grok {
match => ["#message", "%{TIMESTAMP_ISO8601:timestamp} %{IPORHOST:hostip} %{WORD:method} %{URIPATH:page} %{NOTSPACE:query} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:clientip} %{NOTSPACE:useragent} %{NOTSPACE:referrer} %{NUMBER:response} %{NUMBER:subresponse} %{NUMBER:scstatus} %{NUMBER:timetaken}"]
}
}
}
output {
elasticsearch {
protocol => "http"
host => "10.191.132.86"
port => "9200"
}
}
It looks like you can filter different data by setting the type and doing if type else this type. But if they are coming from the same source how do I specify different types?
:) Thanks!
NXLog sets the field SourceModuleName with the value iis_1, iis_2, etc. You may want to use this instead.
A way to do this is filter by a known record entry in each log and wont exist in the other, for example [cs_bytes etc]:
e.g.
if [iisfield] {
mark type as IIS
else
mark type as EventLog
}
I have written a IIS and Event log agent that captures logs for Logit.io they might already do everything you already want

Resources