Exception while parsing logfile in logstash - logstash-grok

I am trying to parse Selenium Log file using logstash , but i am getting exception while parsing.
grok pattern used :
grok {
match => { "message" => "%{LOGLEVEL:severity} *%{GREEDYDATA:msg}" }
}
and sample log file is :
INFO Parental Page Loaded and Lock Content displayed
INFO Received PIN:1234
INFO The UI element mtfileTV.ParentalControls.Enabled [Enabled On] is displayed
INFO Parental Switch is already Enabled
ERROR Exception occoured while finding the given element name "mtfileTV.ParentalControls.Disabled" type "NAME" description " Parental lock is disabled text status" Value "Disabled Off"
ERROR An element could not be located on the page using the given search parameters. (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 382 milliseconds
For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
Build info: version: 'unknown', revision: 'b526bd5', time: '2017-03-07 11:11:07 -0800'
System info: host: 'IN5033-01', ip: '192.168.86.127', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_151'
Driver info: io.appium.java_client.windows.WindowsDriver
Capabilities [{app=mtfileTVLLC.mtfileTV_vgszm6stshdqy!App, platformName=Windows, platform=ANY}]
Session ID: 9D664F38-2D99-4751-A535-60DF0466A08F
*** Element info: {Using=name, value=Disabled Off}
ERROR Exception occoured while finding the given element name "mtfileTV.ParentalControls.Disabled" type "NAME" description " Parental lock is disabled text status" Value "Disabled Off" The specified element mtfileTV.ParentalControls.Disabled Parental lock is disabled text status does not exist
ERROR The UI element mtfileTV.ParentalControls.Disabled is not displayed
INFO Parental control is already Enabled.Process ahead
INFO The UI element mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked] is displayed
INFO The UI element mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked] is displayed
INFO The text "R, NC-17, TV-MA Locked" got from UI element with name mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked]
INFO The UI element mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked] is displayed
INFO The text "R, NC-17, TV-MA Locked" got from UI element with name mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked]
INFO Cur status of the Requested Rating :R, NC-17, TV-MA Locked: is = R, NC-17, TV-MA Locked
INFO R, NC-17, TV-MA Locked is already locked.Do Nothing.
ERROR Exception occoured while finding the given element name "mtfileTV.ParentalControls.Disabled" type "NAME" description " Parental lock is disabled text status" Value "Disabled Off"
ERROR An element could not be located on the page using the given search parameters. (WARNING: The server did not provide any stacktrace information)
Command duration or timeout: 438 milliseconds
For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
Build info: version: 'unknown', revision: 'b526bd5', time: '2017-03-07 11:11:07 -0800'
System info: host: 'IN5033-01', ip: '192.168.86.127', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_151'
Driver info: io.appium.java_client.windows.WindowsDriver
Capabilities [{app=mtfileTVLLC.mtfileTV_vgszm6stshdqy!App, platformName=Windows, platform=ANY}]
Please suggest me suitable grok pattern

The problem is that you have messages consisting of multiple lines. Therefore, you need to use the multiline codec to join all lines which do not start with a LOGLEVEL.
We can write a sample configuration (call it selenium.conf):
input {
stdin {
codec => multiline {
pattern => "^%{LOGLEVEL} "
negate => true
what => "previous"
}
}
}
filter {
grok {
match => { "message" => "%{LOGLEVEL:severity} %{GREEDYDATA:lines}" }
remove_field => [ "message" ]
}
}
output {
stdout {
codec => rubydebug
}
}
If we assume that the sample log file you provided is stored in selenium.log, we can do
$ logstash -f selenium.conf < selenium.log
and see that it works :)
Sample output follows (note the multiline tag in the first entry):
{
"host" => "d125q-PC",
"#timestamp" => 2018-07-04T11:10:28.141Z,
"lines" => "An element could not be located on the page using the given search parameters. (WARNING: The server did not provide any stacktrace information)\nCommand duration or timeout: 382 milliseconds\nFor documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html\nBuild info: version: 'unknown', revision: 'b526bd5', time: '2017-03-07 11:11:07 -0800'\nSystem info: host: 'IN5033-01', ip: '192.168.86.127', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_151'\nDriver info: io.appium.java_client.windows.WindowsDriver\nCapabilities [{app=mtfileTVLLC.mtfileTV_vgszm6stshdqy!App, platformName=Windows, platform=ANY}]\nSession ID: 9D664F38-2D99-4751-A535-60DF0466A08F\n*** Element info: {Using=name, value=Disabled Off}",
"tags" => [
[0] "multiline"
],
"#version" => "1",
"severity" => "ERROR"
}
{
"host" => "d125q-PC",
"#timestamp" => 2018-07-04T11:10:28.142Z,
"lines" => "The UI element mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked] is displayed",
"#version" => "1",
"severity" => "INFO"
}
{
"host" => "d125q-PC",
"#timestamp" => 2018-07-04T11:10:28.144Z,
"lines" => "The text \"R, NC-17, TV-MA Locked\" got from UI element with name mtfileTV.ParentalControls.Status.RNC17TVMA.Locked [R, NC-17, TV-MA Locked]",
"#version" => "1",
"severity" => "INFO"
}

Related

Stormcrawler not retrieving all text content from web page

I'm attempting to use Stormcrawler to crawl a set of pages on our website, and while it is able to retrieve and index some of the page's text, it's not capturing a large amount of other text on the page.
I've installed Zookeeper, Apache Storm, and Stormcrawler using the Ansible playbooks provided here (thank you a million for those!) on a server running Ubuntu 18.04, along with Elasticsearch and Kibana. For the most part, I'm using the configuration defaults, but have made the following changes:
For the Elastic index mappings, I've enabled _source: true, and turned on indexing and storing for all properties (content, host, title, url)
In the crawler-conf.yaml configuration, I've commented out all textextractor.include.pattern and textextractor.exclude.tags settings, to enforce capturing the whole page
After re-creating fresh ES indices, running mvn clean package, and then starting the crawler topology, stormcrawler begins doing its thing and content starts appearing in Elasticsearch. However, for many pages, the content that's retrieved and indexed is only a subset of all the text on the page, and usually excludes the main page text we are interested in.
For example, the text in the following XML path is not returned/indexed:
<html> <body> <div#maincontentcontainer.container> <div#docs-container> <div> <div.row> <div.col-lg-9.col-md-8.col-sm-12.content-item> <div> <div> <p> (text)
While the text in this path is returned:
<html> <body> <div> <div.container> <div.row> <p> (text)
Are there any additional configuration changes that need to be made beyond commenting out all specific tag include and exclude patterns? From my understanding of the documentation, the default settings for those options are to enforce the whole page to be indexed.
I would greatly appreciate any help. Thank you for the excellent software.
Below are my configuration files:
crawler-conf.yaml
config:
topology.workers: 3
topology.message.timeout.secs: 1000
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 100
# override the JVM parameters for the workers
topology.worker.childopts: "-Xmx2g -Djava.net.preferIPv4Stack=true"
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "My crawler"
http.agent.version: "1.0"
http.agent.description: ""
http.agent.url: ""
http.agent.email: ""
# The maximum number of bytes for returned HTTP response bodies.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# text extraction for JSoupParserBolt
# textextractor.include.pattern:
# - DIV[id="maincontent"]
# - DIV[itemprop="articleBody"]
# - ARTICLE
# textextractor.exclude.tags:
# - STYLE
# - SCRIPT
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
http.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
https.protocol.implementation: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
selenium.addresses: "http://localhost:9515"
es-conf.yaml
config:
# ES indexer bolt
es.indexer.addresses: "localhost"
es.indexer.index.name: "content"
# es.indexer.pipeline: "_PIPELINE_"
es.indexer.create: false
es.indexer.bulkActions: 100
es.indexer.flushInterval: "2s"
es.indexer.concurrentRequests: 1
# ES metricsConsumer
es.metrics.addresses: "http://localhost:9200"
es.metrics.index.name: "metrics"
# ES spout and persistence bolt
es.status.addresses: "http://localhost:9200"
es.status.index.name: "status"
es.status.routing: true
es.status.routing.fieldname: "key"
es.status.bulkActions: 500
es.status.flushInterval: "5s"
es.status.concurrentRequests: 1
# spout config #
# positive or negative filters parsable by the Lucene Query Parser
# es.status.filterQuery:
# - "-(key:stormcrawler.net)"
# - "-(key:digitalpebble.com)"
# time in secs for which the URLs will be considered for fetching after a ack of fail
spout.ttl.purgatory: 30
# Min time (in msecs) to allow between 2 successive queries to ES
spout.min.delay.queries: 2000
# Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
spout.reset.fetchdate.after: 120
es.status.max.buckets: 50
es.status.max.urls.per.bucket: 2
# field to group the URLs into buckets
es.status.bucket.field: "key"
# fields to sort the URLs within a bucket
es.status.bucket.sort.field:
- "nextFetchDate"
- "url"
# field to sort the buckets
es.status.global.sort.field: "nextFetchDate"
# CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
es.status.max.start.offset: 500
# AggregationSpout : sampling improves the performance on large crawls
es.status.sample: false
# max allowed duration of a query in sec
es.status.query.timeout: -1
# AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
# use it as nextFetchDate
es.status.recentDate.increase: -1
es.status.recentDate.min.gap: -1
topology.metrics.consumer.register:
- class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
parallelism.hint: 1
#whitelist:
# - "fetcher_counter"
# - "fetcher_average.bytes_fetched"
#blacklist:
# - "__receive.*"
es-crawler.flux
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
- id: "filespout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "seeds.txt"
- true
bolts:
- id: "filter"
className: "com.digitalpebble.stormcrawler.bolt.URLFilterBolt"
parallelism: 3
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 3
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 3
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 3
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 12
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 3
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 3
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 3
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parse"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filespout"
to: "filter"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "filter"
to: "status"
grouping:
streamId: "status"
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byDomain"
parsefilters.json
{
"com.digitalpebble.stormcrawler.parse.ParseFilters": [
{
"class": "com.digitalpebble.stormcrawler.parse.filter.XPathFilter",
"name": "XPathFilter",
"params": {
"canonical": "//*[#rel=\"canonical\"]/#href",
"parse.description": [
"//*[#name=\"description\"]/#content",
"//*[#name=\"Description\"]/#content"
],
"parse.title": [
"//TITLE",
"//META[#name=\"title\"]/#content"
],
"parse.keywords": "//META[#name=\"keywords\"]/#content"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter",
"name": "LinkParseFilter",
"params": {
"pattern": "//FRAME/#src"
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter",
"name": "DomainParseFilter",
"params": {
"key": "domain",
"byHost": false
}
},
{
"class": "com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata",
"name": "CommaSeparatedToMultivaluedMetadata",
"params": {
"keys": ["parse.keywords"]
}
}
]
}
Attempting to use Chromedriver
I installed the latest versions of Chromedriver and Google Chrome for Ubuntu.
First I start chromedriver in headless mode at localhost:9515 as the stormcrawler user (via a separate python shell, as shown below), and then I restart the stormcrawler topology (also as stormcrawler user) but end up with a stack of errors related to Chrome. The odd thing however is that I can confirm chromedriver is running OK within the Python shell directly, and I can confirm that both the driver and browser are actively running via ps -ef). This same stack of errors also occurs when I attempt to simply start chromedriver from the command line (i.e., chromedriver --headless &).
Starting chromedriver in headless mode (in python3 shell)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--window-size=1200x600')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--disable-infobars')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--user-data-dir=/home/stormcrawler/cache/google/chrome')
options.add_argument('--disable-gpu')
options.add_argument('--profile-directory=Default')
options.binary_location = '/usr/bin/google-chrome'
driver = webdriver.Chrome(chrome_options=options, port=9515, executable_path=r'/usr/bin/chromedriver')
Stack trace from starting stormcrawler topology
Run command: storm jar target/stormcrawler-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux --sleep 60000
9486 [Thread-26-fetcher-executor[3 3]] ERROR o.a.s.util - Async loop died!
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: '4.0.0-alpha-6', revision: '5f43a29cfc'
System info: host: 'stormcrawler-dev', ip: '127.0.0.1', os.name: 'Linux', os.arch: 'amd64', os.version: '4.15.0-33-generic', java.version: '1.8.0_282'
Driver info: driver.version: RemoteWebDriver
remote stacktrace: #0 0x55d590b21e89 <unknown>
at com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol.configure(RemoteDriverProtocol.java:101) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.protocol.ProtocolFactory.<init>(ProtocolFactory.java:69) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at com.digitalpebble.stormcrawler.bolt.FetcherBolt.prepare(FetcherBolt.java:818) ~[stormcrawler-1.0-SNAPSHOT.jar:?]
at org.apache.storm.daemon.executor$fn__10180$fn__10193.invoke(executor.clj:803) ~[storm-core-1.2.3.jar:1.2.3]
at org.apache.storm.util$async_loop$fn__624.invoke(util.clj:482) [storm-core-1.2.3.jar:1.2.3]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.7.0.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn't exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
...
Confirming that chromedriver and chrome are both running and reachable
~/stormcrawler$ ps -ef | grep -i 'driver'
stormcr+ 18862 18857 0 14:28 pts/0 00:00:00 /usr/bin/chromedriver --port=9515
stormcr+ 18868 18862 0 14:28 pts/0 00:00:00 /usr/bin/google-chrome --disable-background-networking --disable-client-side-phishing-detection --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-gpu --disable-hang-monitor --disable-infobars --disable-popup-blocking --disable-prompt-on-repost --disable-setuid-sandbox --disable-sync --enable-automation --enable-blink-features=ShadowDOMV0 --enable-logging --headless --log-level=0 --no-first-run --no-sandbox --no-service-autorun --password-store=basic --profile-directory=Default --remote-debugging-port=9222 --test-type=webdriver --use-mock-keychain --user-data-dir=/home/stormcrawler/cache/google/chrome --window-size=1200x600
stormcr+ 18899 18877 0 14:28 pts/0 00:00:00 /opt/google/chrome/chrome --type=renderer --no-sandbox --disable-dev-shm-usage --enable-automation --enable-logging --log-level=0 --remote-debugging-port=9222 --test-type=webdriver --allow-pre-commit-input --ozone-platform=headless --field-trial-handle=17069524199442920904,10206176048672570859,131072 --disable-gpu-compositing --enable-blink-features=ShadowDOMV0 --lang=en-US --headless --enable-crash-reporter --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --shared-files=v8_context_snapshot_data:100
~/stormcrawler$ sudo netstat -lp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 localhost:9222 0.0.0.0:* LISTEN 18026/google-chrome
tcp 0 0 localhost:9515 0.0.0.0:* LISTEN 18020/chromedriver
IIRC you need to set some additional config to work with ChomeDriver.
Alternatively (haven't tried yet) https://hub.docker.com/r/browserless/chrome would be a nice way of handling Chrome in a Docker container.

Unable to send newly updated events form filebeat to logstash

filebeat.inputs:
- type: log
# Change to true to enable this input configuration.
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
- C:\Office\Work\processed_fileChaser_2020-05-22_12_56.log
close_eof: true
close_inactive: 10s
scan_frequency: 15s
tail_files: true
output.logstash:
# The Logstash hosts
hosts: ["XX.XXX.XXX.XX:5044"]
------------------------ logsatsh config -----------------------------------
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
}
mutate {
remove_field => ["tags", "ecs", "log", "agent", "message", "input"]
}
}
output {
file {
path => "C:/Softwares/logstash-7.5.2/output/dm_filechaser.json"
}
}
Sample input:
{"fileName": "DM.YYYYMMDD.TXT", "country": "TW", "type": "TW ", "app_code": "DM", "upstream": "XXXX", "insertionTime": "2020-05-20T13:01:04+08:00"}
{"fileName": "TRANSACTION.YYYYMMDD.TXT", "country": "TW", "type": "TW", "app_code": "DM", "upstream": "XXXX", "insertionTime": "2020-05-22T13:01:04+08:00"} ```
Versions:
Logstash: 7.5.2
Filebeat: 7.5.2
filebeat unable to send new data and always sends same data to logstash from file beat.
Can you please help how to send newly updated data from log to logsatsh.
Filebeat logs:
2020-05-22T15:39:30.630+0530 ERROR logstash/async.go:256 Failed to publish events caused by: read tcp 10.10.242.48:53400->10.10.242.48:5044: i/o timeout
2020-05-22T15:39:30.632+0530 ERROR logstash/async.go:256 Failed to publish events caused by: client is not connected
2020-05-22T15:39:32.605+0530 ERROR pipeline/output.go:121 Failed to publish events: client is not connected
2020-05-22T15:39:32.605+0530 INFO pipeline/output.go:95 Connecting to backoff(async(tcp://10.10.242.48:5044))
2020-05-22T15:39:32.606+0530 INFO pipeline/output.go:105 Connection to backoff(async(tcp://10.10.242.48:5044)) established
2020-05-22T15:39:44.588+0530 INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":1203,"time":{"ms":32}},"total":{"ticks":1890,"time":{"ms":48},"value":1890},"user":{"ticks":687,"time":{"ms":16}}},"handles":{"open":280},"info":{"ephemeral_id":"970ad9cc-16a8-4bf1-85ed-4dd774f0df42","uptime":{"ms":61869}},"memstats":{"gc_next":11787664,"memory_alloc":8709864,"memory_total":16065440,"rss":2510848},"runtime":{"goroutines":26}},"filebeat":{"events":{"active":2,"added":2},"harvester":{"closed":1,"open_files":0,"running":0,"started":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":2,"failed":4,"total":4},"read":{"errors":1},"write":{"bytes":730}},"pipeline":{"clients":1,"events":{"active":2,"filtered":2,"retry":6,"total":2}}},"registrar":{"states":{"current":1}}}}}
2020-05-22T15:39:44.657+0530 INFO log/harvester.go:251 Harvester started for file: C:\Office\Work\processed_fileChaser_2020-05-22_12_56.log

GROK pattern cannot parse logs

I want to launch the ELK-stack for gathering syslog from all my network equipment - Cisco, F5, Huawei, CheckPoint, etc. While experimenting with Logstash, writing grok patterns.
Below is an example of messages from Cisco ASR:
<191>Oct 30 16:30:10 evlogd: [local-60sec10.950] [cli 30000 debug] [8/0/30501 cliparse.c:367] [context: local, contextID: 1] [software internal system syslog] CLI command [user root, mode [local]ASR5K]: show ims-authorization policy-control statistics\u0000
<190>Oct 30 16:30:10 evlogd: [local-60sec10.959] [cli 30005 info] [8/0/30501 _commands_cli.c:1792] [software internal system syslog] CLI session ended for Security Administrator root on device /dev/pts/7\u0000
<190>Oct 30 16:30:10 evlogd: [local-60sec10.981] [snmp 22002 info] [8/0/4550 trap_api.c:930] [software internal system syslog] Internal trap notification 53 (CLISessEnd) user root privilege level Security Administrator ttyname /dev/pts/7\u0000
<190>Oct 30 16:30:12 evlogd: [local-60sec12.639] [cli 30004 info] [8/0/30575 cli_sess.c:127] [software internal system syslog] CLI session started for Security Administrator root on device /dev/pts/7 from 192.168.1.1\u0000
<190>Oct 30 16:30:12 evlogd: [local-60sec12.640] [snmp 22002 info] [8/0/30575 trap_api.c:930] [software internal system syslog] Internal trap notification 52 (CLISessStart) user root privilege level Security Administrator ttyname /dev/pts/7\u0000
All of them matching with my pattern here and here.
<%{POSINT:syslog_pri}>%{DATA:month} %{DATA:monthday} %{TIME:time} %{WORD:device}: \[%{WORD:facility}\-%{HOSTNAME}\] \[%{WORD:service} %{POSINT} %{WORD:priority}\] \[%{DATA}\] ?(\[context: %{DATA:context}, %{DATA}\])?%{SPACE}?(\[%{DATA}\] )%{GREEDYDATA:message}\\u0000
But my simple logstash configuration return tag _grokparsefailure (or _grokparsefailure_sysloginput if I use GROK in syslog input plugin), and doesn't parse my log.
Config using GROK-filter
input { udp {
port => 5140
type => syslog } }
filter {
if [type] == "syslog" {
grok {
match => ["message", "<%{POSINT:syslog_pri}>%{DATA:month} %{DATA:monthday} %{TIME:time} %{WORD:device}: \[%{WORD:facility}\-%{HOSTNAME}\] \[%{WORD:service} %{POSINT} %{WORD:priority}\] \
[%{DATA}\] ?(\[context: %{DATA:context}, %{DATA}\])?%{SPACE}?(\[%{DATA}\] )%{GREEDYDATA:response}\\u0000"]
}
}
}
output { stdout { codec => rubydebug } }
Output:
{
"#version" => "1",
"host" => "172.17.0.1",
"#timestamp" => 2018-10-31T09:46:51.121Z,
"message" => "<190>Oct 31 15:46:51 evlogd: [local-60sec51.119] [snmp 22002 info] [8/0/4550 <sitmain:80> trap_api.c:930] [software internal system syslog] Internal trap notification 53 (CLISessEnd) user kiwi privilege level Security Administrator ttyname /dev/pts/7\u0000",
"type" => "syslog",
"tags" => [
[0] "_grokparsefailure"
]
}
Config syslog-input-plugin:
input {
syslog {
port => 5140
grok_pattern => "<%{POSINT:syslog_pri}>%{DATA:month} %{DATA:monthday} %{TIME:time} %{WORD:device}: \[%{WORD:facility}\-%{HOSTNAME}\] \[%{WORD:service} %{POSINT} %{WORD:priority}\] \[%{DATA
}\] ?(\[context: %{DATA:context}, %{DATA}\])?%{SPACE}?(\[%{DATA}\] )%{GREEDYDATA:response}\\u0000"
}
}
output {
stdout { codec => rubydebug }
}
Output:
{
"severity" => 0,
"#timestamp" => 2018-10-31T09:54:56.871Z,
"#version" => "1",
"host" => "172.17.0.1",
"message" => "<191>Oct 31 15:54:56 evlogd: [local-60sec56.870] [cli 30000 debug] [8/0/22400 <cli:8022400> cliparse.c:367] [context: local, contextID: 1] [software internal system syslog] CLI command [user kiwi, mode [local]ALA3_ASR5K]: show subscribers ggsn-only sum apn osmp\u0000",
"tags" => [
[0] "_grokparsefailure_sysloginput"
],
}
What am I doing wrong? And can someone help fix it?
PS Tested on logstash 2.4.1 and 5
Unlike the online-debuggers, logstash's GROK didn't like my \\u0000 at the end of pattern.
With single backslash all is working.
Right grok-filter is:
<%{POSINT:syslog_pri}>%{DATA:month} %{DATA:monthday} %{TIME:time} %{WORD:device}: \[%{WORD:facility}\-%{HOSTNAME}\] \[%{WORD:service} %{POSINT} %{WORD:priority}\] \[%{DATA}\] ?(\[context: %{DATA:context}, %{DATA}\])?%{SPACE}?(\[%{DATA}\] )%{GREEDYDATA:message}\u0000
I have faced similar problem. The workaround is just to received those logs from routers/firewalls/switches to a syslog-ng server and then forward to the logstash.
Following is a sample configuration for syslog-ng,
source s_router1 {
udp(ip(0.0.0.0) port(1514));
tcp(ip(0.0.0.0) port(1514));
};
destination d_router1_logstash { tcp("localhost",port(5045)); };
log { source(s_router1); destination(d_router1_logstash); };

How to remove filebeat tags like id, hostname, version, grok_failure message

I am new to elk my sample log is look like
2017-01-05T14:28:00 INFO zeppelin IDExtractionService transactionId abcdef1234 operation extractOCRData received request duration 12344 exception error occured
my filebeat configuration is below
filebeat.prospectors:
- input_type: log
paths:
- /opt/apache-tomcat-7.0.82/logs/*.log
document_type: apache-access
fields_under_root: true
output.logstash:
hosts: ["10.2.3.4:5044"]
And my logstash filter.conf file:
filter {
grok {
match => [ "message", "transactionId %{WORD:transaction_id} operation %{WORD:otype} received request duration %{NUMBER:duration} exception %{WORD:error}" ]
}
}
filter {
if "beats_input_codec_plain_applied" in [tags] {
mutate {
remove_tag => ["beats_input_codec_plain_applied"]
}
}
}
;
In kibana dashboard i can see log output as below
beat.name:
ebb8a5ec413b
beat.hostname:
ebb8a5ec413b
host:
ebb8a5ec413b
tags:
beat.version:
6.2.2
source:
/opt/apache-tomcat-7.0.82/logs/IDExtraction.log
otype:
extractOCRData
duration:
12344
transaction_id:
abcdef1234
#timestamp:
April 9th 2018, 16:20:31.853
offset:
805,655
#version:
1
error:
error
message:
2017-01-05T14:28:00 INFO zeppelin IDExtractionService transactionId abcdef1234 operation extractOCRData received request duration 12344 exception error occured
_id:
7X0HqmIBj3MEd9pqhTu9
_type:
doc
_index:
filebeat-2018.04.09
_score:
6.315
1 First question is how to remove filebeat tag like id,hostname,version,grok_failure message
2 how to sort logs on timestamp basis because Newly generated logs not appearing on top of kibana dashboard
3 Is there any changes required in my grok filter
You can remove filebeat tags by setting the value of fields_under_root: false in filebeat configuration file. You can read about this option here.
If this option is set to true, the custom fields are stored as
top-level fields in the output document instead of being grouped under
a fields sub-dictionary. If the custom field names conflict with other
field names added by Filebeat, the custom fields overwrite the other
fields.
you can check if _grokparsefailure is in tags using, if "_grokparsefailure" in [tags] and remove it with remove_tag => ["_grokparsefailure"]
Your grok filter seems to be alright.
Hope it helps.

Multiline codec in Logstash 5.2.0 not working

The following codec config in Logstash in never detecting a new line:
input {
file {
path => "c:\temp\log5.log"
type => "log4net"
codec => multiline {
pattern => "^hello"
negate => true
what => previous
}
}
}
Please can someone confirm if my interpretation of the above config is correct:
If a line does not begin with the text "hello" then merge that line
with the previous line. Conversely if a line begins with the text
"hello" treat is as the start of a new log event.
With the above config, Logstash never detects a new line in my log file even though I have a few lines starting with "hello". Any ideas what the problem may be?
EDIT:
input {
file {
path => "//22.149.166.241/GatewayUnsecure/Log_2016.03.22_22.log"
start_position => "beginning"
type => "log4net"
codec => multiline {
pattern => "^%{YEAR}[/-]%{MONTHNUM}[/-]%{MONTHDAY}"
negate => true
what => previous
}
}
}
Log sample:
2016-03-22 22:00:07,768 [3] INFO AbCap.Cerberus [(null)] - Cerberus 'Cerberus Service Hosting - Unsecure', ('Local'), version 1.0.0.0, host 'WinService' 2016-03-22 22:00:07,783 [7] INFO AbCap.Cerberus [(null)] - Starting 'Cerberus Service Hosting - Unsecure' on JHBDSM020000273 in Local. 2016-03-22 22:00:07,783 [7] DEBUG AbCap.Cerberus [(null)] - Starting: WcfHostWorker 2016-03-22 22:00:07,783 [7] INFO AbCap.Cerberus [(null)] - is opening

Resources