Grok filter for selecting and formatting certain logs lines

Grok filter for selecting and formatting certain logs lines - logstash

I am writing up a grok filter for parsing my application log which is unstructured. What i need is to look for certain lines and generate output in a specific format. e.g below are my logs
2018-05-07 01:19:40 M :Memory (xivr = 513.2 Mb, system = 3502.0 Mb, physical = 5386.7 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 01:29:40 M :Memory (xivr = 513.2 Mb, system = 3495.3 Mb, physical = 5370.1 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 05:51:19 1 :Hangup call
***2018-05-07 05:51:22 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx***]
2018-05-07 05:51:30 24 :Hangup call
***2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]***
2018-05-07 00:31:21 45 :Device Dialogic Digital dxxxB12C1 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1FF0, DriverChannel: 44)
2018-05-07 00:31:22 40 :Device Dialogic Digital dxxxB10C4 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1B2C, DriverChannel: 39)
I need to enter only lines highlighted with *** in below format in my Kibana: Other lines should be simply ignored
Logtimestamp: 2018-05-07 05:51:22
Channel_id: 24
Source_number:
71840746
Destination_Number: 91783028
How can this be achieved?

You can explicitly write whatever is unique about that particular pattern, and use pre-defined grok patterns for the rest.
In your case, the grok pattern would be,
%{TIMESTAMP_ISO8601:Logtimestamp} %{NUMBER:Channel_id} :Answer call from %{NUMBER:Source_number} for %{NUMBER:Destination_Number} %{GREEDYDATA:etc}
It will only match following pattern,
2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]
Explanation
The syntax for a grok pattern is %{SYNTAX:SEMANTIC}.
In your filter,
%{TIMESTAMP_ISO8601:Logtimestamp} matches 2018-05-07 05:51:34
%{NUMBER:Channel_id} match 24
:Answer call from matches the string literally
%{NUMBER:Source_number} matches 71840746
%{NUMBER:Destination_Number} matches 91783028
%{GREEDYDATA:etc} matches rest of the data i.e. [C:\xivr\es\IVR-Dialin.dtx]
in that order.
Output:
{
"Logtimestamp": [
[
"2018-05-07 05:51:22"
]
],
"Channel_id": [
[
"24"
]
],
"Source_number": [
[
"71840746"
]
],
"Destination_Number": [
[
"91783028"
]
],
"etc": [
[
"[C:\\xivr\\es\\IVR-Dialin.dtx***]"
]
]
}
You can test it here.
Hope it helps.

Related

Spacy matching priority

I'm looking to create a physics pattern library with spacy:
I want to detect time and speed pattern. My aim is to stay flexible with those pattern.
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
for id_match, start, end in matcher(doc):
match_label=nlp.vocab[id_match].text
print(match_label, '<--', doc[start:end])
So far my code returns this collection of matches:
TIME <-- time
TIME <-- 23 min
TIME <-- min
SPEED <-- 25 km/h
SPEED<-- km/h
TIME <-- h
I want the matcher to match only once, and to match "23 min" rather than "min". Also would like the matcher not to match an element already matched ( for exemple "h" should not be matched because it already matched in "km/h"

You can try add greedy="LONGEST" to matcher.add() to return only the longest (or FIRST) matches:
matcher.add("SPEED", speed_pattern, greedy="LONGEST")
matcher.add("TIME", time_pattern, greedy="LONGEST")
But note that this doesn't handle overlaps across different match IDs:
TIME <-- 23 min
TIME <-- time
TIME <-- h
SPEED <-- 25 km/h
If you want to filter all of the matches, you can use matcher(doc, as_spans=True) to get the matches directly as spans and then use spacy.util.filter_spans to filter the whole list of spans to a list of non-overlapping spans with the longest spans preferred: https://spacy.io/api/top-level#util.filter_spans
[time, 23 min, 25 km/h]

You can use as_spans=True option with spacy.matcher.Matcher (introduced in spaCy v3.0):
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
From the documentation:
Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False.
See the Python demo:
import spacy
from spacy.tokens.doc import Doc
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
Output:
TIME -> time
TIME -> 23 min
SPEED -> 25 km/h

Why is white space not found in my NodeJS program?

For some obscure reason, my str.split(" ") command doesn't seem to work. I've been trying to debug the situation for a while now and can't seem to find the solution. Let me start off by saying that, unfortunately, I can't replicate the error. I've tried creating a JSFiddle, but it works correctly.
Here's my problem: I've got a JSObject "library" over which I'm looping to create MongoDocuments and during this "construction" I get the following:
if (payload[i].asking) {
let price = payload[i].asking
price = price.substring(1, price.length);
console.log(price);
console.log(price.indexOf(" "));
const priceArr = price.split(" ");
console.log(priceArr);
price = priceArr[0];
currency = priceArr[1];
listing.set('asking', price);
listing.set('currency', currency);
}
where:
payload is the full JSObject "library"
[i] the current JSObject
and asking, the key I'm working on, in this particular place of the code.
And here's the result:
500.00 eur
-1
[ '500.00 eur' ]
950.00 eur
-1
[ '950.00 eur' ]
5,000.00 usd
-1
[ '5,000.00 usd' ]
250.00 usd
-1
[ '250.00 usd' ]
800.00 usd
-1
[ '800.00 usd' ]
899.00 usd
-1
[ '899.00 usd' ]
3,500.00 usd
-1
[ '3,500.00 usd' ]
2,800.00 usd
-1
[ '2,800.00 usd' ]
2,250.00 usd
-1
[ '2,250.00 usd' ]
3,750.00 usd
-1
[ '3,750.00 usd' ]
1,500.00 usd
-1
[ '1,500.00 usd' ]
5,800.00 usd
-1
[ '5,800.00 usd' ]
2,500.00 usd
-1
[ '2,500.00 usd' ]
So I understand why price.split(" ") doesn't work: I apparently have no white space in the first place (indexOf(" ") === -1) but I'm not sure why and what's happening. payload[i].asking is a string alright (price.substring proves it) but I don't understand why this white space doesn't exist.

OK so I found an alternative, using regex, even though it doesn't solve the exact problem I describe per se:
const priceArr = price.split(/(\s+)/)
It might help someone else...

Logstash grok date parsefailure

With this filter
filter
{
grok{
match => { "message" => "\[(?<timestamp>%{MONTHDAY}-%{MONTH}-%{YEAR} %{TIME} %{TZ})\] %{DATA:errortype}: %{GREEDYDATA:errormessage}"}
}
date {
match => [ "timestamp" , "dd-MMM-YYYY HH:mm:ss Z" ]
#remove_field => ["timestamp"]
}
}
And this line
[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508
I got a dateparse failure
With https://grokdebug.herokuapp.com/ all seems OK and using -debug I only have this log
[2018-07-09T08:38:32,925][DEBUG][logstash.inputs.file ] Received line {:path=>"/tmp/request.log", :text=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an o
bject that implements Countable in xxx/program.php on line 508"}
[2018-07-09T08:38:32,941][DEBUG][logstash.inputs.file ] writing sincedb (delta since last write = 1531118312)
[2018-07-09T08:38:32,948][DEBUG][logstash.pipeline ] filter received {"event"=>{"#version"=>"1", "host"=>"guillaume", "path"=>"/tmp/request.log", "#timestamp"=>2018-07-09T06:38:32.939Z, "
message"=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508"}}
[2018-07-09T08:38:32,949][DEBUG][logstash.filters.grok ] Running grok filter {:event=>2018-07-09T06:38:32.939Z guillaume [04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an
array or an object that implements Countable in xxx/program.php on line 508}
[2018-07-09T08:38:32,950][DEBUG][logstash.filters.grok ] Event now: {:event=>2018-07-09T06:38:32.939Z guillaume [04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array o
r an object that implements Countable in xxx.php on line 508}
[2018-07-09T08:38:32,954][DEBUG][logstash.pipeline ] output received {"event"=>{"errormessage"=>" count(): Parameter must be an array or an object that implements Countable xxx.php on line 508", "path"=>"/tmp/request.log", "errortype"=>"PHP Warning", "#timestamp"=>2018-07-09T06:
38:32.939Z, "#version"=>"1", "host"=>"guillaume", "message"=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508", "timestamp"=>"04-Jul-2018 15:28:02 UTC", "tags"=>["_dateparsefailure"]}}

date {
match => [ "timestamp" , "dd-MMM-yyyy HH:mm:ss z" ]
}
Change the YYYY to yyyy and Z to z.
For more details on the date format you can refer to the following page:->
https://www.elastic.co/guide/en/logstash/6.3/plugins-filters-date.html#plugins-filters-date-match

logstash parse complex message from Telegram

I'm processing through Telegram history (txt file) and I need to extract & process quite complex (nested) multiline pattern.
Here's the whole pattern
Free_Trade_Calls__AltSignals:IOC/ BTC (bittrex)
BUY : 0.00164
SELL :
TARGET 1 : 0.00180
TARGET 2 : 0.00205
TARGET 3 : 0.00240
STOP LOS : 0.000120
2018-04-19 15:46:57 Free_Trade_Calls__AltSignals:TARGET
basically I am looking for a pattern starting with
Free_Trade_Calls__AltSignals: ^%(
and ending with a timestamp.
Inside that pattern (telegram message)
- exchange - in brackets in the 1st line
- extract value after BUY
- SELL values in a array of 3 SELL[3] : target 1-3
- STOP loss value (it can be either STOP, STOP LOSS, STOP LOS)....
I've found this Logstash grok multiline message but I am very new to logstash firend advised it to me. I was trying to parse this text in NodeJS but it really is pain in the ass and mad about it.
Thanks Rob :)

Since you need to grab values from each line, you don't need to use multi-line modifier. You can skip empty line with %{SPACE} character.
For your given log, this pattern can be used,
Free_Trade_Calls__AltSignals:.*\(%{WORD:exchange}\)\s*BUY\s*:\s*%{NUMBER:BUY}\s*SELL :\s*TARGET 1\s*:\s*%{NUMBER:TARGET_1}\s*TARGET 2\s*:\s*%{NUMBER:TARGET_2}\s*TARGET 3\s*:\s*%{NUMBER:TARGET_3}\s*.*:\s*%{NUMBER:StopLoss}
please note that \s* equals to %{SPACE}
It will output,
{
"exchange": [
[
"bittrex"
]
],
"BUY": [
[
"0.00164"
]
],
"BASE10NUM": [
[
"0.00164",
"0.00180",
"0.00205",
"0.00240",
"0.000120"
]
],
"TARGET_1": [
[
"0.00180"
]
],
"TARGET_2": [
[
"0.00205"
]
],
"TARGET_3": [
[
"0.00240"
]
],
"StopLoss": [
[
"0.000120"
]
]
}

Parsing two formats of log messages in LogStash

In a single log file, there are two formats of log messages. First as so:
Apr 22, 2017 2:00:14 AM org.activebpel.rt.util.AeLoggerFactory info
INFO:
======================================================
ActiveVOS 9.* version Full license.
Licensed for All application server(s), for 8 cpus,
License expiration date: Never.
======================================================
and second:
Apr 22, 2017 2:00:14 AM org.activebpel.rt.AeException logWarning
WARNING: The product license does not include Socrates.
First line is same, but on the other lines, there can be (written in pseudo) :loglevel: <msg>, or loglevel:<newline><many of =><newline><multiple line msg><newline><many of =>
I have the following configuration:
Query:
%{TIMESTAMP_MW_ERR:timestamp} %{DATA:logger} %{GREEDYDATA:info}%{SPACE}%{LOGLEVEL:level}:(%{SPACE}%{GREEDYDATA:msg}|%{SPACE}=+(%{GREEDYDATA:msg}%{SPACE})*=+)
Grok patterns:
AMPM (am|AM|pm|PM|Am|Pm)
TIMESTAMP_MW_ERR %{MONTH} %{MONTHDAY}, %{YEAR} %{HOUR}:%{MINUTE}:%{SECOND} %{AMPM}
Multiline filter:
%{LOGLEVEL}|%{GREEDYDATA}|=+
The problem is that all messages are always identified with %{SPACE}%{GREEDYDATA:msg}, and so in second case return <many of => as msg, and never with %{SPACE}=+(%{GREEDYDATA:msg}%{SPACE})*=+, probably as first msg pattern contains the second.
How can I parse these two patterns of msg ?

I fixed it by following:
Query:
%{TIMESTAMP_MW_ERR:timestamp} %{DATA:logger} %{DATA:info}\s%{LOGLEVEL:level}:\s((=+\s%{GDS:msg}\s=+)|%{GDS:msg})
Patterns:
AMPM (am|AM|pm|PM|Am|Pm)
TIMESTAMP_MW_ERR %{MONTH} %{MONTHDAY}, %{YEAR} %{HOUR}:%{MINUTE}:%{SECOND} %{AMPM}
GDS (.|\s)*
Multiline pattern:
%{LOGLEVEL}|%{GREEDYDATA}
Logs are correctly parsed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Grok filter for selecting and formatting certain logs lines - logstash

Related

Spacy matching priority

Why is white space not found in my NodeJS program?

Logstash grok date parsefailure

logstash parse complex message from Telegram

Parsing two formats of log messages in LogStash

Categories

Resources