Grok filter for selecting and formatting certain logs lines - logstash

I am writing up a grok filter for parsing my application log which is unstructured. What i need is to look for certain lines and generate output in a specific format. e.g below are my logs
2018-05-07 01:19:40 M :Memory (xivr = 513.2 Mb, system = 3502.0 Mb, physical = 5386.7 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 01:29:40 M :Memory (xivr = 513.2 Mb, system = 3495.3 Mb, physical = 5370.1 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 05:51:19 1 :Hangup call
***2018-05-07 05:51:22 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx***]
2018-05-07 05:51:30 24 :Hangup call
***2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]***
2018-05-07 00:31:21 45 :Device Dialogic Digital dxxxB12C1 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1FF0, DriverChannel: 44)
2018-05-07 00:31:22 40 :Device Dialogic Digital dxxxB10C4 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1B2C, DriverChannel: 39)
I need to enter only lines highlighted with *** in below format in my Kibana: Other lines should be simply ignored
Logtimestamp: 2018-05-07 05:51:22
Channel_id: 24
Source_number:
71840746
Destination_Number: 91783028
How can this be achieved?

You can explicitly write whatever is unique about that particular pattern, and use pre-defined grok patterns for the rest.
In your case, the grok pattern would be,
%{TIMESTAMP_ISO8601:Logtimestamp} %{NUMBER:Channel_id} :Answer call from %{NUMBER:Source_number} for %{NUMBER:Destination_Number} %{GREEDYDATA:etc}
It will only match following pattern,
2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]
Explanation
The syntax for a grok pattern is %{SYNTAX:SEMANTIC}.
In your filter,
%{TIMESTAMP_ISO8601:Logtimestamp} matches 2018-05-07 05:51:34
%{NUMBER:Channel_id} match 24
:Answer call from matches the string literally
%{NUMBER:Source_number} matches 71840746
%{NUMBER:Destination_Number} matches 91783028
%{GREEDYDATA:etc} matches rest of the data i.e. [C:\xivr\es\IVR-Dialin.dtx]
in that order.
Output:
{
"Logtimestamp": [
[
"2018-05-07 05:51:22"
]
],
"Channel_id": [
[
"24"
]
],
"Source_number": [
[
"71840746"
]
],
"Destination_Number": [
[
"91783028"
]
],
"etc": [
[
"[C:\\xivr\\es\\IVR-Dialin.dtx***]"
]
]
}
You can test it here.
Hope it helps.

Related

Spacy matching priority

I'm looking to create a physics pattern library with spacy:
I want to detect time and speed pattern. My aim is to stay flexible with those pattern.
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
for id_match, start, end in matcher(doc):
match_label=nlp.vocab[id_match].text
print(match_label, '<--', doc[start:end])
So far my code returns this collection of matches:
TIME <-- time
TIME <-- 23 min
TIME <-- min
SPEED <-- 25 km/h
SPEED<-- km/h
TIME <-- h
I want the matcher to match only once, and to match "23 min" rather than "min". Also would like the matcher not to match an element already matched ( for exemple "h" should not be matched because it already matched in "km/h"
You can try add greedy="LONGEST" to matcher.add() to return only the longest (or FIRST) matches:
matcher.add("SPEED", speed_pattern, greedy="LONGEST")
matcher.add("TIME", time_pattern, greedy="LONGEST")
But note that this doesn't handle overlaps across different match IDs:
TIME <-- 23 min
TIME <-- time
TIME <-- h
SPEED <-- 25 km/h
If you want to filter all of the matches, you can use matcher(doc, as_spans=True) to get the matches directly as spans and then use spacy.util.filter_spans to filter the whole list of spans to a list of non-overlapping spans with the longest spans preferred: https://spacy.io/api/top-level#util.filter_spans
[time, 23 min, 25 km/h]
You can use as_spans=True option with spacy.matcher.Matcher (introduced in spaCy v3.0):
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
From the documentation:
Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False.
See the Python demo:
import spacy
from spacy.tokens.doc import Doc
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
Output:
TIME -> time
TIME -> 23 min
SPEED -> 25 km/h

Why is white space not found in my NodeJS program?

For some obscure reason, my str.split(" ") command doesn't seem to work. I've been trying to debug the situation for a while now and can't seem to find the solution. Let me start off by saying that, unfortunately, I can't replicate the error. I've tried creating a JSFiddle, but it works correctly.
Here's my problem: I've got a JSObject "library" over which I'm looping to create MongoDocuments and during this "construction" I get the following:
if (payload[i].asking) {
let price = payload[i].asking
price = price.substring(1, price.length);
console.log(price);
console.log(price.indexOf(" "));
const priceArr = price.split(" ");
console.log(priceArr);
price = priceArr[0];
currency = priceArr[1];
listing.set('asking', price);
listing.set('currency', currency);
}
where:
payload is the full JSObject "library"
[i] the current JSObject
and asking, the key I'm working on, in this particular place of the code.
And here's the result:
500.00 eur
-1
[ '500.00 eur' ]
950.00 eur
-1
[ '950.00 eur' ]
5,000.00 usd
-1
[ '5,000.00 usd' ]
250.00 usd
-1
[ '250.00 usd' ]
800.00 usd
-1
[ '800.00 usd' ]
899.00 usd
-1
[ '899.00 usd' ]
3,500.00 usd
-1
[ '3,500.00 usd' ]
2,800.00 usd
-1
[ '2,800.00 usd' ]
2,250.00 usd
-1
[ '2,250.00 usd' ]
3,750.00 usd
-1
[ '3,750.00 usd' ]
1,500.00 usd
-1
[ '1,500.00 usd' ]
5,800.00 usd
-1
[ '5,800.00 usd' ]
2,500.00 usd
-1
[ '2,500.00 usd' ]
So I understand why price.split(" ") doesn't work: I apparently have no white space in the first place (indexOf(" ") === -1) but I'm not sure why and what's happening. payload[i].asking is a string alright (price.substring proves it) but I don't understand why this white space doesn't exist.
OK so I found an alternative, using regex, even though it doesn't solve the exact problem I describe per se:
const priceArr = price.split(/(\s+)/)
It might help someone else...

Logstash grok date parsefailure

With this filter
filter
{
grok{
match => { "message" => "\[(?<timestamp>%{MONTHDAY}-%{MONTH}-%{YEAR} %{TIME} %{TZ})\] %{DATA:errortype}: %{GREEDYDATA:errormessage}"}
}
date {
match => [ "timestamp" , "dd-MMM-YYYY HH:mm:ss Z" ]
#remove_field => ["timestamp"]
}
}
And this line
[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508
I got a dateparse failure
With https://grokdebug.herokuapp.com/ all seems OK and using -debug I only have this log
[2018-07-09T08:38:32,925][DEBUG][logstash.inputs.file ] Received line {:path=>"/tmp/request.log", :text=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an o
bject that implements Countable in xxx/program.php on line 508"}
[2018-07-09T08:38:32,941][DEBUG][logstash.inputs.file ] writing sincedb (delta since last write = 1531118312)
[2018-07-09T08:38:32,948][DEBUG][logstash.pipeline ] filter received {"event"=>{"#version"=>"1", "host"=>"guillaume", "path"=>"/tmp/request.log", "#timestamp"=>2018-07-09T06:38:32.939Z, "
message"=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508"}}
[2018-07-09T08:38:32,949][DEBUG][logstash.filters.grok ] Running grok filter {:event=>2018-07-09T06:38:32.939Z guillaume [04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an
array or an object that implements Countable in xxx/program.php on line 508}
[2018-07-09T08:38:32,950][DEBUG][logstash.filters.grok ] Event now: {:event=>2018-07-09T06:38:32.939Z guillaume [04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array o
r an object that implements Countable in xxx.php on line 508}
[2018-07-09T08:38:32,954][DEBUG][logstash.pipeline ] output received {"event"=>{"errormessage"=>" count(): Parameter must be an array or an object that implements Countable xxx.php on line 508", "path"=>"/tmp/request.log", "errortype"=>"PHP Warning", "#timestamp"=>2018-07-09T06:
38:32.939Z, "#version"=>"1", "host"=>"guillaume", "message"=>"[04-Jul-2018 15:28:02 UTC] PHP Warning: count(): Parameter must be an array or an object that implements Countable in xxx.php on line 508", "timestamp"=>"04-Jul-2018 15:28:02 UTC", "tags"=>["_dateparsefailure"]}}
date {
match => [ "timestamp" , "dd-MMM-yyyy HH:mm:ss z" ]
}
Change the YYYY to yyyy and Z to z.
For more details on the date format you can refer to the following page:->
https://www.elastic.co/guide/en/logstash/6.3/plugins-filters-date.html#plugins-filters-date-match

logstash parse complex message from Telegram

I'm processing through Telegram history (txt file) and I need to extract & process quite complex (nested) multiline pattern.
Here's the whole pattern
Free_Trade_Calls__AltSignals:IOC/ BTC (bittrex)
BUY : 0.00164
SELL :
TARGET 1 : 0.00180
TARGET 2 : 0.00205
TARGET 3 : 0.00240
STOP LOS : 0.000120
2018-04-19 15:46:57 Free_Trade_Calls__AltSignals:TARGET
basically I am looking for a pattern starting with
Free_Trade_Calls__AltSignals: ^%(
and ending with a timestamp.
Inside that pattern (telegram message)
- exchange - in brackets in the 1st line
- extract value after BUY
- SELL values in a array of 3 SELL[3] : target 1-3
- STOP loss value (it can be either STOP, STOP LOSS, STOP LOS)....
I've found this Logstash grok multiline message but I am very new to logstash firend advised it to me. I was trying to parse this text in NodeJS but it really is pain in the ass and mad about it.
Thanks Rob :)
Since you need to grab values from each line, you don't need to use multi-line modifier. You can skip empty line with %{SPACE} character.
For your given log, this pattern can be used,
Free_Trade_Calls__AltSignals:.*\(%{WORD:exchange}\)\s*BUY\s*:\s*%{NUMBER:BUY}\s*SELL :\s*TARGET 1\s*:\s*%{NUMBER:TARGET_1}\s*TARGET 2\s*:\s*%{NUMBER:TARGET_2}\s*TARGET 3\s*:\s*%{NUMBER:TARGET_3}\s*.*:\s*%{NUMBER:StopLoss}
please note that \s* equals to %{SPACE}
It will output,
{
"exchange": [
[
"bittrex"
]
],
"BUY": [
[
"0.00164"
]
],
"BASE10NUM": [
[
"0.00164",
"0.00180",
"0.00205",
"0.00240",
"0.000120"
]
],
"TARGET_1": [
[
"0.00180"
]
],
"TARGET_2": [
[
"0.00205"
]
],
"TARGET_3": [
[
"0.00240"
]
],
"StopLoss": [
[
"0.000120"
]
]
}

Parsing two formats of log messages in LogStash

In a single log file, there are two formats of log messages. First as so:
Apr 22, 2017 2:00:14 AM org.activebpel.rt.util.AeLoggerFactory info
INFO:
======================================================
ActiveVOS 9.* version Full license.
Licensed for All application server(s), for 8 cpus,
License expiration date: Never.
======================================================
and second:
Apr 22, 2017 2:00:14 AM org.activebpel.rt.AeException logWarning
WARNING: The product license does not include Socrates.
First line is same, but on the other lines, there can be (written in pseudo) :loglevel: <msg>, or loglevel:<newline><many of =><newline><multiple line msg><newline><many of =>
I have the following configuration:
Query:
%{TIMESTAMP_MW_ERR:timestamp} %{DATA:logger} %{GREEDYDATA:info}%{SPACE}%{LOGLEVEL:level}:(%{SPACE}%{GREEDYDATA:msg}|%{SPACE}=+(%{GREEDYDATA:msg}%{SPACE})*=+)
Grok patterns:
AMPM (am|AM|pm|PM|Am|Pm)
TIMESTAMP_MW_ERR %{MONTH} %{MONTHDAY}, %{YEAR} %{HOUR}:%{MINUTE}:%{SECOND} %{AMPM}
Multiline filter:
%{LOGLEVEL}|%{GREEDYDATA}|=+
The problem is that all messages are always identified with %{SPACE}%{GREEDYDATA:msg}, and so in second case return <many of => as msg, and never with %{SPACE}=+(%{GREEDYDATA:msg}%{SPACE})*=+, probably as first msg pattern contains the second.
How can I parse these two patterns of msg ?
I fixed it by following:
Query:
%{TIMESTAMP_MW_ERR:timestamp} %{DATA:logger} %{DATA:info}\s%{LOGLEVEL:level}:\s((=+\s%{GDS:msg}\s=+)|%{GDS:msg})
Patterns:
AMPM (am|AM|pm|PM|Am|Pm)
TIMESTAMP_MW_ERR %{MONTH} %{MONTHDAY}, %{YEAR} %{HOUR}:%{MINUTE}:%{SECOND} %{AMPM}
GDS (.|\s)*
Multiline pattern:
%{LOGLEVEL}|%{GREEDYDATA}
Logs are correctly parsed.

Resources