logstash parse complex message from Telegram - logstash

I'm processing through Telegram history (txt file) and I need to extract & process quite complex (nested) multiline pattern.
Here's the whole pattern
Free_Trade_Calls__AltSignals:IOC/ BTC (bittrex)
BUY : 0.00164
SELL :
TARGET 1 : 0.00180
TARGET 2 : 0.00205
TARGET 3 : 0.00240
STOP LOS : 0.000120
2018-04-19 15:46:57 Free_Trade_Calls__AltSignals:TARGET
basically I am looking for a pattern starting with
Free_Trade_Calls__AltSignals: ^%(
and ending with a timestamp.
Inside that pattern (telegram message)
- exchange - in brackets in the 1st line
- extract value after BUY
- SELL values in a array of 3 SELL[3] : target 1-3
- STOP loss value (it can be either STOP, STOP LOSS, STOP LOS)....
I've found this Logstash grok multiline message but I am very new to logstash firend advised it to me. I was trying to parse this text in NodeJS but it really is pain in the ass and mad about it.
Thanks Rob :)

Since you need to grab values from each line, you don't need to use multi-line modifier. You can skip empty line with %{SPACE} character.
For your given log, this pattern can be used,
Free_Trade_Calls__AltSignals:.*\(%{WORD:exchange}\)\s*BUY\s*:\s*%{NUMBER:BUY}\s*SELL :\s*TARGET 1\s*:\s*%{NUMBER:TARGET_1}\s*TARGET 2\s*:\s*%{NUMBER:TARGET_2}\s*TARGET 3\s*:\s*%{NUMBER:TARGET_3}\s*.*:\s*%{NUMBER:StopLoss}
please note that \s* equals to %{SPACE}
It will output,
{
"exchange": [
[
"bittrex"
]
],
"BUY": [
[
"0.00164"
]
],
"BASE10NUM": [
[
"0.00164",
"0.00180",
"0.00205",
"0.00240",
"0.000120"
]
],
"TARGET_1": [
[
"0.00180"
]
],
"TARGET_2": [
[
"0.00205"
]
],
"TARGET_3": [
[
"0.00240"
]
],
"StopLoss": [
[
"0.000120"
]
]
}

Related

Spacy matching priority

I'm looking to create a physics pattern library with spacy:
I want to detect time and speed pattern. My aim is to stay flexible with those pattern.
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
for id_match, start, end in matcher(doc):
match_label=nlp.vocab[id_match].text
print(match_label, '<--', doc[start:end])
So far my code returns this collection of matches:
TIME <-- time
TIME <-- 23 min
TIME <-- min
SPEED <-- 25 km/h
SPEED<-- km/h
TIME <-- h
I want the matcher to match only once, and to match "23 min" rather than "min". Also would like the matcher not to match an element already matched ( for exemple "h" should not be matched because it already matched in "km/h"
You can try add greedy="LONGEST" to matcher.add() to return only the longest (or FIRST) matches:
matcher.add("SPEED", speed_pattern, greedy="LONGEST")
matcher.add("TIME", time_pattern, greedy="LONGEST")
But note that this doesn't handle overlaps across different match IDs:
TIME <-- 23 min
TIME <-- time
TIME <-- h
SPEED <-- 25 km/h
If you want to filter all of the matches, you can use matcher(doc, as_spans=True) to get the matches directly as spans and then use spacy.util.filter_spans to filter the whole list of spans to a list of non-overlapping spans with the longest spans preferred: https://spacy.io/api/top-level#util.filter_spans
[time, 23 min, 25 km/h]
You can use as_spans=True option with spacy.matcher.Matcher (introduced in spaCy v3.0):
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
From the documentation:
Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False.
See the Python demo:
import spacy
from spacy.tokens.doc import Doc
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
time_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{'IN': ['time', 's','h','min']}},
{'LOWER': {"IN": ['maximum','minimum','min','max']}, 'OP':'?'}
]
]
speed_pattern = [
[
{'LIKE_NUM': True, 'OP': '?'},
{'LOWER':{"IN": ['km', 'm']}},
{'IS_PUNCT': True},
{'LOWER' : {"IN": ['h','hour','s','min']}}
]
]
matcher=Matcher(nlp.vocab, validate =True)
matcher.add("SPEED", speed_pattern)
matcher.add("TIME", time_pattern)
doc=nlp("a certain time, more about 23 min, can't get above 25 km/h")
matches = matcher(doc, as_spans=True)
for span in spacy.util.filter_spans(matches):
print(span.label_, "->", span.text)
Output:
TIME -> time
TIME -> 23 min
SPEED -> 25 km/h

Reconstructing Map in Groovy

May I ask is it possible in groovy to transform the map recursively by removing all the key and only keep the value for maps with certain key prefix (e.g. pk-).
Example from this:
[
applications:[
pk-name-AppA:[name:AppA, keyA1:valueA1Prod, keyA2:valueA2],
pk-name-AppB:[name:AppB, keyB1:valueB1Prod, keyB2:valueB2],
otherAppC:[name:AppC, keyA1:valueA1Prod, keyA2:valueA2]
]
]
to
[
applications:[
[name:AppA, keyA1:valueA1Prod, keyA2:valueA2],
[name:AppB, keyB1:valueB1Prod, keyB2:valueB2],
otherAppC:[name:AppC, keyA1:valueA1Prod, keyA2:valueA2]
]
]

Grok filter for selecting and formatting certain logs lines

I am writing up a grok filter for parsing my application log which is unstructured. What i need is to look for certain lines and generate output in a specific format. e.g below are my logs
2018-05-07 01:19:40 M :Memory (xivr = 513.2 Mb, system = 3502.0 Mb, physical = 5386.7 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 01:29:40 M :Memory (xivr = 513.2 Mb, system = 3495.3 Mb, physical = 5370.1 Mb), CpuLoad (sys = 0%, xivr = 0%)
2018-05-07 05:51:19 1 :Hangup call
***2018-05-07 05:51:22 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx***]
2018-05-07 05:51:30 24 :Hangup call
***2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]***
2018-05-07 00:31:21 45 :Device Dialogic Digital dxxxB12C1 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1FF0, DriverChannel: 44)
2018-05-07 00:31:22 40 :Device Dialogic Digital dxxxB10C4 [gc60.dev - Dialogic (SDK 6.0) ver 3.0.702:11646] (ThreadID: 1B2C, DriverChannel: 39)
I need to enter only lines highlighted with *** in below format in my Kibana: Other lines should be simply ignored
Logtimestamp: 2018-05-07 05:51:22
Channel_id: 24
Source_number:
71840746
Destination_Number: 91783028
How can this be achieved?
You can explicitly write whatever is unique about that particular pattern, and use pre-defined grok patterns for the rest.
In your case, the grok pattern would be,
%{TIMESTAMP_ISO8601:Logtimestamp} %{NUMBER:Channel_id} :Answer call from %{NUMBER:Source_number} for %{NUMBER:Destination_Number} %{GREEDYDATA:etc}
It will only match following pattern,
2018-05-07 05:51:34 24 :Answer call from 71840746 for 91783028 [C:\xivr\es\IVR-Dialin.dtx]
Explanation
The syntax for a grok pattern is %{SYNTAX:SEMANTIC}.
In your filter,
%{TIMESTAMP_ISO8601:Logtimestamp} matches 2018-05-07 05:51:34
%{NUMBER:Channel_id} match 24
:Answer call from matches the string literally
%{NUMBER:Source_number} matches 71840746
%{NUMBER:Destination_Number} matches 91783028
%{GREEDYDATA:etc} matches rest of the data i.e. [C:\xivr\es\IVR-Dialin.dtx]
in that order.
Output:
{
"Logtimestamp": [
[
"2018-05-07 05:51:22"
]
],
"Channel_id": [
[
"24"
]
],
"Source_number": [
[
"71840746"
]
],
"Destination_Number": [
[
"91783028"
]
],
"etc": [
[
"[C:\\xivr\\es\\IVR-Dialin.dtx***]"
]
]
}
You can test it here.
Hope it helps.

genfromtxt return numpy array not separated by comma

I have a *.csv file that store two columns of float data.
I am using this function to import it but it generates the data not separated with comma.
data=np.genfromtxt("data.csv", delimiter=',', dtype=float)
output:
[[ 403.14915 150.560364 ]
[ 403.7822265 135.13165 ]
[ 404.5017 163.4669 ]
[ 434.02465 168.023224 ]
[ 373.7655 177.904114 ]
[ 450.608429 208.4187315]
[ 454.39475 239.9666595]
[ 453.8055 248.4082 ]
[ 457.5625305 247.70315 ]
[ 451.729431 258.19335 ]
[ 366.74405 225.169922 ]
[ 377.0055235 258.110077 ]
[ 380.3581 261.760071 ]
[ 383.98615 262.33805 ]
[ 388.2516785 272.715332 ]
[ 408.378174 200.9713135]]
How to format it to get a numpy array like
[[ 403.14915, 150.560364 ]
[ 403.7822265, 135.13165 ],....]
?
NumPy doesn't display commas when you print arrays. If you really want to see them, you can use
print(repr(data))
The repr function forces a str representation not ment for "nice" printing, but for the literal representation you would use yourself to type the data in your code.

OpenNLP yielding undesired result

I am using OpenNLP to process queries like "doctor working in Los Angeles" and "female living in Hollywood and working in Santa Monica". For English understanding human these sentences are very obvious that the subjects are "doctor" and "female". However when I use opennlp it tagged the sentence as
female_JJ living_NN in_IN hollywood_NN
[ female living ] [ in ] [ hollywood ]
Here's another sentence "person living in santa monica and working in malibu and playing football" was processed to be
person_NN living_VBG in_IN santa_NN monica_NN and_CC working_VBG in_IN malibu_NN and_CC playing_NN football_NN
[ person ] [ living ] [ in ] [ santa monica ] and [ working ] [ in ] [ malibu and playing football ]
Why does OpenNLP's POS tagger tagged them wrongly? These sentences have simplest grammatical structures. If the most advanced NLP technologies still fails to parse these sentences does it mean that NLP is far from being practical currently?
the accuracy of all these NLP projects can not be 100%. because these projects are working on cases of probablity. these errors can be there. still then these are most accurate working results

Resources