Logstash - one event per logile - logstash

We want to introduce a log management tool. One of the possible candidates is the ELK stack.
I looked into the manual and it says that logstash is primary for logs which are continuously written and one event per line. Unfortunately we have to deal with logs where it would we one event per logfile. For example:
************************************************************
Protokollstart: XX.XX.XXXX XX:XX:XX
SessionID: XXXXX - XXX.XXX.XXX.XXX - XXXX
Kommentar: DASY
DASY-Batchlauf für Aufgaben bis XX.XX.XXXX XX:XX:XX
Sachbearbeiter: XXXX
ACHTUNG: Echtlauf - Daten gespeichert
Selektierte Gläubiger - Achtung: keine Aufgaben auf Partnerakten betrachtet
XX - XXXXX
XX - XXXXX
XX - XXXXX
nur Aufgaben zum XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX betrachtet
nur Aufgaben zur Wiedervorlage XXXX betrachtet
************************************************************
Startzeit : XX.XX.XXXX XX:XX:XX
************************************************************
Es liegen keine Aufgaben zur Bearbeitung an.
insgesamt bearbeitete Anzahl Aufgaben: 0
************************************************************
Ende des DASY-Batchlauf für Aufgaben: XX.XX.XXXX XX:XX:XX
Statistik:
Warnungen: 0
Protokollende: XX.XX.XXXX XX:XX:XX
************************************************************
I know that there is a multiline plugin / codec, but we some problems to deal with.
There has to be an indicator whether the file is still written or finished, because there can occur large gaps between the writing of the file. The indicator should always be Protokollende: XX.XX.XXXX XX:XX:XX
The writing of the files can last multiple hours (we once had a workload running for 48 hours) and the event mustn't trigger until the in 1. defined indicator is reached.
Is there any way to implement these requirements with standard functionality?
I hope I described my problem the problem well enough. If there are any questions please let me know :)

Related

Regex to find text & value in large text

As I SSH into CM, run commands and start reading the CLI output, I get the following
back:
# * A lot more output above but been removed *
terminal_output = """
[24;1H [79b[1GCommand: disp sys cust<<[23;0H[0;7m [79b[1G[0m[24;0H [79b[1G[1;0H[0;7m [79b[1G[0m[2;0H [79b[1G[3;1H[0J7[1;1H[0;7mdisplay system-parameters customer-options [0m8[1;65H[0;7mPage 1 of 12[0m[2;33HOPTIONAL FEATURES[4;8HG3 Version: [4;20HV20 [4;50HSoftware Package: [4;68HEnterprise [5;10HLocation: [5;20H2[6;10HPlatform: [6;20H28 [5;51HSystem ID (SID): [5;68H9990093751 [6;51HModule ID (MID): [6;68H1 [8;60HUSED[9;29HPlatform Maximum Ports: [9;53H 81000[9;60H 436[10;35HMaximum Stations: [10;53H 135[10;60H 110[11;27HMaximum XMOBILE Stations: [11;53H 41000[11;60H 0[12;17HMaximum Off-PBX Telephones - EC500: [12;53H 135[12;60H 2[13;17HMaximum Off-PBX Telephones - OPS: [13;53H 135[13;60H 40[14;17HMaximum Off-PBX Telephones - PBFMC: [14;53H 135[14;60H 0[15;17HMaximum Off-PBX Telephones - PVFMC: [15;53H 135[15;60H 0[16;17HMaximum Off-PBX Telephones - SCCAN: [16;53H 0[16;60H 0[17;22HMaximum Survivable Processors: [17;53H 313[17;62H 1[22;9H(NOTE: You must logoff & login to effect the permission changes.)[2;50H[0m
"""
It's a lot of ANSI escape codes (I think?) which sort of makes the output not too readable but anyways, what I'm trying to get back is the following from the text above:
Maximum Stations: 135 110
I know from my understanding that a Regex would be required for this.
The Regexes that I tried using but did not work:
r'Maximum Stations:\s*(\d+)(\d+)'
r'Maximum Stations: \d+'
If anyone knows how to filter out these ANSI character codes so they don't appear in the final output that'd be great too.
Thank you.
you can try the following
"(Maximum Stations:)\s\[\d*;\d*H\s*(\d*)\[\d*;\d*H\s*(\d*)"gm
it produces three groups the first with the maximum stations text then two more each with the number you wanted to capture. You would have to combine the groups to get your final output.
I don't know if this will be generic enough for your application though.

Node.js - Parse raw text to JSON using RegEx

I´m still new to Node.js and currently developing a small app for my kitchen. This app can scan receipts and uses OCR to extract the data. For OCR extracting I´m using the ocr-space web api. Afterwards I need to parse the raw text to a JSON structure and send it to my database. I´ve also tested this receipt using AWS textract, which gave me a even poorer result.
Currently I´m struggling at the parsing part using RegEx in Node.js.
Here is my JSON structure which I use to parse the receipt data:
receipt = {
title: 'title of receipt'
items: [
'item1',
'item2',
'item3'
],
preparation: 'preparation text'
}
As most of the receipts have a items part and afterwards a preparation part my general approach so far looks like the following:
Searching for keywords like 'items' and 'preparation' in the raw text
Parse the text between these keywords
Do further string processing, like missing whitespaces, triming etc.
This approach doesn´t work if these keywords are missing. Take for example the following receipt, where I´m struggle to parse it into my JSON structure. The receipt is in German and there are no corresponding keywords ('items' or 'Zutaten', 'preparation' or 'Zubereitung').
Following information from the raw text are necessary:
title: line 1
items: line 2 - 8
preparation: line 9 until end
Do you have any hints or tips how to come closer to the solution? Or do you have any other ideas how to manage such situations accordingly?
Quinoa-Brot
30 g Chiasamen
350 g Quinoa
70 ml Olivenöl
1/2 TL Speisenatron
1 Prise Salz
Saft von 1/2 Zitrone
1 Handvoll Sonnenblumenkerne
30 g Schwarzkümmelsamen
1 Chiasamen mit 100 ml Wasser
verrühren und 30 Minuten quel-
len lassen. Den Ofen auf 200 oc
vorheizen, eine kleine Kastenform
mit Backpapier auslegen.
2 Quinoa mit der dreifachen
Menge Wasser in einen Topf ge-
ben, einmal aufkochen und dann
3 Minuten köcheln lassen - die
Quinoa wird so nur teilweise ge-
gegart. In ein Sieb abgießen, kalt
abschrecken und anschließend
gut abtropfen lassen.
Between each line there is a \n tabulator.
The parsed receipt should look like this:
receipt = {
title: 'Quinoa-Brot',
items: [
'30 g Chiasamen',
'350 g Quinoa',
'70 ml Olivenöl',
'1/2 TL Speisenatron',
'1 Prise Salz',
'Saft von 1/2 Zitrone'
'1 Handvoll Sonnenblumenkerne'
'30 g Schwarzkümmelsamen',
],
preparation: '1 Chiasamen mit 100 ml Wasser verrühren und 30 Minuten quellen lassen. Den Ofen auf 200 oc vorheizen, eine kleine Kastenform mit Backpapier auslegen. 2 Quinoa mit der dreifachen Menge Wasser in einen Topf geben, einmal aufkochen und dann 3 Minuten köcheln lassen - die Quinoa wird so nur teilweise gegegart. In ein Sieb abgießen, kalt abschrecken und anschließend gut abtropfen lassen.'
}
Pattern matching solutions like RegExp don't sound suitable for this sort of a categorization problem. You might want to consider clustering (k-means, etc.) - training a model to differentiate between ingredients and instructions. This can be done by labeling a number of recipes (the more the better), or using unsupervised ML by clustering line by line.
If you need to stick to RegExp for some reason, you keeping track of repeated words. Weak methodology: ingredient names (Chiasemen, Quinoa, ) will be referenced in the instructions, so you can match on multiline to find where the same word is repeated later on:
(?<=\b| )([^ ]+)(?= |$).+(\1)
If you do run this on a loop, plus logic, you can find pairs ingredient-instruction pairs, and work through the document with silhouette information.
You might be able to take advantage of ingredient lines containing numeric data like numbers or words like "piece(s), sticks, leaves" which you might store in a dictionary. That can enrich the word boundary input matches.
I would reconsider using RegExp here at all...

A complicated logstash pattern in Grok

I have following 3 lines in a log that need to be grok'd for ElasticSearch through logstash.
2020-01-27 13:30:43,536 INFO com.test.bestmatch.streamer.function.BestMatchProcessor - Best match for ID: COi0620200110450BAD5CB723457A9B4747F1727 Total Batch Processing time: 3942
2020-01-27 13:30:43,581 INFO HTTPConnection - COi0620200110450BAD5CB723457A9B4747F1727 | People: 51 | Addresses: 5935 | HTTP Query Time: 24
2020-01-27 13:30:43,698 INFO bestRoute - COi0620200110450BAD5CB723457A9B4747F1727 | Touch Points: 117 | Best Match Time 3943
I tried various grok patterns but couldn't get to any concrete one.
Edited as per request
I need the following in ES in the context of the specific log entry
1st line
ID: COi0620200110450BAD5CB723457A9B4747F1727
Total Batch Processing time: 3942
2nd Line
ID: COi0620200110450BAD5CB723457A9B4747F1727
People: 51
Addresses: 5935
HTTP Query Time: 24
3rd Line
Touch Points 117
Best Match Time: 3943.
The output is from a Flink log. If there are flink patterns out there then please let me know.
1st line:
^%{TIMESTAMP_ISO8601:time}\s*%{LOGLEVEL:loglevel}.*ID: (?<ID>[\w\d]*).*time: (?<total_time>[\d]*)$
2nd line:
^%{TIMESTAMP_ISO8601:time}\s*%{LOGLEVEL:loglevel}.* - (?<ID>[\w]*).*People: (?<people>[\w]*).*Addresses: (?<addresses>[\d]*).*HTTP Query Time: (?<query_time>[\d]*)$
3rd line:
^%{TIMESTAMP_ISO8601:time}\s*%{LOGLEVEL:loglevel}.* - (?<ID>[\w]*).*Touch Points: (?<touch_points>[\d]*).*Best Match Time (?<best_match_time>[\d]*)$
There are many ways to parse this, this is only one approach. I would reccomend to adjust the field names I used to the new ECS. https://www.elastic.co/guide/en/ecs/current/index.html

How to download pubmed articles and read them?

Im having trouble to save pubmed articles and read them. I've seen at this page here that there are some special files types but no one of them worked for me. I want to save them in a way that I can continuous using the keys to get the the data. I don't know if its possible use it if I save it as a text file. My code is this one:
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
def __init__(self):
Entrez.email='myemail#gmail.com'
self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')
'''Metodo 4 ler dado em forma de texto.'''
def saveArticlesFilesInXMLMode(self,dbs, ids):
net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
# if not os.path.exists(directory):
# os.makedirs(directory)
# filename = directory + '/'
# if not os.path.exists(filename):
out_handle = open(directory, "w+")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
print("Parsing...")
record = SeqIO.read(directory, "fasta")
print(record)
return(record.read())
I'm getting this error: ValueError: No records found in handle
Pease someone can help me?
Now my code is like this, I am trying to do a function to save in .fasta like you did. And one to read the .fasta files like in the answer above.
import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO
def save_Articles_Files(dbName, idNum, rettypeName):
net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
filename = path + idNum + ".fasta"
out_handle = open(filename, "w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")
enter code here
Entrez.email='myemail#gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)
But my function is not working I need some help please!
You're mixing up two concepts.
1) Entrez.efetch() is used to access NCBI. In your case you are downloading an article from Pubmed. The result that you get from net_handle.read() looks like:
PMID- 26837606
OWN - NLM
STAT- In-Process
DA - 20160203
LR - 20160210
IS - 2045-2322 (Electronic)
IS - 2045-2322 (Linking)
VI - 6
DP - 2016 Feb 03
TI - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG - 20315
LID - 10.1038/srep20315 [doi]
AB - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted
genome modification in eukaryotic organisms from yeast to human cell lines. Its
successful application in several plant species promises enormous potential for
basic and applied plant research. However, extensive studies are still needed to
assess this system in other important plant species, to broaden its fields of
application and to improve methods. Here we showed that the CRISPR/Cas9 system is
efficient in petunia (Petunia hybrid), an important ornamental plant and a model
for comparative research. When PDS was used as target gene, transgenic shoot
lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
readily generated and identified in the first generation. A sequential
transformation strategy--introducing Cas9 and sgRNA expression cassettes
sequentially into petunia--can be used to make targeted mutations with short
indels or chromosomal fragment deletions. Our results present a new plant species
amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
exploitation.
FAU - Zhang, Bin
AU - Zhang B
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Xia
AU - Yang X
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Yang, Chunping
AU - Yang C
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Li, Mingyang
AU - Li M
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
FAU - Guo, Yulong
AU - Guo Y
AD - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
Horticulture Science for Southern Mountainous Regions, Ministry of Education,
College of Horticulture and Landscape Architecture, Southwest University,
Chongqing 400716, China.
LA - eng
PT - Journal Article
PT - Research Support, Non-U.S. Gov't
DEP - 20160203
PL - England
TA - Sci Rep
JT - Scientific reports
JID - 101563288
SB - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.
2) SeqIO.read() is used to read and parse FASTA files. This is a format that is used to store sequences. A sequence in FASTA format is represented as a series of lines. The first line in a FASTA file starts with a ">" (greater-than) symbol. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.
As you can see, the result that you get back from Entrez.efetch() (which I pasted above) doesn't look like a FASTA file. So SeqIO.read() gives the error that it can't find any sequence records in the file.

Search multiline error log for error code and then some of it's parameters on Linux

What command would give me the output I need for each instance of an error code in a very large log file? The file has records marked by a begin and end with number of characters. Such as:
SR 120
1414760452 0 1 Fri Oct 31 13:00:52 2014 2218714 4
GROVEMR2 scn
../SrxParamIF.m 284
New Exam Started
EN 120
The 5th field is the error code, 2218714 in previous example.
I thought of just grep'ing for the error code and outputting -A lines afterwards; then picking what I needed from that rather than parsing the entire file. That seems easy but my grep/awk/sed usage isn't to that level.
ONLY when error 2274021 is encountered as in the following example I'd like some output as shown.
Show me output such as: egrep ‘Coil:|Connector:|Channels faulted:| First channel:’ ERRORLOG|less
Part of input file of interest:
Mon Nov 24 13:43:37 2014 2274021 1
AWHMRGE3T NSP
SCP:RfHubCanHWO::RfBias 4101
^MException Class: Unknown Severity: Unknown
Function: RF: RF Bias
PSD: VIBRANT Coil: Breast SMI Scan: 1106/14
Coil Fault - Short Circuit
A multicoil bias fault was detected.
.
Connector: Port 1 (P1)
Channels faulted: 0x200
First channel: 10 of 32, counting from 1
Fault value: -2499 mV, Channel: 10->
Output:
Coil: Breast SMI
Connector: Port 1 (P1)
Channels faulted: 0x200
First channel: 10 of 32, counting from 1
Thanks in advance for any pointers!
Try the following (with the convenient adaptations)
#!/usr/bin/perl
use strict;
$/="\nEN "; # register separated by "\nEN "
my $error=2274021; # the error!
while(<>){ # for all registers
next unless /\b$error\b/; # ignore unless error
for my $line ( split(/\n/,$_)){
print "$line\n" if ($line =~ /Coil:|Connector:|Channels faulted:|First channel:/);
}
print "====\n"
}
Is this what you need?

Resources