How to push complex legacy logs into logstash? - logstash

I'd like to use ELK to analyze and visualize our GxP Logs, created by our stoneold LIMS system.
At least the system runs on SLES but the whole logging structure is some kind of a mess.
I try to give you an impression:
Main_Dir
| Log Dir
| Large number of sub dirs with a lot of files in them of which some may be of interest later
| Archive Dir
| [some dirs which I'm not interested in]
| gpYYMM <-- subdirs created automatically each month: YY = Year ; MM = Month
| gpDD.log <-- log file created automatically each day.
| [more dirs which I'm not interested in]
Important: Each medical examination, that I need to track, is completely logged in the gpDD.log file that represents the date of the order entry. The duration of the complete examination varies between minutes (if no material is available), several hours or days (e.g. 48h for a Covid-19 examination) or even several weeks for a microbiological sample. Example: All information about a Covid-19 sample, that reached us on December 30th is logged in ../gp2012/gp30.log even if the examination was made on January 4th and the validation / creation of report was finished on January 5th.
Could you please provide me some guidance of the right beat to use ( I guess either logbeat or filebeat) and how to implement the log transfer?

Logstash file input:
input {
file {
path => "/Main Dir/Archive Dir/gp*/gp*.log"
}
}
Filebeat input:
- type: log
paths:
- /Main Dir/Archive Dir/gp*/gp*.log
In both cases the path is possible, however if you need further processing of the lines, I would suggest using at least Logstash as a passthrough (using beats input if you do not want to install Logstash on the source itself, which can be understood)

Related

Easy way to merge files in Linux, based on line timestamp?

We currently have a process which grabs distinct log files off a remote system and places them all in a single consolidated file for analysis.
The lines are all of the form:
2023-02-08 20:39:32 Textual stuff goes here.
so the process is a rather simple:
cat source_log_file_* | sort > consolidated_log_file
Now this works fine for merging the individual files into a coherent, ordered, file but it has the problem that it also sorts lines within each of the source log files where they have the same timestamps). For example, the left side below is modified to the right side:
2023-02-08 20:39:32 First ==> 2023-02-08 20:39:32 First
2023-02-08 20:39:32 Second ==> 2023-02-08 20:39:32 Fourth
2023-02-08 20:39:32 Third ==> 2023-02-08 20:39:32 Second
2023-02-08 20:39:32 Fourth ==> 2023-02-08 20:39:32 Third
This makes analysis rather difficult as sequence within a source log file is changed.
I could temporarily insert a sequence number (per source file) between the timestamp and the text and remove it from the consolidated file but I was wondering if it were possible to do a merge of the files based on timestamp rather than a sort.
By that, I mean open every single source log file (which is already sorted correctly based on sequence) and, until they're all processed, grab the first line that has the earliest timestamp and add it to the consolidated file. This way, the order of lines is preserved for each source log file but the lines from the separate files are sequenced correctly.
I can write a program to do that if need be, but I was wondering if there was an easy way to do it with standard tools.
You need to use just the timestamp as the sort key, not the whole line. Then use the --stable option to keep the lines in their original order if they have the same timestamp.
sort -k 1,2 --stable source_log_file_* > consolidated_log_file

How do I modify/subset a wget script to specify a date range to only download certain years into different scripts?

I am trying to download a lot of data for some research from the CMIP6 website (https://esgf-node.llnl.gov/search/cmip6/) that provides wget scripts for each model.
The scripts are for every 6 hours or month from 1850 to 2014. The date format looks like this (1st script): 185001010600-185101010000 or (for 2nd script) 195001010600-195002010000, 195002010600-195003010000
My goal is to turn one giant script into several smaller ones with five years each for 1980 to 2015
As an example, I would want to subset the main script into different scripts with 5 year intervals ("19800101-19841231" then "19850101-19901231", etc.) with each named wget-1980_1985.sh, wget-1985_1990.sh, respectively
For an example date range for the 2nd script, I would need:
197912010600 through 198601010000, then every 5 years after that
I'm a beginner so please help if you can!
Part of the wget script format for each file looks like this (it won't let me copy and paste the whole thing since there are too many links [see below to find the file yourself]):
1.) #These are the embedded files to be downloaded download_files="$(cat <185001010600-185101010000.nc' 'http://esgf-data2.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrPlevPt/hus/gn/v20191204/hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185001010600-185101010000.nc' 'SHA256'
'fa9ac4149cc700876cb10c4e681173bcc0040ea03b9a439d1c66ef47b0253c5a'
'hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185101010600-185201010000.nc' 'http://esgf-data2.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrPlevPt/hus/gn/v20191204/hus_6hrPlevPt_MIROC6_historical_r1i1p1f1_gn_185101010600-185201010000.nc' 'SHA256'
'4ef4f99aa34aae6dfdafaa4aab206344125abe7808df675d688890825db53047'
2.) For the second script, the dates look like this: 'ps_6hrLev_MIROC6_historical_r1i1p1f1_gn_195001010600-195002010000.nc'
To run it, you just download the script from the website (see below)
or downloading from this link should work:
1.) https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.CMIP.MIROC.MIROC6.historical.r1i1p1f1.6hrPlevPt.hus.gn.v20191204|esgf-data2.diasjp.net
2.) A similar script can be seen here (the dates are different but I need this one too):
https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.CMIP.MIROC.MIROC6.historical.r1i1p1f1.6hrLev.ps.gn.v20191114|esgf-data2.diasjp.net
to run the script in the terminal, this is the command i use
bash wget* -H
and it will download each file.
I can vi the script and delete each file (by using "dd") I don't need but this will be extremely time consuming.
To find this data and get the wget script from the website, go to: https://esgf-node.llnl.gov/search/cmip6/
and select the variables on the left side of the page as follows:
Source ID: MIROC6,
Experiment ID: Historical,
Variant Label: r1i1p1f1,
Table ID: 6hrPlevPt,
and Variable: hus
it will look like this
*If these files are too big, you can also select Frequency:monthly instead for a much smaller file. I just want you to see the date format since monthly is just the month and year
Then hit search and it will give you one model to download. it will look like thisOn the bottom, with the links, it will say "wget script." Click that and it will download.
You can
vi wget*
to view and/or edit it or
bash wget* -H
to run/download each file.
It might ask you to log in but I've found typing in nonsense to the username and password still starts the download.
Please help! This will be the next 6 months of my life and I really don't want to "dd" every file I don't need for all of these!
A bash for loop can generate relevant date ranges and output filename.
A simple sed script can delete relevant lines if they appear in order.
For example:
#!/bin/bash
in=esgf_script
for y in $(seq 1979 5 2014); do
out="wget_{$y}-$((y+4)).sh"
sed '/_gn_/{ # if some kind of url:
/_gn_'$((y+5))'/,$ d; # delete if year >= y+5
/_gn_2015/,$ d; # delete if year >= 2015
/_gn_'$y'/,$ !d; # delete if year < y
}' <"$in" >"$out"
done
The seq command generates every fifth year starting from 1979 up to 2014.
The sed script:
looks for lines containing urls: /_gn_/
deletes if year is too big
otherwise, doesn't delete if year is big enough
This code assumes that:
no lines except urls contain the first regex (/_gn_/)
the urls appear in ascending year order (eg. urls containing 1994 cannot appear before ones containing 1993)

Understanding GTFS service times

I have a problem with the understanding of the gtfs file format. Or maybe there is an error in that data. There is a gtfs file from a public transportation agency called "Verkehrsverbund Mittelthüringen" (VMT). This data is accessible at https://transitfeeds.com/p/verkehrsverbund-mittelth-ringen/1080/latest/routes.
For example: I have taken the trip with the ID 9782458 (trips.txt).
2841_0,45,9782458,"Erfurt, Thüringenhalle","",0,,,,
This has the service ID 45 with the specification
45,0,0,0,0,0,0,0,20191101,20200229
Additionally here are the entries for the celendar_dates.txt
45,20191104,1
45,20191111,1
45,20191118,1
45,20191125,1
45,20191202,1
45,20191209,1
45,20191216,1
45,20191105,1
45,20191112,1
45,20191119,1
45,20191126,1
45,20191203,1
45,20191210,1
45,20191217,1
45,20191106,1
45,20191113,1
45,20191120,1
45,20191127,1
45,20191204,1
45,20191211,1
45,20191218,1
45,20191107,1
45,20191114,1
45,20191121,1
45,20191128,1
45,20191205,1
45,20191212,1
45,20191219,1
45,20191101,1
45,20191108,1
45,20191115,1
45,20191122,1
45,20191129,1
45,20191206,1
45,20191213,1
45,20191220,1
Does this mean, that the service is available all times, except from the 1st November 2019 to the 29th February 2020? My Problem now is the output of the search engine tansitfeeds.com. It says the trip with the ID 9782458 is available at the 14th November 2019. Which is contary to my understanding of the data: the trip won't be available in November. Where is the clue I missed? Or is there an error in the data?
The line you pasted indicates that service ID 45 runs on zero days of the week (that's what all those zeros mean), so the start and end dates in the same line don't really mean anything.
If this service actually does run on Nov. 14, this could be represented in the calendar_dates.txt file, which is usually used to represent service changes for special dates.
EDIT: the data you added from calendar_dates.txt does indeed show that service 45 has been added for date 20191114.

NLog log-rotation/archiving inconsistent behavior

I have a project that uses NLog to create and maintain log files. This includes the use of log-rotation/archiving of old log files. However, I've seen that the archiving settings of NLog are not always respected, especially regarding the ArchiveEvery configuration option. Based on this stackoverflow answer, I assume the library checks the last write time for a file to check if it has to archive the current file and start a new one, but not until a new log message is passed to the library.
In my project I have the library configured to archive every minute. This should be fine, as my project logs messages every few seconds, and I expect to see an archived file every minute because the log messages keep coming. However I see inconsistent behavior, with sometimes multiple minutes in between different, but subsequent, archived log files. For example, I currently have the following files on my disk:
Filename | Last write time
----------------------+------------------
Log.01-06-2017.2.csv | 1-6-2017 10:42
Log.01-06-2017.3.csv | 1-6-2017 10:44
Log.01-06-2017.4.csv | 1-6-2017 10:46
Log.01-06-2017.5.csv | 1-6-2017 10:47
Log.01-06-2017.6.csv | 1-6-2017 10:48
Log.01-06-2017.7.csv | 1-6-2017 10:52
Log.01-06-2017.8.csv | 1-6-2017 11:01
Log.01-06-2017.9.csv | 1-6-2017 11:04
Log.01-06-2017.20.csv | 1-6-2017 11:43
Log.01-06-2017.csv | 1-6-2017 11:46
As you can see, the archived files are not created every minute. As for my NLog config at the moment:
fileTarget.ArchiveNumbering = ArchiveNumberingMode.DateAndSequence;
fileTarget.ArchiveEvery = FileArchivePeriod.Minute;
fileTarget.KeepFileOpen = true;
fileTarget.AutoFlush = true;
fileTarget.ArchiveDateFormat = "dd-MM-yyyy";
fileTarget.ArchiveOldFileOnStartup = true;
I am struggling to get this to work "properly". I write this in parentheses as I don't have much experience with NLog and don't really know how the library behaves. I had hoped to find more information on the NLog wiki page over at GitHub, but I couldn't find the information I needed over there.
Edit
fileTarget.FileName is comprised of a base-folder (storage.Folder.FullName = "C:\ProgramData\\"), a subfolder (LogFolder = "AuditLog"), and the filename (LogFileName = "Log.csv"):
fileTarget.FileName = Path.Combine(storage.Folder.FullName, Path.Combine(LogFolder, LogFileName));
The fileTarget.ArchiveFileName is not set, so I imagine it being the default one.
Could it be that specifying the complete path for the FileName is screwing things up? If so, is there a different way to specify a specific folder to put the log files in?

Bash Script Efficient For Loop (Different Source File)

First of all i'm a beginner in bash scripting. I usually code in Java but this certain task requires me to create some bash scripts in Linux. FYI i've already made a working script but I think its not efficient enough because of the large files I'm dealing with.
The problem is simple I have 2 logs that I need to compare and make some correction on one of the logs... ill call it logA and logB. This 2 logs contains different format here is an example:
01/04/2015 06:48:59|9691427842139609|601113090635|PR|30.00|8.3|1.7| <- log A
17978712 2015-04-01 06:48:44 601113090635 SUCCESS DealerERecharge.<-log B
17978714 2015-04-01 06:48:49 601113090635 SUCCESS DealerERecharge.<-log B
As you can see there is a gap in time stamp. The actual logs that will match with log A is the one with the ID 17978714 because it is the closest time from it. The highest time gap I've seen is 1 minute. I cant use the RANGE logic because if there are more than one line on log B that is within the 1 minute range then all of the line will show in my regenerated log.
The script I made contains a for loop which iterate the timestamp of log A until it hits something in log B (The first one it hits is the closest)
Inside the for loop I have this line of code which makes the loop slow.
LINEOUTPUT=$(less $File2 | grep "Validation 1" | grep "Validation 2" | grep "Timestamp From Log A")
I've read some sample using SED but the problem is I have 2 more validation to consider before matching it with the time stamp.
The validation works as a filter to narrow down the exact match for log A and B.
Additional Info: I tried doing some benchmark test for the script I made by performing some loop. One thing I've noticed is that even though I only use 1 pipe for that script the loop tick is still slow.

Resources