Logstash file input glob? - logstash

I am using logstash file input with glob to read my files:
path => "/home/Desktop/LogstashInput/**/*.log"
Directory structure format:
LogstashInput => server-name => date => abc.log
This is reading all log files within every date directory ending with ".log".
Now I want to read only some particular log files within all date directories. Eg: 2014.11.05 directory has abc.log, xyz.log............ 10 such files. Then I want to read say only five particular files, how should the path input be ??
I read about exclude in logstash but it becomes a lot of files to be excluded as there are different type of files within different server-name directories and different dates

The logstash agent is written in ruby, so refer to the ruby glob rules. Based on your actual file names, you might be able to get one working.

Related

python 3.x: how to combine dictionary zipped log file if exist in two different directories into another directory

I have below structure of directories having .log.gz files. Also these logs files are having String dictionary type. Anyone of these directories may not exist as well.
xyz1/
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
xyz12/
2022-08-08T01:30Z.log.gz
2022-08-08T01:33Z.log.gz
I want to create another directory and combine above files
xyz/
2022-08-08T01:30Z.log.gz
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
Conditions:
any one of xyz1 and xyz2 may exist or both can exist
If same name file exist in both directory combine them into third one "xyz"
While combining the String dictionary format should be retained
Solution I opted:
Check one directory if it exists and iterate over files.
For each files check if same file exist in other directory.
If yes, than decompress both files and combine them, zip them into xyz directory
if not than copy it into xyz
Is there any better way to perform above operation. Below is how I combine two log files.
combinefile = {}
combinefile.update(json.loads(xyz1/file1.log))
combinefile.update(json.loads(xyz2/file1.log))

How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark?

Lets say I have a directory called 'all_data', and inside this, I have several other directories based on the date of the data that it contains. These directories are named date_2020_11_01 to date_2020_11_30 and each one of these contain csv files which I intend to read in a single dataframe.
But I don't want to read the data for date_2020_11_15 and date_2020_11_16. How do I do it?
I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16:
spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")
df= spark.read.format("parquet").option("header", "true").load(paths)
where paths is a list of all the paths where data is present, worked for me.
Simple method is, read all data directory as it is and apply filter condition
df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'")
Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.

what are the usual problems that we face with sincedb in logstash

I am using ELK stack, so using file input plugin in logstash i am working on it
at first i used file*.txt to match with file pattern
later i used masterfile.txt as a single file which has the data of all matching patterns
and now i am going back to file*.txt , but here i see the problem- I am seeing the data on kibana which is the date after the file*.txt is replaced with masterfile.txt but not the history,
I feel like i must understand the behavior of sincedb logstash here
also a possible solution to get the history data
Logstash stores information about the position of the last byte read in the file that contains the logs with sincedb_path. During the execution, Logstash starts reading the input file from the mentioned position.
Take into account 'start_position' and the name of the index ( Logstash -> output) if you want to create a new index with different logs.
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-sincedb_path

Reading from rotating log files in logstash

As per the documentation of logstash's file plugin, the section on File Rotation says the following:
To support programs that write to the rotated file for some time after
the rotation has taken place, include both the original filename and
the rotated filename (e.g. /var/log/syslog and /var/log/syslog.1) in
the filename patterns to watch (the path option).
If anyone can clarify how to specify two filenames in the path configuration, that will be of great help as I did not find an exact example. Some examples suggest to use wild-cards like /var/log/syslog*, however I am looking for an example that achieves exactly what is said in documentation - two filenames in the path option.
The attribute path is an array and thus you can specify multiple files as follows:
input {
file{
path => [ "/var/log/syslog.log", "/var/log/syslog1.log"]
}
}
You can also use * notation for name or directory as follows:
input {
file{
path => [ "/var/log/syslog.log", "/var/log/syslog1.log", "/var/log/*.log", "/var/*/*.log"]
}
}
When you specify path as /var/*/*.log it does a recursive search to get all files with .log extension.
Reference Documentation

How to capture file names and check for date part in the filename in Unix code?

I am a newbie to linux and I have a requirement where I need to capture the file names, check the date in the file names and proceed and load the data from all the files only if all the file names have the same date in them. Lets say, I have few files
X_US_20130420.CSV
X_CA_20130420.CSV
X_PH_20130420.CSV
X_NS_20130420.CSV
I need to check if all the files have the same date (20130420 here) and then use that date as a parameter in my next job. Please help.
There are lots of ways to go about this, but one way would be to loop through all the files, parse out the dates, and see if any date doesn't match the others. I'm not going to deprive you of the privilege of figuring out the bulk of the work, but the date parsing can be done like so:
If you have bash:
file=X_US_20130420.CSV
myDate=${file##*([A-Z_])}
myDate=${myDate%.CSV}
# myDate is now 20130420
If you don't have bash:
file=X_US_20130420.CSV
myDate="$(echo $file | sed 's:^[A-Z_]\{1,\}\([0-9]\{8\}\).CSV/\1/')"
# myDate is now 20130420

Resources