I wonder if you can configure logstash in the following way:
Background Info:
Every day I get a xml file pushed to my server, which should be parsed.
To indicate a complete file transfer afterwards I get an empty .ctl (custom file) transfered to the same folder.
The files both have the following name schema 'feedback_{year}{yearday}_UTC{hoursminutesseconds}_51.{extention}' (e.g. feedback_16002_UTC235953_51.xml). So they have the same file name but one is with .xml and the other is a .ctl file.
Question:
Is there a way to configure logstash to wait parsing the xml file until the according .ctl file is present?
EDIT:
Is there maybe a way to archiev that with filebeat?
EDIT2:
It would also be enough to be able to configure logstash in a way that it will wait x minutes before starting to process a new file, if that is easier.
Thanks for any help in advance
Your problem is that you don't want to start the parser before the file transfer hasn't been completed. So, why don't push the data to a file (file-complete.xml) when you find your flag file (empty.ctl)?
Here is the possible logic for a script and runs using crontab:
if empty.ctl exists:
Clear file-complete.xml
Add the content of file.xml to file-complete.xml.
Remove empty.ctl
This way, you'd need to parse the data from file-complete.xml. I think is simpler to debug and configure.
Hope it helps,
Related
I wanted to implement a log rotation option in linux. I have a *.trc file where all the logs are getting written. I wanted a new log file to be created every hour. I have done some analysis and found the below
I have done some analysis and got to know about the logrotate option. Where we need to update the rotation details for a specific file in the logrotate.conf file
I wanted to know if there is an option without using the logrotate option. I wanted to rotate the logfiles on an hourly basis, so something like appending date and hour information to the log file and create new files based on the current hour information.
Im looking for some suggestions on how to implement the log rotation using the second option specified above.
Any details on the above would be really helpful
If you have control over the process that creates the logs, you could just timestamp the file at the moment of creation. This will remove the need to rename the log.
Before you write every line you check the time. If one hour passed after that file was created, you close the current file and open a new one with a new timestamp.
If you do not have control over the process, you can pipe the output of your process (stdout,stderr) to multilog, which is a binary that's part of the package daemon-tools in most Linux distros.
https://cr.yp.to/daemontools/multilog.html
I'm currently using logstash to import dozens of log files from different webapps into Graylog. It works great the files are tagged so I know from wich webapp they originate.
I can't change the webapp thus I can't add a GELF appender to the log4j conf of the webapp. The idea is to periodically retrieve the log files, parse them and import them with logstash into Graylog.
My problem is how do I make sure I don't import a log event I've already imported.
For example, I have a log file that has a log pattern that increments: log.1, log.2, etc. So I'll have log events that could be in log.1 the first time and 2 weeks later when I reimport them they'll maybe be in log.3.
I'm afraid I can't handle that with logstash's file input "sincedb_path" and "start_position".
So here are a few options I've gathered and I'd like your input about them, if anyone encountered the same issue:
Use a logstash filter dropping all events before a certain date,
requires to keep an index of every last log date of every file
imported (potentially 50+) and a lot of configuration writing
Use of a drool rule in GrayLog to refuse logs with timestamps prior
to last log received for a given type
Ask to change the log pattern to be something like log.date instead
of a log pattern that renames files (but I'd rather avoid this one)
Any other idea?
For one of my Project, I have a certain challenge where I need to take all the reports generated in a certain path, I want this to be an automated process in "Linux". I know the way how to get the file names which have been updated in the past 120 mins, but not the files directly. Now my requirements are in such a way
Take a certain files that have been updated in past 120 mins from the path
/source/folder/which/contains/files
Now do some bussiness logic on this generated files which i can take care of
Move this files to
/destination/folder/where/files/should/go
I know how to achieve #2 and #3 but not sure of #1. Can someone help me how can i achieve this.
Thanks in Advance.
Write a shell script. Sample below. I haven't provided the commands to get the actual list of file names as you said you know how to do that.
#!/bin/sh
files=<my file list>
for file in $files; do
cp $file <destination_dirctory>
done
Currently I am using file input plugin to go over my log archive but file input plugin is not the right solution for me because file input plugin inherently expects that file is stream of events and not as a static file. Now, this is causing a great deal of problem for me because my log archive has a 100,000 + log files and I logstash opens a handle on all these files which are never going to change.
I am facing following problems
1) Logstash fails with problem mentioned in SO
2) With those many open file handles log archival storage is getting very slow.
Does anybody know a way to let logstash know that treat files statically or once a file is processed do not keep file handle on it.
In logstash Jira bug, I was told to write my own plugin with some other suggestions which won't help me much.
Logstash file input can process static file. You need to add this configuration
file {
path => "/your/logs/path"
start_position => "beginning"
}
After adding the start_position, logstash reads the file from the beginning. Please refer here for more information. Remember that this option only modifies “first contact” situations where a file is new and not seen before. If a file has already been seen before, this option has no effect. Otherwise you have set your sincedb_path to /dev/null .
For the first question, I have answer in the comment. Please try to add the maximum file opened.
For my suggestion, You can try to write a script copy the log file to the logstash monitor path and move it out constantly. You have to estimate the time that logstash process a log file.
look out for this also turn on -v and --debug for logstash
{:timestamp=>"2016-05-06T18:47:35.896000+0530",
:message=>"_discover_file: /datafiles/server.log:
**skipping because it was last modified more than 86400.0 seconds ago**",
:level=>:debug, :file=>"filewatch/watch.rb", :line=>"330",
:method=>"_discover_file"}
solution is to touch the file or change the ignore_older setting
When I start logstash, the old logs are not imported into ES.
Only the new request logs are recorded in ES.
Now I've see this in the doc.
Even if I set the start_position=>"beginning", old logs are not inserted.
This only happens when I run logstash on linux.
If I run it with the same config, old logs are imported.
I don't even need to set start_position=>"beginning" on windows.
Any idea about this ?
When you read an input log to Logstash, Logstash will keep an record about the position it read on this file, that's call sincedb.
Where to write the sincedb database (keeps track of the current position of monitored log files).
The default will write sincedb files to some path matching "$HOME/.sincedb*"
So, if you want to import old log files, you must delete all the .sincedb* at your $HOME.
Then, you need to set
start_position=>"beginning"
at your configuration file.
Hope this can help you.
Please see this line also.
This option only modifies "first contact" situations where a file is new and not seen before. If a file has already been seen before, this option has no effect.