Storing run time logs in a folder - linux

I am running a shell script in Linux environment to create some logs (dynamic log files) as text files.
I want to store all the log files created into a single folder after some particular time.
So how can I do that? Can anyone suggest some commands?
Thanks in advance.

In the script you can define that directory as a variable and you can use that one across the script.
#!/bin/bash
LOG_DIR=/tmp/logs
LOG_FILE=$LOG_DIR/log_file.$$ ## $$ will create the different log file for each and every run
## You can also do it by using some time stamp using date command.
<Your commands> >> $LOG_FILE

It really depends on your situations:
[Suggested if your log files are small in size]
You may want to backup your logs by just add a cron job, and zip/tar it to another folder, as a snapshot. Since the log files are small, even zip/tar everything may need to take you many many years to fill-up your hard drive.
[Suggested if your log files are large]
In your script that generate logs, you may want to rotate through a few indexed files, say, log.0 to log.6, each for one week day, from Sunday to Saturday. And you can have another script to backup yesterday's log (so that it won't have race conditions between the log producer and the log consumer, i.e. the log mover/copier). You can have strategies for how many days of backup will be still existing, and for how long of those should be discarded.
The yesterdays' log mover/copier can be easily done by a cron job.

Related

Reading only the unread files in PySpark

I have a root directory where additional directories are being created daily or sometimes hourly, with avro files, example:
root/2021/12/01/file121.avro
root/2022/06/01/file611.avro
root/2022/06/01/file612.avro
root/2022/06/01/file613.avro
root/2022/06/03/file631.avro
root/2022/06/03/file632.avro
root/2022/06/05/file651.avro
root/2022/06/05/file652.avro
root/2022/06/05/file653.avro
Each time my PySpark code runs it needs to read the files that have not been read before in any of the sub-directories of the root. I need to process one file per run of the code. The code will be run about every 5 minutes.
How can this be accomplished, in PySpark?
Any approach/strategy and code ideas would be much appreciated.
Best :)
Michael

CopyTruncate log rotation mechanism is dropping logs

I had implemented a linux based logrotation with copytruncate strategy. Below is the config for same:
/data/app/info.log {
missingok
copytruncate
maxsize 50M
daily
rotate 30
create 644 app app
delaycompress
compress
}
With above config, whenever logrotation task is triggered with application simultaneously writing logs, some log lines are getting dropped. Can someone please guide what am I doing wrong or suggest any other log rotation strategy with no data loss.
I know this question is a few months old but simply for the benefit of others: You are not doing anything wrong. From the manpage
copytruncate:
Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one. It can be used when some program cannot be told to close its logfile and thus might continue writing (appending) to the previous log file forever. Note that there is a very small time slice between copying the file and truncating it, so some logging data might be lost. When this option is used, the create option will have no effect, as the old log file stays in place."

Does the file get changed in squeue if I modify after being sent into queue? [duplicate]

Say I want to run a job on the cluster: job1.m
Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv
I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters, and tell it to save data to job1_edited.csv. Then I re-submit job1.m.
Now I have two batch jobs in the queue.
What will happen to my output files? Will job1.csv be data from the original job1.m file? And will job1_edited.csv be data from the edited file? Or will job1.csv and job1_edited.csv be the same output?
:(
Thanks in advance!
I am assuming job1.m is a Mathematica job, run from inside a Bash submission script. In that case, job1.m is read when the job starts so if it is modified after submission but before job start, the modified version will run. If it is modified after the job starts, the original version will run.
If job1.m is the submission script itself (so you run sbatch job1.m), that script is copied in a spool directory specific to the job so if it is modified after the job is submitted, it still will run the original version.
In any case, it is better, for reproducibility and traceability, to make use of a workflow manager such as Fireworks, or Bosco

creating cron job that sends output to file every day and overwrites this file every month

I need help with cron job that sends output to file every day and overwrites this file every month my only problem is how to make it overwrite each month and I need this in one job so creating 2 jobs one that outputs to a file and other removing it every month is out of picture
You could run it every day but use date +%w to print the day number and act differently (call with > to clobber the file instead of >> to append) based on that.
Note that some cron daemons require % to be escaped, hence \%.
# Run every day at 00:30 but overwrite file on Mondays; append every other day.
# Note that this requires bash as your shell.
# May need to override with SHELL=/bin/bash
30 00 * * * if [ "$(date +\%w)" = "1" ]; then /your/command > /your/logfile; else /your/command >> /your/logfile; fi
Edit:
You mention in comments above that your actual goal is log rotation.
The norm for Linux systems is to use something like logrotate to manage logs like this. That also has the advantage that you can keep multiple previous log files and compress them if you like.
I would recommend making use of a logrotate config snippet to accomplish your goal instead of doing it in the cron job itself. To put this in the cron job is counter-intuitive if it's merely for log rotation.
Here's an example logrotate snippet, which may go in a location like /etc/logrotate.d/yourapp depending on which Linux distribution you're using.
/var/log/yourlog {
daily
missingok
# keep one year of logs
rotate 365
compress
# keep the first one uncompressed for ease of viewing
delaycompress
}
This will result in your log file being rotated daily, with the first iteration being like /var/log/yourlog.1 and then compressed iterations like /var/log/yourlog.2.gz, /var/log/yourlog.3.gz and so on.
In my opinion therefore, your question is not actually a cron question. The kind of cron trickery used above would only be appropriate in situations such as when you want a job to fire on the last Sunday of the month, or the last day of the month, or other criteria that can't be expressed in cron syntax.

Aggregator that releases partial group based on correlation but holds on to rest of the messages

I want to set the correlation strategy on an aggregator so that it uses a date out of the incoming file (as message) name to correlate files so all files with todays date belong to the same group. Now since I might have multiple days worth of data its possible that I have aggregated 2 days of files. I want to base the release strategy on a done file (message) that includes the date in the filename as well so essentially each day will have a bunch of files and a done for file. Ingesting done file should release files for that day from the aggregator but still keep the other day files until the done file for that day is ingested.
so in this scenario, correlation is obviously simple - but what I am not sure about is how to release not all but only some specific messages from the group based on the correlation key. Documentation talks about messagereaper but that goes into messagestore stuff and I want to do all this in memory.
let me elaborate with an example
i have these files on a directory which im polling by a file inbound channel adapter
file-1-2014.04.27.dat
file-2-2014.04.27.dat
file-3-2014.04.27.dat
done-2014.04.27.dat
file-1-2014.04.28.dat
file-2-2014.04.28.dat
done-2014.04.28.dat
as these files are being polled in i have an aggregator in the flow where all incoming files are being aggregated. To correlate I was thinking I can extract the date and put that in correlation_id header so that first 3 files are being considered to belong to one group and then second 2 files belong to the second group .. now once I consume the done-2014.04.27.dat file at that time I want to release the first 3 files to be further processed in the flow but hold on to
file-1-2014.04.28.dat
file-2-2014.04.28.dat
until I receive the
done-2014.04.28.dat
and then release these 2 files.
Any help would be appreciated.
Thanks
I am not sure what you mean when you say "correlation is simple" but then go on to say you only want to release part of the group. If they have different dates then they will be in different groups, so there's no need to release part of a group, just release the whole group by running the reaper just after midnight (or any time the next day). It's not at all clear why you need a "done" message.
By default, the aggregator uses an in-memory message store (SimpleMessageStore).
EDIT:
Just put the done file in the same group and have your release strategy detect the presence of the done file. You could use an expression, but if the group can be large, it would be more efficient to implement ReleaseStrategy and iterate over MessageGroup.getMessages() looking for the done file.
The next step depends on what's downstream of the aggregator. If you use a splitter to split them back to separate files, you can simply add a filter to drop the done file. If you deal with the collection of files directly, either ignore the done file, or add a transformer to remove it from the collection.
With respect to the reaper; assuming files arrive in real time, I was simply suggesting that if you, say, run the reaper once a day (say at 01:00) with a group timeout of, say 30 minutes, then the reaper will release yesterday's files (without the need for a done file).
EDIT:
See my comment on your "answer" below - you have 2 subscribers on filesLogger.

Resources