How to schedule periodic event and save file based on configured interval? - jython-2.7

This might be two independent questions. I'm sure someone will let me know!
I need to create and write timestamped data to csv files at regular intervals. What I'm struggling with is how to schedule the writing of the data and also how to name the files. Details below.
I have 2 configuration settings for the intervals:
capture_int_s: this is the interval in seconds at which to write the timestamped data into the CSV file. This needs to be a multiple of new_file_int (e.g. set to 30s would write current values to the CSV every 30s)
new_file_int: this is the interval at which to create a new CSV file. This needs to be flexible to allow anything from seconds to weeks to calendar month to years (thinking crontab-esque)
I want to name the file like so:
If new_file_int = 30mins, then the filename should be:
{prefix}_20220629_000000.csv #000000 is HHmmss. First 30 mins of records
{prefix}_20220629_003000.csv # next 30mins of records
etc...
{prefix}_20220629_233000.csv
{prefix}_20220630_000000.csv
etc...
If new_file_int = 1 cal month, then the filename should be:
{prefix}_20220601_000000.csv # holds all of June's data
{prefix}_20220701_000000.csv # holds all of July's data
etc...
How do I produce the date and time in the filename?
And what is the best way to schedule the writes? (I have functions already to write the data into a file and the file will be created if the path doesn't exist)

Related

creating cron job that sends output to file every day and overwrites this file every month

I need help with cron job that sends output to file every day and overwrites this file every month my only problem is how to make it overwrite each month and I need this in one job so creating 2 jobs one that outputs to a file and other removing it every month is out of picture
You could run it every day but use date +%w to print the day number and act differently (call with > to clobber the file instead of >> to append) based on that.
Note that some cron daemons require % to be escaped, hence \%.
# Run every day at 00:30 but overwrite file on Mondays; append every other day.
# Note that this requires bash as your shell.
# May need to override with SHELL=/bin/bash
30 00 * * * if [ "$(date +\%w)" = "1" ]; then /your/command > /your/logfile; else /your/command >> /your/logfile; fi
Edit:
You mention in comments above that your actual goal is log rotation.
The norm for Linux systems is to use something like logrotate to manage logs like this. That also has the advantage that you can keep multiple previous log files and compress them if you like.
I would recommend making use of a logrotate config snippet to accomplish your goal instead of doing it in the cron job itself. To put this in the cron job is counter-intuitive if it's merely for log rotation.
Here's an example logrotate snippet, which may go in a location like /etc/logrotate.d/yourapp depending on which Linux distribution you're using.
/var/log/yourlog {
daily
missingok
# keep one year of logs
rotate 365
compress
# keep the first one uncompressed for ease of viewing
delaycompress
}
This will result in your log file being rotated daily, with the first iteration being like /var/log/yourlog.1 and then compressed iterations like /var/log/yourlog.2.gz, /var/log/yourlog.3.gz and so on.
In my opinion therefore, your question is not actually a cron question. The kind of cron trickery used above would only be appropriate in situations such as when you want a job to fire on the last Sunday of the month, or the last day of the month, or other criteria that can't be expressed in cron syntax.

Architecture for luigi tasks with multiple inputs

I have number of pickle files, one for each date between 2005 and 2010. Each file contains a dictionary of words with their respective frequencies for that date. I also have a "master file" with all unique words for the whole period. There are about 5 million words in total.
I need to take all that data and produce one CSV file per word, which will have one row per date. E.g., for example file some_word.txt:
2005-01-01,0.0003
2005-01-02,0.00034
2005-01-03,0.008
I'm having trouble organizing this process with the luigi framework. My current top-level task takes a word, looks up it's associated frequency for every date and stores the result in a CSV file. I guess I could just loop through every word in my master file and run the task with that word, but I estimate that would take months, if not longer. Here's my top-level AggregateTokenFreqs task in a simplified version.
class AggregateTokenFreqs(luigi.Task):
word = luigi.Parameter()
def requires(self):
pass # not sure what to require here, master file?
def output(self):
return luigi.LocalTarget('data/{}.csv'.format(self.word))
def run(self):
results = []
for date_ in some_list_of_dates:
with open('pickles/{}.p'.format(date_), 'rb') as f:
freqs = pickle.load(f)
results.append((date_, freqs.get(self.word))
# Write results list to output CSV file
#MattMcKnight says you might be better using multiprocessing. However if you want to use Luigi here's what you can do:
Luigi has the concept of workers that you configure. That's the number of local process to run different task in parallel.
You can model the task instead of "looping" through all the pickles, pass one single pickle to the task (as parameter). You will have to write the result to a TSV in a directory with an unique name.
The have a loop that create a task per each pickle (date). An configure the number of workers (i.e. 5). that way you would be able to process 5 files at the same time.
You will require an additional task that "joins" all the individual CSV files into one.
Hope this helps.

Azure Data Factory Data-Set Slicing

I have some trouble understanding slicing (Dataset Availability) in Azure Data Factory. Let's say I have a source dataset which never changes. Then I for some reason set up hourly slicing for my source data set. Will each slice then be identical? What is the point of using slices at all in such case (i.e. why is it Required)?
Or another case, let's say my source dataset is appended with new data continuously (for example an event log). And each morning I want to do some analysis on all history of that log. Should I then set up daily slicing? Will each slice include the full history or just the last day?
The slices are the intervals in which the pipeline is executed within the period defined in the start and end properties of the pipeline.
If you have a fix source and you execute an activity more than once, it will always use the same source (because it does not change). Lets say you set the start time and end time to be a day, and set the frequency to be 1 hour - the activity will be executed 24 times. You will have 24 slices, all using the same data source.
For your second scenario, if the data keeps changing, you can set the frequency to once a day. What will be processed depends on the activity you define in the pipeline - lets say that the pipeline deletes the old source once it finish processing, or there's logic in the activity the takes only the new data.

Aggregator that releases partial group based on correlation but holds on to rest of the messages

I want to set the correlation strategy on an aggregator so that it uses a date out of the incoming file (as message) name to correlate files so all files with todays date belong to the same group. Now since I might have multiple days worth of data its possible that I have aggregated 2 days of files. I want to base the release strategy on a done file (message) that includes the date in the filename as well so essentially each day will have a bunch of files and a done for file. Ingesting done file should release files for that day from the aggregator but still keep the other day files until the done file for that day is ingested.
so in this scenario, correlation is obviously simple - but what I am not sure about is how to release not all but only some specific messages from the group based on the correlation key. Documentation talks about messagereaper but that goes into messagestore stuff and I want to do all this in memory.
let me elaborate with an example
i have these files on a directory which im polling by a file inbound channel adapter
file-1-2014.04.27.dat
file-2-2014.04.27.dat
file-3-2014.04.27.dat
done-2014.04.27.dat
file-1-2014.04.28.dat
file-2-2014.04.28.dat
done-2014.04.28.dat
as these files are being polled in i have an aggregator in the flow where all incoming files are being aggregated. To correlate I was thinking I can extract the date and put that in correlation_id header so that first 3 files are being considered to belong to one group and then second 2 files belong to the second group .. now once I consume the done-2014.04.27.dat file at that time I want to release the first 3 files to be further processed in the flow but hold on to
file-1-2014.04.28.dat
file-2-2014.04.28.dat
until I receive the
done-2014.04.28.dat
and then release these 2 files.
Any help would be appreciated.
Thanks
I am not sure what you mean when you say "correlation is simple" but then go on to say you only want to release part of the group. If they have different dates then they will be in different groups, so there's no need to release part of a group, just release the whole group by running the reaper just after midnight (or any time the next day). It's not at all clear why you need a "done" message.
By default, the aggregator uses an in-memory message store (SimpleMessageStore).
EDIT:
Just put the done file in the same group and have your release strategy detect the presence of the done file. You could use an expression, but if the group can be large, it would be more efficient to implement ReleaseStrategy and iterate over MessageGroup.getMessages() looking for the done file.
The next step depends on what's downstream of the aggregator. If you use a splitter to split them back to separate files, you can simply add a filter to drop the done file. If you deal with the collection of files directly, either ignore the done file, or add a transformer to remove it from the collection.
With respect to the reaper; assuming files arrive in real time, I was simply suggesting that if you, say, run the reaper once a day (say at 01:00) with a group timeout of, say 30 minutes, then the reaper will release yesterday's files (without the need for a done file).
EDIT:
See my comment on your "answer" below - you have 2 subscribers on filesLogger.

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Resources