Architecture for luigi tasks with multiple inputs - python-3.x

I have number of pickle files, one for each date between 2005 and 2010. Each file contains a dictionary of words with their respective frequencies for that date. I also have a "master file" with all unique words for the whole period. There are about 5 million words in total.
I need to take all that data and produce one CSV file per word, which will have one row per date. E.g., for example file some_word.txt:
2005-01-01,0.0003
2005-01-02,0.00034
2005-01-03,0.008
I'm having trouble organizing this process with the luigi framework. My current top-level task takes a word, looks up it's associated frequency for every date and stores the result in a CSV file. I guess I could just loop through every word in my master file and run the task with that word, but I estimate that would take months, if not longer. Here's my top-level AggregateTokenFreqs task in a simplified version.
class AggregateTokenFreqs(luigi.Task):
word = luigi.Parameter()
def requires(self):
pass # not sure what to require here, master file?
def output(self):
return luigi.LocalTarget('data/{}.csv'.format(self.word))
def run(self):
results = []
for date_ in some_list_of_dates:
with open('pickles/{}.p'.format(date_), 'rb') as f:
freqs = pickle.load(f)
results.append((date_, freqs.get(self.word))
# Write results list to output CSV file

#MattMcKnight says you might be better using multiprocessing. However if you want to use Luigi here's what you can do:
Luigi has the concept of workers that you configure. That's the number of local process to run different task in parallel.
You can model the task instead of "looping" through all the pickles, pass one single pickle to the task (as parameter). You will have to write the result to a TSV in a directory with an unique name.
The have a loop that create a task per each pickle (date). An configure the number of workers (i.e. 5). that way you would be able to process 5 files at the same time.
You will require an additional task that "joins" all the individual CSV files into one.
Hope this helps.

Related

Celery Process a Task and Items within a Task

I am new to Celery, and I would like advice on how best to use Celery to accomplish the following.
Suppose I have ten large datasets. I realize that I can use Celery to do work on each dataset by submitting ten tasks. But suppose that each dataset consists of 1,000,000+ text documents stored in a NoSQL database (Elasticsearch in my case). The work is performed at the document level. The work could be anything - maybe counting words.
For a given dataset, I need to start the dataset-level task. The task should read documents from the data store. Then workers should process the documents - a document-level task.
How can I do this, given that the task is defined at the dataset level, not the document level? I am trying to move away from using a JoinableQueue to store documents and submit them for work with multiprocessing.
It have read that it is possible to use multiple queues in Celery, but it is not clear to me that that is the best approach.
Lets see if this helps. You can define a workflow and add tasks to it and then run the whole thing after building up your tasks. You can have normal python methods return tasks to can be added into celery primatives (chain, group chord etc) See here for more info. For example lets say you have two tasks that process documents for a given dataset:
def some_task():
return dummy_task.si()
def some_other_task():
return dummy_task.si()
#celery.task()
def dummy_task(self, *args, **kwargs):
return True
You can then provide a task that generates the subtasks like so:
#celery.task()
def dataset_workflow():
datastets = get_datasets(*args, **kwargs)
workflows = []
for dataset in datasets:
documents = get_documents(dataset)
worflow = chain(some_task(documents), some_other_task(documents))
worlfows.append(workflow)
run_workflows = chain(*workflows).apply_aysnc()
Keep in mind that generating alot of tasks can consume alot of memory for the celery workers, so throttling or breaking the task generation up might be needed as you start to scale your workloads.
Additionally you can have the document level tasks on a diffrent queue then your worflow task if needed based on resource contstraints etc.

How to schedule periodic event and save file based on configured interval?

This might be two independent questions. I'm sure someone will let me know!
I need to create and write timestamped data to csv files at regular intervals. What I'm struggling with is how to schedule the writing of the data and also how to name the files. Details below.
I have 2 configuration settings for the intervals:
capture_int_s: this is the interval in seconds at which to write the timestamped data into the CSV file. This needs to be a multiple of new_file_int (e.g. set to 30s would write current values to the CSV every 30s)
new_file_int: this is the interval at which to create a new CSV file. This needs to be flexible to allow anything from seconds to weeks to calendar month to years (thinking crontab-esque)
I want to name the file like so:
If new_file_int = 30mins, then the filename should be:
{prefix}_20220629_000000.csv #000000 is HHmmss. First 30 mins of records
{prefix}_20220629_003000.csv # next 30mins of records
etc...
{prefix}_20220629_233000.csv
{prefix}_20220630_000000.csv
etc...
If new_file_int = 1 cal month, then the filename should be:
{prefix}_20220601_000000.csv # holds all of June's data
{prefix}_20220701_000000.csv # holds all of July's data
etc...
How do I produce the date and time in the filename?
And what is the best way to schedule the writes? (I have functions already to write the data into a file and the file will be created if the path doesn't exist)

In PySpark groupBy, how do I calculate execution time by group?

I am using PySpark for a university project, where I have large dataframes and I apply a PandasUDF, using groupBy. Basically the call looks like this:
df.groupBy(col).apply(pandasUDF)
I am using 10 cores in my Spark config (SparkConf().setMaster('local[10]')).
The goal is to be able to report the time each group took to run my code. I want the time each group takes to finish so that I can take the average. I am also interested in calculating the standard deviation.
I now am testing with cleaned data that I know will be separated into 10 groups, and I have the UDF print the running time using time.time(). But, if I am to use more groups this is not going to be possible to do (for context, all my data will be separated into 3000-something groups). Is there a way to measure the execution time per group?
If don't want to print the execution time to stdout you could return it as an extra column from the Pandas UDF instead e.g.
#pandas_udf("my_col long, execution_time long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
return pdf.assign(execution_time=datetime.now() - start)
Alternatively, to compute the average execution time in the driver application, you could accumulate the execution time and the number of UDF calls in the UDF with two Accumulators. e.g.
udf_count = sc.accumulator(0)
total_udf_execution_time = sc.accumulator(0)
#pandas_udf("my_col long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
udf_count.add(1)
total_udf_execution_time.add(datetime.now() - start)
return pdf
# Some Spark action to run business logic
mean_udf_execution_time = total_udf_execution_time.value / udf_count.value

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

How to design spark program to process 300 most recent files?

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?
One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).
My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.
If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Resources