How to design spark program to process 300 most recent files? - apache-spark

Situation
New small files comes in periodically. I need to do calculation on recent 300 files. So basically there is a window moving forward. The size of the window is 300 and I need do calculation on the window.
But something very important to know is that this is not a spark stream computing. Because in spark stream, the unit/scope of window is time. Here the unit/scope is number of files.
Solution1
I will maintain a dict, the size of the dict is 300. Each new file comes in, I turn it into spark data frame and put it into dict. Then I make sure the oldest file in the dict is popped out if the length of dict is over 300.
After this I will merge all data frames in the dict to a bigger one and do calculation.
The above process will be run in a loop. Every time new file comes in we go through the loop.
pseudo code for solution 1
for file in file_list:
data_frame = get_data_frame(file)
my_dict[ timestamp ] = data_frame
for timestamp in my_dict.keys():
if timestamp older than 24 hours:
# not only unpersist, but also delete to make sure the memory is released
my_dict[timestamp].unpersist
del my_dict[ timestamp ]
# pop one data frame from the dict
big_data_frame = my_dict.popitem()
for timestamp in my_dict.keys():
df = my_dict.get( timestamp )
big_data_frame = big_data_frame.unionAll(df)
# Then we run SQL on the big_data_frame to get report
problem for solution 1
Always hit Out of memory or gc overhead limit
question
Do you see anything inappropriate in the solution 1?
Is there any better solution?
Is this the right kind of situation to use spark ?

One observation, you probably don't want to use popitem, the keys of a Python dictionary are not sorted, so you can't guarantee that you're popping the earliest item. Instead I would recreate the dictionary each time using a sorted list of timestamps. Assuming your filenames are just timestamps:
my_dict = {file:get_dataframe(file) for file in sorted(file_list)[-300:]}
Not sure if this will fix your problem, can you paste the full stacktrace of your error into the question? It's possible that your problem is happening in the Spark merge/join (not included in your question).

My suggestion to this is streaming, but not with respect to time, I mean you will still have some window and sliding interval set, but say it is 60 secs.
So every 60 secs you get the DStream of file contents, in 'x' partitions. These 'x' partitions represent the files you drop onto HDFS or file system.
So, this way you can keep track of how many files/partitions have been read, if they are less than 300 then wait until they become 300. After the count hits 300 then you can start processing.

If it's possible to keep track of the most recent files or if it's possible to just discover them once in a while, then I'd suggest to do something like
sc.textFile(','.join(files));
or if it's possible to identify specific pattern to get those 300 files, then
sc.textFile("*pattern*");
And it's even possible to have comma separated patterns, but it might happen that some files that match more, than one pattern, would be read more, than once.

Related

Best Way to Save Sensor Data, Split Every x Megabytes in Python

I'm saving sensor data at 64 samples per second into a csv file. The file is about 150megs at end of 24 hours. It takes a bit longer than I'd like to process it and I need to do some processing in real time.
value = str(milivolts)
logFile.write(str(datet) + ',' + value + "\n")
So I end up with single lines with date and milivolts up to 150 megs. At end of 24 hours it makes a new file and starts saving to it.
I'd like to know if there is a better way to do this. I have searched but can't find any good information on a compression to use while saving sensor data. Is there a way to compress while streaming / saving? What format is best for this?
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
Thanks for any input.
I'd like to know if there is a better way to do this.
One of the simplest ways is to use a logging framework, it will allow you to configure what compressor to use (if any), the approximate size of a file and when to rotate logs. You could start with this question. Try experimenting with several different compressors to see if speed/size is OK for your app.
While saving the sensor data, is there an easy way to split it into x megabyte files without data gaps?
A logging framework would do this for you based on the configuration. You could combine several different options: have fixed-size logs and rotate at least once a day, for example.
Generally, this is accurate up to the size of a logged line, so if the data is split into lines of reasonable size, this makes life super easy. One line ends in one file, another is being written into a new file.
Files also rotate, so you can have order of the data encoded in the file names:
raw_data_<date>.gz
raw_data_<date>.gz.1
raw_data_<date>.gz.2
In the meta code it will look like this:
# Parse where to save data, should we compress data,
# what's the log pattern, how to rotate logs etc
loadLogConfig(...)
# any compression, rotation, flushing etc happens here
# but we don't care, and just write to file
logger.trace(data)
# on shutdown, save any temporary buffer to the files
logger.flush()

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Array of RDDs? One RDD for a time window

I have a question about bucketing time events with Spark, and the best way to handle it.
So I'm ingesting a very large dataset, with specific start/stop times for each event.
For instance, I might load in three weeks of data. Within the main time window, I divide that into buckets of smaller intervals. So 3 weeks divided into 24 hour time buckets, with an array that looks like [(start_epoch, stop_epoch), (start_epoch, stop_epoch), ...]
Within each time bucket I map/reduce my events down into a smaller set.
I'd like to keep the events split up by the time bucket they belong to.
What is the best way to handle this? Each map/reduce operation results in a new RDD so I'm effectively left with a large array of RDDs.
Is it "safe" to just loop over that array from the driver, and then do other transformations/actions on each RDD to get results each time window?
Thanks!
I would suggest to think about it a bit differently:
You want to read your data, and then "keyBy" time rounded to hour resolution. And then you can reduceByKey(or combineByKey if you want another type in output).
While working with spark it's not necessary to collect items into arrays by some key(even antipattern)
RDD[Event] -> keyBy ts rounded to hour -> RDD[(hour, event)] -> reduceByKey(i.e. hour) -> RDD[(hour, aggregated view of all events in this hour)]

Multiple windows of different durations in Spark Streaming application

I would like to process a real-time stream of data (from Kafka) using Spark Streaming. I need to compute various stats from the incoming stream and they need to be computed for windows of varying durations. For example, I might need to compute the avg value of a stat 'A' for the last 5 mins while at the same time compute the median for stat 'B' for the last 1 hour.
In this case, what's the recommended approach to using Spark Streaming? Below are a few options I could think of:
(i) Have a single DStream from Kafka and create multiple DStreams from it using the window() method. For each of these resulting DStreams, the windowDuration would be set to different values as required. eg:
// pseudo-code
val streamA = kafkaDStream.window(Minutes(5), Minutes(1))
val streamB = kafkaDStream.window(Hours(1), Minutes(10))
(ii) Run separate Spark Streaming apps - one for each stat
Questions
To me (i) seems like a more efficient approach. However, I have a couple of doubts regarding that:
How would streamA and streamB be represented in the underlying
datastructure.
Would they share data - since they originate from the
KafkaDStream? Or would there be duplication of data?
Also, are there more efficient methods to handle such a use case.
Thanks in advance
Your (i) streams look sensible, will share data, and you can look at WindowedDStream to get an idea of the underlying representation. Note your streams are of course lazy, so only the batches being computed upon are in the system at any given time.
Since the state you have to maintain for the computation of an average is small (2 numbers), you should be fine. I'm more worried about the median (which requires a pair of heaps).
One thing you haven't made clear, though, is if you really need the update component of your aggregation that is implied by the windowing operation. Your streamA maintains the last 5 minutes of data, updated every minute, and streamB maintains the last hour updated every 10 minutes.
If you don't need that freshness, not requiring it will of course should minimize the amount of data in the system. You can have a streamA with a batch interval of 5mins and a streamB which is deducted from it (with window(Hours(1)), since 60 is a multiple of 5) .

Fastest way to shuffle lines in a file in Linux

I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)
The 50 minutes is not caused by the actual mechanics of sorting, based on your description. The time is likely spent waiting on /dev/random to generate enough entropy.
One approach is to use an external source of random data (http://random.org, for example) along with a variation on a Schwartzian Transform. The Schwartzian Transform turns the data to be sorted into "enriched" data with the sort key embedded. The data is sorted using the key and then the key is discarded.
To apply this to your problem:
generate a text file with random numbers, 1 per line, with the same number of lines as the file to be sorted. This can be done at any time, run in the background, run on a different server, downloaded from random.org, etc. The point is that this randomness is not generated while you are trying to sort.
create an enriched version of the file using paste:
paste random_number_file.txt string_data.txt > tmp_string_data.txt
sort this file:
sort tmp_string_data.txt > sorted_tmp_string_data.txt
remove the random data:
cut -f2- sorted_tmp_string_data.txt > random_string_data.txt
This is the basic idea. I tried it and it does work, but I don't have 16 million lines of text or 16 million lines of random numbers. You may want to pipeline some of those steps instead of saving it all to disk.
You may try my tool: HugeFileProcessor. It's capable of shuffling files of hundreds of GBs in a reasonable time.
Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.
The algorithm is as follows.
Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.
As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.
Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.
We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.
Why it works?
Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.
Some numbers
On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Resources