Apache Beam Combine vs GroupByKey - apache-spark

So, i'm facing this seems-to-be-classic-problem, extract timeframed toppers for unbounded stream,
using Apache Beam (Flink as the engine):
Assuming sites+hits tuples input:
{"aaa.com", 1001}, {"bbb.com", 21}, {"aaa.com", 1002}, {"ccc.com", 3001}, {"bbb.com", 22} ....
(Expected rate: +100K entries per hour)
Goal is to output sites which are >1% of total hits, in each 1 hour timeframe.
i.e. for 1 hour fix window, pick the site that sums >1% hits out total hits.
So first, sum by key:
{"aaa.com", 2003}, {"bbb.com", 43}, {"ccc.com", 3001} ....
And finally output the >1%:
{"aaa.com"}, {"ccc.com"}
Alternative:
1) Group + parDo:
Fixed time windowing 1 hour, group all elements, following by iterable parDo for all window elements, calculate sum and output the >1% sites.
Cons seems to be all agg process done single thread and also seems require double iterations to get the sum and get >1%.
2) GroupByKey + Combine
Fixed time windowing 1 hour, GrouByKey using key=Site, applying Combine with custom accumulator to sum hits per key.
Although the Combine option(#2) seems more suitable,
i'm missing the part of getting in the sum-per-1-hour-window, needed to calculate the >%1 elements:
Can same window be used for 2 combines: one per key and one total hits sum in this window?
and what is the best approach to combine them both to make the >1% call per element?
10x

You can do this via side inputs. For instance, you'd do something like this (code in Python, but answer for Java is similar):
input_data = .... # ("aaa.com", 1001), ("bbb.com", 21), ("aaa.com", 1002), ("ccc.com", 3001), ("bbb.com", 22) ....
total_per_key = input_data | beam.CombinePerKey(sum)
global_sum_per_window = beam.pvalue.AsSingleton(
input_data
| beam.Values()
| beam.CombineGlobally(sum).without_defaults())
def find_more_than_1pct(elem, global_sum):
key, value = elem
if value > global_sum * 0.01:
yield elem
over_1_pct_keys = total_per_key | beam.FlatMap(find_more_than_1pct)
In this case, the global_sum_per_window PCollection will have one value for each window, and the total_per_key will have one value per-key-per-window.
Let me know if that works!

Related

How to lower RAM usage using xarray open_mfdataset and the quantile function

I am trying to load multiple years of daily data in nc files (one nc file per year). A single nc file has a dimension of 365 (days) * 720 (lat) * 1440 (lon). All the nc files are in the "data" folder.
import xarray as xr
ds = xr.open_mfdataset('data/*.nc',
chunks={'latitude': 10, 'longitude': 10})
# I need the following line (time: -1) in order to do quantile, or it throws a ValueError:
# ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized'
# consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single
# dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True``
# in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
ds = ds.chunk({'time': -1})
# Perform the quantile "computation" (looks more like a reference to the computation, as it's fast
ds_qt = ds.quantile(0.975, dim="time")
# Verify the shape of the loaded ds
print(ds)
# This shows the expected "concatenation" of the nc files.
# Get a sample for a given location to test the algorithm
print(len(ds.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values))
print(ds_qt.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values)
The result is correct. My issue comes from memory usage. I thought that by doing the open_mfdataset method, which uses Dask under the hood, this would be solved. However, loading "just" 2 years of nc files uses around 8GB of virtual RAM, and using 10 years of data uses my entire virtual RAM (around 32GB).
Am I missing something to be able to take a given percentile value across a dask array (I would need 30 nc files)? I apparently have to apply the chunk({'time': -1}) to the dataset to be able to use the quantile function, is this what makes the RAM savings fail?
This may help somebody in the future, here is the solution I am implementing, even though it is not optimized. I basically break the nc files into slices based on geolocation, and paste it back together to create the output file.
ds = xr.open_mfdataset('data/*.nc')
step = 10
min_lat = -90
max_lat = min_lat + step
output_ds = None
while max_lat <= 90:
cropped_ds = ds.sel(lat=slice(min_lat, max_lat))
cropped_ds = cropped_ds.chunk({'time': -1})
cropped_ds_quantile = cropped_ds.quantile(0.975, dim="time")
if not output_ds:
output_ds = cropped_ds_quantile
else:
output_ds = xr.merge([output_ds, cropped_ds_quantile])
min_lat += step
max_lat += step
output_ds.to_netcdf('output.nc')
It's not great, but it limits RAM usage to manageable levels. I am still open to a cleaner/faster solution if it exists (likely).

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Double for loop on pandas groupby

I'm trying to have a pairwise comparison on two dataframes based on some key, but I'm having a hard time with pandas groupby in a double for loop since it is very slow. Is there any way I can optimize so that I don't have to recompute the groups every time I run the outer loop?
I tried using the same groupby variable but it doesn't seem to solve the recomputation problem.
mygroups = mydf.groupby('mykey')
for key1,subdf1 in mygroups:
for key2,subdf2 in mygroups:
if(key2 <= key1):
continue
do_some_work(subdf1,subdf2)
subdf2 seems to start recomputing from the first key rather than from the next key after key1. In my use-case scenario I expected that key2 will be the next in the iteration after key1 and so on. How can I have such behavior happen without the need to recompute?
Your observation is correct, the inner loop iterates over the whole dataframe, not just the records after key1.
The easiest way for smaller DataFrames
I would create a list with the groups first and then iterate over this list.
This is what I would do:
mygroups_list= [(key, subdf) for (key, subdf) mydf.groupby('mykey')]
for len(mygroups_list) > 0:
key1,subdf1= mygroups_list.pop(0)
for key2,subdf2 in mygroups_list:
do_some_work(subdf1,subdf2)
You just have to make sure, the groups are really sorted, but AFAIK this is done by the .groupby method anyways. If you are not sure, you can just add a mygroups_list.sort(key=lambda tup: tup[0]) outside your loop.
If size yet does matter
For larger dataframes you can avoid materializing the dataframes at once and just defer that until you actually need the data like this:
# create the groupby object as usual
group_by= mydf.groupby('mykey')
# now fetch the row indices from the groupby object
# and because this is actually a dictionary
# extract the keys from it and sort them
mygroups_dict= group_by.indices
mygroups_labels= list(mygroups_dict)
mygroups_labels.sort()
# now use a similar approach as above
while len(mygroups_labels) > 0:
key1= mygroups_labels.pop(0)
# but instead of creating the sub dataframes
# before you enter the loop, just do it
# within the loop and use the row indices
# the groupby object evaluated
subdf1= mydf.iloc[mygroups_dict[key1]]
for key2 in mygroups_labels:
subdf2= mydf.iloc[mygroups_dict[key2]]
do_some_work(subdf1, subdf2)
That should be much less memory extensive because you just need to store the row indices instead of the whole rows throughout the hole processing time.
For the following example setup:
import numpy as np
def do_some_work(subdf1, subdf2):
print('{} --> {} (len={}/{})'.format(subdf1['mykey'].iat[0], subdf2['mykey'].iat[0], len(subdf1), len(subdf2)))
mydf= pd.DataFrame(dict(mykey=np.random.randint(5, size=100), col=range(1, 101)))
This outputs something like (of course the len info will look different from run to run because of the randint). But note the group labels (left and right of the arrow). On the right side you have key2 which always is > key1:
0 --> 1 (len=21/16)
0 --> 2 (len=21/21)
0 --> 3 (len=21/20)
0 --> 4 (len=21/22)
1 --> 2 (len=16/21)
1 --> 3 (len=16/20)
1 --> 4 (len=16/22)
2 --> 3 (len=21/20)
2 --> 4 (len=21/22)
3 --> 4 (len=20/22)

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

How to increment counters based on a column value being fixed in a Window?

I have a dataset that, over time, indicates the region where certain users were located. From this dataset I want to calculate the number of nights that they spent at each location. By "spending the night" I mean: take the last location seen of a user until 23h59 of a certain day; if all observed locations from that user until 05:00 the next day, or the first one after that if there is none yet, match the last of the previous day, that's a night spent at that location.
| Timestamp| User| Location|
|1462838468|49B4361512443A4DA...|1|
|1462838512|49B4361512443A4DA...|1|
|1462838389|49B4361512443A4DA...|2|
|1462838497|49B4361512443A4DA...|3|
|1465975885|6E9E0581E2A032FD8...|1|
|1457723815|405C238E25FE0B9E7...|1|
|1457897289|405C238E25FE0B9E7...|2|
|1457899229|405C238E25FE0B9E7...|11|
|1457972626|405C238E25FE0B9E7...|9|
|1458062553|405C238E25FE0B9E7...|9|
|1458241825|405C238E25FE0B9E7...|9|
|1458244457|405C238E25FE0B9E7...|9|
|1458412513|405C238E25FE0B9E7...|6|
|1458412292|405C238E25FE0B9E7...|6|
|1465197963|6E9E0581E2A032FD8...|6|
|1465202192|6E9E0581E2A032FD8...|6|
|1465923817|6E9E0581E2A032FD8...|5|
|1465923766|6E9E0581E2A032FD8...|2|
|1465923748|6E9E0581E2A032FD8...|2|
|1465923922|6E9E0581E2A032FD8...|2|
I'm guessing I need to use Window functions here, and I've used PySpark for other things in the past, but I'm a bit at a loss as to where to start here.
I think in the end you do need to have a function that takes a series of events and outputs nights spent... something like (example just to get the idea):
def nights_spent(location_events):
# location_events is a list of events that have time and location
location_events = sort_by_time(location_events)
nights = []
prev_event = None
for event in location_events[1:]:
if prev_location is not None:
if next_day(prev_event.time, event.time) \
and same_location(prev_event.location, event.location):
# TODO: How do you handle when prev_event
# and event are more than 1 day apart?
nights.append(prev_location)
prev_location = location
return nights
Then, I think that a good first approach is to first group by user so that you get all events (with location and time) for a given user.
Then you can feed that list of events to the function above, and you'll have all the (user, nights_spent) rows in an RDD.
So, in general, the RDD would look something like:
nights_spent_per_user = all_events.map(lambda x => (x.user, [(x.time, x.location)])).reduce(lambda a, b: a + b).map(x => (x[0], nights_spent(x[1])))
Hope that helps to get you started.

Resources