Spark RDD - flatMap with an argument which is another RDD - shapely polygons - apache-spark

I have 2 dataframes containing WKT strings. In need to compute the intersection of every polygon from the first dataframe with the second.
The first approach I used was to crossJoin both dataframes then apply a function which parses the WKT to shaply polygon then compute the intersection and return the result. However this is quite slow because there is a lot of parsing required.
I'm trying to do the parsing upfront and keep it in a RDD (dataframe convert it to string wkt). Then I would like to apply a flatMap on the first dataframe and return the result of all intersection with the second RDD.
Here is a snippet code:
import shapely.wkt
# creation of the 2 dataframes
df = spark.createDataFrame(
[
("abc", "POLYGON((0 0,0 1,1 1,1 0,0 0))"),
("def", "POLYGON((1 2,1 4,3 4,3 2,1 2))"),
],
["id", "polygon"] # add your column names here
)
df2 = spark.createDataFrame(
[
("jkl", "POLYGON((0.5 0.5,0.5 1,1 1,1 0.5,0.5 0.5))"),
("def", "POLYGON((0 2,1 4,3 4,3 2,0 2))"),
],
["id", "polygon"] # add your column names here
)
# convert to RDD with parsing to shapely polygon
a = df.rdd.map(lambda row : (row["polygon"], shapely.wkt.loads(row["POLYGON_SIMPLIFIED"])))
b = df2.rdd.map(lambda row : (row["polygon"], shapely.wkt.loads(row["POLYGON_SIMPLIFIED"])))
# intersection function
def intersect(row, otherRDD):
ans = []
poly1 = row[1]
for other in otherRDD:
poly2 = other[1]
inter = poly1.intersect(poly2)
ans.append((row[0], other[0], str(inter)))
return ans
# I want to run the rdd on every row of the dataframe a and pass as argument the rdd "b"
test = a.flatMap(lambda row : intersect(row, b))
# then go back to dataframe
test.toDF(["id1", "id2", "intersection"]).show()
The error is:
Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I've taken a look at this error on the internet but from what I've seen, we need to collect and sent it in the function. However, both dataframes are quite big :s. Is there a workaround ?
Many thanks for your support,
Nicolas

Related

PySpark data skewness with Window Functions

I have a huge PySpark dataframe and I'm doing a series of Window functions over partitions defined by my key.
The issue with the key is, my partitions gets skewed by this and results in Event Timeline that looks something like this,
I know that I can use salting technique to solve this issue when I'm doing a join. But how can I solve this issue when I'm using Window functions?
I'm using functions like lag, lead etc in the Window functions. I can't do the process with salted key, because I'll get wrong results.
How to solve skewness in this case?
I'm looking for a dynamic way of repartitioning my dataframe without skewness.
Updates based on answer from #jxc
I tried creating a sample df and tried running code over that,
df = pd.DataFrame()
df['id'] = np.random.randint(1, 1000, size=150000)
df['id'] = df['id'].map(lambda x: 100 if x % 2 == 0 else x)
df['timestamp'] = pd.date_range(start=pd.Timestamp('2020-01-01'), periods=len(df), freq='60s')
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
w = Window.partitionBy("id").orderBy("timestamp")
sdf = sdf.withColumn("new_col", F.lag("amt").over(w) + F.lead("amt").over(w))
x = sdf.toPandas()
This gave me a event timeline like this,
I tried the code from #jxc's answer,
sdf = sc.createDataFrame(df)
sdf = sdf.withColumn("amt", F.rand()*100)
N = 24*3600*365*2
sdf_1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
sdf_2 = sdf_1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_val')
)
sdf_3 = sdf_2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_val', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
df_new = sdf_2.filter('rn not in (1,cnt)').union(sdf_3)
x = df_new.toPandas()
I ended up one additional stage and the event timeline looked more skewed,
Also the run time is increased by a bit with new code
To process a large partition, you can try split it based on the orderBy column(most likely a numeric column or date/timestamp column which can be converted into numeric) so that all new sub-partitions maintain the correct order of rows. process rows with the new partitioner and for calculation using lag and lead functions, only rows around the boundary between sub-partitions need to be post-processed. (Below also discussed how to merge small partitions in task-2)
Use your example sdf and assume we have the following WinSpec and a simple aggregate function:
w = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w))
Task-1: split large partitions:
Try the following:
select a N to split timestamp and set up an additional partitionBy column pid (using ceil, int, floor etc.):
# N to cover 35-days' intervals
N = 24*3600*35
df1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))
add pid into partitionBy(see w1), then calaulte row_number(), lag() and lead() over w1. find also number of rows (cnt) in each new partition to help identify the end of partitions (rn == cnt). the resulting new_val will be fine for majority of rows except those on the boundaries of each partition.
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')
df2 = df1.select(
'*',
F.count('*').over(w2).alias('cnt'),
F.row_number().over(w1).alias('rn'),
(F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_amt')
)
Below is an example df2 showing the boundary rows.
process the boundary: select rows which are on the boundaries rn in (1, cnt) plus those which have values used in the calculation rn in (2, cnt-1), do the same calculation of new_val over w and save result for boundary rows only.
df3 = df2.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
.filter('rn in (1,cnt)')
Below shows the resulting df3 from the above df2
merge df3 back to df2 to update boundary rows rn in (1,cnt)
df_new = df2.filter('rn not in (1,cnt)').union(df3)
Below screenshot shows the final df_new around the boundary rows:
# drop columns which are used to implement logic only
df_new = df_new.drop('cnt', 'rn')
Some Notes:
the following 3 WindowSpec are defined:
w = Window.partitionBy('id').orderBy('timestamp') <-- fix boundary rows
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp') <-- calculate internal rows
w2 = Window.partitionBy('id', 'pid') <-- find #rows in a partition
note: strictly, we'd better use the following w to fix boundary rows to avoid issues with tied timestamp around the boundaries.
w = Window.partitionBy('id').orderBy('pid', 'rn') <-- fix boundary rows
if you know which partitions are skewed, just divide them and skip others. the existing method might split a small partition into 2 or even more if they are sparsely distributed
df1 = df.withColumn('pid', F.when(F.col('id').isin('a','b'), F.ceil(F.unix_timestamp('timestamp')/N)).otherwise(1))
If for each partition, you can retrieve count(number of rows) and min_ts=min(timestamp), then try something more dynamically for pid(below M is the threshold number of rows to split):
F.expr(f"IF(count>{M}, ceil((unix_timestamp(timestamp)-unix_timestamp(min_ts))/{N}), 1)")
note: for skewness inside a partition, will requires more complex functions to generate pid.
if only lag(1) function is used, just post-process left boundaries, filter by rn in (1, cnt) and update only rn == 1
df3 = df1.filter('rn in (1, cnt)') \
.withColumn('new_amt', F.lag('amt',1).over(w)) \
.filter('rn = 1')
similar to lead function when we need only to fix right boundaries and update rn == cnt
if only lag(2) is used, then filter and update more rows with df3:
df3 = df1.filter('rn in (1, 2, cnt-1, cnt)') \
.withColumn('new_amt', F.lag('amt',2).over(w)) \
.filter('rn in (1,2)')
You can extend the same method to mixed cases with both lag and lead having different offset.
Task-2: merge small partitions:
Based on the number of records in a partition count, we can set up an threshold M so that if count>M, the id holds its own partition, otherwise we merge partitions so that #of total records is less than M (below method has a edging case of 2*M-2).
M = 20000
# create pandas df with columns `id`, `count` and `f`, sort rows so that rows with count>=M are located on top
d2 = pd.DataFrame([ e.asDict() for e in sdf.groupby('id').count().collect() ]) \
.assign(f=lambda x: x['count'].lt(M)) \
.sort_values('f')
# add pid column to merge smaller partitions but the total row-count in partition should be less than or around M
# potentially there could be at most `2*M-2` records for the same pid, to make sure strictly count<M, use a for-loop to iterate d1 and set pid:
d2['pid'] = (d2.mask(d2['count'].gt(M),M)['count'].shift(fill_value=0).cumsum()/M).astype(int)
# add pid to sdf. In case join is too heavy, try using Map
sdf_1 = sdf.join(spark.createDataFrame(d2).alias('d2'), ["id"]) \
.select(sdf["*"], F.col("d2.pid"))
# check pid: # of records and # of distinct ids
sdf_1.groupby('pid').agg(F.count('*').alias('count'), F.countDistinct('id').alias('cnt_ids')).orderBy('pid').show()
+---+-----+-------+
|pid|count|cnt_ids|
+---+-----+-------+
| 0|74837| 1|
| 1|20036| 133|
| 2|20052| 134|
| 3|20010| 133|
| 4|15065| 100|
+---+-----+-------+
Now, the new Window should be partitioned by pid alone and move id to orderBy, see below:
w3 = Window.partitionBy('pid').orderBy('id','timestamp')
customize lag/lead functions based on the above w3 WinSpec, and then calculate new_val:
lag_w3 = lambda col,n=1: F.when(F.lag('id',n).over(w3) == F.col('id'), F.lag(col,n).over(w3))
lead_w3 = lambda col,n=1: F.when(F.lead('id',n).over(w3) == F.col('id'), F.lead(col,n).over(w3))
sdf_new = sdf_1.withColumn('new_val', lag_w3('amt',1) + lead_w3('amt',1))
To handle such skewed data, there are a couple of things you can try out.
If you are using Databricks to run your jobs and you know which column will have the skew then you can try out an option called skew hint
I recommend moving to Spark 3.0 since you will have the option to use Adaptive Query Execution (AQE) which can handle most of the issues improving your job health and potentially running them faster.
Usually, I suggest making your data more even-sized partitions before any wide operation, and Increasing the cluster size does help but I am not sure if this will work for you.

(KNN ) row compute use outer DataFrame on pyspark

question
my data structure is like this:
train_info:(over 30000 rows)
----------
odt:string (unique)
holiday_type:string
od_label:string
array:array<double> (with variable length depend on different odt and holiday_type )
useful_index:array<int> (length same as vectors)
...(other not important cols)
label_data:(over 40000 rows)
----------
holiday_type:string
od_label: string
l_origin_array:array<double> (with variable length)
...(other not important cols)
my expected result is like this(length same with train_info):
--------------
odt:string
holiday_label:string
od_label:string
prediction:int
my solution is like this:
if __name__=='__main __'
loop_item = train_info.collect()
result = knn_for_loop(spark, loop_item,train_info.schema,label_data)
----- do something -------
def knn_for_loop(spark, predict_list, schema, label_data):
result = list()
for i in predict_list:
# turn this Row col to Data Frame and join on label data
# across to this row data pick label data array data
predict_df = spark.sparkContext.parallelize([i]).toDF(schema) \
.join(label_data, on=['holiday_type', "od_label"], how='left') \
.withColumn("l_array",
UDFuncs.value_from_array_by_index(f.col('l_origin_array'), f.col("useful_index"))) \
.toPandas()
# pandas execute
train_x = predict_df.l_array.values
train_y = predict_df.label.values
test_x = predict_df.array.values[0]
test_y = KNN(train_x, train_y, test_x)
result.append((i['odt'], i['holiday_type'], i['od_label'], test_y))
return result
it's worked but is really slow, I estimate each row need 18s.
in R language I can do this easily using do function:
train_info%>%group_by(odt)%>%do(.,knn_loop,label_data)
something my tries
I tried to join them before use,and query them when I compute, but the data is too large to run (these two df have 400 million rows after join and It takes up 180 GB disk space on hive and query really slowly).
I tried to use pandas_udf, but it only allows one pd.data.frame parameter and slow).
I tried to use UDF, but UDF can't receive data frame obj.
I tried to use spark-knn package ,but I run with error,may be my offline
installation is wrong .
thanks for your help.

How can I select specific values from list and plot a seaborn boxplot?

I have a list (length 300) of lists (each length 1000). I want to sort the list of 300 by the median of each list of 1000, and then plot a seaborn boxplot of the top 10 (i.e. the 10 lists with the greatest median).
I am able to plot the entire list of 300 but don't know where to go from there.
I can plot a range of the points but how to I plot, for example: data[3],data[45], data[129] all in the same plot?
ax = sns.boxplot(data = data[0:50])
I can also work out which items in the list are in the top 10 by doing this (but I realise this is not the most elegant way!)
array_median = np.median(data, axis=1)
np_sortedarray = np.sort(np.array(array_median))
sort_panda = pd.DataFrame(array_median)
TwoL = sort_panda.reset_index()
TwoL.sort_values(0)
Ultimately I want a boxplot with 10 boxes, showing the list items that have the greatest median values.
Example of data: list of 300 x 1000
[[1.236762285232544,
1.2303414344787598,
1.196462631225586,
...1.1787045001983643,
1.1760116815567017,
1.1614983081817627,
1.1546586751937866],
[1.1349891424179077,
1.1338907480239868,
1.1239897012710571,
1.1173863410949707,
...1.1015456914901733,
1.1005324125289917,
1.1005228757858276],
[1.0945734977722168,
...1.091795563697815]]
I modified your example data a bit just to make it easier.
import seaborn as sns
import pandas as pd
import numpy as np
data = [[1.236762285232544, 1.2303414344787598, 1.196462631225586, 1.1787045001983643, 1.1760116815567017, 1.1614983081817627, 1.1546586751937866],
[1.1349891424179077, 1.1338907480239868, 1.1239897012710571, 1.1173863410949707, 1.1015456914901733, 1.1005324125289917, 1.1005228757858276]]
To sort your data, since it is in list format and not numpy arrays, you can use the sorted function with a key to tell it to perform an operation on each list in your list, which is how the function will sort. Setting reverse = True tells it to sort highest to lowest.
sorted_data = sorted(data, key = lambda x: np.median(x), reverse = True)
To select the top n lists, add [:n] to the end of the previous statement.
To plot in Seaborn, it's easiest to convert your data to a pandas.DataFrame.
df = pd.DataFrame(data).T
That makes a DataFrame with 10 columns (or 2 in this example). We can rename the columns to make each dataset clearer.
df = df.rename(columns={k: f'Data{k+1}' for k in range(len(sorted_data))}).reset_index()
And to plot 2 (or 10) boxplots in one plot, you can reshape the dataframe to have 2 columns, one for the data and one for the dataset number (ID) (credit here).
df = pd.wide_to_long(df, stubnames = ['Data'], i = 'index', j = 'ID').reset_index()[['ID', 'Data']]
And then you can plot it.
sns.boxplot(x='ID', y = 'Data', data = df)
See this answer for fetching top 10 elements
idx = (-median).argsort()[:10]
data[idx]
Also, you can get particular elements of data like this
data[[3, 45, 129]]

pyspark rdd of csv to data frame with large number of columns dynamically

I have an existing rdd which consists of a single column of text with many (20k+) comma separated values.
How can I convert this to a data frame without specifying every column literally?
# split into columns
split_rdd = input_rdd.map(lambda l: l.split(","))
# convert to Row types
rows_rdd = split_rdd.map(lambda p: Row(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
))
df = spark.createDataFrame(rows_rdd)
How can I dynamically create the
field_1=p[0],
dict?
For example
row_dict = dict(
field_1=p[0],
field_2=p[1],
field_3 = float(p[2]),
field_4 = float(p[3])
)
is invalid syntax since the 'p[0]' needs to be quoted, but then it is a literal and doesn't get evaluated in the lambda function.
This is a large enough dataset that I need to avoid writing out the rdd and reading it back into a dataframe for performance.
You could try using dictionary comprehension in your creation of the row instance:
df = split_rdd\
.map(lambda p: {'field_%s' % index : val
for (index, val) in enumerate(p)})\
.map(lambda p: Row(**p))\
.toDF()
This is first mapping the list column values array from split_rdd into a dictionary with dynamically generated field_N keys mapped to respective values. These dictionaries are then used in the creation of Row instances.

How to split column of vectors into two columns?

I use PySpark.
Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.
I've tried the following:
output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))
but I get the error that 'col should be Column'.
Any suggestions on how to transform a column of vectors into columns of its values?
I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.
The following code works:
split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())
output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))
Or to append these columns to the original dataframe:
randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
Got the same problem, below is the code adjusted for the situation when you have n-length vector.
splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out = tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])
You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:
from pyspark.sql.functions import udf, col
split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
split2_udf(col("probability")).alias("c2"))
This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.
I tried #Rookie Boy 's loop but it seems the splits udf loop doesn't work for me.
I modified a bit.
out = df
for i in range(len(n)):
splits_i = udf(lambda x: x[i].item(), FloatType())
out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()

Resources