Basic Question
I have a dataset with ~10 billion rows. I'm looking for the most performant way to calculate rolling/windowed aggregates/metrics (sum, mean, min, max, stddev) over four different time windows (3 days, 7 days, 14 days, 21 days).
Spark/AWS EMR Specs
spark version: 2.4.4
ec2 instance type: r5.24xlarge
num core ec2 instances: 10
num pyspark partitions: 600
Overview
I read a bunch of SO posts that addressed either the mechanics of calculating rolling statistics or how to make Window functions faster. However, none of the posts combined these two concepts in a way that solves my problem. I've shown below a few options that do what I want but I need them to operate faster on my real dataset so I'm looking for suggestions that are faster/better.
My dataset is structured as follows but with ~10 billion rows:
+--------------------------+----+-----+
|date |name|value|
+--------------------------+----+-----+
|2020-12-20 17:45:19.536796|1 |5 |
|2020-12-21 17:45:19.53683 |1 |105 |
|2020-12-22 17:45:19.536846|1 |205 |
|2020-12-23 17:45:19.536861|1 |305 |
|2020-12-24 17:45:19.536875|1 |405 |
|2020-12-25 17:45:19.536891|1 |505 |
|2020-12-26 17:45:19.536906|1 |605 |
|2020-12-20 17:45:19.536796|2 |10 |
|2020-12-21 17:45:19.53683 |2 |110 |
|2020-12-22 17:45:19.536846|2 |210 |
|2020-12-23 17:45:19.536861|2 |310 |
|2020-12-24 17:45:19.536875|2 |410 |
|2020-12-25 17:45:19.536891|2 |510 |
|2020-12-26 17:45:19.536906|2 |610 |
|2020-12-20 17:45:19.536796|3 |15 |
|2020-12-21 17:45:19.53683 |3 |115 |
|2020-12-22 17:45:19.536846|3 |215 |
I need my dataset to look like below. Note: window statistics for a 7-day window are shown but I need three other windows as well.
+--------------------------+----+-----+----+-----+---+---+------------------+
|date |name|value|sum |mean |min|max|stddev |
+--------------------------+----+-----+----+-----+---+---+------------------+
|2020-12-20 17:45:19.536796|1 |5 |5 |5.0 |5 |5 |NaN |
|2020-12-21 17:45:19.53683 |1 |105 |110 |55.0 |5 |105|70.71067811865476 |
|2020-12-22 17:45:19.536846|1 |205 |315 |105.0|5 |205|100.0 |
|2020-12-23 17:45:19.536861|1 |305 |620 |155.0|5 |305|129.09944487358058|
|2020-12-24 17:45:19.536875|1 |405 |1025|205.0|5 |405|158.11388300841898|
|2020-12-25 17:45:19.536891|1 |505 |1530|255.0|5 |505|187.08286933869707|
|2020-12-26 17:45:19.536906|1 |605 |2135|305.0|5 |605|216.02468994692867|
|2020-12-20 17:45:19.536796|2 |10 |10 |10.0 |10 |10 |NaN |
|2020-12-21 17:45:19.53683 |2 |110 |120 |60.0 |10 |110|70.71067811865476 |
|2020-12-22 17:45:19.536846|2 |210 |330 |110.0|10 |210|100.0 |
|2020-12-23 17:45:19.536861|2 |310 |640 |160.0|10 |310|129.09944487358058|
|2020-12-24 17:45:19.536875|2 |410 |1050|210.0|10 |410|158.11388300841898|
|2020-12-25 17:45:19.536891|2 |510 |1560|260.0|10 |510|187.08286933869707|
|2020-12-26 17:45:19.536906|2 |610 |2170|310.0|10 |610|216.02468994692867|
|2020-12-20 17:45:19.536796|3 |15 |15 |15.0 |15 |15 |NaN |
|2020-12-21 17:45:19.53683 |3 |115 |130 |65.0 |15 |115|70.71067811865476 |
|2020-12-22 17:45:19.536846|3 |215 |345 |115.0|15 |215|100.0 |
Details
For ease of reading, I'll just do one window in these examples. Things I have tried:
Basic Window().over() syntax
Converting windowed values into an array column and using higher order functions
Spark SQL
Setup
import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import FloatType
import pandas as pd
import numpy as np
spark = SparkSession.builder.appName('example').getOrCreate()
# create spark dataframe
n = 7
names = [1, 2, 3]
date_list = [datetime.datetime.today() - datetime.timedelta(days=(n-x)) for x in range(n)]
values = [x*100 for x in range(n)]
rows = []
for name in names:
for d, v in zip(date_list, values):
rows.append(
{
"name": name,
"date": d,
"value": v+(5*name)
}
)
df = spark.createDataFrame(data=rows)
# setup window
window_days = 7
window = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-window_days * 60 * 60 * 24 + 1, Window.currentRow)
)
1. Basic
This creates multiple window specs as shown here and is therefore performed in serial and runs very slowly on a large dataset
status_quo = (df
.withColumn("sum",F.sum(F.col("value")).over(window))
.withColumn("mean",F.avg(F.col("value")).over(window))
.withColumn("min",F.min(F.col("value")).over(window))
.withColumn("max",F.max(F.col("value")).over(window))
.withColumn("stddev",F.stddev(F.col("value")).over(window))
)
status_quo.show()
status_quo.explain()
2. Array Column --> Higher Order Functions
Per this answer seems to create fewer window specs, but the aggregate() function syntax makes no sense to me, I don't know how to write stddev using higher order functions, and the performance doesn't seem much better in small tests
#F.udf(returnType=FloatType())
def array_stddev(row_value):
"""
temporary function since I don't know how to write higher order standard deviation
"""
return np.std(row_value, dtype=float).tolist()
# 1. collect window into array column
# 2. use higher order (array) functions to calculate aggregations over array (window values)
# Question: how to write standard deviation in aggregate()
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window))
.withColumn("sum_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max_example", F.array_max(F.col("value_array")))
.withColumn("min_example", F.array_min(F.col("value_array")))
.withColumn("std_example", array_stddev(F.col("value_array")))
)
3. Spark SQL
This appears to be the fastest in simple tests. The only (minor) issue is the rest of my codebase uses the DataFrame API. Seems faster in small tests but not tested on full dataset.
df.createOrReplaceTempView("df")
sql_example = spark.sql(
"""
SELECT
*
, sum(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS sum
, mean(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean
, min(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS min
, max(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS max
, stddev(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS stddev
FROM df"""
)
NOTE: I'm going to mark this as the accepted answer for the time being. If someone finds a faster/better please notify me and I'll switch it!
EDIT Clarification: The calculations shown here assume input dataframes pre-processed to the day day-level with day-level rolling calculations
After I posted the question I tested several different options on my real dataset (and got some input from coworkers) and I believe the fastest way to do this (for large datasets) uses pyspark.sql.functions.window() with groupby().agg instead of pyspark.sql.window.Window().
A similar answer can be found here
The steps to make this work are:
sort dataframe by name and date (in example dataframe)
.persist() dataframe
Compute grouped dataframe using F.window() and join back to df for every window required.
The best/easiest way to see this in action is on the SQL diagram in the Spark GUI thing. When Window() is used, the SQL execution is totally sequential. However, when F.window() is used, the diagram shows parallelization! NOTE: on small datasets Window() still seems faster.
In my tests with real data on 7-day windows, Window() was 3-5x slower than F.window(). The only downside is F.window() is a bit less convenient to use. I've shown some code and screenshots below for reference
Fastest Solution Found (F.window() with groupby.agg())
# this turned out to be super important for tricking spark into parallelizing things
df = df.orderBy("name", "date")
df.persist()
fwindow7 = F.window(
F.col("date"),
windowDuration="7 days",
slideDuration="1 days",
).alias("window")
gb7 = (
df
.groupBy(F.col("name"), fwindow7)
.agg(
F.sum(F.col("value")).alias("sum7"),
F.avg(F.col("value")).alias("mean7"),
F.min(F.col("value")).alias("min7"),
F.max(F.col("value")).alias("max7"),
F.stddev(F.col("value")).alias("stddev7"),
F.count(F.col("value")).alias("cnt7")
)
.withColumn("date", F.date_sub(F.col("window.end").cast("date"), 1))
.drop("window")
)
window_function_example = df.join(gb7, ["name", "date"], how="left")
fwindow14 = F.window(
F.col("date"),
windowDuration="14 days",
slideDuration="1 days",
).alias("window")
gb14 = (
df
.groupBy(F.col("name"), fwindow14)
.agg(
F.sum(F.col("value")).alias("sum14"),
F.avg(F.col("value")).alias("mean14"),
F.min(F.col("value")).alias("min14"),
F.max(F.col("value")).alias("max14"),
F.stddev(F.col("value")).alias("stddev14"),
F.count(F.col("value")).alias("cnt14")
)
.withColumn("date", F.date_sub(F.col("window.end").cast("date"), 1))
.drop("window")
)
window_function_example = window_function_example.join(gb14, ["name", "date"], how="left")
window_function_example.orderBy("name", "date").show(truncate=True)
SQL Diagram
Option 2 from Original Question (Higher Order Functions applied to Window())
window7 = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-7 * 60 * 60 * 24 + 1, Window.currentRow)
)
window14 = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-14 * 60 * 60 * 24 + 1, Window.currentRow)
)
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window7))
.withColumn("sum7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max7", F.array_max(F.col("value_array")))
.withColumn("min7", F.array_min(F.col("value_array")))
.withColumn("std7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean7)*(x - mean7), acc -> sqrt(acc / (size(value_array) - 1)))'))
.withColumn("count7", F.size(F.col("value_array")))
.drop("value_array")
)
hof_example = (
hof_example
.withColumn("value_array", F.collect_list(F.col("value")).over(window14))
.withColumn("sum14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max14", F.array_max(F.col("value_array")))
.withColumn("min14", F.array_min(F.col("value_array")))
.withColumn("std14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean14)*(x - mean14), acc -> sqrt(acc / (size(value_array) - 1)))'))
.withColumn("count14", F.size(F.col("value_array")))
.drop("value_array")
)
hof_example.show(truncate=True)
SQL Diagram Snippet
Try this aggregate for stddev. If you want to understand the syntax, you can check the docs.
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window))
.withColumn("sum_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max_example", F.array_max(F.col("value_array")))
.withColumn("min_example", F.array_min(F.col("value_array")))
.withColumn("std_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean_example)*(x - mean_example), acc -> sqrt(acc / (size(value_array) - 1)))'))
)
By the way, I don't think the other two approaches (pyspark window vs spark sql) are different. The query plans look identical to me. (I only selected min and max to reduce the size of the query plan)
Pyspark query plan:
status_quo = (df
.withColumn("min",F.min(F.col("value")).over(window))
.withColumn("max",F.max(F.col("value")).over(window))
)
status_quo.explain()
== Physical Plan ==
*(4) Project [date#3793, name#3794L, value#3795L, min#3800L, max#3807L]
+- Window [max(value#3795L) windowspecdefinition(name#3794L, _w0#3808L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS max#3807L], [name#3794L], [_w0#3808L ASC NULLS FIRST]
+- *(3) Sort [name#3794L ASC NULLS FIRST, _w0#3808L ASC NULLS FIRST], false, 0
+- *(3) Project [date#3793, name#3794L, value#3795L, min#3800L, cast(date#3793 as bigint) AS _w0#3808L]
+- Window [min(value#3795L) windowspecdefinition(name#3794L, _w0#3801L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS min#3800L], [name#3794L], [_w0#3801L ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#3801L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#812]
+- *(1) Project [date#3793, name#3794L, value#3795L, cast(date#3793 as bigint) AS _w0#3801L]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]
Spark SQL query plan:
df.createOrReplaceTempView("df")
sql_example = spark.sql(
"""
SELECT
*
, min(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS min
, max(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS max
FROM df"""
)
sql_example.explain()
== Physical Plan ==
*(4) Project [date#3793, name#3794L, value#3795L, min#4670L, max#4671L]
+- Window [max(value#3795L) windowspecdefinition(name#3794L, _w1#4675 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -7 days, currentrow$())) AS max#4671L], [name#3794L], [_w1#4675 ASC NULLS FIRST]
+- *(3) Sort [name#3794L ASC NULLS FIRST, _w1#4675 ASC NULLS FIRST], false, 0
+- *(3) Project [date#3793, name#3794L, value#3795L, _w1#4675, min#4670L]
+- Window [min(value#3795L) windowspecdefinition(name#3794L, _w0#4674 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -7 days, currentrow$())) AS min#4670L], [name#3794L], [_w0#4674 ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#4674 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#955]
+- *(1) Project [date#3793, name#3794L, value#3795L, date#3793 AS _w0#4674, date#3793 AS _w1#4675]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]
Aggregate function query plan:
hof_example.explain()
== Physical Plan ==
Project [date#3793, name#3794L, value#3795L, value_array#5516, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5523 + cast(lambda x#5524L as double)), lambda acc#5523, lambda x#5524L, false), lambdafunction(lambda id#5525, lambda id#5525, false)) AS sum_example#5522, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false)) AS mean_example#5531, array_max(value_array#5516) AS max_example#5541L, array_min(value_array#5516) AS min_example#5549L, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5559 + ((cast(lambda x#5560L as double) - aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false))) * (cast(lambda x#5560L as double) - aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false))))), lambda acc#5559, lambda x#5560L, false), lambdafunction(SQRT((lambda acc#5561 / cast((size(value_array#5516, true) - 1) as double))), lambda acc#5561, false)) AS std_example#5558]
+- Window [collect_list(value#3795L, 0, 0) windowspecdefinition(name#3794L, _w0#5517L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS value_array#5516], [name#3794L], [_w0#5517L ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#5517L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#1136]
+- *(1) Project [date#3793, name#3794L, value#3795L, cast(date#3793 as bigint) AS _w0#5517L]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]
Related
I am using Spark SQL 2.4 queries.
I am using the following sql which is throwing an error: The query is big and has several steps, so I have given a concise version below. When I execute the query, from the Spark-Shell, it fails with the error given below. The explain-plan is rather long so I have trimmed it to a more manageable extent :
I have checked that the values of the partition by encnbr column are fairly unique. However, the Stages tab in Spark UI shows only 1 very lengthy task indicating SKEW. However, since the keys are unique, I'm not sure why this is happening. I have tried using cluster by encnbr in vain.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(79) LocalLimit 4
+- *(79) Project [enc_key#976, prsn_key#951, prov_key#952, clm_key#977, clm_ln_key#978... 7 more fields]
+- Window [lag(non_keys#2862, 1, null) windowspecdefinition(encnbr#2722, eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST, specifiedwindowframe(Rowframe, -1, -1)) AS _we0#2868], [encnbr#2722], [eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST]......
The query consists of several steps with one depending on the result from the previous. However, the step which is failing is something similar to :
select
enc_key,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
case when lag(non_keys) over (partition by encnbr order by eff_dt asc, data_timestamp asc) is null
then 'Y'
when lag(non_keys) <> non_keys
then 'Y'
else 'N'
end as mod_flg
FROM (
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table1
where encnbr is not null
union all
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table2
where encnbr is not null
)
Can you please help me alleviate this issue. I have tried using cluster by encnbr in the previous step but it still keeps on failing.
Please help
Thanks.
I would like to know if there is any performance/scalability difference when using intermediate steps/columns in pyspark between:
Using .withColumn() for example:
df = df.withColumn('bar', df.foo + 1)
df = df.withColumn('baz', df.bar + 2)
then calling df.select('baz').collect()
versus
Declaring a Spark column as a Python variable:
bar = df.foo + 1
baz = bar + 2
then calling
df.select(baz.alias('baz')).collect()
Question: If many intermediate steps/columns such as bar are required, would the two options differ in space/time complexity?
I saw my original post was deleted. In hindsight it may well be that that was correct barring the lack of communication. That example was using a foldLeft which is not your use case which is fusing of the data pipeline.
To answer your question, the fusing of data pipeline operations by Catalyst means there is no performance issue either ways as the Physical Plans show:
df = spark.createDataFrame([(x,x) for x in range(7)], ['foo', 'bar',])
df = df.withColumn('bar', df.foo + 1)
df = df.withColumn('baz', df.bar + 2)
df.select('baz').explain(extended=True)
== Physical Plan ==
*(1) Project [(foo#276L + 3) AS baz#283L]
+- *(1) Scan ExistingRDD[foo#276L,bar#277L]
and likewise:
df = spark.createDataFrame([(x,x) for x in range(7)], ['foo', 'bar',])
bar = df.foo + 1
baz = bar + 2
df.select(baz.alias('baz')).explain(extended=True)
== Physical Plan ==
*(1) Project [(foo#288L + 3) AS baz#292L]
+- *(1) Scan ExistingRDD[foo#288L,bar#289L]
They look pretty similar to me ... Notice the optimize of +3.
In addition I draw your attention to using foldLeft with .withColumn https://manuzhang.github.io/2018/07/11/spark-catalyst-cost.html
I have a connected graph Like this
user1|A,C,B
user2|A,E,B,A
user3|C,B,A,B,E
user4|A,C,B,E,B
where user are the property name and the path for that particular user is followed. For example for
user1 the path is A->C->B
user2: A->E->B->A
user3: C->B->A->B->E
user4: A->C->B->E->B
Now, I want to find all users who have reached from A to E. Output should be
user2, user3, user4(since all these users finally reached E from A, no matter how many hops they took). How can I write the motif for this.
This is what I tried.
val vertices=spark.createDataFrame(List(("A","Billing"),("B","Devices"),("C","Payment"),("D","Data"),("E","Help"))).toDF("id","desc")
val edges = spark.createDataFrame(List(("A","C","user1"),
("C","B","user1"),
("A","E","user2"),
("E","B","user2"),
("B","A","user2"),
("C","B","user3"),
("B","A","user3"),
("A","B","user3"),
("B","E","user3"),
("A","C","user4"),
("C","B","user4"),
("B","E","user4"),
("E","B","user4"))).toDF("src","dst","user")
val pathAnalysis=GraphFrame(vertices,edges)
pathAnalysis.find("(a)-[]->();()-[]->();()-[]->(d)").filter("a.id='A'").filter("d.id='E'").distinct().show()
But I am getting an exception like this
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Project [a#355]
: +- Join Inner, (__tmp-4363599943432734077#353.src = a#355.id)
: :- LocalRelation [__tmp-4363599943432734077#353]
: +- Project [named_struct(id, _1#0, desc, _2#1) AS a#355]
: +- Filter (named_struct(id, _1#0, desc, _2#1).id = A)
: +- LocalRelation [_1#0, _2#1]
+- LocalRelation
and
LocalRelation [__tmp-1043886091038848698#371]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
I am not sure if my condition is correct or how to set this property
spark.sql.crossJoin.enabled=true on a spark-shell
I invoked my spark-shell as follows
spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11
my suggested solution is kinda trivial, but it will work fine if the paths are relatively short, and the number of user (i.e., number of rows in the data set) is big. If this is not the case, please let me know, other implementations are possible.
case class UserPath(
userId: String,
path: List[String])
val dsUsers = Seq(
UserPath("user1", List("A", "B", "C")),
UserPath("user2", List("A", "E", "B", "A")))
.doDF.as[UserPath]
def pathExists(up: UserPath): Option[String] = {
val prefix = up.path.takeWhile(s => s != "A")
val len = up.path.length
if (up.path.takeRight(len - prefix.length).contains("E"))
Some(up.userId)
else
None
}
// Users with path from A -> E.
dsUsers.map(pathExists).filter(opt => !opt.isEmpty)
You can also use BFS algorithm for it: http://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.lib.BFS
With your data model you will have to iterate over users and run the BFS for each of them like this:
scala> pathAnalysis.bfs.fromExpr($"id" === "A").toExpr($"id" === "E").edgeFilter($"user" === "user3").run().show
+------------+-------------+------------+-------------+---------+
| from| e0| v1| e1| to|
+------------+-------------+------------+-------------+---------+
|[A, Billing]|[A, B, user3]|[B, Devices]|[B, E, user3]|[E, Help]|
+------------+-------------+------------+-------------+---------+
Is it possible to add/replace existing column expression in
DataFrame API/SQL using extension point.
Ex: assume we inject resolution rule which could check the project
node from the plan and on checking for column "name", replace it
with upper(name) for instance.
Is such a thing possible using Extension Points. The examples which i have
found are mostly simple, which do not manipulate the input expressions in the manner i need.
Kindly let me know if this is possible.
Yes this is possible.
Lets take an example. Suppose we want to write a rule which checks for Project operator and if the project is for some particular column (say 'column2'), then it multiply it by 2.
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
object DoubleColumn2OptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case p: Project =>
if (p.projectList.filter(_.name == "column2").size >= 1) {
val newList = p.projectList.map { case x =>
if (x.name == "column2") {
Alias(Multiply(Literal(2, IntegerType), x), "column2_doubled")()
} else {
x
}
}
p.copy(projectList = newList)
} else {
p
}
}
}
say we have a table "table1" which has two columns - column1, column2.
Without this rule -
> spark.sql("select column2 from table1 limit 10").collect()
Array([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
with this rule -
> spark.experimental.extraOptimizations = Seq(DoubleColumn2OptimizationRule)
> spark.sql("select column2 from table1 limit 10").collect()
Array([2], [4], [6], [8], [10], [12], [14], [16], [18], [20])
Also you can call explain on DataFrame to check the plan -
> spark.sql("select column2 from table1 limit 10").explain
== Physical Plan ==
CollectLimit 10
+- *(1) LocalLimit 10
+- *(1) Project [(2 * column2#213) AS column2_doubled#214]
+- HiveTableScan [column2#213], HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [column1#212, column2#213]
I have data like this
Machine , date , hours
123,2014-06-15,15.4
123,2014-06-16,20.3
123,2014-06-18,11.4
131,2014-06-15,12.2
131,2014-06-16,11.5
131,2014-06-17,18.2
131,2014-06-18,19.2
134,2014-06-15,11.1
134,2014-06-16,16.2
I want to partition by key Machine, and find lag of hours by 1 default value 0
Machine , date , hours lag
123,2014-06-15,15.4,0
123,2014-06-16,20.3,15.4
123,2014-06-18,11.4,20.3
131,2014-06-15,12.2,0
131,2014-06-16,11.5,12.2
131,2014-06-17,18.2,11.5
131,2014-06-18,19.2,18.2
134,2014-06-15,11.1,0
134,2014-06-16,16.2,11.1
I am using PairedRDD and groupByKey method, but it doesn't yield in an expected order.
Because there is really no given order here. With some exceptions, RDDs should be considered unordered if any transformations you use require shuffling.
If you need specific order you have to sort your data manually:
case class Record(machine: Long, date: java.sql.Date, hours: Double)
case class RecordWithLag(
machine: Long, date: java.sql.Date, hours: Double, lag: Double
)
def getLag(xs: Seq[Record]): Seq[RecordWithLag] = ???
val rdd = sc.parallelize(List(
Record(123, java.sql.Date.valueOf("2014-06-15"), 15.4),
Record(123, java.sql.Date.valueOf("2014-06-16"), 20.3),
Record(123, java.sql.Date.valueOf("2014-06-18"), 11.4),
Record(131, java.sql.Date.valueOf("2014-06-15"), 12.2),
Record(131, java.sql.Date.valueOf("2014-06-16"), 11.5),
Record(131, java.sql.Date.valueOf("2014-06-17"), 18.2),
Record(131, java.sql.Date.valueOf("2014-06-18"), 19.2),
Record(134, java.sql.Date.valueOf("2014-06-15"), 11.1),
Record(134, java.sql.Date.valueOf("2014-06-16"), 16.2)
))
rdd
.groupBy(_.machine)
.mapValues(_.toSeq.sortWith((x, y) => x.date.compareTo(y.date) < 0))
.mapValues(getLag)
For performance you should consider updating your Spark distribution to >= 1.4.0 and using a data frame with window functions:
val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("df")
sqlContext.sql(
""""SELECT *, lag(hours, 1, 0) OVER (
PARTITION BY machine ORDER BY date
) lag FROM df"""
)
+-------+----------+-----+----+
|machine| date|hours| lag|
+-------+----------+-----+----+
| 123|2014-06-15| 15.4| 0.0|
| 123|2014-06-16| 20.3|15.4|
| 123|2014-06-18| 11.4|20.3|
| 131|2014-06-15| 12.2| 0.0|
| 131|2014-06-16| 11.5|12.2|
| 131|2014-06-17| 18.2|11.5|
| 131|2014-06-18| 19.2|18.2|
| 134|2014-06-15| 11.1| 0.0|
| 134|2014-06-16| 16.2|11.1|
+-------+----------+-----+----+
or
df.select(
$"*",
lag($"hours", 1, 0).over(
Window.partitionBy($"machine").orderBy($"date")
).alias("lag")
)