Modifying query using Spark Catalyst Logical Plan - apache-spark

Is it possible to add/replace existing column expression in
DataFrame API/SQL using extension point.
Ex: assume we inject resolution rule which could check the project
node from the plan and on checking for column "name", replace it
with upper(name) for instance.
Is such a thing possible using Extension Points. The examples which i have
found are mostly simple, which do not manipulate the input expressions in the manner i need.
Kindly let me know if this is possible.

Yes this is possible.
Lets take an example. Suppose we want to write a rule which checks for Project operator and if the project is for some particular column (say 'column2'), then it multiply it by 2.
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
object DoubleColumn2OptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case p: Project =>
if (p.projectList.filter(_.name == "column2").size >= 1) {
val newList = p.projectList.map { case x =>
if (x.name == "column2") {
Alias(Multiply(Literal(2, IntegerType), x), "column2_doubled")()
} else {
x
}
}
p.copy(projectList = newList)
} else {
p
}
}
}
say we have a table "table1" which has two columns - column1, column2.
Without this rule -
> spark.sql("select column2 from table1 limit 10").collect()
Array([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
with this rule -
> spark.experimental.extraOptimizations = Seq(DoubleColumn2OptimizationRule)
> spark.sql("select column2 from table1 limit 10").collect()
Array([2], [4], [6], [8], [10], [12], [14], [16], [18], [20])
Also you can call explain on DataFrame to check the plan -
> spark.sql("select column2 from table1 limit 10").explain
== Physical Plan ==
CollectLimit 10
+- *(1) LocalLimit 10
+- *(1) Project [(2 * column2#213) AS column2_doubled#214]
+- HiveTableScan [column2#213], HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [column1#212, column2#213]

Related

pyspark.sql.utils.AnalysisException: Column ambiguous but no duplicate column names

I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?
Join operation, where a and input have been processed by other functions:
b = (
input
.where(F.col('st').like('%VALUE%'))
.select('id', 'sii')
)
a.join(b, b['id'] == a['item'])
Dataframes:
(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]
(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
+- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]
Exception:
pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
If I recreate the dataframe using the same schema, I do not get any errors:
b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])
What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?
This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:
df = (
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
.select('item',
'id',
F.col('a.sii').alias('a_sii')
)
)

pyspark high performance rolling/window aggregations on timeseries data

Basic Question
I have a dataset with ~10 billion rows. I'm looking for the most performant way to calculate rolling/windowed aggregates/metrics (sum, mean, min, max, stddev) over four different time windows (3 days, 7 days, 14 days, 21 days).
Spark/AWS EMR Specs
spark version: 2.4.4
ec2 instance type: r5.24xlarge
num core ec2 instances: 10
num pyspark partitions: 600
Overview
I read a bunch of SO posts that addressed either the mechanics of calculating rolling statistics or how to make Window functions faster. However, none of the posts combined these two concepts in a way that solves my problem. I've shown below a few options that do what I want but I need them to operate faster on my real dataset so I'm looking for suggestions that are faster/better.
My dataset is structured as follows but with ~10 billion rows:
+--------------------------+----+-----+
|date |name|value|
+--------------------------+----+-----+
|2020-12-20 17:45:19.536796|1 |5 |
|2020-12-21 17:45:19.53683 |1 |105 |
|2020-12-22 17:45:19.536846|1 |205 |
|2020-12-23 17:45:19.536861|1 |305 |
|2020-12-24 17:45:19.536875|1 |405 |
|2020-12-25 17:45:19.536891|1 |505 |
|2020-12-26 17:45:19.536906|1 |605 |
|2020-12-20 17:45:19.536796|2 |10 |
|2020-12-21 17:45:19.53683 |2 |110 |
|2020-12-22 17:45:19.536846|2 |210 |
|2020-12-23 17:45:19.536861|2 |310 |
|2020-12-24 17:45:19.536875|2 |410 |
|2020-12-25 17:45:19.536891|2 |510 |
|2020-12-26 17:45:19.536906|2 |610 |
|2020-12-20 17:45:19.536796|3 |15 |
|2020-12-21 17:45:19.53683 |3 |115 |
|2020-12-22 17:45:19.536846|3 |215 |
I need my dataset to look like below. Note: window statistics for a 7-day window are shown but I need three other windows as well.
+--------------------------+----+-----+----+-----+---+---+------------------+
|date |name|value|sum |mean |min|max|stddev |
+--------------------------+----+-----+----+-----+---+---+------------------+
|2020-12-20 17:45:19.536796|1 |5 |5 |5.0 |5 |5 |NaN |
|2020-12-21 17:45:19.53683 |1 |105 |110 |55.0 |5 |105|70.71067811865476 |
|2020-12-22 17:45:19.536846|1 |205 |315 |105.0|5 |205|100.0 |
|2020-12-23 17:45:19.536861|1 |305 |620 |155.0|5 |305|129.09944487358058|
|2020-12-24 17:45:19.536875|1 |405 |1025|205.0|5 |405|158.11388300841898|
|2020-12-25 17:45:19.536891|1 |505 |1530|255.0|5 |505|187.08286933869707|
|2020-12-26 17:45:19.536906|1 |605 |2135|305.0|5 |605|216.02468994692867|
|2020-12-20 17:45:19.536796|2 |10 |10 |10.0 |10 |10 |NaN |
|2020-12-21 17:45:19.53683 |2 |110 |120 |60.0 |10 |110|70.71067811865476 |
|2020-12-22 17:45:19.536846|2 |210 |330 |110.0|10 |210|100.0 |
|2020-12-23 17:45:19.536861|2 |310 |640 |160.0|10 |310|129.09944487358058|
|2020-12-24 17:45:19.536875|2 |410 |1050|210.0|10 |410|158.11388300841898|
|2020-12-25 17:45:19.536891|2 |510 |1560|260.0|10 |510|187.08286933869707|
|2020-12-26 17:45:19.536906|2 |610 |2170|310.0|10 |610|216.02468994692867|
|2020-12-20 17:45:19.536796|3 |15 |15 |15.0 |15 |15 |NaN |
|2020-12-21 17:45:19.53683 |3 |115 |130 |65.0 |15 |115|70.71067811865476 |
|2020-12-22 17:45:19.536846|3 |215 |345 |115.0|15 |215|100.0 |
Details
For ease of reading, I'll just do one window in these examples. Things I have tried:
Basic Window().over() syntax
Converting windowed values into an array column and using higher order functions
Spark SQL
Setup
import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import FloatType
import pandas as pd
import numpy as np
spark = SparkSession.builder.appName('example').getOrCreate()
# create spark dataframe
n = 7
names = [1, 2, 3]
date_list = [datetime.datetime.today() - datetime.timedelta(days=(n-x)) for x in range(n)]
values = [x*100 for x in range(n)]
rows = []
for name in names:
for d, v in zip(date_list, values):
rows.append(
{
"name": name,
"date": d,
"value": v+(5*name)
}
)
df = spark.createDataFrame(data=rows)
# setup window
window_days = 7
window = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-window_days * 60 * 60 * 24 + 1, Window.currentRow)
)
1. Basic
This creates multiple window specs as shown here and is therefore performed in serial and runs very slowly on a large dataset
status_quo = (df
.withColumn("sum",F.sum(F.col("value")).over(window))
.withColumn("mean",F.avg(F.col("value")).over(window))
.withColumn("min",F.min(F.col("value")).over(window))
.withColumn("max",F.max(F.col("value")).over(window))
.withColumn("stddev",F.stddev(F.col("value")).over(window))
)
status_quo.show()
status_quo.explain()
2. Array Column --> Higher Order Functions
Per this answer seems to create fewer window specs, but the aggregate() function syntax makes no sense to me, I don't know how to write stddev using higher order functions, and the performance doesn't seem much better in small tests
#F.udf(returnType=FloatType())
def array_stddev(row_value):
"""
temporary function since I don't know how to write higher order standard deviation
"""
return np.std(row_value, dtype=float).tolist()
# 1. collect window into array column
# 2. use higher order (array) functions to calculate aggregations over array (window values)
# Question: how to write standard deviation in aggregate()
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window))
.withColumn("sum_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max_example", F.array_max(F.col("value_array")))
.withColumn("min_example", F.array_min(F.col("value_array")))
.withColumn("std_example", array_stddev(F.col("value_array")))
)
3. Spark SQL
This appears to be the fastest in simple tests. The only (minor) issue is the rest of my codebase uses the DataFrame API. Seems faster in small tests but not tested on full dataset.
df.createOrReplaceTempView("df")
sql_example = spark.sql(
"""
SELECT
*
, sum(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS sum
, mean(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean
, min(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS min
, max(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS max
, stddev(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS stddev
FROM df"""
)
NOTE: I'm going to mark this as the accepted answer for the time being. If someone finds a faster/better please notify me and I'll switch it!
EDIT Clarification: The calculations shown here assume input dataframes pre-processed to the day day-level with day-level rolling calculations
After I posted the question I tested several different options on my real dataset (and got some input from coworkers) and I believe the fastest way to do this (for large datasets) uses pyspark.sql.functions.window() with groupby().agg instead of pyspark.sql.window.Window().
A similar answer can be found here
The steps to make this work are:
sort dataframe by name and date (in example dataframe)
.persist() dataframe
Compute grouped dataframe using F.window() and join back to df for every window required.
The best/easiest way to see this in action is on the SQL diagram in the Spark GUI thing. When Window() is used, the SQL execution is totally sequential. However, when F.window() is used, the diagram shows parallelization! NOTE: on small datasets Window() still seems faster.
In my tests with real data on 7-day windows, Window() was 3-5x slower than F.window(). The only downside is F.window() is a bit less convenient to use. I've shown some code and screenshots below for reference
Fastest Solution Found (F.window() with groupby.agg())
# this turned out to be super important for tricking spark into parallelizing things
df = df.orderBy("name", "date")
df.persist()
fwindow7 = F.window(
F.col("date"),
windowDuration="7 days",
slideDuration="1 days",
).alias("window")
gb7 = (
df
.groupBy(F.col("name"), fwindow7)
.agg(
F.sum(F.col("value")).alias("sum7"),
F.avg(F.col("value")).alias("mean7"),
F.min(F.col("value")).alias("min7"),
F.max(F.col("value")).alias("max7"),
F.stddev(F.col("value")).alias("stddev7"),
F.count(F.col("value")).alias("cnt7")
)
.withColumn("date", F.date_sub(F.col("window.end").cast("date"), 1))
.drop("window")
)
window_function_example = df.join(gb7, ["name", "date"], how="left")
fwindow14 = F.window(
F.col("date"),
windowDuration="14 days",
slideDuration="1 days",
).alias("window")
gb14 = (
df
.groupBy(F.col("name"), fwindow14)
.agg(
F.sum(F.col("value")).alias("sum14"),
F.avg(F.col("value")).alias("mean14"),
F.min(F.col("value")).alias("min14"),
F.max(F.col("value")).alias("max14"),
F.stddev(F.col("value")).alias("stddev14"),
F.count(F.col("value")).alias("cnt14")
)
.withColumn("date", F.date_sub(F.col("window.end").cast("date"), 1))
.drop("window")
)
window_function_example = window_function_example.join(gb14, ["name", "date"], how="left")
window_function_example.orderBy("name", "date").show(truncate=True)
SQL Diagram
Option 2 from Original Question (Higher Order Functions applied to Window())
window7 = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-7 * 60 * 60 * 24 + 1, Window.currentRow)
)
window14 = (
Window
.partitionBy(F.col("name"))
.orderBy(F.col("date").cast("timestamp").cast("long"))
.rangeBetween(-14 * 60 * 60 * 24 + 1, Window.currentRow)
)
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window7))
.withColumn("sum7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max7", F.array_max(F.col("value_array")))
.withColumn("min7", F.array_min(F.col("value_array")))
.withColumn("std7", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean7)*(x - mean7), acc -> sqrt(acc / (size(value_array) - 1)))'))
.withColumn("count7", F.size(F.col("value_array")))
.drop("value_array")
)
hof_example = (
hof_example
.withColumn("value_array", F.collect_list(F.col("value")).over(window14))
.withColumn("sum14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max14", F.array_max(F.col("value_array")))
.withColumn("min14", F.array_min(F.col("value_array")))
.withColumn("std14", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean14)*(x - mean14), acc -> sqrt(acc / (size(value_array) - 1)))'))
.withColumn("count14", F.size(F.col("value_array")))
.drop("value_array")
)
hof_example.show(truncate=True)
SQL Diagram Snippet
Try this aggregate for stddev. If you want to understand the syntax, you can check the docs.
hof_example = (
df
.withColumn("value_array", F.collect_list(F.col("value")).over(window))
.withColumn("sum_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x)'))
.withColumn("mean_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + x, acc -> acc / size(value_array))'))
.withColumn("max_example", F.array_max(F.col("value_array")))
.withColumn("min_example", F.array_min(F.col("value_array")))
.withColumn("std_example", F.expr('AGGREGATE(value_array, DOUBLE(0), (acc, x) -> acc + (x - mean_example)*(x - mean_example), acc -> sqrt(acc / (size(value_array) - 1)))'))
)
By the way, I don't think the other two approaches (pyspark window vs spark sql) are different. The query plans look identical to me. (I only selected min and max to reduce the size of the query plan)
Pyspark query plan:
status_quo = (df
.withColumn("min",F.min(F.col("value")).over(window))
.withColumn("max",F.max(F.col("value")).over(window))
)
status_quo.explain()
== Physical Plan ==
*(4) Project [date#3793, name#3794L, value#3795L, min#3800L, max#3807L]
+- Window [max(value#3795L) windowspecdefinition(name#3794L, _w0#3808L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS max#3807L], [name#3794L], [_w0#3808L ASC NULLS FIRST]
+- *(3) Sort [name#3794L ASC NULLS FIRST, _w0#3808L ASC NULLS FIRST], false, 0
+- *(3) Project [date#3793, name#3794L, value#3795L, min#3800L, cast(date#3793 as bigint) AS _w0#3808L]
+- Window [min(value#3795L) windowspecdefinition(name#3794L, _w0#3801L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS min#3800L], [name#3794L], [_w0#3801L ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#3801L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#812]
+- *(1) Project [date#3793, name#3794L, value#3795L, cast(date#3793 as bigint) AS _w0#3801L]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]
Spark SQL query plan:
df.createOrReplaceTempView("df")
sql_example = spark.sql(
"""
SELECT
*
, min(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS min
, max(value)
OVER (
PARTITION BY name
ORDER BY CAST(date AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS max
FROM df"""
)
sql_example.explain()
== Physical Plan ==
*(4) Project [date#3793, name#3794L, value#3795L, min#4670L, max#4671L]
+- Window [max(value#3795L) windowspecdefinition(name#3794L, _w1#4675 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -7 days, currentrow$())) AS max#4671L], [name#3794L], [_w1#4675 ASC NULLS FIRST]
+- *(3) Sort [name#3794L ASC NULLS FIRST, _w1#4675 ASC NULLS FIRST], false, 0
+- *(3) Project [date#3793, name#3794L, value#3795L, _w1#4675, min#4670L]
+- Window [min(value#3795L) windowspecdefinition(name#3794L, _w0#4674 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -7 days, currentrow$())) AS min#4670L], [name#3794L], [_w0#4674 ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#4674 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#955]
+- *(1) Project [date#3793, name#3794L, value#3795L, date#3793 AS _w0#4674, date#3793 AS _w1#4675]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]
Aggregate function query plan:
hof_example.explain()
== Physical Plan ==
Project [date#3793, name#3794L, value#3795L, value_array#5516, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5523 + cast(lambda x#5524L as double)), lambda acc#5523, lambda x#5524L, false), lambdafunction(lambda id#5525, lambda id#5525, false)) AS sum_example#5522, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false)) AS mean_example#5531, array_max(value_array#5516) AS max_example#5541L, array_min(value_array#5516) AS min_example#5549L, aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5559 + ((cast(lambda x#5560L as double) - aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false))) * (cast(lambda x#5560L as double) - aggregate(value_array#5516, 0.0, lambdafunction((lambda acc#5532 + cast(lambda x#5533L as double)), lambda acc#5532, lambda x#5533L, false), lambdafunction((lambda acc#5534 / cast(size(value_array#5516, true) as double)), lambda acc#5534, false))))), lambda acc#5559, lambda x#5560L, false), lambdafunction(SQRT((lambda acc#5561 / cast((size(value_array#5516, true) - 1) as double))), lambda acc#5561, false)) AS std_example#5558]
+- Window [collect_list(value#3795L, 0, 0) windowspecdefinition(name#3794L, _w0#5517L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, -604799, currentrow$())) AS value_array#5516], [name#3794L], [_w0#5517L ASC NULLS FIRST]
+- *(2) Sort [name#3794L ASC NULLS FIRST, _w0#5517L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(name#3794L, 200), true, [id=#1136]
+- *(1) Project [date#3793, name#3794L, value#3795L, cast(date#3793 as bigint) AS _w0#5517L]
+- *(1) Scan ExistingRDD[date#3793,name#3794L,value#3795L]

Spark SQL .withColumn() vs Column expressions

I would like to know if there is any performance/scalability difference when using intermediate steps/columns in pyspark between:
Using .withColumn() for example:
df = df.withColumn('bar', df.foo + 1)
df = df.withColumn('baz', df.bar + 2)
then calling df.select('baz').collect()
versus
Declaring a Spark column as a Python variable:
bar = df.foo + 1
baz = bar + 2
then calling
df.select(baz.alias('baz')).collect()
Question: If many intermediate steps/columns such as bar are required, would the two options differ in space/time complexity?
I saw my original post was deleted. In hindsight it may well be that that was correct barring the lack of communication. That example was using a foldLeft which is not your use case which is fusing of the data pipeline.
To answer your question, the fusing of data pipeline operations by Catalyst means there is no performance issue either ways as the Physical Plans show:
df = spark.createDataFrame([(x,x) for x in range(7)], ['foo', 'bar',])
df = df.withColumn('bar', df.foo + 1)
df = df.withColumn('baz', df.bar + 2)
df.select('baz').explain(extended=True)
== Physical Plan ==
*(1) Project [(foo#276L + 3) AS baz#283L]
+- *(1) Scan ExistingRDD[foo#276L,bar#277L]
and likewise:
df = spark.createDataFrame([(x,x) for x in range(7)], ['foo', 'bar',])
bar = df.foo + 1
baz = bar + 2
df.select(baz.alias('baz')).explain(extended=True)
== Physical Plan ==
*(1) Project [(foo#288L + 3) AS baz#292L]
+- *(1) Scan ExistingRDD[foo#288L,bar#289L]
They look pretty similar to me ... Notice the optimize of +3.
In addition I draw your attention to using foldLeft with .withColumn https://manuzhang.github.io/2018/07/11/spark-catalyst-cost.html

Finding path between 2 vertices which are not directly connected

I have a connected graph Like this
user1|A,C,B
user2|A,E,B,A
user3|C,B,A,B,E
user4|A,C,B,E,B
where user are the property name and the path for that particular user is followed. For example for
user1 the path is A->C->B
user2: A->E->B->A
user3: C->B->A->B->E
user4: A->C->B->E->B
Now, I want to find all users who have reached from A to E. Output should be
user2, user3, user4(since all these users finally reached E from A, no matter how many hops they took). How can I write the motif for this.
This is what I tried.
val vertices=spark.createDataFrame(List(("A","Billing"),("B","Devices"),("C","Payment"),("D","Data"),("E","Help"))).toDF("id","desc")
val edges = spark.createDataFrame(List(("A","C","user1"),
("C","B","user1"),
("A","E","user2"),
("E","B","user2"),
("B","A","user2"),
("C","B","user3"),
("B","A","user3"),
("A","B","user3"),
("B","E","user3"),
("A","C","user4"),
("C","B","user4"),
("B","E","user4"),
("E","B","user4"))).toDF("src","dst","user")
val pathAnalysis=GraphFrame(vertices,edges)
pathAnalysis.find("(a)-[]->();()-[]->();()-[]->(d)").filter("a.id='A'").filter("d.id='E'").distinct().show()
But I am getting an exception like this
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Project [a#355]
: +- Join Inner, (__tmp-4363599943432734077#353.src = a#355.id)
: :- LocalRelation [__tmp-4363599943432734077#353]
: +- Project [named_struct(id, _1#0, desc, _2#1) AS a#355]
: +- Filter (named_struct(id, _1#0, desc, _2#1).id = A)
: +- LocalRelation [_1#0, _2#1]
+- LocalRelation
and
LocalRelation [__tmp-1043886091038848698#371]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
I am not sure if my condition is correct or how to set this property
spark.sql.crossJoin.enabled=true on a spark-shell
I invoked my spark-shell as follows
spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11
my suggested solution is kinda trivial, but it will work fine if the paths are relatively short, and the number of user (i.e., number of rows in the data set) is big. If this is not the case, please let me know, other implementations are possible.
case class UserPath(
userId: String,
path: List[String])
val dsUsers = Seq(
UserPath("user1", List("A", "B", "C")),
UserPath("user2", List("A", "E", "B", "A")))
.doDF.as[UserPath]
def pathExists(up: UserPath): Option[String] = {
val prefix = up.path.takeWhile(s => s != "A")
val len = up.path.length
if (up.path.takeRight(len - prefix.length).contains("E"))
Some(up.userId)
else
None
}
// Users with path from A -> E.
dsUsers.map(pathExists).filter(opt => !opt.isEmpty)
You can also use BFS algorithm for it: http://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.lib.BFS
With your data model you will have to iterate over users and run the BFS for each of them like this:
scala> pathAnalysis.bfs.fromExpr($"id" === "A").toExpr($"id" === "E").edgeFilter($"user" === "user3").run().show
+------------+-------------+------------+-------------+---------+
| from| e0| v1| e1| to|
+------------+-------------+------------+-------------+---------+
|[A, Billing]|[A, B, user3]|[B, Devices]|[B, E, user3]|[E, Help]|
+------------+-------------+------------+-------------+---------+

How to check if spark dataframe is empty in pyspark [duplicate]

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that?
PS: I want to check if it's empty so that I only save the DataFrame if it's not empty
For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.
df.head(1).isEmpty
df.take(1).isEmpty
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1))
len(df.take(1)) == 0 # or bool(df.take(1))
Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.
def first(): T = head()
def head(): T = head(1).head
head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.
take(n) is also equivalent to head(n)...
def take(n: Int): Array[T] = head(n)
And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
I would say to just grab the underlying RDD. In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
I had the same question, and I tested 3 main solution :
(df != null) && (df.count > 0)
df.head(1).isEmpty() as #hulin003 suggest
df.rdd.isEmpty() as #Justin Pihony suggest
and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :
it takes ~9366ms
it takes ~5607ms
it takes ~1921ms
therefore I think that the best solution is df.rdd.isEmpty() as #Justin Pihony suggest
Since Spark 2.4.0 there is Dataset.isEmpty.
It's implementation is :
def isEmpty: Boolean =
withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):
type DataFrame = Dataset[Row]
You can take advantage of the head() (or first()) functions to see if the DataFrame has a single row. If so, it is not empty.
If you do df.count > 0. It takes the counts of all partitions across all executors and add them up at Driver. This take a while when you are dealing with millions of rows.
The best way to do this is to perform df.take(1) and check if its null. This will return java.util.NoSuchElementException so better to put a try around df.take(1).
The dataframe return an error when take(1) is done instead of an empty row. I have highlighted the specific code lines where it throws the error.
If you are using Pyspark, you could also do:
len(df.head(1)) > 0
For Java users you can use this on a dataset :
public boolean isDatasetEmpty(Dataset<Row> ds) {
boolean isEmpty;
try {
isEmpty = ((Row[]) ds.head(1)).length == 0;
} catch (Exception e) {
return true;
}
return isEmpty;
}
This check all possible scenarios ( empty, null ).
PySpark 3.3.0+ / Scala 2.4.0+
df.isEmpty()
On PySpark, you can also use this bool(df.head(1)) to obtain a True of False value
It returns False if the dataframe contains no rows
In Scala you can use implicits to add the methods isEmpty() and nonEmpty() to the DataFrame API, which will make the code a bit nicer to read.
object DataFrameExtensions {
implicit def extendedDataFrame(dataFrame: DataFrame): ExtendedDataFrame =
new ExtendedDataFrame(dataFrame: DataFrame)
class ExtendedDataFrame(dataFrame: DataFrame) {
def isEmpty(): Boolean = dataFrame.head(1).isEmpty // Any implementation can be used
def nonEmpty(): Boolean = !isEmpty
}
}
Here, other methods can be added as well. To use the implicit conversion, use import DataFrameExtensions._ in the file you want to use the extended functionality. Afterwards, the methods can be used directly as so:
val df: DataFrame = ...
if (df.isEmpty) {
// Do something
}
I found that on some cases:
>>>print(type(df))
<class 'pyspark.sql.dataframe.DataFrame'>
>>>df.take(1).isEmpty
'list' object has no attribute 'isEmpty'
this is same for "length" or replace take() by head()
[Solution] for the issue we can use.
>>>df.limit(2).count() > 1
False
If you want only to find out whether the DataFrame is empty, then df.isEmpty, df.head(1).isEmpty() or df.rdd.isEmpty() should work, these are taking a limit(1) if you examine them:
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#52L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#60L])
+- *(2) GlobalLimit 1
+- Exchange SinglePartition
+- *(1) LocalLimit 1
+- *(1) InMemoryTableScan
+- InMemoryRelation [value#32L], StorageLevel(disk, memory, deserialized, 1 replicas)
... // the rest of the plan related to your computation
But if you are doing some other computation that requires a lot of memory and you don't want to cache your DataFrame just to check whether it is empty, then you can use an accumulator:
def accumulateRows(acc: LongAccumulator)(df: DataFrame): DataFrame =
df.map { row => // we map to the same row, count during this map
acc.add(1)
row
}(RowEncoder(df.schema))
val rowAccumulator = spark.sparkContext.longAccumulator("Row Accumulator")
val countedDF = df.transform(accumulateRows(rowAccumulator))
countedDF.write.saveAsTable(...) // main action
val isEmpty = rowAccumulator.isZero
Note that to see the row count, you should first perform the action. If we change the order of the last 2 lines, isEmpty will be true regardless of the computation.
df1.take(1).length>0
The take method returns the array of rows, so if the array size is equal to zero, there are no records in df.
Let's suppose we have the following empty dataframe:
df = spark.sql("show tables").limit(0)
If you are using Spark 2.1, for pyspark, to check if this dataframe is empty, you can use:
df.count() > 0
Or
bool(df.head(1))
You can do it like:
val df = sqlContext.emptyDataFrame
if( df.eq(sqlContext.emptyDataFrame) )
println("empty df ")
else
println("normal df")
dataframe.limit(1).count > 0
This also triggers a job but since we are selecting single record, even in case of billion scale records the time consumption could be much lower.
From:
https://medium.com/checking-emptiness-in-distributed-objects/count-vs-isempty-surprised-to-see-the-impact-fa70c0246ee0

Resources