PySpark- How to output csv/parquet file with the sequential records? - apache-spark

I'd plan to read data from a very large BigQuery table then output with 61,000 sequential records, I've tried code below:
TMP_BUCKET = "stg-gcs-bucket"
MAX_PARTITION_BYTES = str(512 * 1024 * 1024)
# 1k Account per file
# MAX_ROW_NUM_PER_FILE = "18300"
MAX_ROW_NUM_PER_FILE = "61000"
spark = SparkSession \
.builder \
.master('yarn') \
.appName('crs-bq-export-csv') \
.config('spark.sql.execution.arrow.pyspark.enabled', 'true') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar') \
.config("spark.sql.broadcastTimeout", "36000") \
.config("spark.sql.files.maxRecordsPerFile", MAX_ROW_NUM_PER_FILE) \
.config("spark.sql.files.maxPartitionBytes", MAX_PARTITION_BYTES) \
.config("spark.files.maxPartitionBytes", MAX_PARTITION_BYTES) \
.config("spark.driver.maxResultSize", "24g") \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.getOrCreate()
#Try to read full data from BQ
df = spark.read.format('bigquery') \
.option('table', TABLE_NAME) \
.load()
df.sort('colA').sort('colB').write.mode('overwrite').csv(OUTPUT_PATH, header=True)
but the final results didn't sort with the colA and colB and they are all inordinate:
Expected CSV:
colA colB
1. 1
2. 2
3. 3
....
60001 60001
But got:
colA colB
2. 1
3. 3
2. 2
1. 3
I checked the spark doc and it will shullfle all dfs in order to get better performance, but I need to get the final csv with specific order, how can I achieve this?
How can I do for this case? Any helps will be super helpful!

I create the dataframe like this:
data = [("2.", "1"),
("3.", "3"),
("2.", "2"),
("1.", "3")]
columns = ["colA", "colB"]
df = spark.createDataFrame(data, columns)
df.show()
+----+----+
|colA|colB|
+----+----+
|2. |1 |
|3. |3 |
|2. |2 |
|1. |3 |
+----+----+
If I run your code I get:
df.sort('colA').sort('colB').show()
+----+----+
|colA|colB|
+----+----+
| 2.| 1|
| 2.| 2|
| 1.| 3|
| 3.| 3|
+----+----+
Let's look at the execution plan it sorts by colB:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [colB#1 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(colB#1 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=94]
+- Scan ExistingRDD[colA#0,colB#1]
And that is in line with the way the sort function is implemented - it sorts the whole dataframe based on the column values from the columns you have passed to the sort function. So, the final effect of chaining sort function calls has means that the resulting dataframe will be sorted based on the last sort function call.
Here is the correct approach for your use case:
df.sort('colA', 'colB').show()
df.sort('colA', 'colB').explain()
+----+----+
|colA|colB|
+----+----+
| 1.| 3|
| 2.| 1|
| 2.| 2|
| 3.| 3|
+----+----+
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [colA#0 ASC NULLS FIRST, colB#1 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(colA#0 ASC NULLS FIRST, colB#1 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=148]
+- Scan ExistingRDD[colA#0,colB#1]
As you can see in the output dataframe and in the execution plan, it sorts by both columns because I am passing both columns to the sort function, first by colA and then by colB.

Related

How round-robin repartition without key might cause data skew?

Seems like I'm missing something about repartition in spark.
AFAIK, you can repartition with a key:
df.repartition("key") , in which case spark will use a hash partitioning method.
And you can repartition with setting only partitions number:
df.repartition(10), in which spark will use a round robin partitioning method.
In which case a round robin partition will have a data skew which will require using salt to randomize the results equally, if repartitioning with only column numbers is done in a round robin manner?
With df.repartition(10) you cannot have a skew. As you mention it, spark uses a round robin partitioning method so that partitions have the same size.
We can check that:
spark.range(100000).repartition(5).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange RoundRobinPartitioning(5), REPARTITION_BY_NUM, [id=#1380]
+- Range (0, 100000, step=1, splits=16)
spark.range(100000).repartition(5).groupBy(spark_partition_id).count
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20000|
| 1|20000|
| 2|20000|
| 3|20000|
| 4|20000|
+--------------------+-----+
If you use df.repartition("key"), something different happens:
// let's specify the number of partitions as well
spark.range(100000).repartition(5, 'id).explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(id#352L, 5), REPARTITION_BY_NUM, [id=#1424]
+- Range (0, 100000, step=1, splits=16)
Let's try:
spark.range(100000).repartition(5, 'id).groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|20128|
| 1|20183|
| 2|19943|
| 3|19940|
| 4|19806|
+--------------------+-----+
Each element of the column is hashed and hashes are split between partitions. Therefore partitions have similar sizes but they don't have exactly the same size. However, two rows with the same key necessarily end up in the same partition. So if your key is skewed (one or more particular keys are over-represented in the dataframe), your partitioning will be skewed as well:
spark.range(100000)
.withColumn("key", when('id < 1000, 'id).otherwise(lit(0)))
.repartition(5, 'key)
.groupBy(spark_partition_id).count.show
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 0|99211|
| 1| 196|
| 2| 190|
| 3| 200|
| 4| 203|
+--------------------+-----+

Spark, why does adding an 'or' clause inside a join creates a cartesian product plan

I have the two following dataframes:
df1:
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 5|
| 2| 6|
| 3| 7|
| 3| 8|
+---+---+
info:
+---+---+------------+
| a| b| i|
+---+---+------------+
| 1| 2|1 --> 2 info|
| 1| 3|1 --> 3 info|
| 7| 3|3 --> 7 info|
+---+---+------------+
For each row in 'df1' I want to find a corresponding row in 'info':
select df1.*, info.i from df1
join info
on
(df1.a = info.a and df1.b = info.b)
This works and generates the following explain plan:
*(5) Project [a#0L, b#1L, i#6]
+- *(5) SortMergeJoin [a#0L, b#1L], [a#4L, b#5L], Inner
:- *(2) Sort [a#0L ASC NULLS FIRST, b#1L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#0L, b#1L, 200), ENSURE_REQUIREMENTS, [id=#37]
: +- *(1) Filter (isnotnull(a#0L) AND isnotnull(b#1L))
: +- *(1) Scan ExistingRDD[a#0L,b#1L]
+- *(4) Sort [a#4L ASC NULLS FIRST, b#5L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#4L, b#5L, 200), ENSURE_REQUIREMENTS, [id=#43]
+- *(3) Filter (isnotnull(a#4L) AND isnotnull(b#5L))
+- *(3) Scan ExistingRDD[a#4L,b#5L,i#6]
However, looking at the output:
+---+---+------------+
| a| b| i|
+---+---+------------+
| 1| 3|1 --> 3 info|
| 1| 2|1 --> 2 info|
+---+---+------------+
I understand this is not good enough for me. 'info' table has no meaning to order. So I want the record a=3, b=7 in df1 to be paired with the record a=7, b=3 in info.
select df1.*, info.i from df1
join info
on
(df1.a = info.a and df1.b = info.b) or
(df1.a = info.b and df1.b = info.a)
Output is exactly as I wanted:
+---+---+------------+
| a| b| i|
+---+---+------------+
| 1| 2|1 --> 2 info|
| 1| 3|1 --> 3 info|
| 3| 7|3 --> 7 info|
+---+---+------------+
However, the explain plan worries me:
== Physical Plan ==
*(3) Project [a#0L, b#1L, i#6]
+- CartesianProduct (((a#0L = a#4L) AND (b#1L = b#5L)) OR ((a#0L = b#5L) AND (b#1L = a#4L)))
:- *(1) Scan ExistingRDD[a#0L,b#1L]
+- *(2) Scan ExistingRDD[a#4L,b#5L,i#6]
Questions:
Is adding the OR clause correct? We can assume 'df1' and 'info' tables are unique in (a,b). df1 is ordered but info is not.
why did the plan change?
I am running on Spark 3.1.2
Adding an OR condition to a join clause makes it impossible to easily ensure that rows to be joined are regrouped on the same executor and can be effectively matched within the same executor. Thus in this case Spark is forced to use the naive cartesian product algorithm to join dataframes
To simplify, let's look at the first step of Spark join algorithm: regrouping the rows that will be matched together to the same executor. The second step (efficiently join rows within the same executor) is a bit more complicated but have the same issue with OR condition.
Spark regroups rows of a dataframe to executors (repartition) by using an hash function on join columns of dataframe, and send rows with similar hashes to same executor (for instance, similar hashes can be hashes starting with the same characters)
Example
Let's take two dataframe to join, df1 and df2. Here is df1:
row
k1
k2
v1
1
1
A
X1
2
1
B
X2
3
2
A
X3
4
2
B
X4
And here is df2:
row
k1
k2
v2
5
1
A
Y1
6
1
B
Y2
7
2
A
Y3
8
2
B
Y4
We will join those two dataframes on a Spark cluster with 2 executors
AND join condition
We use df1.k1 = df2.k1 AND df1.k2 = df2.k2 for join condition
If we use concatenation of k1 and k2 (very bad hash function, by the way) as an hash function and use the first character to regroup, the first two rows of df1 will be matched with the first two rows of df2 and the last two rows of df1 will be matched with the first two rows of df2.
So you can split your two dataframes, putting the first two rows on one executor and last two rows on the other executor and then perform the join independently on each executor.
OR join condition
However, if we change join condition to an OR condition, df1.k1 = df2.k1 OR df1.k2 = df2.k2
In this case, hashing doesn't work anymore, as row 1 of df1 will be matched to row 5, 6 and 7 of df2 and row 2 of df1 will be matched to row 5, 6 and 8 of df2. So the first two rows of df1 will be matched to all rows of df2. And the first two rows of df2 will be matched to all rows of df1.
So you can't split your dataframes to send on executors parts that can be treated independently anymore.
Conclusion
As we can see, for OR condition, you can't use hash to distribute rows for a join. So, when Spark see an OR condition on a join, it can't select hash based join algorithms. For a similar reason, it can't select Sort Merge Join algorithm.
So two algorithms remains available for Spark to chose: Broadcast Nested Loop Join and Cartesian Product. If one of the dataframe is small, Spark will use Broadcast Nested Loop Join. Else Spark will use Cartesian Product
That's why adding an OR condition makes Spark use a Cartesian Product plan.
Max and min can be evaluated and used for join (Scala):
val df1 = Seq(
(1, 2),
(1, 3),
(1, 4),
(2, 5),
(2, 6),
(3, 7),
(3, 8)
).toDF("a", "b")
.withColumn("maxValue", when($"a">$"b", $"a").otherwise($"b"))
.withColumn("minValue", when($"a">$"b", $"b").otherwise($"a"))
val info = Seq(
(1, 2, "1 --> 2 info"),
(1, 3, "1 --> 3 info"),
(7, 3, "3 --> 7 info")
).toDF("a", "b", "i")
.withColumn("maxValue", when($"a">$"b", $"a").otherwise($"b"))
.withColumn("minValue", when($"a">$"b", $"b").otherwise($"a"))
df1
.join(info, Seq("maxValue", "minValue"))
// drop unused columns
.drop("maxValue", "minValue")
.drop(info.col("a")).drop(info.col("b"))
Output:
+---+---+------------+
|a |b |i |
+---+---+------------+
|1 |2 |1 --> 2 info|
|1 |3 |1 --> 3 info|
|3 |7 |3 --> 7 info|
+---+---+------------+

Spark SQL : Why am I seeing 3 jobs instead of one single job in the Spark UI?

As per my understanding, there will be one job for each action in Spark.
But often I see there are more than one jobs triggered for a single action.
I was trying to test this by doing a simple aggregation on a dataset to get the maximum from each category ( here the "subject" field)
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
students.show(5)
+----------+--------------+----------+----+-------+-----+-----+
|student_id|exam_center_id| subject|year|quarter|score|grade|
+----------+--------------+----------+----+-------+-----+-----+
| 1| 1| Math|2005| 1| 41| D|
| 1| 1| Spanish|2005| 1| 51| C|
| 1| 1| German|2005| 1| 39| D|
| 1| 1| Physics|2005| 1| 35| D|
| 1| 1| Biology|2005| 1| 53| C|
| 1| 1|Philosophy|2005| 1| 73| B|
// Task : Find Highest Score in each subject
val highestScores = students.groupBy("subject").max("score")
highestScores.show(10)
+----------+----------+
| subject|max(score)|
+----------+----------+
| Spanish| 98|
|Modern Art| 98|
| French| 98|
| Physics| 98|
| Geography| 98|
| History| 98|
| English| 98|
| Classics| 98|
| Math| 98|
|Philosophy| 98|
+----------+----------+
only showing top 10 rows
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
== Physical Plan ==
*(2) HashAggregate(keys=[subject#12], functions=[max(score#15)])
+- Exchange hashpartitioning(subject#12, 1)
+- *(1) HashAggregate(keys=[subject#12], functions=[partial_max(score#15)])
+- *(1) FileScan csv [subject#12,score#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/lab/SparkLab/files/exams/students.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<subject:string,score:int>
I think only #3 does the actual "job" (executes a plan which you'll see if you open Details for the query on SQL tab). The other two are preparatory steps --
#1 is querying NameNode to build InMemoryFileIndex to read your csv, and
#2 is sampling the dataset to execute .groupBy("subject").max("score") which internally requires a sortByKey (here are more details on that).
I would suggest to check the physical plan-
highestScores.explain()
You might see something like-
*(2) HashAggregate(keys=[subject#9], functions=[max(score#12)], output=[subject#9, max(score)#51])
+- Exchange hashpartitioning(subject#9, 2)
+- *(1) HashAggregate(keys=[subject#9], functions=[partial_max(score#12)], output=[subject#9, max#61])
[Map stage] stage#1 is to achieve the local aggregation (partial aggregation) and then the shuffling happened using hashpartitioning(subject). Note the hashpartitioner uses group by column
[Reduce stage] stage#2 is to merge the output of stage#1 to get final max(score)
this is actually used to print the top 10 records show(10)

Left anti join in groups

I have a dataframe that has two columns a and b where the values in the b column are a subset of the values in the a column. For instance:
df
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 2| 1|
| 3| 2|
+---+---+
I’d like to produce a dataframe with columns a and anti_b where the values in the anti_b column are any values from the a column such that a!=anti_b and the row (a,anti_b) does not appear in the original dataframe. So in the above dataframe, the result should be:
anti df
+---+------+
| a|anti_b|
+---+------+
| 3| 1|
| 2| 3|
+---+------+
This can be accomplished with a crossJoin and a call to array_contains, but it’s extremely slow and inefficient. Does anyone know of a better spark idiom to accomplish this, something like anti_join?
Here is the inefficient example using a small dataframe so you can see what I’m after:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').withColumnRenamed('a','_a').distinct().crossJoin(df.select('a').withColumnRenamed('a','anti_b').distinct()).where(pyspark.sql.functions.col('_a')!=pyspark.sql.functions.col('anti_b'))
anti_df = df.groupBy(
'a'
).agg(
pyspark.sql.functions.collect_list('b').alias('bs')
).join(
crossed_df,
on=((pyspark.sql.functions.col('a')==pyspark.sql.functions.col('_a'))&(~pyspark.sql.functions.expr('array_contains(bs,anti_b)'))),
how='inner'
).select(
'a','anti_b'
)
print('df')
df.show()
print('anti df')
anti_df.show()
Edit: This also works, but it's not much faster:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').distinct().crossJoin(df.select('a').withColumnRenamed('a','b').distinct()).where(pyspark.sql.functions.col('a')!=pyspark.sql.functions.col('b'))
anti_df = crossed_df.join(
df,
on=['a','b'],
how='left_anti'
)
This should be better than what you have:
from pyspark.sql.functions import collect_set, expr
anti_df = df.groupBy("a").agg(collect_set("b").alias("bs")).alias("l")\
.join(df.alias("r"), on=expr("NOT array_contains(l.bs, r.b)"))\
.where("l.a != r.b")\
.selectExpr("l.a", "r.b AS anti_b")\
anti_df.show()
#+---+------+
#| a|anti_b|
#+---+------+
#| 3| 1|
#| 2| 3|
#+---+------+
If you compare this execution plan with your methods, you'll see that is better (because you can swap out distinct for collect_set), but it still has a Cartesian product.
anti_df.explain()
#== Physical Plan ==
#*(3) Project [a#0, b#294 AS anti_b#308]
#+- CartesianProduct (NOT (a#0 = b#294) && NOT array_contains(bs#288, b#294))
# :- *(1) Filter isnotnull(a#0)
# : +- ObjectHashAggregate(keys=[a#0], functions=[collect_set(b#1, 0, 0)])
# : +- Exchange hashpartitioning(a#0, 200)
# : +- ObjectHashAggregate(keys=[a#0], functions=[partial_collect_set(b#1, 0, 0)])
# : +- Scan ExistingRDD[a#0,b#1]
# +- *(2) Project [b#294]
# +- *(2) Filter isnotnull(b#294)
# +- Scan ExistingRDD[a#293,b#294]
However, I don't think there's any way to avoid the Cartesian product for this particular problem without more information.

How Spark SQL treats nulls during filtering? [duplicate]

I am bit confused with the difference when we are using
df.filter(col("c1") === null) and df.filter(col("c1").isNull)
Same dataframe I am getting counts in
=== null but zero counts in isNull. Please help me to understand the difference. Thanks
First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.
Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.
spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+
The only valid methods to check for NULL are:
IS NULL:
spark.sql("SELECT NULL IS NULL").show
+--------------+
|(NULL IS NULL)|
+--------------+
| true|
+--------------+
spark.sql("SELECT TRUE IS NULL").show
+--------------+
|(true IS NULL)|
+--------------+
| false|
+--------------+
IS NOT NULL:
spark.sql("SELECT NULL IS NOT NULL").show
+------------------+
|(NULL IS NOT NULL)|
+------------------+
| false|
+------------------+
spark.sql("SELECT TRUE IS NOT NULL").show
+------------------+
|(true IS NOT NULL)|
+------------------+
| true|
+------------------+
implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.
Note:
For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:
spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+
or not(_ <=> _) / <=>
spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+
in SQL and DataFrame DSL respectively.
Related:
Including null values in an Apache Spark Join
Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Example extends App {
val session = SparkSession.builder().master("local[*]").getOrCreate()
case class Record(c1: String, c2: String)
val data = List(Record("a", "b"), Record(null, "c"))
val rdd = session.sparkContext.parallelize(data)
import session.implicits._
val df: DataFrame = rdd.toDF
val filtered = df.filter(col("c1") === null)
println(filtered.count()) // <-- outputs 0, not expected
val filtered2 = df.filter(col("c1").isNull)
println(filtered2.count())
println(filtered2) // <- outputs 1, as expected
filtered.explain(true)
filtered2.explain(true)
}
The first explain plan shows:
== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]
This filter clause looks nonsensical. The && to null ensures this can never resolve to true.
The second explain plan looks like:
== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]
Here the filter is what expect and want.

Resources