How to use pyspark to efficiently keep only those groupbs from dataframe that satisfy a certain group specific filter? - apache-spark

Let us use the following dummy data:
df = spark.createDataFrame([(1,2),(1,3),(1,40),(1,0),(2,3),(2,1),(2,4),(3,2),(3,4)],['a','b'])
df.show()
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 1| 40|
| 1| 0|
| 2| 3|
| 2| 1|
| 2| 4|
| 3| 2|
| 3| 4|
+---+---+
How to filter out the data groups that do not have average(b) > 6
Expected output:
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 1| 40|
| 1| 0|
+---+---+
How I am achieving it:
df_filter = df.groupby('a').agg(F.mean(F.col('b')).alias("avg"))
df_filter = df_filter.filter(F.col('avg') > 6.)
df.join(df_filter,'a','inner').drop('avg').show()
Problem:
The shuffle happens twice. Once for computing the df_filter and the other time for the join.
df_filter = df.groupby('a').agg(F.mean(F.col('b')).alias("avg"))
df_filter = df_filter.filter(F.col('avg') > 6.)
df.join(df_filter,'a','inner').drop('avg').explain()
== Physical Plan ==
*(5) Project [a#175L, b#176L]
+- *(5) SortMergeJoin [a#175L], [a#222L], Inner
:- *(2) Sort [a#175L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#175L, 200), ENSURE_REQUIREMENTS, [plan_id=919]
: +- *(1) Filter isnotnull(a#175L)
: +- *(1) Scan ExistingRDD[a#175L,b#176L]
+- *(4) Sort [a#222L ASC NULLS FIRST], false, 0
+- *(4) Project [a#222L]
+- *(4) Filter (isnotnull(avg#219) AND (avg#219 > 6.0))
+- *(4) HashAggregate(keys=[a#222L], functions=[avg(b#223L)])
+- Exchange hashpartitioning(a#222L, 200), ENSURE_REQUIREMENTS, [plan_id=925]
+- *(3) HashAggregate(keys=[a#222L], functions=[partial_avg(b#223L)])
+- *(3) Filter isnotnull(a#222L)
+- *(3) Scan ExistingRDD[a#222L,b#223L]
If I think about it I should just shuffle the data once on the key a and then no more shuffles are needed since every partition would be self sufficient.
Question: In general, What is the efficient way to exclude the data groups that do not satisfy a group-dependent filter?

You can use Window functionality instead of doing groupBy + join,
out = df.withColumn("avg", avg(col("b")).over(Window.partitionBy("a")))\
.where("avg>6").drop("avg")
out.explain()
out.show()
+- Project [a#0L, b#1L]
+- Filter (isnotnull(avg#5) AND (avg#5 > 6.0))
+- Window [avg(b#1L) windowspecdefinition(a#0L, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS avg#5], [a#0L]
+- Sort [a#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#0L, 200), ENSURE_REQUIREMENTS, [plan_id=16]
+- Scan ExistingRDD[a#0L,b#1L]
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 1| 40|
| 1| 0|
+---+---+

Related

pyspark dataframe not maintaining order after dropping a column

I create a dataframe:
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
its contents are :
df.show()
+---+---+
| a| c|
+---+---+
| 0| 0|
| 1| 1|
| 3| 3|
| 5| 5|
| 6| 6|
| 8| 8|
| 9| 9|
| 10| 10|
| 2| 2|
| 4| 4|
| 7| 7|
| 11| 11|
+---+---+
But, If I drop a column, the remaining column gets permuted
df.drop('c').show()
+---+
| a|
+---+
| 0|
| 2|
| 3|
| 5|
| 6|
| 7|
| 9|
| 11|
| 1|
| 4|
| 8|
| 10|
+---+
Please help me understand what is happening here?
I wanted to add my answer since I felt I could explain this slightly differently.
The repartition results in a RoundRobinPartition. It basically redistributes the data in round-robin fashion.
Since you are evaluating the dataframe again, it recomputes the partition after the drop.
You can see this by running a few commands in addition to what you have shown.
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
df.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(df.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0, c=0), Row(a=1, c=1), Row(a=3, c=3), Row(a=5, c=5), Row(a=6, c=6), Row(a=8, c=8), Row(a=9, c=9), Row(a=10, c=10)], [Row(a=2, c=2), Row(a=4, c=4), Row(a=7, c=7), Row(a=11, c=11)]]
temp = df.drop("c")
temp.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#14L]
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(temp.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0), Row(a=2), Row(a=3), Row(a=5), Row(a=6), Row(a=7), Row(a=9), Row(a=11)], [Row(a=1), Row(a=4), Row(a=8), Row(a=10)]]
In the above code, the explain() shows the RoundRobinPartitioning taking place. The use of glom shows the redistribution of data across partitions.
In the original dataframe, the partitions are in the order that you see the results of show().
In the second dataframe above, you can see that the data has shuffled across the last two partitions, resulting in it not being in the same order. This is because when re-evaluating the dataframe the repartition runs again.
Edits as per discussion in the comments
If you run a df.drop('b'), we are trying to drop a column that doesn't exist. So it's really what is called a noop or a no operation. So the partitioning doesn't change.
df.drop('b').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
Similarly if you're adding a column and run it, the round partition runs before the column is added. This again results in the same partitioning and hence the order is consistent with the original dataframe.
import pyspark.sql.functions as f
df.withColumn('tt', f.rand()).explain()
# == Physical Plan ==
# *(1) Project [a#70L, c#71L, rand(-3030352041536166328) AS tt#76]
# +- Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since the resulting dataframe in the stage before the partitioning is run is different.
df.drop('c').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#70L]
# +- Scan ExistingRDD[a#70L,c#71L]
As mentioned in another answer to this question, the round-robin partitioner is random for different data, but consistent with the same data on which the partition is run. So if the underlying data changes from the operation, the resulting partition will be different.

Spark SQL : Why am I seeing 3 jobs instead of one single job in the Spark UI?

As per my understanding, there will be one job for each action in Spark.
But often I see there are more than one jobs triggered for a single action.
I was trying to test this by doing a simple aggregation on a dataset to get the maximum from each category ( here the "subject" field)
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
students.show(5)
+----------+--------------+----------+----+-------+-----+-----+
|student_id|exam_center_id| subject|year|quarter|score|grade|
+----------+--------------+----------+----+-------+-----+-----+
| 1| 1| Math|2005| 1| 41| D|
| 1| 1| Spanish|2005| 1| 51| C|
| 1| 1| German|2005| 1| 39| D|
| 1| 1| Physics|2005| 1| 35| D|
| 1| 1| Biology|2005| 1| 53| C|
| 1| 1|Philosophy|2005| 1| 73| B|
// Task : Find Highest Score in each subject
val highestScores = students.groupBy("subject").max("score")
highestScores.show(10)
+----------+----------+
| subject|max(score)|
+----------+----------+
| Spanish| 98|
|Modern Art| 98|
| French| 98|
| Physics| 98|
| Geography| 98|
| History| 98|
| English| 98|
| Classics| 98|
| Math| 98|
|Philosophy| 98|
+----------+----------+
only showing top 10 rows
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
== Physical Plan ==
*(2) HashAggregate(keys=[subject#12], functions=[max(score#15)])
+- Exchange hashpartitioning(subject#12, 1)
+- *(1) HashAggregate(keys=[subject#12], functions=[partial_max(score#15)])
+- *(1) FileScan csv [subject#12,score#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/lab/SparkLab/files/exams/students.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<subject:string,score:int>
I think only #3 does the actual "job" (executes a plan which you'll see if you open Details for the query on SQL tab). The other two are preparatory steps --
#1 is querying NameNode to build InMemoryFileIndex to read your csv, and
#2 is sampling the dataset to execute .groupBy("subject").max("score") which internally requires a sortByKey (here are more details on that).
I would suggest to check the physical plan-
highestScores.explain()
You might see something like-
*(2) HashAggregate(keys=[subject#9], functions=[max(score#12)], output=[subject#9, max(score)#51])
+- Exchange hashpartitioning(subject#9, 2)
+- *(1) HashAggregate(keys=[subject#9], functions=[partial_max(score#12)], output=[subject#9, max#61])
[Map stage] stage#1 is to achieve the local aggregation (partial aggregation) and then the shuffling happened using hashpartitioning(subject). Note the hashpartitioner uses group by column
[Reduce stage] stage#2 is to merge the output of stage#1 to get final max(score)
this is actually used to print the top 10 records show(10)

Left anti join in groups

I have a dataframe that has two columns a and b where the values in the b column are a subset of the values in the a column. For instance:
df
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 2| 1|
| 3| 2|
+---+---+
I’d like to produce a dataframe with columns a and anti_b where the values in the anti_b column are any values from the a column such that a!=anti_b and the row (a,anti_b) does not appear in the original dataframe. So in the above dataframe, the result should be:
anti df
+---+------+
| a|anti_b|
+---+------+
| 3| 1|
| 2| 3|
+---+------+
This can be accomplished with a crossJoin and a call to array_contains, but it’s extremely slow and inefficient. Does anyone know of a better spark idiom to accomplish this, something like anti_join?
Here is the inefficient example using a small dataframe so you can see what I’m after:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').withColumnRenamed('a','_a').distinct().crossJoin(df.select('a').withColumnRenamed('a','anti_b').distinct()).where(pyspark.sql.functions.col('_a')!=pyspark.sql.functions.col('anti_b'))
anti_df = df.groupBy(
'a'
).agg(
pyspark.sql.functions.collect_list('b').alias('bs')
).join(
crossed_df,
on=((pyspark.sql.functions.col('a')==pyspark.sql.functions.col('_a'))&(~pyspark.sql.functions.expr('array_contains(bs,anti_b)'))),
how='inner'
).select(
'a','anti_b'
)
print('df')
df.show()
print('anti df')
anti_df.show()
Edit: This also works, but it's not much faster:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').distinct().crossJoin(df.select('a').withColumnRenamed('a','b').distinct()).where(pyspark.sql.functions.col('a')!=pyspark.sql.functions.col('b'))
anti_df = crossed_df.join(
df,
on=['a','b'],
how='left_anti'
)
This should be better than what you have:
from pyspark.sql.functions import collect_set, expr
anti_df = df.groupBy("a").agg(collect_set("b").alias("bs")).alias("l")\
.join(df.alias("r"), on=expr("NOT array_contains(l.bs, r.b)"))\
.where("l.a != r.b")\
.selectExpr("l.a", "r.b AS anti_b")\
anti_df.show()
#+---+------+
#| a|anti_b|
#+---+------+
#| 3| 1|
#| 2| 3|
#+---+------+
If you compare this execution plan with your methods, you'll see that is better (because you can swap out distinct for collect_set), but it still has a Cartesian product.
anti_df.explain()
#== Physical Plan ==
#*(3) Project [a#0, b#294 AS anti_b#308]
#+- CartesianProduct (NOT (a#0 = b#294) && NOT array_contains(bs#288, b#294))
# :- *(1) Filter isnotnull(a#0)
# : +- ObjectHashAggregate(keys=[a#0], functions=[collect_set(b#1, 0, 0)])
# : +- Exchange hashpartitioning(a#0, 200)
# : +- ObjectHashAggregate(keys=[a#0], functions=[partial_collect_set(b#1, 0, 0)])
# : +- Scan ExistingRDD[a#0,b#1]
# +- *(2) Project [b#294]
# +- *(2) Filter isnotnull(b#294)
# +- Scan ExistingRDD[a#293,b#294]
However, I don't think there's any way to avoid the Cartesian product for this particular problem without more information.

inner join not working in DataFrame using Spark 2.1

My Data Set :-
emp dataframe looks like this :-
emp.show()
+---+-----+------+----------+-------------+
| ID| NAME|salary|department| date|
+---+-----+------+----------+-------------+
| 1| sban| 100.0| IT| 2018-01-10|
| 2| abc| 200.0| HR| 2018-01-05|
| 3| Jack| 100.0| SALE| 2018-01-05|
| 4| Ram| 100.0| IT|2018-01-01-06|
| 5|Robin| 200.0| IT| 2018-01-07|
| 6| John| 200.0| SALE| 2018-01-08|
| 7| sban| 300.0| Director| 2018-01-01|
+---+-----+------+----------+-------------+
2- Then I group by using name and took its max salary , say dataframe is grpEmpByName :-
val grpByName = emp.select(col("name")).groupBy(col("name")).agg(max(col("salary")).alias("max_salary"))
grpByName.select("*").show()
+-----+----------+
| name|max_salary|
+-----+----------+
| Jack| 100.0|
|Robin| 200.0|
| Ram| 100.0|
| John| 200.0|
| abc| 200.0|
| sban| 300.0|
+-----+----------+
3- Then trying to join :-
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, col("emp.salary") === col("grpEmpByName.max_salary") , "inner")
Its throwing
18/02/08 21:29:26 INFO CodeGenerator: Code generated in 13.667672 ms
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`grpByName.max_salary`' given input columns: [NAME, department, date, ID, salary, max_salary, NAME];;
'Join Inner, (salary#2 = 'grpByName.max_salary)
:- Project [ID#0, NAME#1, salary#2, department#3, date#4]
: +- MetastoreRelation default, emp
+- Aggregate [NAME#44], [NAME#44, max(salary#45) AS max_salary#25]
+- Project [salary#45, NAME#44]
+- Project [ID#43, NAME#44, salary#45, department#46, date#47]
+- MetastoreRelation default, emp
I am not getting why its not working as when I check
grpByName.select(col("max_salary")).show()
+----------+
|max_salary|
+----------+
| 100.0|
| 200.0|
| 100.0|
| 200.0|
| 200.0|
| 300.0|
+----------+
Thanks in advance .
The dot notation is used to refer to nested structures inside a table, not to refer to the table itself.
Call the col method define on the DataFrame instead, like this:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"), "inner")
You can see an example here.
Furthermore, note that joins are inner by default, so you should just be able to write the following:
emp.join(grpEmpByName, emp.col("salary") === grpEmpByName.col("max_salary"))
i am not sure, hope can help:
val joinedBySalarywithMaxSal = emp.join(grpEmpByName, emp.col("emp") === grpEmpByName.col("max_salary") , "inner")

Spark train test split

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release.
So far I could only find https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling which does not seem to be a great fit for splitting heavily imbalanced dataset into train /test samples.
Let's assume we have a dataset like this:
+---+-----+
| id|label|
+---+-----+
| 0| 0.0|
| 1| 1.0|
| 2| 0.0|
| 3| 1.0|
| 4| 0.0|
| 5| 1.0|
| 6| 0.0|
| 7| 1.0|
| 8| 0.0|
| 9| 1.0|
+---+-----+
This dataset is perfectly balanced, but this approach will work for unbalanced data as well.
Now, let's augment this DataFrame with additional information that will be useful in deciding which rows should go to train set. The steps are as follows:
Determine how many examples of every label should be a part of train set given some ratio.
Shuffle the rows of the DataFrame.
Use window function to partition and order the DataFrame by label and then rank each label's observations using row_number().
We end up with the following data frame:
+---+-----+----------+
| id|label|row_number|
+---+-----+----------+
| 6| 0.0| 1|
| 2| 0.0| 2|
| 0| 0.0| 3|
| 4| 0.0| 4|
| 8| 0.0| 5|
| 9| 1.0| 1|
| 5| 1.0| 2|
| 3| 1.0| 3|
| 1| 1.0| 4|
| 7| 1.0| 5|
+---+-----+----------+
Note: the rows are shuffled (see: random order in id column), partitioned by label (see: label column) and ranked.
Let's assume that we would like to make 80% split. In this case, we would like four 1.0 labels and four 0.0 labels to go to training dataset and one 1.0 label and one 0.0 label to go to test dataset. We have this information in row_number column, so now we can simply use it in user defined function (if row_number is less or equal four, the example goes to train set).
After applying the UDF, the resulting data frame is as follows:
+---+-----+----------+----------+
| id|label|row_number|isTrainSet|
+---+-----+----------+----------+
| 6| 0.0| 1| true|
| 2| 0.0| 2| true|
| 0| 0.0| 3| true|
| 4| 0.0| 4| true|
| 8| 0.0| 5| false|
| 9| 1.0| 1| true|
| 5| 1.0| 2| true|
| 3| 1.0| 3| true|
| 1| 1.0| 4| true|
| 7| 1.0| 5| false|
+---+-----+----------+----------+
Now, to get the train/test data one has to do:
val train = df.where(col("isTrainSet") === true)
val test = df.where(col("isTrainSet") === false)
These sorting and partitioning steps might be prohibitive for some really big datasets, so I suggest first filtering the dataset as much as possible. The physical plan is as follows:
== Physical Plan ==
*(3) Project [id#4, label#5, row_number#11, if (isnull(row_number#11)) null else UDF(label#5, row_number#11) AS isTrainSet#48]
+- Window [row_number() windowspecdefinition(label#5, label#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS row_number#11], [label#5], [label#5 ASC NULLS FIRST]
+- *(2) Sort [label#5 ASC NULLS FIRST, label#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(label#5, 200)
+- *(1) Project [id#4, label#5]
+- *(1) Sort [_nondeterministic#9 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(_nondeterministic#9 ASC NULLS FIRST, 200)
+- LocalTableScan [id#4, label#5, _nondeterministic#9
Here's full working example (tested with Spark 2.3.0 and Scala 2.11.12):
import org.apache.spark.SparkConf
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions.{col, row_number, udf, rand}
class StratifiedTrainTestSplitter {
def getNumExamplesPerClass(ss: SparkSession, label: String, trainRatio: Double)(df: DataFrame): Map[Double, Long] = {
df.groupBy(label).count().createOrReplaceTempView("labelCounts")
val query = f"SELECT $label AS ratioLabel, count, cast(count * $trainRatio as long) AS trainExamples FROM labelCounts"
import ss.implicits._
ss.sql(query)
.select("ratioLabel", "trainExamples")
.map((r: Row) => r.getDouble(0) -> r.getLong(1))
.collect()
.toMap
}
def split(df: DataFrame, label: String, trainRatio: Double): DataFrame = {
val w = Window.partitionBy(col(label)).orderBy(col(label))
val rowNumPartitioner = row_number().over(w)
val dfRowNum = df.sort(rand).select(col("*"), rowNumPartitioner as "row_number")
dfRowNum.show()
val observationsPerLabel: Map[Double, Long] = getNumExamplesPerClass(df.sparkSession, label, trainRatio)(df)
val addIsTrainColumn = udf((label: Double, rowNumber: Int) => rowNumber <= observationsPerLabel(label))
dfRowNum.withColumn("isTrainSet", addIsTrainColumn(col(label), col("row_number")))
}
}
object StratifiedTrainTestSplitter {
def getDf(ss: SparkSession): DataFrame = {
val data = Seq(
(0, 0.0), (1, 1.0), (2, 0.0), (3, 1.0), (4, 0.0), (5, 1.0), (6, 0.0), (7, 1.0), (8, 0.0), (9, 1.0)
)
ss.createDataFrame(data).toDF("id", "label")
}
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder()
.config(new SparkConf().setMaster("local[1]"))
.getOrCreate()
val df = new StratifiedTrainTestSplitter().split(getDf(spark), "label", 0.8)
df.cache()
df.where(col("isTrainSet") === true).show()
df.where(col("isTrainSet") === false).show()
}
}
Note: the labels are Doubles in this case. If your labels are Strings you'll have to switch types here and there.
Spark supports stratified samples as outlined in https://s3.amazonaws.com/sparksummit-share/ml-ams-1.0.1/6-sampling/scala/6-sampling_student.html
df.stat.sampleBy("label", Map(0 -> .10, 1 -> .20, 2 -> .3), 0)
Perhaps this method wasn't available when the OP posted this question, but I'm leaving this here for future reference:
# splitting dataset into train and test set
train, test = df.randomSplit([0.7, 0.3], seed=42)
Although this answer is not specific to Spark, in Apache beam I do this to split train 66% and test 33% (just an illustrative example, you can customize the partition_fn below to be more sophisticated and accept arguments such to specify the number of buckets or bias selection towards something or assure randomization is fair across dimensions, etc):
raw_data = p | 'Read Data' >> Read(...)
clean_data = (raw_data
| "Clean Data" >> beam.ParDo(CleanFieldsFn())
def partition_fn(element):
return random.randint(0, 2)
random_buckets = (clean_data | beam.Partition(partition_fn, 3))
clean_train_data = ((random_buckets[0], random_buckets[1])
| beam.Flatten())
clean_eval_data = random_buckets[2]

Resources