pyspark dataframe not maintaining order after dropping a column - apache-spark

I create a dataframe:
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
its contents are :
df.show()
+---+---+
| a| c|
+---+---+
| 0| 0|
| 1| 1|
| 3| 3|
| 5| 5|
| 6| 6|
| 8| 8|
| 9| 9|
| 10| 10|
| 2| 2|
| 4| 4|
| 7| 7|
| 11| 11|
+---+---+
But, If I drop a column, the remaining column gets permuted
df.drop('c').show()
+---+
| a|
+---+
| 0|
| 2|
| 3|
| 5|
| 6|
| 7|
| 9|
| 11|
| 1|
| 4|
| 8|
| 10|
+---+
Please help me understand what is happening here?

I wanted to add my answer since I felt I could explain this slightly differently.
The repartition results in a RoundRobinPartition. It basically redistributes the data in round-robin fashion.
Since you are evaluating the dataframe again, it recomputes the partition after the drop.
You can see this by running a few commands in addition to what you have shown.
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
df.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(df.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0, c=0), Row(a=1, c=1), Row(a=3, c=3), Row(a=5, c=5), Row(a=6, c=6), Row(a=8, c=8), Row(a=9, c=9), Row(a=10, c=10)], [Row(a=2, c=2), Row(a=4, c=4), Row(a=7, c=7), Row(a=11, c=11)]]
temp = df.drop("c")
temp.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#14L]
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(temp.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0), Row(a=2), Row(a=3), Row(a=5), Row(a=6), Row(a=7), Row(a=9), Row(a=11)], [Row(a=1), Row(a=4), Row(a=8), Row(a=10)]]
In the above code, the explain() shows the RoundRobinPartitioning taking place. The use of glom shows the redistribution of data across partitions.
In the original dataframe, the partitions are in the order that you see the results of show().
In the second dataframe above, you can see that the data has shuffled across the last two partitions, resulting in it not being in the same order. This is because when re-evaluating the dataframe the repartition runs again.
Edits as per discussion in the comments
If you run a df.drop('b'), we are trying to drop a column that doesn't exist. So it's really what is called a noop or a no operation. So the partitioning doesn't change.
df.drop('b').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
Similarly if you're adding a column and run it, the round partition runs before the column is added. This again results in the same partitioning and hence the order is consistent with the original dataframe.
import pyspark.sql.functions as f
df.withColumn('tt', f.rand()).explain()
# == Physical Plan ==
# *(1) Project [a#70L, c#71L, rand(-3030352041536166328) AS tt#76]
# +- Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since the resulting dataframe in the stage before the partitioning is run is different.
df.drop('c').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#70L]
# +- Scan ExistingRDD[a#70L,c#71L]
As mentioned in another answer to this question, the round-robin partitioner is random for different data, but consistent with the same data on which the partition is run. So if the underlying data changes from the operation, the resulting partition will be different.

Related

Spark SQL : Why am I seeing 3 jobs instead of one single job in the Spark UI?

As per my understanding, there will be one job for each action in Spark.
But often I see there are more than one jobs triggered for a single action.
I was trying to test this by doing a simple aggregation on a dataset to get the maximum from each category ( here the "subject" field)
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
students.show(5)
+----------+--------------+----------+----+-------+-----+-----+
|student_id|exam_center_id| subject|year|quarter|score|grade|
+----------+--------------+----------+----+-------+-----+-----+
| 1| 1| Math|2005| 1| 41| D|
| 1| 1| Spanish|2005| 1| 51| C|
| 1| 1| German|2005| 1| 39| D|
| 1| 1| Physics|2005| 1| 35| D|
| 1| 1| Biology|2005| 1| 53| C|
| 1| 1|Philosophy|2005| 1| 73| B|
// Task : Find Highest Score in each subject
val highestScores = students.groupBy("subject").max("score")
highestScores.show(10)
+----------+----------+
| subject|max(score)|
+----------+----------+
| Spanish| 98|
|Modern Art| 98|
| French| 98|
| Physics| 98|
| Geography| 98|
| History| 98|
| English| 98|
| Classics| 98|
| Math| 98|
|Philosophy| 98|
+----------+----------+
only showing top 10 rows
While examining the Spark UI, I can see there are 3 "jobs" executed for the groupBy operation, while I was expecting just one.
Can anyone help me to understand why there is 3 instead of just 1?
== Physical Plan ==
*(2) HashAggregate(keys=[subject#12], functions=[max(score#15)])
+- Exchange hashpartitioning(subject#12, 1)
+- *(1) HashAggregate(keys=[subject#12], functions=[partial_max(score#15)])
+- *(1) FileScan csv [subject#12,score#15] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/C:/lab/SparkLab/files/exams/students.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<subject:string,score:int>
I think only #3 does the actual "job" (executes a plan which you'll see if you open Details for the query on SQL tab). The other two are preparatory steps --
#1 is querying NameNode to build InMemoryFileIndex to read your csv, and
#2 is sampling the dataset to execute .groupBy("subject").max("score") which internally requires a sortByKey (here are more details on that).
I would suggest to check the physical plan-
highestScores.explain()
You might see something like-
*(2) HashAggregate(keys=[subject#9], functions=[max(score#12)], output=[subject#9, max(score)#51])
+- Exchange hashpartitioning(subject#9, 2)
+- *(1) HashAggregate(keys=[subject#9], functions=[partial_max(score#12)], output=[subject#9, max#61])
[Map stage] stage#1 is to achieve the local aggregation (partial aggregation) and then the shuffling happened using hashpartitioning(subject). Note the hashpartitioner uses group by column
[Reduce stage] stage#2 is to merge the output of stage#1 to get final max(score)
this is actually used to print the top 10 records show(10)

Apache Spark: Get the first and last row of each partition

I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.

Left anti join in groups

I have a dataframe that has two columns a and b where the values in the b column are a subset of the values in the a column. For instance:
df
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 3|
| 2| 1|
| 3| 2|
+---+---+
I’d like to produce a dataframe with columns a and anti_b where the values in the anti_b column are any values from the a column such that a!=anti_b and the row (a,anti_b) does not appear in the original dataframe. So in the above dataframe, the result should be:
anti df
+---+------+
| a|anti_b|
+---+------+
| 3| 1|
| 2| 3|
+---+------+
This can be accomplished with a crossJoin and a call to array_contains, but it’s extremely slow and inefficient. Does anyone know of a better spark idiom to accomplish this, something like anti_join?
Here is the inefficient example using a small dataframe so you can see what I’m after:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').withColumnRenamed('a','_a').distinct().crossJoin(df.select('a').withColumnRenamed('a','anti_b').distinct()).where(pyspark.sql.functions.col('_a')!=pyspark.sql.functions.col('anti_b'))
anti_df = df.groupBy(
'a'
).agg(
pyspark.sql.functions.collect_list('b').alias('bs')
).join(
crossed_df,
on=((pyspark.sql.functions.col('a')==pyspark.sql.functions.col('_a'))&(~pyspark.sql.functions.expr('array_contains(bs,anti_b)'))),
how='inner'
).select(
'a','anti_b'
)
print('df')
df.show()
print('anti df')
anti_df.show()
Edit: This also works, but it's not much faster:
df = spark.createDataFrame(pandas.DataFrame(numpy.array(
[[1,2],[1,3],[2,1],[3,2]]),columns=['a','b']))
crossed_df = df.select('a').distinct().crossJoin(df.select('a').withColumnRenamed('a','b').distinct()).where(pyspark.sql.functions.col('a')!=pyspark.sql.functions.col('b'))
anti_df = crossed_df.join(
df,
on=['a','b'],
how='left_anti'
)
This should be better than what you have:
from pyspark.sql.functions import collect_set, expr
anti_df = df.groupBy("a").agg(collect_set("b").alias("bs")).alias("l")\
.join(df.alias("r"), on=expr("NOT array_contains(l.bs, r.b)"))\
.where("l.a != r.b")\
.selectExpr("l.a", "r.b AS anti_b")\
anti_df.show()
#+---+------+
#| a|anti_b|
#+---+------+
#| 3| 1|
#| 2| 3|
#+---+------+
If you compare this execution plan with your methods, you'll see that is better (because you can swap out distinct for collect_set), but it still has a Cartesian product.
anti_df.explain()
#== Physical Plan ==
#*(3) Project [a#0, b#294 AS anti_b#308]
#+- CartesianProduct (NOT (a#0 = b#294) && NOT array_contains(bs#288, b#294))
# :- *(1) Filter isnotnull(a#0)
# : +- ObjectHashAggregate(keys=[a#0], functions=[collect_set(b#1, 0, 0)])
# : +- Exchange hashpartitioning(a#0, 200)
# : +- ObjectHashAggregate(keys=[a#0], functions=[partial_collect_set(b#1, 0, 0)])
# : +- Scan ExistingRDD[a#0,b#1]
# +- *(2) Project [b#294]
# +- *(2) Filter isnotnull(b#294)
# +- Scan ExistingRDD[a#293,b#294]
However, I don't think there's any way to avoid the Cartesian product for this particular problem without more information.

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+

Resources