I ran into a surprising behavior when using .select():
>>> my_df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
>>> a_c = s_df.select(col("a"), col("c")) # removing column b
>>> a_c.show()
+---+---+
| a| c|
+---+---+
| 1| 5|
| 2| 6|
+---+---+
>>> a_c.filter(col("b") == 3).show() # I can still filter on "b"!
+---+---+
| a| c|
+---+---+
| 1| 5|
+---+---+
This behavior got my wondering... Are my following points correct?
DataFrames are just views, a simple DataFrame is a view of itself. In my case a_c is just a view into my_df.
When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing.
If there is additional information that is relevant, please add!
This is happening because of the lazy nature of Spark. It is "smart" enough to push the filter down so that it happens at a lower level - before the filter*. So, since this all happens within the same stage of execution and is able to still be resolved. In fact you can see this in explain:
== Physical Plan ==
*Project [a#0, c#2]
+- *Filter (b#1 = 3) <---Filter before Project
+- LocalTableScan [A#0, B#1, C#2]
You can force a shuffle and new stage, then see your filter fail, though. Even catching it at compile time. Here's an example:
a_c.groupBy("a","c").count.filter(col("b") === 3)
*There is also a projection pruning that pushes the selection down to database layers if it realizes it doesn't need the column at any point. However I believe the filter would cause it to "need" it and not prune...but I didn't test that.
Let us start with some basics about the spark underlying.This will make your understanding easy.
RDD : Underlying the spark core is the data structure called RDD ,which are
lazily evaluated. By lazy evaluation we mean that RDD computation
happens when the action (like calling a count in RDD or show in dataset).
Dataset or Dataframe(which Dataset[Row]) also uses RDDs at the core.
This means every transformation (like filter) will be realized only when the action is triggered (show).
So your question
"When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing."
As there is no data which was realized. We have to realize it to bring it to memory. Your filter works on the initial dataframe.
The only way to make your a_c.filter(col("b") == 3).show() throw a run time exception is to cache your intermediate dataframe by using dataframe.cache.
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot resolve column name
Eg.
val a_c = s_df.select(col("a"), col("c")).cache
a_c.filter(col("b") == 3).show()
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name.
Related
I create a dataframe:
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
its contents are :
df.show()
+---+---+
| a| c|
+---+---+
| 0| 0|
| 1| 1|
| 3| 3|
| 5| 5|
| 6| 6|
| 8| 8|
| 9| 9|
| 10| 10|
| 2| 2|
| 4| 4|
| 7| 7|
| 11| 11|
+---+---+
But, If I drop a column, the remaining column gets permuted
df.drop('c').show()
+---+
| a|
+---+
| 0|
| 2|
| 3|
| 5|
| 6|
| 7|
| 9|
| 11|
| 1|
| 4|
| 8|
| 10|
+---+
Please help me understand what is happening here?
I wanted to add my answer since I felt I could explain this slightly differently.
The repartition results in a RoundRobinPartition. It basically redistributes the data in round-robin fashion.
Since you are evaluating the dataframe again, it recomputes the partition after the drop.
You can see this by running a few commands in addition to what you have shown.
df = spark.createDataFrame(pd.DataFrame({'a':range(12),'c':range(12)})).repartition(8)
df.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(df.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0, c=0), Row(a=1, c=1), Row(a=3, c=3), Row(a=5, c=5), Row(a=6, c=6), Row(a=8, c=8), Row(a=9, c=9), Row(a=10, c=10)], [Row(a=2, c=2), Row(a=4, c=4), Row(a=7, c=7), Row(a=11, c=11)]]
temp = df.drop("c")
temp.explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#14L]
# +- Scan ExistingRDD[a#14L,c#15L]
print("Partitions structure: {}".format(temp.rdd.glom().collect()))
# Partitions structure: [[], [], [], [], [], [], [Row(a=0), Row(a=2), Row(a=3), Row(a=5), Row(a=6), Row(a=7), Row(a=9), Row(a=11)], [Row(a=1), Row(a=4), Row(a=8), Row(a=10)]]
In the above code, the explain() shows the RoundRobinPartitioning taking place. The use of glom shows the redistribution of data across partitions.
In the original dataframe, the partitions are in the order that you see the results of show().
In the second dataframe above, you can see that the data has shuffled across the last two partitions, resulting in it not being in the same order. This is because when re-evaluating the dataframe the repartition runs again.
Edits as per discussion in the comments
If you run a df.drop('b'), we are trying to drop a column that doesn't exist. So it's really what is called a noop or a no operation. So the partitioning doesn't change.
df.drop('b').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
Similarly if you're adding a column and run it, the round partition runs before the column is added. This again results in the same partitioning and hence the order is consistent with the original dataframe.
import pyspark.sql.functions as f
df.withColumn('tt', f.rand()).explain()
# == Physical Plan ==
# *(1) Project [a#70L, c#71L, rand(-3030352041536166328) AS tt#76]
# +- Exchange RoundRobinPartitioning(8)
# +- Scan ExistingRDD[a#70L,c#71L]
In the case of df.drop('c'), the column is first dropped and then the partitioner is applied. This results in a different partitioning since the resulting dataframe in the stage before the partitioning is run is different.
df.drop('c').explain()
# == Physical Plan ==
# Exchange RoundRobinPartitioning(8)
# +- *(1) Project [a#70L]
# +- Scan ExistingRDD[a#70L,c#71L]
As mentioned in another answer to this question, the round-robin partitioner is random for different data, but consistent with the same data on which the partition is run. So if the underlying data changes from the operation, the resulting partition will be different.
I have been struggling to understand this weird Spark stream behaviour.
I want to write 2 files of CSV into a delta table using a Spark Streaming.
I made this example only to understand how Streams work, I dont want to use other solutions I just need to understand why is this not working.
So, I have to CSV files in /test/input:
A.csv
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
B.csv
+---+---+
| id| x|
+---+---+
| 4| D|
| 5| E|
+---+---+
I read the directory (so the union of the two dataframes above) as a stream:
schema = StructType([StructField("id",IntegerType(),True), StructField("x",StringType(),True)])
df = spark.readStream.format("csv").schema(schema).option("ignoreChanges", "true").option("delimiter", ";").option("header", True).load("/test/input")
I then wanted to write this stream using the following code:
def processDf(df, epoch_id):
Ids=[x.id for x in df.select("id").distinct().collect()]
for i in Ids:
temp_df=df.filter((df.id==i))
temp_df.write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id=="+str(i)).mode("append").save("/test/res")
df.writeStream.format("delta").foreachBatch(processDf).queryName("x").option("checkpointLocation", "/test/check").trigger(once=True).start()
No errors are shown. The code executes successfully.
When I go to check my files in /test/res I find all data:
But when I check delta data, I notice that only the first line is present:
df= (spark.read.format("delta").option("sep", ";").option("header", "true").load("/test/res")).cache()
+---+---+
| id| x|
+---+---+
| 1| A|
+---+---+
Why isnt it inserting all lines ? Is it the replaceWhere option ?
replaceWhere is supposed to delete only the partitions that are already in the table and got updated in source data.
What am I doing wrong please.
EDIT:
Same behaviour is noticed even if I read only one CSV in input. Code still writes only one line in output instead of all lines.
This was actually a syntaxical error, I changed the loop block with the following and it worked:
for i in ids:
i=str(i)
tmp = df.filter(df.id == i)
tmp.write.format("delta").option("mergeSchema", "true").partitionBy(PartitionKey).option("replaceWhere", "id == '$i'".format(i=i)).save("/res/")
I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question
I have 2 dataframes with columns as shown below.
Note: Column uid is not a unique key, and there're duplicate rows with the same uid in the dataframes.
val df1 = spark.read.parquet(args(0)).drop("sv")
val df2 = spark.read.parquet(args(1))
scala> df1.orderBy("uid").show
+----+----+---+
| uid| hid| sv|
+----+----+---+
|uid1|hid2| 10|
|uid1|hid1| 10|
|uid1|hid3| 10|
|uid2|hid1| 2|
|uid3|hid2| 10|
|uid4|hid2| 3|
|uid5|hid3| 5|
+----+----+---+
scala> df2.orderBy("uid").show
+----+----+---+
| uid| pid| sv|
+----+----+---+
|uid1|pid2| 2|
|uid1|pid1| 1|
|uid2|pid1| 2|
|uid3|pid1| 3|
|uid3|pidx|999|
|uid3|pid2| 4|
|uidx|pid1| 2|
+----+----+---+
scala> df1.drop("sv")
.join(df2, "uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
+----+----+----+----+-----+
| hid| pid|xcnt|xsum| xavg|
+----+----+----+----+-----+
|hid1|pid1| 2| 3| 1.5|
|hid1|pid2| 1| 2| 2.0|
|hid2|pid2| 2| 6| 3.0|
|hid2|pidx| 1| 999|999.0|
|hid2|pid1| 2| 4| 2.0|
|hid3|pid1| 1| 1| 1.0|
|hid3|pid2| 1| 2| 2.0|
+----+----+----+----+-----+
In this demo case, everything looks good.
But when I apply the same operations on the production large data, the final output contains many duplicate rows (of same (hid, pid) pair).
I though the groupBy operator would be like select distinct hid, pid from ..., but obviously not.
So what's wrong with my operation? Should I repartition the dataframe by hid, pid?
Thanks!
-- Update
And if I add .drop("uid") once I join the dataframes, then some rows are missed from the final output.
scala> df1.drop("sv")
.join(df2, "uid").drop("uid")
.groupBy("hid", "pid")
.agg(count("*") as "xcnt", sum("sv") as "xsum", avg("sv") as "xavg")
.orderBy("hid").show
To be honest I think that there are problems with the data, not the code. Of course there shouldn't be any duplicates if pid and hid are truly different (I've seen some rogue Cyrillic symbols in data before).
To debug this issue you can try and see what combinations of 'uid' and sv values represent each duplicate row.
df1.drop( "sv" )
.join(df2, "uid")
.groupBy( "hid", "pid" )
.agg( collect_list( "uid" ), collect_list( "sv" ) )
.orderBy( "hid" )
.show
After that you'll have some start point to assess your data. Or, if the lists of uid (and 'sv') are the same, file a bug.
I think I might have found the root cause.
Maybe this is caused by AWS S3 consistency model.
The background is, I submitted 2 Spark jobs to create 2 tables, and submitted a third task to join the two tables (I split them in case any of them fails and I don't need to re-run them).
I put these 3 spark-submit in a shell script running in sequence, and got the result with duplicated rows.
When I re-ran the last job just now, the result seems good.
With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName"). In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map on the Dataset:
case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)
val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
| a|bNewName|
+---+--------+
| 1| 2|
| 3| 4|
+---+--------+
This seems a very round-about and inefficient way just to rename a column. Is there a better way?
A slightly more concise solution would be something like this:
origDS.toDF("a", "bNewName").as[OrigRenamed]
but in practice renaming is simply not meaningful on statically typed Dataset. While we use the same columnar representation as Dataframe (Dataset[Row]) semantics is completely different here.
Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets are not statically typed DataFrames but collections of objects.
You can make it slightly more concise, while maintaining semantics:
origDS.map(o => OrigRenamed(o.a, o.b)).show()