Spark union column order - apache-spark

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)

The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.

in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

Related

bucketing with QuantileDiscretizer using groupBy function in pyspark

I have a large dataset like so:
| SEQ_ID|RESULT|
+-------+------+
|3462099|239.52|
|3462099|239.66|
|3462099|239.63|
|3462099|239.64|
|3462099|239.57|
|3462099|239.58|
|3462099|239.53|
|3462099|239.66|
|3462099|239.63|
|3462099|239.52|
|3462099|239.58|
|3462099|239.52|
|3462099|239.64|
|3462099|239.71|
|3462099|239.64|
|3462099|239.65|
|3462099|239.54|
|3462099| 239.6|
|3462099|239.56|
|3462099|239.67|
The RESULT column is grouped by SEQ_ID column.
I want to bucket/bin the RESULT based on the counts of each group. After applying some aggregations, I have a data frame with the number of buckets that each SEQ_ID must be binned by. like so:
| SEQ_ID|num_buckets|
+-------+----------+
|3760290| 12|
|3462099| 5|
|3462099| 5|
|3760290| 13|
|3462099| 13|
|3760288| 10|
|3760288| 5|
|3461201| 6|
|3760288| 13|
|3718665| 18|
So for example, this tells me that the RESULT values that belong to the 3760290 SEQ_ID must be binned in 12 buckets.
For a single group, I would collect() the num_buckets value and do:
discretizer = QuantileDiscretizer(numBuckets=num_buckets, inputCol='RESULT', outputCol='buckets')
df_binned=discretizer.fit(df).transform(df)
I understand that when using QuantileDiscretizer, each group would result in a separate dataframe, I can then union them all.
But how can I use QuantileDiscretizer to bin the various groups without using a for loop?

Spark Dataframe issue in overwriting the partition data of Hive table

Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","d‌​ynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.

What exactly does .select() do?

I ran into a surprising behavior when using .select():
>>> my_df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
>>> a_c = s_df.select(col("a"), col("c")) # removing column b
>>> a_c.show()
+---+---+
| a| c|
+---+---+
| 1| 5|
| 2| 6|
+---+---+
>>> a_c.filter(col("b") == 3).show() # I can still filter on "b"!
+---+---+
| a| c|
+---+---+
| 1| 5|
+---+---+
This behavior got my wondering... Are my following points correct?
DataFrames are just views, a simple DataFrame is a view of itself. In my case a_c is just a view into my_df.
When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing.
If there is additional information that is relevant, please add!
This is happening because of the lazy nature of Spark. It is "smart" enough to push the filter down so that it happens at a lower level - before the filter*. So, since this all happens within the same stage of execution and is able to still be resolved. In fact you can see this in explain:
== Physical Plan ==
*Project [a#0, c#2]
+- *Filter (b#1 = 3) <---Filter before Project
+- LocalTableScan [A#0, B#1, C#2]
You can force a shuffle and new stage, then see your filter fail, though. Even catching it at compile time. Here's an example:
a_c.groupBy("a","c").count.filter(col("b") === 3)
*There is also a projection pruning that pushes the selection down to database layers if it realizes it doesn't need the column at any point. However I believe the filter would cause it to "need" it and not prune...but I didn't test that.
Let us start with some basics about the spark underlying.This will make your understanding easy.
RDD : Underlying the spark core is the data structure called RDD ,which are
lazily evaluated. By lazy evaluation we mean that RDD computation
happens when the action (like calling a count in RDD or show in dataset).
Dataset or Dataframe(which Dataset[Row]) also uses RDDs at the core.
This means every transformation (like filter) will be realized only when the action is triggered (show).
So your question
"When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing."
As there is no data which was realized. We have to realize it to bring it to memory. Your filter works on the initial dataframe.
The only way to make your a_c.filter(col("b") == 3).show() throw a run time exception is to cache your intermediate dataframe by using dataframe.cache.
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot resolve column name
Eg.
val a_c = s_df.select(col("a"), col("c")).cache
a_c.filter(col("b") == 3).show()
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name.

Subtracting DataFrames by a single ID column - duplicate columns behave differently

I am trying to compare two DataFrames with the same schema (in Spark 1.6.0, using Scala) to determine which rows in the newer table have been added (i.e. are not present in the older table).
I need to do this by ID (i.e. examining a single column, not the whole row, to see what is new). Some rows may have changed between the versions, in that they have the same id in both versions, but the other columns have changed - I do not want these in the output, so I cannot simply subtract the two versions.
Based on various suggestions, I am doing a left-outer join on the chosen ID column, then selecting rows with nulls in columns from the right side of the join (indicating that they were not present in the older version of the table):
def diffBy(field:String, newer:DataFrame, older:DataFrame): DataFrame = {
newer.join(older, newer(field) === older(field), "left_outer")
.select(older(field).isNull)
// TODO just select the leftmost columns, removing the nulls
}
However, this does not work. (row 3 exists only in the newer version, so should be output):
scala> newer.show
+---+-------+
| id| value|
+---+-------+
| 3| three|
| 2|two-new|
+---+-------+
scala> older.show
+---+-------+
| id| value|
+---+-------+
| 1| one|
| 2|two-old|
+---+-------+
scala> diffBy("id", newer, older).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
The join is working as expected:
scala> val joined = newer.join(older, newer("id") === older("id"), "left_outer")
scala> joined.show
+---+-------+----+-------+
| id| value| id| value|
+---+-------+----+-------+
| 2|two-new| 2|two-old|
| 3| three|null| null|
+---+-------+----+-------+
So the problem is in the selection of the column for filtering.
joined.where(older("id").isNull).show
+---+-----+---+-----+
| id|value| id|value|
+---+-----+---+-----+
+---+-----+---+-----+
Perhaps it is due to the duplicate id column names in the join? But if I use the value column (which is also duplicated) instead to detect nulls, it works as expected:
joined.where(older("value").isNull).show
+---+-----+----+-----+
| id|value| id|value|
+---+-----+----+-----+
| 3|three|null| null|
+---+-----+----+-----+
What is going on here - and why is the behaviour different for id and value?
You can solve the problem using a special spark join called "leftanti" . It is equivalent to minus (in Oracle PL SQL).
val joined = newer.join(older, newer("id") === older("id"), "leftanti")
This will only select columns from newer.
I have found a solution to my problem, though not an explanation for why it occurs.
It seems to be necessary to create an alias in order to refer unambiguously to the rightmost id column, and then use a textual WHERE clause so that I can substitute in the qualified column name from the variable field:
newer.join(older.as("o"), newer(field) === older(field), "left_outer")
.where(s"o.$field IS NULL")

How to efficiently rename columns in Datasets (Spark 2.0)

With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName"). In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map on the Dataset:
case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)
val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
| a|bNewName|
+---+--------+
| 1| 2|
| 3| 4|
+---+--------+
This seems a very round-about and inefficient way just to rename a column. Is there a better way?
A slightly more concise solution would be something like this:
origDS.toDF("a", "bNewName").as[OrigRenamed]
but in practice renaming is simply not meaningful on statically typed Dataset. While we use the same columnar representation as Dataframe (Dataset[Row]) semantics is completely different here.
Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets are not statically typed DataFrames but collections of objects.
You can make it slightly more concise, while maintaining semantics:
origDS.map(o => OrigRenamed(o.a, o.b)).show()

Resources