How to efficiently rename columns in Datasets (Spark 2.0) - apache-spark

With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName"). In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map on the Dataset:
case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)
val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
| a|bNewName|
+---+--------+
| 1| 2|
| 3| 4|
+---+--------+
This seems a very round-about and inefficient way just to rename a column. Is there a better way?

A slightly more concise solution would be something like this:
origDS.toDF("a", "bNewName").as[OrigRenamed]
but in practice renaming is simply not meaningful on statically typed Dataset. While we use the same columnar representation as Dataframe (Dataset[Row]) semantics is completely different here.
Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets are not statically typed DataFrames but collections of objects.

You can make it slightly more concise, while maintaining semantics:
origDS.map(o => OrigRenamed(o.a, o.b)).show()

Related

How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta'). The data consists of multiple columns including a type, date and value column.
Example DataFrame;
type
date
value
1
2020-01-21
6
1
2020-01-16
5
2
2020-01-20
8
2
2020-01-15
4
I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.
Do you have any suggestions on how to approach this challenge?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function.
e.g
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported"
are you saying this from stream point of view, because same i am able to do.
Here is the solution to your problem.
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+

Spark union column order

I've come across something strange recently in Spark. As far as I understand, given the column based storage method of spark dfs, the order of the columns really don't have any meaning, they're like keys in a dictionary.
During a df.union(df2), does the order of the columns matter? I would've assumed that it shouldn't, but according to the wisdom of sql forums it does.
So we have df1
df1
| a| b|
+---+----+
| 1| asd|
| 2|asda|
| 3| f1f|
+---+----+
df2
| b| a|
+----+---+
| asd| 1|
|asda| 2|
| f1f| 3|
+----+---+
result
| a| b|
+----+----+
| 1| asd|
| 2|asda|
| 3| f1f|
| asd| 1|
|asda| 2|
| f1f| 3|
+----+----+
It looks like the schema from df1 was used, but the data appears to have joined following the order of their original dataframes.
Obviously the solution would be to do df1.union(df2.select(df1.columns))
But the main question is, why does it do this? Is it simply because it's part of pyspark.sql, or is there some underlying data architecture in Spark that I've goofed up in understanding?
code to create test set if anyone wants to try
d1={'a':[1,2,3], 'b':['asd','asda','f1f']}
d2={ 'b':['asd','asda','f1f'], 'a':[1,2,3],}
pdf1=pd.DataFrame(d1)
pdf2=pd.DataFrame(d2)
df1=spark.createDataFrame(pdf1)
df2=spark.createDataFrame(pdf2)
test=df1.union(df2)
The Spark union is implemented according to standard SQL and therefore resolves the columns by position. This is also stated by the API documentation:
Return a new DataFrame containing union of rows in this and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does >deduplication of elements), use this function followed by a distinct.
Also as standard in SQL, this function resolves columns by position (not by name).
Since Spark >= 2.3 you can use unionByName to union two dataframes were the column names get resolved.
in spark Union is not done on metadata of columns and data is not shuffled like you would think it would. rather union is done on the column numbers as in, if you are unioning 2 Df's both must have the same numbers of columns..you will have to take in consideration of positions of your columns previous to doing union. unlike SQL or Oracle or other RDBMS, underlying files in spark are physical files. hope that answers your question

switch identifiers in spark to pseudonymize dataset

I have a dataframe like:
+---+-----+
| id|value|
+---+-----+
| a| 1|
| a| 2|
| b| 1|
| b| 3|
+---+-----+
val df = Seq(("a", 1), ("a", 2), ("b", 1), ("b", 3)).toDF("id", "value")
How can I efficiently switch / rotate IDs. Note, hashing is not what I want here, I explicitly want to rotated the identifiers. How could this be implemented in spark efficiently without a self join? Maybe some RDD zipWithIndex?
Not: my intention is to pseudonyme / anonymize the dataset by rotating identifiers. My requirement is to replace each a with another identifier, i.e. possibly b. They all need to be replaced to the same value.
edit
I had a first suggestion: https://spark.apache.org/docs/latest/ml-features.html#stringindexer but this would change data types and also not rotate identifiers something I want to prevent. I need a drop in but anon- /pseudo-nymized replacement.
Also, I expect about 8 million (constant) distinct values for ID.
Collecting all distinct elements and building a map using zip with a randomly permutated list of these distinct elements works.

What exactly does .select() do?

I ran into a surprising behavior when using .select():
>>> my_df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 5|
| 2| 4| 6|
+---+---+---+
>>> a_c = s_df.select(col("a"), col("c")) # removing column b
>>> a_c.show()
+---+---+
| a| c|
+---+---+
| 1| 5|
| 2| 6|
+---+---+
>>> a_c.filter(col("b") == 3).show() # I can still filter on "b"!
+---+---+
| a| c|
+---+---+
| 1| 5|
+---+---+
This behavior got my wondering... Are my following points correct?
DataFrames are just views, a simple DataFrame is a view of itself. In my case a_c is just a view into my_df.
When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing.
If there is additional information that is relevant, please add!
This is happening because of the lazy nature of Spark. It is "smart" enough to push the filter down so that it happens at a lower level - before the filter*. So, since this all happens within the same stage of execution and is able to still be resolved. In fact you can see this in explain:
== Physical Plan ==
*Project [a#0, c#2]
+- *Filter (b#1 = 3) <---Filter before Project
+- LocalTableScan [A#0, B#1, C#2]
You can force a shuffle and new stage, then see your filter fail, though. Even catching it at compile time. Here's an example:
a_c.groupBy("a","c").count.filter(col("b") === 3)
*There is also a projection pruning that pushes the selection down to database layers if it realizes it doesn't need the column at any point. However I believe the filter would cause it to "need" it and not prune...but I didn't test that.
Let us start with some basics about the spark underlying.This will make your understanding easy.
RDD : Underlying the spark core is the data structure called RDD ,which are
lazily evaluated. By lazy evaluation we mean that RDD computation
happens when the action (like calling a count in RDD or show in dataset).
Dataset or Dataframe(which Dataset[Row]) also uses RDDs at the core.
This means every transformation (like filter) will be realized only when the action is triggered (show).
So your question
"When I created a_c no new data was created, a_c is just pointing at the same data my_df is pointing."
As there is no data which was realized. We have to realize it to bring it to memory. Your filter works on the initial dataframe.
The only way to make your a_c.filter(col("b") == 3).show() throw a run time exception is to cache your intermediate dataframe by using dataframe.cache.
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot resolve column name
Eg.
val a_c = s_df.select(col("a"), col("c")).cache
a_c.filter(col("b") == 3).show()
So spark will throw"main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name.

Add columns on a Pyspark Dataframe

I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

Resources