How to convert RDD[List[Int]] to DataFrame? - apache-spark

I hava a RDD[List[Int]] ,I don not know the count of list[Int],I want to convert i Rdd[List[Int]] to DataFrame,How should I do?
this is my input:
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2)
this is my expect result:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+---+---+---+---+

There might be some other and better functional way to do this, but this works too:
def getSchema(myArray : Array[Int]): StructType = {
var schemaArray = scala.collection.mutable.ArrayBuffer[StructField]()
for((el,idx) <- myArray.view.zipWithIndex){
schemaArray += StructField("col"+idx , IntegerType, true)
}
StructType(schemaArray)
}
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val rdd1=sc.parallelize(Lz,2).map(Row.fromSeq(_))
val schema = getSchema(l1) //Since both arrays will be of same type and size
val df = sqlContext.createDataFrame(rdd1, schema)
df.show()
+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3| 4|
| 1| 2| 3| 4|
+----+----+----+----+

You can do the following :
val l1=Array(1,2,3,4)
val l2=Array(1,2,3,4)
val Lz=Seq(l1,l2)
val df = sc.parallelize(Lz,2).map{
case Array(val1, val2, val3, val4) => (val1, val2, val3, val4)
}.toDF
df.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
If you have lots of columns, you would need to proceed differently but you need to know the schema of your data otherwise you'll not be able to perform the following :
val sch = df.schema // I just took the schema from the old df but you can add one programmatically
val df2 = spark.createDataFrame(sc.parallelize(Lz,2).map{ Row.fromSeq(_) }, sch)
df2.show
// +---+---+---+---+
// | _1| _2| _3| _4|
// +---+---+---+---+
// | 1| 2| 3| 4|
// | 1| 2| 3| 4|
// +---+---+---+---+
Unless you provide a schema, you won't be able to do much except having an array column :
val df3 = sc.parallelize(Lz,2).toDF
// df3: org.apache.spark.sql.DataFrame = [value: array<int>]
df3.show
// +------------+
// | value|
// +------------+
// |[1, 2, 3, 4]|
// |[1, 2, 3, 4]|
// +------------+
df3.printSchema
//root
// |-- value: array (nullable = true)
// | |-- element: integer (containsNull = false)

Related

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

spark drop multiple duplicated columns after join

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
I wasn't completely satisfied with the answers in this. For the most part, especially #stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.
common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())
df_a.join(df_b.drop(*common_cols))
The select version of this looks similar but you have to add in the aliasing.
unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList
df_a.alias("a")
.join(df_b.alias("b"))
.select(*keep_columns)
I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

Using when and otherwise while converting boolean values to strings in Pyspark

I have a data frame in Pyspark
df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
The schema is below:
DataFrame[id: int, name: string, testing: string, avg_result: string, score: string, active: boolean]
I want to convert Y to True, N to False true to True and false to False.
When I do like below:
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').
when(f.col(col) == 'true', True).when(f.col(col) == 'false', False).otherwise(f.col(col)))
I get below error and there is no change in data frame
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN (testing = N) THEN False WHEN (testing = Y) THEN True WHEN (testing = true) THEN true WHEN (testing = false) THEN false ELSE testing' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
When I do like below
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').otherwise(f.col(col)))
I get below error
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
But the data frame changes to
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| True| 0.05| 10| false|
| 3| Ian| False| 0.01| 1| false|
| 4| Jim| False| 1.2| 3| true|
+---+----+-------+----------+-----+------+
New attempt
for col in cols:
df = df.withColumn(col, f.when(f.col(col) == 'N', 'False').when(f.col(col) == 'Y', 'True').
when(f.col(col) == 'true', 'True').when(f.col(col) == 'false', 'False').otherwise(f.col(col)))
Error received
pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN if ((isnull(active) || isnull(cast(N as double)))) null else CASE cast(cast(N as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False WHEN if ((isnull(active) || isnull(cast(Y as double)))) null else CASE cast(cast(Y as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(true as double)))) null else CASE cast(cast(true as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN True WHEN if ((isnull(active) || isnull(cast(false as double)))) null else CASE cast(cast(false as double) as double) WHEN cast(1 as double) THEN active WHEN cast(0 as double) THEN NOT active ELSE false THEN False ELSE active' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;"
How can I get the data frame to be like
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| True|
| 2| Ram| True| 0.05| 10| False|
| 3| Ian| False| 0.01| 1| False|
| 4| Jim| False| 1.2| 3| True|
+---+----+-------+----------+-----+------+
As I mentioned in the comments, the issue is a type mismatch. You need to convert the boolean column to a string before doing the comparison. Finally, you need to cast the column to a string in the otherwise() as well (you can't have mixed types in a column).
Your code is easy to modify to get the correct output:
import pyspark.sql.functions as f
cols = ["testing", "active"]
for col in cols:
df = df.withColumn(
col,
f.when(
f.col(col) == 'N',
'False'
).when(
f.col(col) == 'Y',
'True'
).when(
f.col(col).cast('string') == 'true',
'True'
).when(
f.col(col).cast('string') == 'false',
'False'
).otherwise(f.col(col).cast('string'))
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
However, there are some alternative approaches as well. For instance, this is a good place to use pyspark.sql.Column.isin():
df = reduce(
lambda df, col: df.withColumn(
col,
f.when(
f.col(col).cast('string').isin(['N', 'false']),
'False'
).when(
f.col(col).cast('string').isin(['Y', 'true']),
'True'
).otherwise(f.col(col).cast('string'))
),
cols,
df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
(Here I used reduce to eliminate the for loop, but you could have kept it.)
You could also use pyspark.sql.DataFrame.replace() but you'd have to first convert the column active to a string:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true',], 'True', subset=cols)\
.replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above
Or using replace just once:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)
Looking at the schema and the transformations applied, there is a type mismatch between String and Boolean returned. E.g. 'N' is returned as 'False' (String) and 'false' is returned as False (Boolean)
You can cast the transformed columns to String to convert Y to True, N to False, true to True and false to False.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as f
data = [
(1, "sam", None, None, None, True),
(2, "Ram", "Y", 0.05, 10, False),
(3, "Ian", "N", 0.01, 1, False),
(4, "Jim", "N", 1.2, 3, True)
]
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("testing", StringType(), True),
StructField("avg_result", StringType(), True),
StructField("score", StringType(), True),
StructField("active", BooleanType(), True)
])
df = sc.parallelize(data).toDF(schema)
Before applying the transformations
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- testing: string (nullable = true)
|-- avg_result: string (nullable = true)
|-- score: string (nullable = true)
|-- active: boolean (nullable = true)
>>> df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| true|
| 2| Ram| Y| 0.05| 10| false|
| 3| Ian| N| 0.01| 1| false|
| 4| Jim| N| 1.2| 3| true|
+---+----+-------+----------+-----+------+
Applying transformation with cast in the otherwise clause .otherwise(f.col(col).cast("string"))
cols = ["testing", "active"]
for col in cols:
df = df.withColumn(col,
f.when(f.col(col) == 'N', 'False')
.when(f.col(col) == 'Y', 'True')
.when(f.col(col).cast("string") == 'true', 'True')
.when(f.col(col).cast("string") == 'false', 'False'))
Results
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- testing: string (nullable = true)
|-- avg_result: string (nullable = true)
|-- score: string (nullable = true)
|-- active: string (nullable = true)
>>> df.show()
+---+----+-------+----------+-----+------+
| id|name|testing|avg_result|score|active|
+---+----+-------+----------+-----+------+
| 1| sam| null| null| null| True|
| 2| Ram| True| 0.05| 10| False|
| 3| Ian| False| 0.01| 1| False|
| 4| Jim| False| 1.2| 3| True|
+---+----+-------+----------+-----+------+
You could convert them to boolean and then back to string.
EDIT: I'm using spark 2.3.0
e.g.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, initcap
from pyspark.sql.types import IntegerType, BooleanType, StringType, StructType, StructField
data = [(1, "Y"), (1, "N"), (2, "false"), (2, "1"), (3, "NULL"), (3, None)]
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(BooleanType()))
print(df.dtypes)
df.show()
df = df.withColumn("txt", col("txt").cast(StringType()))
df = df.withColumn("txt", initcap(col("txt")))
print(df.dtypes)
df.show()
will give you
[('id', 'int'), ('txt', 'string')]
+---+-----+
| id| txt|
+---+-----+
| 1| Y|
| 1| N|
| 2|false|
| 2| 1|
| 3| NULL|
| 3| null|
+---+-----+
[('id', 'int'), ('txt', 'boolean')]
+---+-----+
| id| txt|
+---+-----+
| 1| true|
| 1|false|
| 2|false|
| 2| true|
| 3| null|
| 3| null|
+---+-----+
[('id', 'int'), ('txt', 'string')]
+---+-----+
| id| txt|
+---+-----+
| 1| True|
| 1|False|
| 2|False|
| 2| True|
| 3| null|
| 3| null|
+---+-----+

pyspark two dataframes subtractbykey issue

I am trying to output a dataframe only with columns identified with different values after comparing two dataframes. I am finding difficulty in identifying an approach to proceed.
**Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()
Trying to match based on the ids (id,name) between dataframe1 & dataframe2
Dataframe 1:
+----+---+----------+--------+-------+
|name| id| fruit| animal| veggie|
+----+---+----------+--------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| pears| tiger| onion|
| c| 8| jackfruit|elephant|raddish|
| c| 9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+
Dataframe 2:
+----+---+---------+------+-------+
|name| id| fruit|animal| veggie|
+----+---+---------+------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| banana| tiger| onion|
| c| 8|jackfruit| camel|raddish|
+----+---+---------+------+-------+
Expected dataframe
+----+---+----------+--------+
|name| id| fruit| animal|
+----+---+----------+--------+
| c| 7| pears| tiger|
| c| 8| jackfruit|elephant|
| c| 9|watermelon| giraffe|
+----+---+----------+--------+

Changing Nulls Ordering in Spark SQL

I need to be able to sort columns in ascending and descending order and also allow nulls to be first or nulls to be last. Using RDDs I could use the sortByKey method with a custom comparator. I was wondering if there is a corresponding approach using the Dataset API. I see how to to add desc/asc to columns but I have no clue on the nulls ordering.
You can also do it with the dataset API:
scala> val df = Seq("a", "b", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]
scala> df.select('*).orderBy('x.asc_nulls_last).show
+----+
| x|
+----+
| a|
| b|
|null|
+----+
scala> df.select('*).orderBy('x.asc_nulls_first).show
+----+
| x|
+----+
|null|
| a|
| b|
+----+
Same thing works with desc_nulls_last and desc_nulls_first.
As mentioned by Oleksandr, there was a pull request for this. Now you can optionally use "nulls first" or "nulls last"
scala> spark.sql("select * from spark_10747 order by col3 nulls last").show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 6| 7| 4|
| 6| 11| 4|
| 6| 15| 8|
| 6| 15| 8|
| 6| 7| 8|
| 6| 12| 10|
| 6| 9| 10|
| 6| 13|null|
| 6| 10|null|
+----+----+----+

Resources