how to execute many expressions in the selectExpr - apache-spark

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")

def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+

Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

Related

Can you please explain the parameters of takeSample takeSample() in pyspark [duplicate]

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?
The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+
This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

Spark SQL orderBy and global ordering across partitions

I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.
val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+

Spark DataFrame select null value

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+

What does withReplacement do, if specified for sample against a Spark Dataframe

Reading the spark documentation: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
There is this boolean parameter withReplacement without much explanation.
sample(withReplacement, fraction, seed=None)
What is it and how do we use it?
The parameter withReplacement controls the Uniqueness of sample result. If we treat a Dataset as a bucket of balls, withReplacement=true means, taking a random ball out of the bucket and place it back into it. that means, the same ball can be picked up again.
Assuming all unique elements in a Dataset:
withReplacement=true, same element can be produced more than once as the result of sample.
withReplacement=false, each element of the dataset will be sampled only once.
import spark.implicits._
val df = Seq(1, 2, 3, 5, 6, 7, 8, 9, 10).toDF("ids")
df.show()
df.sample(true, 0.5, 5)
.show
df.sample(false, 0.5, 5)
.show
Result
+---+
|ids|
+---+
| 1|
| 2|
| 3|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 6|
| 7|
| 7|
| 9|
| 10|
+---+
+---+
|ids|
+---+
| 1|
| 3|
| 7|
| 8|
| 9|
+---+
This is actually mentioned in the spark docs version 2.3.
https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample
withReplacement – Sample with replacement
case class Member(id: Int, name: String, role: String)
val member1 = new Member(1, "User1", "Data Engineer")
val member2 = new Member(2, "User2", "Software Engineer")
val member3 = new Member(3, "User3", "DevOps Engineer")
val memberDF = Seq(member1, member2, member3).toDF
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 1|User1| Data Engineer|
| 2|User2|Software Engineer|
+---+-----+-----------------+
memberDF.sample(true, 0.4).show
+---+-----+---------------+
| id| name| role|
+---+-----+---------------+
| 3|User3|DevOps Engineer|
+---+-----+---------------+
memberDF.sample(true, 0.4).show
+---+-----+-----------------+
| id| name| role|
+---+-----+-----------------+
| 2|User2|Software Engineer|
| 3|User3| DevOps Engineer|
+---+-----+-----------------+

Full outer join in pyspark data frames

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+

Resources