Pyspark autoincrement for alternating group of values - apache-spark

I'm trying to create a new column in a Spark DataFrame using Pyspark, which represents an autoincrement (or ID) based on groups of alternating boolean values. Lets say I have the following DataFrame:
df.show()
+-----+------------+-------------+
|id |par_id |is_on |
+-----+------------+-------------+
|40002|1 |true |
|40003|2 |true |
|40004|null |false |
|40005|17 |true |
|40006|2 |true |
|40007|17 |true |
|40008|240 |true |
|40009|1861 |true |
|40010|1862 |true |
|40011|2 |true |
|40012|null |false |
|40013|1863 |true |
|40014|626 |true |
|40016|208 |true |
|40017|2 |true |
|40018|null |false |
|40019|2 |true |
|40020|1863 |true |
|40021|2 |true |
|40022|2 |true |
+-----+------------+-------------+
I want to extend this DataFrame with an incremental id called id2 using the is_on attribute. That is, each group of boolean values should get an increasing id. The resulting DataFrame should look like this:
df.show()
+-----+------------+-------------+-----+
|id |par_id |is_on |id2 |
+-----+------------+-------------+-----+
|40002|1 |true |1 |
|40003|2 |true |1 |
|40004|null |false |2 |
|40005|17 |true |3 |
|40006|2 |true |3 |
|40007|17 |true |3 |
|40008|240 |true |3 |
|40009|1861 |true |3 |
|40010|1862 |true |3 |
|40011|2 |true |3 |
|40012|null |false |4 |
|40013|1863 |true |5 |
|40014|626 |true |5 |
|40016|208 |true |5 |
|40017|2 |true |5 |
|40018|null |false |6 |
|40019|2 |true |7 |
|40020|1863 |true |7 |
|40021|2 |true |7 |
|40022|2 |true |7 |
+-----+------------+-------------+-----+
Do you have any suggestions to do that? How can I write a User Defined Function for this?

#this is python spark testing file
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, col, udf, struct
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark=SparkSession.builder.master("local").appName("durga prasad").config("spark.sql.warehouse.dir","/home/hadoop/spark-2.0.1-bin-hadoop2.7/bin/test_warehouse").getOrCreate()
df=spark.read.csv("/home/hadoop/stack_test.txt",sep=",",header=True)
# This is udf
count=1 # these variable is changed based on function call
prStr='' # these variable is changed based on function call
def test_fun(str):
global count
global prStr
if str=="false":
count=count + 1
prStr=str
return count
if str=="true" and prStr =='false':
count=count + 1
prStr=str
return count
elif str=='true':
count=count
prStr=str
return count
# udf function end
testUDF = udf(test_fun, StringType()) # register udf
df.select("id","par_id","is_on",testUDF('is_on').alias("id2")).show()
####output
+-----+------+-----+---+
| id|par_id|is_on|id2|
+-----+------+-----+---+
|40002| 1| true| 1|
|40003| 2| true| 1|
|40004| null|false| 2|
|40005| 17| true| 3|
|40006| 2| true| 3|
|40007| 17| true| 3|
|40008| 240| true| 3|
|40009| 1861| true| 3|
|40010| 1862| true| 3|
|40011| 2| true| 3|
|40012| null|false| 4|
|40013| 1863| true| 5|
|40014| 626| true| 5|
|40016| 208| true| 5|
|40017| 2| true| 5|
|40018| null|false| 6|
|40019| 2| true| 7|
|40020| 1863| true| 7|
|40021| 2| true| 7|
|40022| 2| true| 7|
+-----+------+-----+---+

Related

Spark SQL orderBy and global ordering across partitions

I want to sort the Dataframe, so that the different partitions are sorted internally (and also across each other, i.e ALL elements of one partition are gonna be either <= or >= than ALL elements of another partition). This is important because I want to use Window functions with the Window.partitionBy("partitionID"). However, there is something wrong with my understanding of how Spark works.
I run the following sample code:
val df = sc.parallelize(List((10),(8),(5),(9),(1),(6),(4),(7),(3),(2)),5)
.toDF("val")
.withColumn("partitionID",spark_partition_id)
df.show
+---+-----------+
|val|partitionID|
+---+-----------+
| 10| 0|
| 8| 0|
| 5| 1|
| 9| 1|
| 1| 2|
| 6| 2|
| 4| 3|
| 7| 3|
| 3| 4|
| 2| 4|
+---+-----------+
so far so good, 5 partitions are expected without internal or external order.
To fix that I do:
scala> val df2 = df.orderBy("val").withColumn("partitionID2",spark_partition_id)
df2: org.apache.spark.sql.DataFrame = [val: int, partitionID: int, partitionID2: int]
scala> df2.show
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 4|
| 3| 4| 4|
| 4| 3| 3|
| 5| 1| 1|
| 6| 2| 2|
| 7| 3| 3|
| 8| 0| 0|
| 9| 1| 1|
| 10| 0| 0|
+---+-----------+------------+
Now the val column is sorted, as expected but the partitions themselves are not "sorted". My expected result is something along the lines:
+---+-----------+------------+
|val|partitionID|partitionID2|
+---+-----------+------------+
| 1| 2| 2|
| 2| 4| 2|
| 3| 4| 4|
| 4| 3| 4|
| 5| 1| 1|
| 6| 2| 1|
| 7| 3| 3|
| 8| 0| 3|
| 9| 1| 0|
| 10| 0| 0|
+---+-----------+------------+
or something equivalent, i.e subsequent sorted elements belong in the same partition.
Can you point out what part of my logic is flawed and how to extract the intended behavior in this example? Every help is appreciated.
I run the above using scala and Spark 1.6 if that is relevant.
val df2 = df
.orderBy("val")
.repartitionByRange(5, col("val"))
.withColumn("partitionID2", spark_partition_id)
df2.show(false)
// +---+-----------+------------+
// |val|partitionID|partitionID2|
// +---+-----------+------------+
// |1 |2 |0 |
// |2 |4 |0 |
// |3 |4 |1 |
// |4 |3 |1 |
// |5 |1 |2 |
// |6 |2 |2 |
// |7 |3 |3 |
// |8 |0 |3 |
// |9 |1 |4 |
// |10 |0 |4 |
// +---+-----------+------------+

how to execute many expressions in the selectExpr

it is possible to apply many expression in the same selectExpr,
for example If I have this DF:
+---+
| i|
+---+
| 10|
| 15|
| 11|
| 56|
+---+
how to multiply by 2 and rename the column as this :
df.selectExpr("i*2 as multiplication")
def selectExpr(exprs: String*): org.apache.spark.sql.DataFrame
If you have many expressions you have to pass them comma separated strings. Please check below code.
scala> val df = (1 to 10).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.selectExpr("id*2 as twotimes", "id * 3 as threetimes").show
+--------+----------+
|twotimes|threetimes|
+--------+----------+
| 2| 3|
| 4| 6|
| 6| 9|
| 8| 12|
| 10| 15|
| 12| 18|
| 14| 21|
| 16| 24|
| 18| 27|
| 20| 30|
+--------+----------+
Yes, you can pass multiple expressions inside the df.selectExpr. https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset#selectExpr(exprs:String*):org.apache.spark.sql.DataFrame
scala> case class Person(name: String, lanme: String)
scala> val personDS = Seq(Person("Max", 1), Person("Adam", 2), Person("Muller", 3)).toDS()
scala > personDs.show(false)
+------+---+
|name |age|
+------+---+
|Max |1 |
|Adam |2 |
|Muller|3 |
+------+---+
scala> personDS.selectExpr("age*2 as multiple","name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+
Or else you can also use withColumn to achieve the same results
scala> personDS.withColumn("multiple",$"age"*2).select($"multiple",$"name").show(false)
+--------+------+
|multiple|name |
+--------+------+
|2 |Max |
|4 |Adam |
|6 |Muller|
+--------+------+

Spark DataFrame select null value

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.
eg: if my dataframe is below
+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
| 1| a| b|
| 2| null| d|
| 3| m| null|
+-----+------+------+
my final dataframe should be
+-----+------+-----+--------------+
|Rowid|Record|Value| error_desc|
+-----+------+-----+--------------+
| 1| a| b| null|
| 2| null| d|record is null|
| 3| m| null| value is null|
+-----+------+-----+--------------+
I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
| 1| a| b|
| 2| null| d|
| 3| m| null|
| 4| null| d|
| 5| null| null|
| null| e| null|
| 7| e| r|
+-----+------+-----+
scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
| var check:String = ""
| val ColList = Column.split(",").toList
| ColList.foreach{ x =>
| if (r.getAs(x) == null)
| {
| check = check + x.toString + " is null. "
| }}
| check
| })
scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc |
+-----+------+-----+-------------------------------+
|1 |a |b | |
|2 |null |d |Record is null. |
|3 |m |null |Value is null. |
|4 |null |d |Record is null. |
|5 |null |null |Record is null. Value is null. |
|null |e |null |Rowid is null. Value is null. |
|7 |e |r | |
+-----+------+-----+-------------------------------+

Pyspark pivot data frame based on condition

I have a data frame in pyspark like below.
df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| ios| 13|
| 1| ios| 14|
| 1|android| 15|
| 1|android| 16|
| 1|android| 17|
| 2| ios| 21|
| 2|android| 18|
+---+-------+----+
Now from this data frame I want to create another data frame by pivoting it.
df1.show()
+---+-----+-----+-----+---------+---------+---------+
| id| ios1| ios2| ios3| android1| android2| android3|
+---+-----+-----+-----+---------+---------+---------+
| 1| 11| 12| 13| 15| 16| 17|
| 2| 21| Null| Null| 18| Null| Null|
+---+-----+-----+-----+---------+---------+---------+
Here I need to consider a condition that for each Id even though there will be more than 3 types I want to consider only 3 or less than 3.
How can I do that?
Edit
new_df.show()
+---+-------+----+
| id| type|s_id|
+---+-------+----+
| 1| ios| 11|
| 1| ios| 12|
| 1| | 13|
| 1| | 14|
| 1|andriod| 15|
| 1| | 16|
| 1| | 17|
| 2|andriod| 18|
| 2| ios| 21|
+---+-------+----+
The result I am getting is below
+---+----+----+----+--------+----+----+
| id| 1| 2| 3|andriod1|ios1|ios2|
+---+----+----+----+--------+----+----+
| 1| 13| 14| 16| 15| 11| 12|
| 2|null|null|null| 18| 21|null|
+---+----+----+----+--------+----+----+
What I want is
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 | null| null| 11| 12|null|
|2 |18 | null| null| 21|null|null|
+---+--------+--------+--------+----+----+----+
Using the following logic should get you your desired result.
Window function is used to generate row number for each group of id and type ordered by s_id. Generated row number is used to filter and concat with type. Then finally grouping and pivoting should give you your desired output
from pyspark.sql import Window
windowSpec = Window.partitionBy("id", "type").orderBy("s_id")
from pyspark.sql import functions as f
df.withColumn("ranks", f.row_number().over(windowSpec))\
.filter(f.col("ranks") < 4)\
.withColumn("type", f.concat(f.col("type"), f.col("ranks")))\
.drop("ranks")\
.groupBy("id")\
.pivot("type")\
.agg(f.first("s_id"))\
.show(truncate=False)
which should give you
+---+--------+--------+--------+----+----+----+
|id |android1|android2|android3|ios1|ios2|ios3|
+---+--------+--------+--------+----+----+----+
|1 |15 |16 |17 |11 |12 |13 |
|2 |18 |null |null |21 |null|null|
+---+--------+--------+--------+----+----+----+
answer for the edited part
You just need an additional filter as
df.withColumn("ranks", f.row_number().over(windowSpec)) \
.filter(f.col("ranks") < 4) \
.filter(f.col("type") != "") \
.withColumn("type", f.concat(f.col("type"), f.col("ranks"))) \
.drop("ranks") \
.groupBy("id") \
.pivot("type") \
.agg(f.first("s_id")) \
.show(truncate=False)
which would give you
+---+--------+----+----+
|id |andriod1|ios1|ios2|
+---+--------+----+----+
|1 |15 |11 |12 |
|2 |18 |21 |null|
+---+--------+----+----+
Now this dataframe lacks android2, android3 and ios3 columns. Because they are not present in your updated input data. you can add them using withColumn api and populate null values

Calculate difference between value in current row and value in first row per group - pyspark [duplicate]

This question already has answers here:
Applying a Window function to calculate differences in pySpark
(2 answers)
Closed 1 year ago.
I have this DataFrame:
DataFrame[date: string, t: string, week: string, a: bigint, b: bigint]
With the following data:
+---------+--+--------+---+---+
|date |t |week |a |b |
+---------+--+--------+---+---+
|20180328 |1 |2018-W10|31 |35 |
|20180328 |1 |2018-W11|18 |37 |
|20180328 |1 |2018-W12|19 |37 |
|20180328 |1 |2018-W13|19 |38 |
|20180328 |1 |2018-W14|20 |38 |
|20180328 |1 |2018-W15|22 |39 |
|20180328 |1 |2018-W16|23 |39 |
|20180328 |1 |2018-W17|24 |40 |
|20180328 |1 |2018-W18|25 |40 |
|20180328 |1 |2018-W19|25 |41 |
|20180328 |1 |2018-W20|26 |41 |
|20180328 |1 |2018-W21|26 |41 |
|20180328 |1 |2018-W22|26 |41 |
|20180328 |2 |2018-W10|14 |26 |
|20180328 |2 |2018-W11|82 |33 |
|20180328 |2 |2018-W12|87 |36 |
|20180328 |2 |2018-W13|89 |39 |
|20180328 |2 |2018-W14|10 |45 |
|20180328 |2 |2018-W15|10 |45 |
|20180328 |2 |2018-W16|11 |48 |
|20180328 |2 |2018-W17|11 |55 |
|20180328 |2 |2018-W18|11 |60 |
|20180328 |2 |2018-W19|11 |70 |
|20180328 |2 |2018-W20|11 |79 |
|20180328 |2 |2018-W21|11 |86 |
|20180328 |2 |2018-W22|12 |93 |
+---------+--+--------+---+---+
And I want to add a new column that has, for each date and type (column t), the difference between that row and the first week for that date for column b.
Something like this:
+---------+--+--------+---+---+---+
|date |t |week |a |b |h |
+---------+--+--------+---+---+---+
|20180328 |1 |2018-W10|31 |35 |0 |
|20180328 |1 |2018-W11|18 |37 |2 |
|20180328 |1 |2018-W12|19 |37 |2 |
|20180328 |1 |2018-W13|19 |38 |3 |
|20180328 |1 |2018-W14|20 |38 |3 |
|20180328 |1 |2018-W15|22 |39 |4 |
|20180328 |1 |2018-W16|23 |39 |4 |
|20180328 |1 |2018-W17|24 |40 |5 |
|20180328 |1 |2018-W18|25 |40 |5 |
|20180328 |1 |2018-W19|25 |41 |6 |
|20180328 |1 |2018-W20|26 |41 |6 |
|20180328 |1 |2018-W21|26 |41 |6 |
|20180328 |1 |2018-W22|26 |41 |6 |
|20180328 |2 |2018-W10|14 |26 |0 |
|20180328 |2 |2018-W11|82 |33 |7 |
|20180328 |2 |2018-W12|87 |36 |10 |
|20180328 |2 |2018-W13|89 |39 |13 |
|20180328 |2 |2018-W14|10 |45 |19 |
|20180328 |2 |2018-W15|10 |45 |19 |
|20180328 |2 |2018-W16|11 |48 |22 |
|20180328 |2 |2018-W17|11 |55 |29 |
|20180328 |2 |2018-W18|11 |60 |34 |
|20180328 |2 |2018-W19|11 |70 |44 |
|20180328 |2 |2018-W20|11 |79 |53 |
|20180328 |2 |2018-W21|11 |86 |60 |
|20180328 |2 |2018-W22|12 |93 |67 |
+---------+--+--------+---+---+---+
Each number in column h is the value in col('b') - value in col('b') at W10 for that type.
You can accomplish this using a pyspark.sql.Window.
Partition by the column 't' and order by the column 'week'. This works because sorting your week column will do a lexicographical sort, and 'W10' will be the first value for your group. If this were not the case, you would need to find another way to sort the column so that the order is what you want.
Here is a trimmed down example.
data = [
('20180328',1,'2018-W10',31,35),
('20180328',1,'2018-W11',18,37),
('20180328',1,'2018-W12',19,37),
('20180328',1,'2018-W13',19,38),
('20180328',1,'2018-W14',20,38),
('20180328',2,'2018-W10',14,26),
('20180328',2,'2018-W11',82,33),
('20180328',2,'2018-W12',87,36),
('20180328',2,'2018-W13',89,39)
]
df = sqlCtx.createDataFrame(data, ['date', 't', 'week', 'a', 'b'])
df.show()
#+--------+---+--------+---+---+
#| date| t| week| a| b|
#+--------+---+--------+---+---+
#|20180328| 1|2018-W10| 31| 35|
#|20180328| 1|2018-W11| 18| 37|
#|20180328| 1|2018-W12| 19| 37|
#|20180328| 1|2018-W13| 19| 38|
#|20180328| 1|2018-W14| 20| 38|
#|20180328| 2|2018-W10| 14| 26|
#|20180328| 2|2018-W11| 82| 33|
#|20180328| 2|2018-W12| 87| 36|
#|20180328| 2|2018-W13| 89| 39|
#+--------+---+--------+---+---+
Using pyspark DataFrame functions
Define the Window:
from pyspark.sql import Window
w = Window.partitionBy('t').orderBy('week')
Create the new column using the Window:
import pyspark.sql.functions as f
df = df.select('*', (f.col('b') - f.first('b').over(w)).alias('h'))
df.show()
#+--------+---+--------+---+---+---+
#| date| t| week| a| b| h|
#+--------+---+--------+---+---+---+
#|20180328| 1|2018-W10| 31| 35| 0|
#|20180328| 1|2018-W11| 18| 37| 2|
#|20180328| 1|2018-W12| 19| 37| 2|
#|20180328| 1|2018-W13| 19| 38| 3|
#|20180328| 1|2018-W14| 20| 38| 3|
#|20180328| 2|2018-W10| 14| 26| 0|
#|20180328| 2|2018-W11| 82| 33| 7|
#|20180328| 2|2018-W12| 87| 36| 10|
#|20180328| 2|2018-W13| 89| 39| 13|
#+--------+---+--------+---+---+---+
Using pyspark-sql
Here is the equivalent operation using pyspark-sql:
df.registerTempTable('myTable')
df = sqlCtx.sql(
"SELECT *, (b - FIRST(b) OVER (PARTITION BY t ORDER BY week)) AS h FROM myTable"
)
df.show()
#+--------+---+--------+---+---+---+
#| date| t| week| a| b| h|
#+--------+---+--------+---+---+---+
#|20180328| 1|2018-W10| 31| 35| 0|
#|20180328| 1|2018-W11| 18| 37| 2|
#|20180328| 1|2018-W12| 19| 37| 2|
#|20180328| 1|2018-W13| 19| 38| 3|
#|20180328| 1|2018-W14| 20| 38| 3|
#|20180328| 2|2018-W10| 14| 26| 0|
#|20180328| 2|2018-W11| 82| 33| 7|
#|20180328| 2|2018-W12| 87| 36| 10|
#|20180328| 2|2018-W13| 89| 39| 13|
#+--------+---+--------+---+---+---+
Related
Applying a Window function to calculate differences in pySpark
Find maximum row per group in Spark DataFrame
How to select the first row of each group?

Resources