SPARK code for sql case statement and row_number equivalent - apache-spark

I have a data set like below
hduser#ubuntu:~$ hadoop fs -cat /user/hduser/test_sample/sample1.txt
Eid1,EName1,EDept1,100
Eid2,EName2,EDept1,102
Eid3,EName3,EDept1,101
Eid4,EName4,EDept2,110
Eid5,EName5,EDept2,121
Eid6,EName6,EDept3,99
I want to generate the output as below using spark code
Eid1,EName1,IT,102,1
Eid2,EName2,IT,101,2
Eid3,EName3,IT,100,3
Eid4,EName4,ComSc,121,1
Eid5,EName5,ComSc,110,2
Eid6,EName6,Mech,99,1
which is equivalent of the below SQL
Select emp_id, emp_name, case when emp_dept='EDept1' then 'IT' when emp_dept='EDept2' then 'ComSc' when emp_dept='EDept3' then 'Mech' end dept_name, emp_sal, row_number() over (partition by emp_dept order by emp_sal desc) as rn from emp
Can someone suggest how should I get that in spark.

You can use RDD.zipWithIndex, then convert it to a DataFrame, then use min() and join to get the results you want.
Like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// SORT BY added as per comment request
val test = sc.textFile("/user/hadoop/test.txt")
.sortBy(_.split(",")(2)).sortBy(_.split(",")(3).toInt)
// Table to hold the dept name lookups
val deptDF =
sc.parallelize(Array(("EDept1", "IT"),("EDept2", "ComSc"),("EDept3", "Mech")))
.toDF("deptCode", "dept")
val schema = StructType(Array(
StructField("col1", StringType, false),
StructField("col2", StringType, false),
StructField("col3", StringType, false),
StructField("col4", StringType, false),
StructField("col5", LongType, false))
)
// join to deptDF added as per comment
val testDF = sqlContext.createDataFrame(
test.zipWithIndex.map(tuple => Row.fromSeq(tuple._1.split(",") ++ Array(tuple._2))),
schema
)
.join(deptDF, $"col3" === $"deptCode")
.select($"col1", $"col2", $"dept" as "col3", $"col4", $"col5")
.orderBy($"col5")
testDF.show
col1 col2 col3 col4 col5
Eid1 EName1 IT 100 0
Eid3 EName3 IT 101 1
Eid2 EName2 IT 102 2
Eid4 EName4 ComSc 110 3
Eid5 EName5 ComSc 121 4
Eid6 EName6 Mech 99 5
val result = testDF.join(
testDF.groupBy($"col3").agg($"col3" as "g_col3", min($"col5") as "start"),
$"col3" === $"g_col3"
)
.select($"col1", $"col2", $"col3", $"col4", $"col5" - $"start" + 1 as "index")
result.show
col1 col2 col3 col4 index
Eid4 EName4 ComSc 110 1
Eid5 EName5 ComSc 121 2
Eid6 EName6 Mech 99 1
Eid1 EName1 IT 100 1
Eid3 EName3 IT 101 2
Eid2 EName2 IT 102 3

Related

PySpark row to struct with specified structure

This is my initial dataframe:
columns = ["CounterpartID","Year","Month","Day","churnprobability", "deadprobability"]
data = [(1234, 2021,5,12, 0.85,0.6),(1224, 2022,6,12, 0.75,0.6),(1345, 2022,5,13, 0.8,0.2),(234, 2021,7,12, 0.9,0.8)]
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType
schema = StructType([
StructField("client_id", IntegerType(), False),
StructField("year", IntegerType(), False),
StructField("month", IntegerType(), False),
StructField("day", IntegerType(), False),
StructField("churn_probability", DoubleType(), False),
StructField("dead_probability", DoubleType(), False)
])
df = spark.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)
Then I do some transformations on the columns (basically, separating out the float columns into before decimals and after decimals columns) to get the intermediary dataframe.
abc = df.rdd.map(lambda x: (x[0],x[1],x[2],x[3],int(x[4]),int(x[4]%1 * pow(10,9)), int(x[5]),int(x[5]%1 * pow(10,9)) )).toDF(['client_id','year', 'month', 'day', 'churn_probability_unit', 'churn_probability_nano', 'dead_probability_unit', 'dead_probability_nano'] )
display(abc)
Below is the final desired dataframe (this is just an example of one row, but of course I'll need all the rows from the intermediary dataframe.
sjson = {"clientId": {"id": 1234 },"eventDate": {"year": 2022,"month": 8,"day": 5},"churnProbability": {"rate": {"units": "500","nanos": 780000000}},"deadProbability": {"rate": {"units": "500","nanos": 780000000}}}
df = spark.read.json(sc.parallelize([sjson])).select("clientId", "eventDate", "churnProbability", "deadProbability")
display(df)
How do I reach this end state from the intermediary state efficiently for all rows?
End goal is to use this final dataframe to write to Kafka where the schema of the topic is a form of the final desired dataframe.
I would probably eliminate the use of rdd logic (and again toDF) by using just one select from your original df:
from pyspark.sql import functions as F
defg = df.select(
F.struct(F.col('client_id').alias('id')).alias('clientId'),
F.struct('year', 'month', 'day').alias('eventDate'),
F.struct(
F.struct(
F.floor('churn_probability').alias('unit'),
(F.col('churn_probability') % 1 * 10**9).cast('long').alias('nanos')
).alias('rate')
).alias('churnProbability'),
F.struct(
F.struct(
F.floor('dead_probability').alias('unit'),
(F.col('dead_probability') % 1 * 10**9).cast('long').alias('nanos')
).alias('rate')
).alias('deadProbability'),
)
defg.show()
# +--------+-------------+----------------+----------------+
# |clientId| eventDate|churnProbability| deadProbability|
# +--------+-------------+----------------+----------------+
# | {1234}|{2021, 5, 12}|{{0, 850000000}}|{{0, 600000000}}|
# | {1224}|{2022, 6, 12}|{{0, 750000000}}|{{0, 600000000}}|
# | {1345}|{2022, 5, 13}|{{0, 800000000}}|{{0, 200000000}}|
# | {234}|{2021, 7, 12}|{{0, 900000000}}|{{0, 800000000}}|
# +--------+-------------+----------------+----------------+
So, I was able to solve this using structs , without using to_json
import pyspark.sql.functions as f
defg = abc.withColumn(
"clientId",
f.struct(
f.col("client_id").
alias("id")
)).withColumn(
"eventDate",
f.struct(
f.col("year").alias("year"),
f.col("month").alias("month"),
f.col("day").alias("day"),
)
).withColumn(
"churnProbability",
f.struct( f.struct(
f.col("churn_probability_unit").alias("unit"),
f.col("churn_probability_nano").alias("nanos")
).alias("rate")
)
).withColumn(
"deadProbability",
f.struct( f.struct(
f.col("dead_probability_unit").alias("unit"),
f.col("dead_probability_nano").alias("nanos")
).alias("rate")
)
).select ("clientId","eventDate","churnProbability", "deadProbability" )

How to concat two columns into map in hive

I have a hive table
row1 (Id_1, locale_1, value_1)
row2 (Id_1, locale_2, value_2)
row3 (Id_1, locale_3, value_3)
row4 (Id_2, locale_1, value_1)
row5 (Id_2, locale_3, value_3)
How to make it like a map with primary key id?
row (Id, Map<locale, value>)
like
row1 (Id_1, {locale_1 -> value_1, locale_2 -> value_2, locale_3 -> value_3})
row2 (Id_2, {locale_1 -> value_1, locale_3 -> value_3})
Thank you!
If you are not worried about the order of key value pairs in your map coulmns, here is a quick way using map function in Spark DataFrame API
val data = Seq(
("Id_1", "locale_1", "value_1"),
("Id_1", "locale_2", "value_2"),
("Id_1", "locale_3", "value_3"),
("Id_2", "locale_1", "value_1"),
("Id_2", "locale_3", "value_3")
)
import spark.implicits._
val df = data.toDF("id", "locale", "value")
import org.apache.spark.sql.functions._
val df2 = df.withColumn("myMap", map($"locale", $"value")).drop("locale").drop("value")
val df3 = df2
.withColumn("myMap", map_entries(col("myMap")))
.groupBy("id")
.agg(map_from_entries(flatten(collect_set("myMap"))).as("myMap"))
Output of df3.show(false) will be

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

UDF with multiple rows as response pySpark

I want to apply splitUtlisation on each row of utilisationDataFarme and pass startTime and endTime as parameters., as a result splitUtlisation will return multiple rows of data hence I want to create a new DataFrame with (Id, Day, Hour, Minute)
def splitUtlisation(onDateTime, offDateTime):
yield onDateTime
rule = rrule.rrule(rrule.HOURLY, byminute = 0, bysecond = 0, dtstart=offDateTime)
for result in rule.between(onDateTime, offDateTime):
yield result
yield offDateTime
utilisationDataFarme = (
sc.parallelize([
(10001, "2017-02-12 12:01:40" , "2017-02-12 12:56:32"),
(10001, "2017-02-13 12:06:32" , "2017-02-15 16:06:32"),
(10001, "2017-02-16 21:45:56" , "2017-02-21 21:45:56"),
(10001, "2017-02-21 22:32:41" , "2017-02-25 00:52:50"),
]).toDF(["id", "startTime" , "endTime"])
.withColumn("startTime", col("startTime").cast("timestamp"))
.withColumn("endTime", col("endTime").cast("timestamp"))
In core Python I did like this
dayList = ['SUN' , 'MON' , 'TUE' , 'WED' , 'THR' , 'FRI' , 'SAT']
for result in hours_aligned(datetime.datetime.now(), datetime.datetime.now() + timedelta(hours=68)):
print(dayList[datetime.datetime.weekday(result)], result.hour, 60 if result.minute == 0 else result.minute)
Result
THR 21 60
THR 22 60
THR 23 60
FRI 0 60
FRI 1 60
FRI 2 60
FRI 3 60
How to create it in pySpark?
I tried to create new Schema and apply
schema = StructType([StructField("Id", StringType(), False), StructField("Day", StringType(), False), StructField("Hour", StringType(), False) , StructField("Minute", StringType(), False)])
udf_splitUtlisation = udf(splitUtlisation, schema)
df = sqlContext.createDataFrame([],"id" , "Day" , "Hour" , "Minute")
Still I could not handle multiple rows as response.
You can use pyspark's explode to unpack a single row containing multiple values into multiple rows once you have your udf defined correctly.
As far as I know you won't be able to use generators with yield as an udf. Instead, you need to return all values at once as an array (see return_type) which then can be exploded and expanded:
import pandas as pd
from pyspark.sql.functions import col, udf, explode
from pyspark.sql.types import ArrayType, StringType, MapType
# input data as given by OP
df = (
sc.parallelize(
[
(10001, "2017-02-12 12:01:40", "2017-02-12 12:56:32"),
(10001, "2017-02-13 12:06:32", "2017-02-15 16:06:32"),
(10001, "2017-02-16 21:45:56", "2017-02-21 21:45:56"),
(10001, "2017-02-21 22:32:41", "2017-02-25 00:52:50"),
]
)
.toDF(["id", "startTime", "endTime"])
.withColumn("startTime", col("startTime").cast("timestamp"))
.withColumn("endTime", col("endTime").cast("timestamp"))
)
return_type = ArrayType(MapType(StringType(), StringType()))
#udf(returnType=return_type)
def your_udf_func(start, end):
"""Insert your function to return whatever you like
as a list of dictionaries.
For example, I chose to return hourly values for
day, hour and minute.
"""
date_range = pd.date_range(start, end, freq="h")
df = pd.DataFrame(
{
"day": date_range.strftime("%a"),
"hour": date_range.hour,
"minute": date_range.minute,
}
)
values = df.to_dict("index").values()
return list(values)
extracted = your_udf_func("startTime", "endTime")
exploded = explode(extracted).alias("exploded")
expanded = [
col("exploded").getItem(k).alias(k) for k in ["hour", "day", "minute"]
]
result = df.select("id", exploded).select("id", *expanded)
And the result is:
result.show(5)
+-----+----+---+------+
| id|hour|day|minute|
+-----+----+---+------+
|10001| 12|Sun| 1|
|10001| 12|Mon| 6|
|10001| 13|Mon| 6|
|10001| 14|Mon| 6|
|10001| 15|Mon| 6|
+-----+----+---+------+
only showing top 5 rows

Specify default value for rowsBetween and rangeBetween in Spark

I have a question concerning a window operation in Sparks Dataframe 1.6.
Let's say I have the following table:
id|MONTH |number
1 201703 2
1 201704 3
1 201705 7
1 201706 6
At moment I'm using the rowsBetween function:
val window = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-2, 0)
randomDF.withColumn("counter", sum(col("number")).over(window))
This gives me following results:
id|MONTH |number |counter
1 201703 2 2
1 201704 3 5
1 201705 7 12
1 201706 6 16
What I wan't to achieve is setting a default value (like in lag() and lead()) when there are no prescending rows. For example: '0' so that I get results like:
id|MONTH |number |counter
1 201703 2 0
1 201704 3 0
1 201705 7 12
1 201706 6 16
I've already looked in the documentation but Spark 1.6 does not allow this, and I was wondering if there was some kind of workaround.
Many thanks !
How about something like this where:
add additional lag step
substitute values with case
Code
val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
Seq(
Row(1, 1, 201703, 2),
Row(2, 1, 201704, 3),
Row(3, 1, 201705, 7),
Row(4, 1, 201706, 6)))
val schema: StructType = new StructType()
.add(StructField("sortColumn", IntegerType, false))
.add(StructField("id", IntegerType, false))
.add(StructField("month", IntegerType, false))
.add(StructField("number", IntegerType, false))
val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)
val prevRows = 2
val window = Window.partitionBy("id")
.orderBy(col("month"))
.rowsBetween(-prevRows, 0)
val window2 = Window.partitionBy("id")
.orderBy(col("month"))
val df2 = df0.withColumn("counter", sum(col("number")).over(window))
val df3 = df2.withColumn("myLagTmp", lag(lit(1), prevRows).over(window2))
val df4 = df3.withColumn("counter", expr("case when myLagTmp is null then 0 else counter end")).drop(col("myLagTmp"))
df4.sort("sortColumn").show()
Thanks to the answer of #astro_asz i've came up with the following solution:
val numberRowsBetween = 2
val window1 = Window.partitionBy("id").orderBy("MONTH")
val window2 = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-(numberRowsBetween - 1), 0)
randomDF.withColumn("counter", when(lag(col("number"), numberRowsBetween , 0).over(window1) === 0, 0)
.otherwise(sum(col("number")).over(window2)))
This solution will put a '0' as default value.

Resources