spark date_format results showing null - apache-spark

I have data source like below:
order_id,order_date,order_customer_id,order_status
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
I am trying to convert to mm/dd/yyyy only for CLOSED orders using the below queries but getting output as null. can you please assist to get required date format using DSL or spark sql method:
closed_df=ord_df.select(date_format(to_date('order_date','yyyy-mm-dd hh:mm:SS.a'),'mm/dd/yyyy') .\
alias("formate_date")).show()
#output:
|formate_date|
+------------+
| null|
| null|
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date("order_date","yyyy-mm-dd hh:mm:ss.a"),'mm/dd/yyyy') as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
#output:
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1| null| 11599| CLOSED|
| 4| null| 8827| CLOSED

The format to date from the string 2013-07-25 00:00:00.0 is yyyy-MM-dd HH:mm:SS.s. Likewise for date formatting the format is MM/dd/yyyy. Here the Spark formatting doc for more information.
data = [(1, "2013-07-25 00:00:00.0", 11599, "CLOSED",),
(2, "2013-07-25 00:00:00.0", 256, "PENDING_PAYMENT",),
(3, "2013-07-25 00:00:00.0", 12111, "COMPLETE",),
(4, "2013-07-25 00:00:00.0", 8827, "CLOSED",), ]
ord_df = spark.createDataFrame(data, ("order_id", "order_date", "order_customer_id", "order_status",))
from pyspark.sql.functions import to_date, date_format
closed_df = (ord_df.where("order_status = 'CLOSED'")
.select(date_format(to_date('order_date','yyyy-MM-dd HH:mm:SS.s'),'MM/dd/yyyy')
.alias("formate_date"))).show()
"""
+------------+
|formate_date|
+------------+
| 07/25/2013|
| 07/25/2013|
+------------+
"""
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date(order_date,"yyyy-MM-dd HH:mm:SS.s"), "MM/dd/yyyy") as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
"""
+--------+----------+-----------------+------------+
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1|07/25/2013| 11599| CLOSED|
| 4|07/25/2013| 8827| CLOSED|
+--------+----------+-----------------+------------+
"""

Related

Spark - Python: Select rows and dates

I have the follow df in Spark (Python).
I am just trying to select the day that the "datos_acumulados" column is more than 20480. In this case, the output should be a table like the following: (a table format including the nulls):
Results:
grupo_edad| fecha|acumuladosMB|datos_acumulados|
| 1|2020-08-04| 4864| 20921|
| 4| null| null| null|
Dataframe: df_datos_acumulados
grupo_edad| fecha|acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
| 1|2020-08-01| 6185| 6185|
| 1|2020-08-02| 5854| 12039|
| 1|2020-08-03| 4018| 16057|
| 1|2020-08-04| 4864| 20921|
| 1|2020-08-05| 5526| 26447|
| 1|2020-08-06| 4818| 31265|
| 1|2020-08-07| 5359| 36624|
| 4|2020-08-01| 674| 674|
| 4|2020-08-02| 744| 1418|
| 4|2020-08-03| 490| 1908|
| 4|2020-08-04| 355| 2263|
| 4|2020-08-05| 1061| 3324|
| 4|2020-08-06| 752| 4076|
| 4|2020-08-07| 560| 4636|
Thanks!
Thank you to the answer of #pasha701 I could get the final table but It doesn't show the nulls rows that I need too:
grupoDistinctDF = df_datos_acumulados.withColumn("grupo_edad", col("grupo_edad"))
grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")
df_datos_acumulados = df_datos_acumulados.where(col("datos_acumulados") >= 20480) \
.withColumn("row_number", row_number().over(grupoWindow)) \
.where(col("row_number") == 1) \
.drop("row_number")
grupoDistinctDF = grupoDistinctDF.join(df_datos_acumulados,["grupo_edad"], "left")
Output:
grupo_edad| fecha|acumuladosMB|datos_acumulados|
| 1|2020-08-04| 4864| 20921|
If first row where "datos_acumulados" > 20480 is required, Window function "row_number()" can be used for get such first row, and joined with distinct "grupo_edad" (Scala):
val df = Seq(
(1, "2020-08-01", 6185, 6185),
(1, "2020-08-02", 5854, 12039),
(1, "2020-08-03", 4018, 16057),
(1, "2020-08-04", 4864, 20921),
(1, "2020-08-05", 5526, 26447),
(1, "2020-08-06", 4818, 31265),
(1, "2020-08-07", 5359, 36624),
(4, "2020-08-01", 674, 674),
(4, "2020-08-02", 744, 1418),
(4, "2020-08-03", 490, 1908),
(4, "2020-08-04", 355, 2263),
(4, "2020-08-05", 1061, 3324),
(4, "2020-08-06", 752, 4076),
(4, "2020-08-07", 560, 4636)
).toDF("grupo_edad", "fecha", "acumuladosMB", "datos_acumulados")
val grupoDistinctDF = df.select("grupo_edad").distinct()
val grupoWindow = Window.partitionBy("grupo_edad").orderBy("fecha")
val firstMatchingRowDF = df
.where($"datos_acumulados" > 20480)
.withColumn("row_number", row_number().over(grupoWindow))
.where($"row_number" === 1)
.drop("row_number")
grupoDistinctDF.join(firstMatchingRowDF, Seq("grupo_edad"), "left")
Output:
+----------+----------+------------+----------------+
|grupo_edad|fecha |acumuladosMB|datos_acumulados|
+----------+----------+------------+----------------+
|4 |null |null |null |
|1 |2020-08-04|4864 |20921 |
+----------+----------+------------+----------------+

Updating rows based on the next time a specific value occurs in a dataframe pyspark

If I have a dataframe like this
data = [(("ID1", "ENGAGEMENT", 2019-03-03)), (("ID1", "BABY SHOWER", 2019-04-13)), (("ID1", "WEDDING", 2019-07-10)),
(("ID1", "DIVORCE", 2019-09-26))]
df = spark.createDataFrame(data, ["ID", "Event", "start_date"])
df.show()
+---+-----------+----------+
| ID| Event|start_date|
+---+-----------+----------+
|ID1| ENGAGEMENT|2019-03-03|
|ID1|BABY SHOWER|2019-04-13|
|ID1| WEDDING|2019-07-10|
|ID1| DIVORCE|2019-09-26|
+---+-----------+----------+
From this dataframe the end date of the event would have to be inferred based on the start date of the subsequent events
For example: if you have a engagement then that would end when the wedding is so you would take the start date of the wedding as the end date of the engagement.
So the above dataframe should be getting this output.
+---+-----------+----------+----------+
| ID| Event|start_date| end_date|
+---+-----------+----------+----------+
|ID1| ENGAGEMENT|2019-03-03|2019-07-10|
|ID1|BABY SHOWER|2019-04-13|2019-04-13|
|ID1| WEDDING|2019-07-10|2019-09-26|
|ID1| DIVORCE|2019-09-26| NULL|
+---+-----------+----------+----------+
I initially attempted this using the lead function over a window partioned by the ID to get rows in front but as it can be 20 rows later that the "Wedding" event would be it doesn't work and is a really messy way to do it.
df = df.select("*", *([f.lead(f.col(c),default=None).over(Window.orderBy("ID")).alias("LEAD_"+c)
for c in ["Event", "start_date"]]))
activity_dates = activity_dates.select("*", *([f.lead(f.col(c),default=None).over(Window.orderBy("ID")).alias("LEAD_"+c)
for c in ["LEAD_Event", "LEAD_start_date"]]))
df = df.withColumn("end_date", f.when((col("Event") == "ENGAGEMENT") & (col("LEAD_Event") == "WEDDING"), col("LEAD_start_date"))
.when((col("Event") == "ENGAGEMENT") & (col("LEAD_LEAD_Event") == "WEDDING"), col("LEAD_LEAD_start_date"))
How can I achieve this without looping through the dataset?
Here is my try.
from pyspark.sql import Window
from pyspark.sql.functions import *
df.withColumn('end_date', expr('''
case when Event = 'ENGAGEMENT' then first(if(Event = 'WEDDING', start_date, null), True) over (Partition By ID)
when Event = 'BABY SHOWER' then first(if(Event = 'BABY SHOWER', start_date, null), True) over (Partition By ID)
when Event = 'WEDDING' then first(if(Event = 'DIVORCE', start_date, null), True) over (Partition By ID)
else null end
''')).show()
+---+-----------+----------+----------+
| ID| Event|start_date| end_date|
+---+-----------+----------+----------+
|ID1| ENGAGEMENT|2019-03-03|2019-07-10|
|ID1|BABY SHOWER|2019-04-13|2019-04-13|
|ID1| WEDDING|2019-07-10|2019-09-26|
|ID1| DIVORCE|2019-09-26| null|
+---+-----------+----------+----------+

Aggregate First Grouped Item from Subsequent Items

I have user game sessions containing: user id, game id, score and a timestamp when the game was played.
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0),
("u1", "g3", 2, 2),
("u1", "g3", 5, 3),
("u1", "g4", 5, 4),
("u2", "g2", 1, 1),
], ["UserID", "GameID", "Score", "Time"])
Desired Output
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
I want to transform the data such that I get the max score of the first game the user played as well as the max score of the second game (bonus if I can also get the max score of all subsequent games). Unfortunately I'm not sure how that's possible to do with Spark SQL.
I know I can group by UserID, GameID and then agg to get the max score and min time. Not sure to how to proceed from there.
Clarification: note that MaxScoreGame1 and MaxScoreGame2 refer to the first and second game user player; not the GameID.
You could try using a combination of Window functions and Pivot.
Get the row number for every game partitioned by UserID ordered by Time.
Filter down to GameNumber being 1 or 2.
Pivot on that to get your desired output shape.
Unfortunately I am using scala not python, but the below should be fairly easily transferable to python library.
import org.apache.spark.sql.expressions.Window
// Use a window function to get row number
val rowNumberWindow = Window.partitionBy(col("UserId")).orderBy(col("Time"))
val output = {
df
.select(
col("*"),
row_number().over(rowNumberWindow).alias("GameNumber")
)
.filter(col("GameNumber") <= lit(2))
.groupBy(col("UserId"))
.pivot("GameNumber")
.agg(
sum(col("Score"))
)
}
output.show()
+------+---+----+
|UserId| 1| 2|
+------+---+----+
| u1| 10| 2|
| u2| 1|null|
+------+---+----+
Solution with PySpark:
from pyspark.sql import Window
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
(df
.groupBy("UserID", "GameID")
.agg(F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameMaxScoreCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameMaxScoreCol")
.agg(F.max("Score"))
).show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+

Column name when using concat

I am concatenating some columns in Spark SQL using the concat function. Here is some dummy code:
import org.apache.spark.sql.functions.{concat, lit}
val data1 = sc.parallelize(Array(2, 0, 3, 4, 5))
val data2 = sc.parallelize(Array(4, 0, 0, 6, 7))
val data3 = sc.parallelize(Array(1, 2, 3, 4, 10))
val dataJoin = data1.zip(data2).zip(data3).map((x) => (x._1._1, x._1._2, x._2 )).toDF("a1","a2","a3")
val dataConcat = dataJoin.select($"a1",concat(lit("["),$"a1", lit(","),$"a2", lit(","),$"a3", lit("]")))
Is there a way to specify or to change the label of the columns in order to avoid the default name which is not very practical?
+---+------------------------+
| a1|concat([,a1,,,a2,,,a3,])|
+---+------------------------+
| 2| [2,4,1]|
| 0| [0,0,2]|
| 3| [3,0,3]|
| 4| [4,6,4]|
| 5| [5,7,10]|
+---+------------------------+
Use as or alias methods to give a name to your column.

An error about Dataset.filter in Spark SQL

I want to filter the dataset only to contain the record which can be found in MySQL.
Here is the Dataset:
dataset.show()
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
And here is the table in MySQL:
+---+-----+
| id| name|
+---+-----+
| 1| a|
| 3| c|
| 4| d|
+---+-----+
This is my code (running in spark-shell):
import java.util.Properties
case class App(id: Int, name: String)
val data = sc.parallelize(Array((1, "a"), (2, "b"), (3, "c")))
val dataFrame = data.map { case (id, name) => App(id, name) }.toDF
val dataset = dataFrame.as[App]
val url = "jdbc:mysql://ip:port/tbl_name"
val table = "my_tbl_name"
val user = "my_user_name"
val password = "my_password"
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
dataset.filter((x: App) =>
0 != sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count).show()
But I get "java.lang.NullPointerException"
at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638)
at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:362)
at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623)
I have tested
val x = App(1, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + x.id.toString), properties).count
val y = App(5, "aa")
sqlContext.read.jdbc(url, table, Array("id = " + y.id.toString), properties).count
and I can get the right result 1 and 0.
What's the problem with filter?
What's the problem with filter?
You get an exception because you're trying to execute an action (count on a DataFrame) inside a transformation (filter). Neither nested actions nor transformations are supported in Spark.
Correct solution is as usual either join on compatible data structures, lookup using local data structure or query directly against external system (without using Spark data structures).

Resources