I am trying to convert a pyspark column of string type to date type as below.
**Date**
31 Mar 2020
2 Apr 2020
29 Jan 2019
8 Sep 2109
Output required:
31-03-2020
02-04-2020
29-01-2019
08-04-2109
Thanks.
You can use dayofmonth,year,month (or) date_format() (or) from_unixtime(unix_timestamp()) in built functions for this case.
Example:
#sample data
df=spark.createDataFrame([("31 Mar 2020",),("2 Apr 2020",),("29 Jan 2019",)],["Date"])
#DataFrame[Date: string]
df.show()
#+-----------+
#| Date|
#+-----------+
#|31 Mar 2020|
#| 2 Apr 2020|
#|29 Jan 2019|
#+-----------+
from pyspark.sql.functions import *
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",year(col("new_dt"))).\
withColumn("month",month(col("new_dt"))).\
withColumn("day",dayofmonth(col("new_dt"))).\
show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 3| 31|
#| 2 Apr 2020|2020-04-02|2020| 4| 2|
#|29 Jan 2019|2019-01-29|2019| 1| 29|
#+-----------+----------+----+-----+---+
#using date_format
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",date_format(col("new_dt"),"yyyy")).\
withColumn("month",date_format(col("new_dt"),"MM")).\
withColumn("day",date_format(col("new_dt"),"dd")).show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 03| 31|
#| 2 Apr 2020|2020-04-02|2020| 04| 02|
#|29 Jan 2019|2019-01-29|2019| 01| 29|
#+-----------+----------+----+-----+---+
The to_date function would need days as 02 or ' 2' instead of 2. Therefore, we can use regex to remove spaces, then wherever the length of the string is less than the max(9), we can add 0 to the start of the string. Then we can apply to_date and use it to extract your other columns(day,month,year). Can also use date_format to keep your date in a specified format.
df.show()#sample df
+-----------+
| Date|
+-----------+
|31 Mar 2020|
|2 Apr 2020|
|29 Jan 2019|
|8 Sep 2019|
+-----------+
from pyspark.sql import functions as F
df.withColumn("regex", F.regexp_replace("Date","\ ",""))\
.withColumn("Date", F.when(F.length("regex")<9, F.concat(F.lit(0),F.col("regex")))\
.otherwise(F.col("regex"))).drop("regex")\
.withColumn("Date", F.to_date("Date",'ddMMMyyyy'))\
.withColumn("Year", F.year("Date"))\
.withColumn("Month",F.month("Date"))\
.withColumn("Day", F.dayofmonth("Date"))\
.withColumn("Date_Format2", F.date_format("Date", 'dd-MM-yyyy'))\
.show()
#output
+----------+----+-----+---+------------+
| Date|Year|Month|Day|Date_Format2|
+----------+----+-----+---+------------+
|2020-03-31|2020| 3| 31| 31-03-2020|
|2020-04-02|2020| 4| 2| 02-04-2020|
|2019-01-29|2019| 1| 29| 29-01-2019|
|2019-09-08|2019| 9| 8| 08-09-2019|
+----------+----+-----+---+------------+
Related
I am currently using Pyspark to do a moving average calculation for the last 12 months for different company group. The data looks like this:
| CALENDAR_DATE| COMPANY | VALUE
| 2021-11-01 | a | 31
| 2021-10-01 | a | 31
| 2021-09-01 | a | 33
| 2021-08-01 | a | 21
| 2021-07-01 | a | 25
| 2021-06-01 | a | 28
| 2021-05-01 | a | 31
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 31
| 2021-03-01 | a | 33
| 2021-04-01 | a | 10
| 2021-03-01 | a | 25
| 2021-04-01 | a | 30
| 2021-03-01 | a | 27
| 2021-02-01 | a | 18
| 2021-01-01 | a | 15
| 2021-11-01 | b | 31
| 2021-10-01 | b | 30
| 2021-09-01 | b | 31
| 2021-08-01 | b | 32
and I would like to get an extra column called rolling_average for each company a and b.
my code looks like this and it doesn't give me the right answer. I really don't know what is the problem.
from pyspark.sql.functions import *
from pyspark.sql.window import *
w = Window().partitionBy('COMPANY').orderBy('CALENDAR_DATE').rowsBetween(-11, 0)
df = df.withColumn('ROLLING_AVERAGE', round(avg('VALUE').over(w), 1))
You need to use Window rangeBetween instead of rowsBetween. But before convert the CALENDAR_DATE column into timestamp:
from pyspark.sql import Window
from pyspark.sql import functions as F
df = df.withColumn('calendar_timestamp', F.to_timestamp('CALENDAR_DATE').cast("long"))
# 2629800 is the number of seconds in one month
w = Window().partitionBy('COMPANY').orderBy('calendar_timestamp').rangeBetween(-11 * 2629800, 0)
df1 = df.withColumn(
'ROLLING_AVERAGE',
F.round(F.avg('VALUE').over(w), 1)
).drop('calendar_timestamp')
df1.show()
#+-------------+-------+-----+---------------+
#|CALENDAR_DATE|COMPANY|VALUE|ROLLING_AVERAGE|
#+-------------+-------+-----+---------------+
#| 2021-08-01| b| 32| 32.0|
#| 2021-09-01| b| 31| 31.5|
#| 2021-10-01| b| 30| 31.0|
#| 2021-11-01| b| 31| 31.0|
#| 2021-01-01| a| 15| 15.0|
#| 2021-02-01| a| 18| 16.5|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 33| 25.2|
#| 2021-03-01| a| 25| 25.2|
#| 2021-03-01| a| 27| 25.2|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 31| 25.3|
#| 2021-04-01| a| 10| 25.3|
#| 2021-04-01| a| 30| 25.3|
#| 2021-05-01| a| 31| 25.8|
#| 2021-06-01| a| 28| 26.0|
#| 2021-07-01| a| 25| 25.9|
#| 2021-08-01| a| 21| 25.6|
#| 2021-09-01| a| 33| 26.1|
#| 2021-10-01| a| 31| 26.4|
#+-------------+-------+-----+---------------+
I have a table of this format
date dept rate
2020-07-06 Marketing. 20
2020-07-06 Sales. 15
2020-07-06 Engg. 40
2020-07-06 Sites. 18
2020-07-08 Sales. 5
2020-07-08 Engg. 10
2020-07-08 Sites. 7
I want to add new "SpendRate" column in such a way that, for the latest two days (7th and 8th July in the example) should copy the values from 6th July's "rate" towards "Spendrate"..
date dept rate. Spendrate
2020-07-06 Marketing. 20 20
2020-07-06 Sales. 15 15
2020-07-06 Engg. 40 40
2020-07-06 Sites. 18 18
2020-07-07 Marketing. 20. 20
2020-07-08 Sales. 5. 15
2020-07-08 Engg. 10 40
2020-07-08 Sites. 7 18
Use window first(col,ignoreNulls=True) with rangeBetween clause to generate a frame.
Example:
df.show()
#+----------+---------+----+
#| date| dept|rate|
#+----------+---------+----+
#|2020-07-06|Marketing| 20|
#|2020-07-06| Sales| 15|
#|2020-07-06| Engg| 40|
#|2020-07-06| sites| 18|
#|2020-07-08| Sales| 5|
#|2020-07-08| Engg| 10|
#|2020-07-08| sites| 7|
#|2020-07-07|Marketing| 20|
#+----------+---------+----+
sql("select *, first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) as Spendrate from tmp order by date").show()
#for more specific range by checking datediff -1 or 0 then generating Spendrate column.
sql("select date,dept,rate,case when diff=-1 then first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) when diff=0 then first(rate,True) over(partition by dept order by cast (date as timestamp) RANGE BETWEEN INTERVAL 2 DAYS PRECEDING AND CURRENT ROW) else rate end as Spendrate from (select *,datediff(date,current_date)diff from tmp)t order by date").show()
#+----------+---------+----+----------+
#| date| dept|rate| Spendrate|
#+----------+---------+----+----------+
#|2020-07-06|Marketing| 20| 20 |
#|2020-07-06| sites| 18| 18 |
#|2020-07-06| Engg| 40| 40 |
#|2020-07-06| Sales| 15| 15 |
#|2020-07-07|Marketing| 20| 20 |
#|2020-07-08| sites| 7| 18 |
#|2020-07-08| Engg| 10| 40 |
#|2020-07-08| Sales| 5| 15 |
#+----------+---------+----+----------+
from pyspark.sql import Window, WindowSpec
import pyspark.sql.functions as F
import pandas as pd
# Create the test data Assuming everyday, we have data for all departments
date = ['2020-07-06', '2020-07-06', '2020-07-06', '2020-07-06','2020-07-07','2020-07-07','2020-07-07','2020-07-07', '2020-07-08', '2020-07-08','2020-07-08','2020-07-08' ]
dept = ['Marketing', 'Sales', 'Engg', 'Sites','Marketing', 'Sales', 'Engg', 'Sites','Marketing', 'Sales', 'Engg', 'Sites',]
rate = [20,15,40,18,20, 3, 6, 9, 100,5,10,7]
df = pd.DataFrame([date, dept, rate]).T
df.columns = ['date', 'dept', 'rate']
# create spark DtaFrame
sdf = spark.createDataFrame(df)
sdf.show()
+----------+---------+----+
| date| dept|rate|
+----------+---------+----+
|2020-07-06|Marketing| 20|
|2020-07-06| Sales| 15|
|2020-07-06| Engg| 40|
|2020-07-06| Sites| 18|
|2020-07-07|Marketing| 20|
|2020-07-07| Sales| 3|
|2020-07-07| Engg| 6|
|2020-07-07| Sites| 9|
|2020-07-08|Marketing| 100|
|2020-07-08| Sales| 5|
|2020-07-08| Engg| 10|
|2020-07-08| Sites| 7|
+----------+---------+----+
# Lag returns by one day
windowSpec = Window.partitionBy('dept').orderBy('date')
value_column = 'rate_shift'
value_ff = F.lag(sdf['rate'], offset=2).over(windowSpec)
sdf = sdf.withColumn(value_column, value_ff)
# returns = returns.withColumn(value_column, value_ff)
sdf.orderBy('date').show()
+----------+---------+----+----------+
| date| dept|rate|rate_shift|
+----------+---------+----+----------+
|2020-07-06|Marketing| 20| null|
|2020-07-06| Sales| 15| null|
|2020-07-06| Engg| 40| null|
|2020-07-06| Sites| 18| null|
|2020-07-07|Marketing| 20| null|
|2020-07-07| Engg| 6| null|
|2020-07-07| Sales| 3| null|
|2020-07-07| Sites| 9| null|
|2020-07-08| Sales| 5| 15|
|2020-07-08| Engg| 10| 40|
|2020-07-08|Marketing| 100| 20|
|2020-07-08| Sites| 7| 18|
+----------+---------+----+----------+
rates are shiifted by 2 days
I need to write a user defined aggregate function that captures the number of days between previous discharge_date and following admit_date for each consecutive visits.
I will also need to pivot on the "PERSON_ID" values.
I have the following input_df :
input_df :
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-03-15| 2018-03-16|
| 333|2018-06-10| 2018-06-11|
| 111|2018-03-01| 2018-03-02|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 111|2018-03-30| 2018-03-31|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-20| 2018-06-21|
| 111|2018-01-01| 2018-01-02|
+---------+----------+--------------+
First, I need to group by each person and sort the corresponding rows by ADMIT_DATE. That would yield "input_df2".
input_df2:
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-01-01| 2018-01-03|
| 111|2018-03-01| 2018-03-02|
| 111|2018-03-15| 2018-03-16|
| 111|2018-03-30| 2018-03-31|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-10| 2018-06-11|
| 333|2018-06-20| 2018-06-21|
+---------+----------+--------------+
The desired output_df :
+------------------+-----------------+-----------------+----------------+
|PERSON_ID_DISTINCT| FIRST_DIFFERENCE|SECOND_DIFFERENCE|THIRD_DIFFERENCE|
+------------------+-----------------+-----------------+----------------+
| 111| 1 month 26 days | 13 days| 14 days|
| 222| 3 days| NAN| NAN|
| 333| 8 days| 9 days| NAN|
+------------------+-----------------+-----------------+----------------+
I know the maximum number a person appears in my input_df, so I know how many columns should be created by :
print input_df.groupBy('PERSON_ID').count().sort('count', ascending=False).show(5)
Thanks a lot in advance,
You can use pyspark.sql.functions.datediff() to compute the difference between two dates in days. In this case, you just need to compute the difference between the current row's ADMIT_DATE and the previous row's DISCHARGE_DATE. You can do this by using pyspark.sql.functions.lag() over a Window.
For example, we can compute the duration between visits in days as a new column DURATION.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('PERSON_ID').orderBy('ADMIT_DATE')
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w)-1)\
.sort('PERSON_ID', 'INDEX')\
.show()
#+---------+----------+--------------+--------+-----+
#|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|DURATION|INDEX|
#+---------+----------+--------------+--------+-----+
#| 111|2018-01-01| 2018-01-02| null| 0|
#| 111|2018-03-01| 2018-03-02| 58| 1|
#| 111|2018-03-15| 2018-03-16| 13| 2|
#| 111|2018-03-30| 2018-03-31| 14| 3|
#| 222|2018-12-01| 2018-12-02| null| 0|
#| 222|2018-12-05| 2018-12-06| 3| 1|
#| 333|2018-06-01| 2018-06-02| null| 0|
#| 333|2018-06-10| 2018-06-11| 8| 1|
#| 333|2018-06-20| 2018-06-21| 9| 2|
#+---------+----------+--------------+--------+-----+
Notice, I also added an INDEX column using pyspark.sql.functions.row_number(). We can just filter for INDEX > 0 (because the first value will always be null) and then just pivot the DataFrame:
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w) - 1)\
.where('INDEX > 0')\
.groupBy('PERSON_ID').pivot('INDEX').agg(f.first('DURATION'))\
.sort('PERSON_ID')\
.show()
#+---------+---+----+----+
#|PERSON_ID| 1| 2| 3|
#+---------+---+----+----+
#| 111| 58| 13| 14|
#| 222| 3|null|null|
#| 333| 8| 9|null|
#+---------+---+----+----+
Now you can rename the columns to whatever you desire.
Note: This assumes that ADMIT_DATE and DISCHARGE_DATE are of type date.
input_df.printSchema()
#root
# |-- PERSON_ID: long (nullable = true)
# |-- ADMIT_DATE: date (nullable = true)
# |-- DISCHARGE_DATE: date (nullable = true)
I'm not quite sure why my code gives 52 as the answer for: weekofyear("01/JAN/2017") .
Does anyone have a possible explanation for this? Is there a better way to do this?
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('weekOfYear').getOrCreate()
from pyspark.sql.functions import to_date
df = spark.createDataFrame(
[(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date"))
df.show()
+---+-----------+
| id| date|
+---+-----------+
| 1|01/JAN/2017|
| 2|15/FEB/2017|
+---+-----------+
Calculate the week of the year
df=df.withColumn("weekofyear", functions.weekofyear(to_date(df["date"],"dd/MMM/yyyy")))
df.printSchema()
root
|-- id: long (nullable = true)
|-- date: string (nullable = true)
|-- weekofyear: integer (nullable = true)
df.show()
The 'error' is visible below:
+---+-----------+----------+
| id| date|weekofyear|
+---+-----------+----------+
| 1|01/JAN/2017| 52|
| 2|15/FEB/2017| 7|
+---+-----------+----------+
It seems like weekofyear() will only return 1 for January 1st if the day of the week is Monday through Thursday.
To confirm, I created a DataFrame with all "01/JAN/YYYY" from 1900 to 2018:
df = sqlCtx.createDataFrame(
[(1, "01/JAN/{y}".format(y=year),) for year in range(1900,2019)],
["id", "date"]
)
Now let's convert it to a date, get the day of the week, and count the values for weekofyear():
import pyspark.sql.functions as f
df.withColumn("d", f.to_date(f.from_unixtime(f.unix_timestamp('date', "dd/MMM/yyyy"))))\
.withColumn("weekofyear", f.weekofyear("d"))\
.withColumn("dayofweek", f.date_format("d", "E"))\
.groupBy("dayofweek", "weekofyear")\
.count()\
.show()
#+---------+----------+-----+
#|dayofweek|weekofyear|count|
#+---------+----------+-----+
#| Sun| 52| 17|
#| Mon| 1| 18|
#| Tue| 1| 17|
#| Wed| 1| 17|
#| Thu| 1| 17|
#| Fri| 53| 17|
#| Sat| 53| 4|
#| Sat| 52| 12|
#+---------+----------+-----+
Note, I am using Spark v 2.1 where to_date() does not accept a format argument, so I had to use the method described in this answer to convert the string to a date.
Similarly to_date() only returns 1 for:
January 2nd, if the day of the week is Monday through Friday.
January 3rd, if the day of the week is Monday through Saturday.
Update
This behavior is consistent with the ISO 8601 definition.
I have data like this:
2014 a 1
2015 b 2
2014 a 2
2015 c 4
2014 b 2
How to transfer it to:
a b c
2014 3 2 0
2015 0 2 4
in Spark.
thanks.
This is a protopyical application of a pivot table
df.show()
//
//+------+------+----+
//|letter|number|year|
//+------+------+----+
//| a| 1|2014|
//| b| 2|2015|
//| a| 2|2014|
//| c| 4|2015|
//| b| 2|2014|
//+------+------+----+
val pivot = df.groupBy("year")
.pivot("letter")
.sum("number")
.na.fill(0,Seq("a"))
.na.fill(0,Seq("c"))
pivot.show()
//+----+---+---+---+
//|year| a| b| c|
//+----+---+---+---+
//|2014| 3| 2| 0|
//|2015| 0| 2| 4|
//+----+---+---+---+