Transform a day of the year to day_month format - apache-spark

So I was wondering if it's possible with PySpark to transform a random day of the year (0-365) to day-month format. In my case, the input would be a string.
Example:
Input : "091"
Expected output (month-day): "0331"

This is possible but you need to have to year also. Convert the year to date (first January of that year) and add the days to get the desired result, then format it.
Here's a working example
from pyspark.sql import functions as F
df = spark.createDataFrame([("2020", "091")], ["year", "day_of_year"])
df1 = df.withColumn(
"first_day_year",
F.concat_ws("-", "year", F.lit("01"), F.lit("01"))
).withColumn(
"day_month",
F.date_format(
F.expr("date_add(first_day_year, cast(day_of_year as int) -1)"),
"MMdd"
)
).drop("first_day_year")
df1.show()
#+----+-----------+---------+
#|year|day_of_year|day_month|
#+----+-----------+---------+
#|2020| 091| 0331|
#+----+-----------+---------+

You can use date_add to add the number of days to the new year's day.
import pyspark.sql.functions as F
df2 = df.withColumn(
'day_month',
F.expr("date_format(date_add('2020-01-01', int(day - 1)), 'MMdd')")
)
df2.show()
+---+---------+
|day|day_month|
+---+---------+
| 91| 0331|
+---+---------+
Note that the result will vary depending on whether it's a leap year or not.

Related

How to get week of month in Spark 3.0+?

I cannot find any datetime formatting pattern to get the week of month in spark 3.0+
As use of 'W' is deprecated, is there a solution to get week of month without using legacy option?
The below code doesn't work for spark 3.2.1
df = df.withColumn("weekofmonth", f.date_format(f.col("Date"), "W"))
For completeness, it's worth mentioning that one can set the configuration to "LEGACY".
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
from pyspark.sql import functions as F
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df = spark.createDataFrame(
[('2022-07-01',),
('2022-07-02',),
('2022-07-03',)],
['Date'])
df.withColumn("weekofmonth", F.date_format(F.col("Date"), "W")).show()
# +----------+-----------+
# | Date|weekofmonth|
# +----------+-----------+
# |2022-07-01| 1|
# |2022-07-02| 1|
# |2022-07-03| 2|
# +----------+-----------+
you can try using udf:
from pyspark.sql.functions import col,year,month,dayofmonth
df = spark.createDataFrame(
[(1, "2022-04-22"), (2, "2022-05-12")], ("id", "date"))
from calendar import monthcalendar
def get_week_of_month(year, month, day):
return next(
(
week_number
for week_number, days_of_week in enumerate(monthcalendar(year, month), start=1)
if day in days_of_week
),
None,
)
fn1 = udf(get_week_of_month)
df =df.withColumn('week_of_mon',fn1(year(col('date')),month(col('date')),dayofmonth(col('date'))))
display(df)
If you have table with year, month and week numbers sorted by year and week, you may try my solution:
select
year_iso,
month,
posexplode(collect_list(week_iso)) as (week_of_month, week_iso)
from your_table_with_dates
group by year_iso, month
Here we just transform column week_iso into array grouped by year_iso and month, and then explodes it backward into two columns (position inside month and week_iso).
Note, positions starts in 0, but its not a real problem.
Screenshots of tables:
Source table
Table with week of month

Pyspark date format from multiple columns

I have four string columns 'hour', 'day', 'month', 'year' in my data frame. I would like to create new column fulldate in format 'dd/MM/yyyy HH:mm'.
df2 = df1.withColumn("fulldate", to_date(concat(col('day'), lit('/'), col('month'), lit('/'), col('year'), lit(' '), col('hour'), lit(':'), lit('0'), lit('0')), 'dd/MM/yyyy HH:mm'))
but it doesn't seem to work. I'm getting format "yyyy-mm-dd".
Am I missing something?
For Spark 3+, you can use make_timestamp function to create a timestamp column from those columns and use date_format to convert it to the desired date pattern :
from pyspark.sql import functions as F
df2 = df1.withColumn(
"fulldate",
F.date_format(
F.expr("make_timestamp(year, month, day, hour, 0, 0)"),
"dd/MM/yyyy HH:mm"
)
)
Use date_format instead of to_date.
to_date converts a column to date type from the given format, while date_format converts a date type column to the given format.
from pyspark.sql.functions import date_format, concat, col, lit
df2 = df1.withColumn(
"fulldate",
date_format(
concat(col('year'), lit('/'), col('month'), lit('/'), col('day'), lit(' '), col('hour'), lit(':'), lit('00'), lit(':'), lit('00')),
'dd/MM/yyyy HH:mm'
)
)
For better readability, you can use format_string:
from pyspark.sql.functions import date_format, format_string, col
df2 = df1.withColumn(
"fulldate",
date_format(
format_string('%d/%d/%d %d:00:00', col('year'), col('month'), col('day'), col('hour')),
'dd/MM/yyyy HH:mm'
)
)

How to convert all the date format to a timestamp for date column?

I am using PySpark version 3.0.1. I am reading a csv file as a PySpark dataframe having 2 date column. But when I try to print the schema both column is populated as string type.
Above screenshot attached is a Dataframe and schema of the Dataframe.
How to convert the row values there in both the date column to timestamp format using pyspark?
I have tried many things but all code is required the current format but how to convert to proper timestamp if I am not aware of what format is coming in csv file.
I have tried below code as wellb but this is creating a new column with null value
df1 = df.withColumn('datetime', col('joining_date').cast('timestamp'))
print(df1.show())
print(df1.printSchema())
Since there are two different date types, you need to convert using two different date formats, and coalesce the results.
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
)
)
result.show()
+------------+-------------------+
|joining_date| datetime|
+------------+-------------------+
| 01-20-20|2020-01-20 00:00:00|
| 01/19/20|2020-01-19 00:00:00|
+------------+-------------------+
If you want to convert all to a single format:
import pyspark.sql.functions as F
result = df.withColumn(
'datetime',
F.date_format(
F.coalesce(
F.to_timestamp('joining_date', 'MM-dd-yy'),
F.to_timestamp('joining_date', 'MM/dd/yy')
),
'MM-dd-yy'
)
)
result.show()
+------------+--------+
|joining_date|datetime|
+------------+--------+
| 01-20-20|01-20-20|
| 01/19/20|01-19-20|
+------------+--------+

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Generating monthly timestamps between two dates in pyspark dataframe

I have some DataFrame with "date" column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date" column.
One of the solution is below:
month_step = 31*60*60*24
min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()
df_ts = spark.range(
(min_date / month_step) * month_step,
((max_date / month_step) + 1) * month_step,
month_step
).select(col("id").cast("timestamp").alias("yearmonth"))
df_formatted_ts = df_ts.withColumn(
"yearmonth",
f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')
df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)
The problem is that I took as a month_step 31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?
Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.
That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:
|2010-01 |
|2010-03 |
|2010-04 |
|2010-05 |
|2010-06 |
|2010-07 |
I checked and there are just 3 months which were skipped from 2001 through 2018.
Suppose you had the following DataFrame:
data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#| minDate| maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+
You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.
Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:
import pyspark.sql.functions as f
df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
#+----------+
#| date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+

Resources