Generating monthly timestamps between two dates in pyspark dataframe - apache-spark

I have some DataFrame with "date" column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date" column.
One of the solution is below:
month_step = 31*60*60*24
min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()
df_ts = spark.range(
(min_date / month_step) * month_step,
((max_date / month_step) + 1) * month_step,
month_step
).select(col("id").cast("timestamp").alias("yearmonth"))
df_formatted_ts = df_ts.withColumn(
"yearmonth",
f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')
df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)
The problem is that I took as a month_step 31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?
Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.
That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:
|2010-01 |
|2010-03 |
|2010-04 |
|2010-05 |
|2010-06 |
|2010-07 |
I checked and there are just 3 months which were skipped from 2001 through 2018.

Suppose you had the following DataFrame:
data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#| minDate| maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+
You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.
Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:
import pyspark.sql.functions as f
df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
#+----------+
#| date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+

Related

How to get week of month in Spark 3.0+?

I cannot find any datetime formatting pattern to get the week of month in spark 3.0+
As use of 'W' is deprecated, is there a solution to get week of month without using legacy option?
The below code doesn't work for spark 3.2.1
df = df.withColumn("weekofmonth", f.date_format(f.col("Date"), "W"))
For completeness, it's worth mentioning that one can set the configuration to "LEGACY".
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
from pyspark.sql import functions as F
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df = spark.createDataFrame(
[('2022-07-01',),
('2022-07-02',),
('2022-07-03',)],
['Date'])
df.withColumn("weekofmonth", F.date_format(F.col("Date"), "W")).show()
# +----------+-----------+
# | Date|weekofmonth|
# +----------+-----------+
# |2022-07-01| 1|
# |2022-07-02| 1|
# |2022-07-03| 2|
# +----------+-----------+
you can try using udf:
from pyspark.sql.functions import col,year,month,dayofmonth
df = spark.createDataFrame(
[(1, "2022-04-22"), (2, "2022-05-12")], ("id", "date"))
from calendar import monthcalendar
def get_week_of_month(year, month, day):
return next(
(
week_number
for week_number, days_of_week in enumerate(monthcalendar(year, month), start=1)
if day in days_of_week
),
None,
)
fn1 = udf(get_week_of_month)
df =df.withColumn('week_of_mon',fn1(year(col('date')),month(col('date')),dayofmonth(col('date'))))
display(df)
If you have table with year, month and week numbers sorted by year and week, you may try my solution:
select
year_iso,
month,
posexplode(collect_list(week_iso)) as (week_of_month, week_iso)
from your_table_with_dates
group by year_iso, month
Here we just transform column week_iso into array grouped by year_iso and month, and then explodes it backward into two columns (position inside month and week_iso).
Note, positions starts in 0, but its not a real problem.
Screenshots of tables:
Source table
Table with week of month

Transform a day of the year to day_month format

So I was wondering if it's possible with PySpark to transform a random day of the year (0-365) to day-month format. In my case, the input would be a string.
Example:
Input : "091"
Expected output (month-day): "0331"
This is possible but you need to have to year also. Convert the year to date (first January of that year) and add the days to get the desired result, then format it.
Here's a working example
from pyspark.sql import functions as F
df = spark.createDataFrame([("2020", "091")], ["year", "day_of_year"])
df1 = df.withColumn(
"first_day_year",
F.concat_ws("-", "year", F.lit("01"), F.lit("01"))
).withColumn(
"day_month",
F.date_format(
F.expr("date_add(first_day_year, cast(day_of_year as int) -1)"),
"MMdd"
)
).drop("first_day_year")
df1.show()
#+----+-----------+---------+
#|year|day_of_year|day_month|
#+----+-----------+---------+
#|2020| 091| 0331|
#+----+-----------+---------+
You can use date_add to add the number of days to the new year's day.
import pyspark.sql.functions as F
df2 = df.withColumn(
'day_month',
F.expr("date_format(date_add('2020-01-01', int(day - 1)), 'MMdd')")
)
df2.show()
+---+---------+
|day|day_month|
+---+---------+
| 91| 0331|
+---+---------+
Note that the result will vary depending on whether it's a leap year or not.

spark - Calculating average of values in 2 or more columns and putting in new column in every row [duplicate]

This question already has answers here:
Spark DataFrame: Computing row-wise mean (or any aggregate operation)
(2 answers)
Closed 4 years ago.
Suppose I have a Dataset/Dataframe with following contents:-
name, marks1, marks2
Alice, 10, 20
Bob, 20, 30
I want to add a new column which should have the average of column B and C.
Expected Result:-
name, marks1, marks2, Result(Avg)
Alice, 10, 20, 15
Bob, 20, 30, 25
for Summing or any other arithmetic operation I use df.withColumn("xyz", $"marks1"+$"marks2"). I cannot find a similar way for Average. Please help.
Additionally:- The number of columns are not fixed. Like sometimes it might be average of 2 columns, sometimes 3 or even more. So I want a generic code which should work.
One of the easiest and optimized way is to create a list of columns of marks columns and use it with withColumn as
pyspark
from pyspark.sql.functions import col
marksColumns = [col('marks1'), col('marks2')]
averageFunc = sum(x for x in marksColumns)/len(marksColumns)
df.withColumn('Result(Avg)', averageFunc).show(truncate=False)
and you should get
+-----+------+------+-----------+
|name |marks1|marks2|Result(Avg)|
+-----+------+------+-----------+
|Alice|10 |20 |15.0 |
|Bob |20 |30 |25.0 |
+-----+------+------+-----------+
scala-spark
the process is almost same in scala as done in python above
import org.apache.spark.sql.functions.{col, lit}
val marksColumns = Array(col("marks1"), col("marks2"))
val averageFunc = marksColumns.foldLeft(lit(0)){(x, y) => x+y}/marksColumns.length
df.withColumn("Result(Avg)", averageFunc).show(false)
which should give you same output as in pyspark
I hope the answer is helpful
It's as easy as using User Defined Functions. By creating a specific UDF to deal with average of many columns, you will be able to reuse it as many times as you want.
Python
In this snippet, I'm creating a UDF that takes an array of columns, and calculates the average of it.
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DoubleType
avg_cols = udf(lambda array: sum(array)/len(array), DoubleType())
df.withColumn("average", avg_cols(array("marks1", "marks2"))).show()
Output:
+-----+------+------+--------+
| name|marks1|marks2| average|
+-----+------+------+--------+
|Alice| 10| 20| 15.0|
| Bob| 20| 30| 25.0|
+-----+------+------+--------+
Scala
With the Scala API, you must process the selected columns as a Row. You just have to select the columns using the Spark struct function.
import org.apache.spark.sql.functions._
import spark.implicits._
import scala.util.Try
def average = udf((row: Row) => {
val values = row.toSeq.map(x => Try(x.toString.toDouble).toOption).filter(_.isDefined).map(_.get)
if(values.nonEmpty) values.sum / values.length else 0.0
})
df.withColumn("average", average(struct($"marks1", $"marks2"))).show()
As you can see, I'm casting all any values to Double with Try, so that if the value cannot be casted, it won't throw any exception, performing the average only on those columns that are defined.
And that's all :)

Spark SQL weekofyear function

I am using spark sql's weekofyear function to calculate the week number for the given date.
I am using the following code,
test("udf - week number of the year") {
val spark = SparkSession.builder().master("local").appName("udf - week number of the year").getOrCreate()
import spark.implicits._
val data1 = Seq("20220101", "20220102", "20220103", "20220104", "20220105", "20220106", "20220107", "20220108", "20220109", "20220110", "20220111", "20220112")
data1.toDF("day").createOrReplaceTempView("tbl_day")
spark.sql("select day, to_date(day, 'yyyyMMdd') as date, weekofyear(to_date(day, 'yyyyMMdd')) as week_num from tbl_day").show(truncate = false)
/*
+--------+----------+--------+
|day |date |week_num|
+--------+----------+--------+
|20220101|2022-01-01|52 |
|20220102|2022-01-02|52 |
|20220103|2022-01-03|1 |
|20220104|2022-01-04|1 |
|20220105|2022-01-05|1 |
|20220106|2022-01-06|1 |
|20220107|2022-01-07|1 |
|20220108|2022-01-08|1 |
|20220109|2022-01-09|1 |
|20220110|2022-01-10|2 |
|20220111|2022-01-11|2 |
|20220112|2022-01-12|2 |
+--------+----------+--------+
*/
spark.stop
}
I am surprised to find that 20220101's week number is 52, but it is the first day of 2022, so that it should be 1.
I instigate the source code of weekofyear and find:
It is using the following code to create the Calendar instance so that it gives the result above
#transient private lazy val c = {
val c = Calendar.getInstance(DateTimeUtils.getTimeZone("UTC"))
c.setFirstDayOfWeek(Calendar.MONDAY)
c.setMinimalDaysInFirstWeek(4)
c
}
I would ask why spark sql treat the first few days of the year in this way.
As a comparison,
I using the following oracle sql to get the week number which gives me 1
select to_number(to_char(to_date('01/01/2022','MM/DD/YYYY'),'WW')) from dual
In hive, the result is the same as spark sql.
I will post my findings here:
Spark SQL and Hive are following ISO-8601 standard to calculate the week number of the year for a given date.
One point to note: Spark SQL internally is using java.util.Calendar API to do the work , java 8' java.time API has been natively supporting ISO-8601 standard,using java.time API, we don't have to do the trick(c.setMinimalDaysInFirstWeek(4))
On Spark 3.0 you can use EXTRACT function. A few examples:
> SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT extract(week FROM timestamp'2019-08-12 01:00:00.123456');
33
> SELECT extract(doy FROM DATE'2019-08-12');
224
> SELECT extract(SECONDS FROM timestamp'2019-10-01 00:00:01.000001');
1.000001
> SELECT extract(days FROM interval 1 year 10 months 5 days);
5
> SELECT extract(seconds FROM interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
30.001001
Documentation here

Getting the week number of month using time stamp format '2016-11-22 14:35:51' using SPARK SQL

I have a time stamp column which has records of type '2016-11-22 14:35:51' in SPARK SQL. Would someone help me with retrieving the week number of month for the given time stamp format?
I have tried,
SELECT timestamp, DATE_FORMAT(timestamp, 'u') AS WEEK FROM table_1;
But it gives the wrong output as,
timestamp | WEEK
2016-11-22 | 2
Appreciate if someone coulkd help me out.
Thanks.
You are using the wrong literal, you need to use W instead of u.
The literal u refers to the day number of week.
scala> sqlContext.sql("select current_timestamp() as time, date_format(current_timestamp,'W') as week").show
// +--------------------+----+
// | time|week|
// +--------------------+----+
// |2017-01-09 13:46:...| 2|
// +--------------------+----+
scala> sqlContext.sql("select to_date('2017-01-01') as time, date_format(to_date('2017-01-01'),'W') as week").show
// +----------+----+
// | time|week|
// +----------+----+
// |2017-01-01| 1|
// +----------+----+
If you have more doubts, you can always refer to the official documentation of SimpleDateFormat in Java.

Resources