Spark SQL weekofyear function - apache-spark

I am using spark sql's weekofyear function to calculate the week number for the given date.
I am using the following code,
test("udf - week number of the year") {
val spark = SparkSession.builder().master("local").appName("udf - week number of the year").getOrCreate()
import spark.implicits._
val data1 = Seq("20220101", "20220102", "20220103", "20220104", "20220105", "20220106", "20220107", "20220108", "20220109", "20220110", "20220111", "20220112")
data1.toDF("day").createOrReplaceTempView("tbl_day")
spark.sql("select day, to_date(day, 'yyyyMMdd') as date, weekofyear(to_date(day, 'yyyyMMdd')) as week_num from tbl_day").show(truncate = false)
/*
+--------+----------+--------+
|day |date |week_num|
+--------+----------+--------+
|20220101|2022-01-01|52 |
|20220102|2022-01-02|52 |
|20220103|2022-01-03|1 |
|20220104|2022-01-04|1 |
|20220105|2022-01-05|1 |
|20220106|2022-01-06|1 |
|20220107|2022-01-07|1 |
|20220108|2022-01-08|1 |
|20220109|2022-01-09|1 |
|20220110|2022-01-10|2 |
|20220111|2022-01-11|2 |
|20220112|2022-01-12|2 |
+--------+----------+--------+
*/
spark.stop
}
I am surprised to find that 20220101's week number is 52, but it is the first day of 2022, so that it should be 1.
I instigate the source code of weekofyear and find:
It is using the following code to create the Calendar instance so that it gives the result above
#transient private lazy val c = {
val c = Calendar.getInstance(DateTimeUtils.getTimeZone("UTC"))
c.setFirstDayOfWeek(Calendar.MONDAY)
c.setMinimalDaysInFirstWeek(4)
c
}
I would ask why spark sql treat the first few days of the year in this way.
As a comparison,
I using the following oracle sql to get the week number which gives me 1
select to_number(to_char(to_date('01/01/2022','MM/DD/YYYY'),'WW')) from dual
In hive, the result is the same as spark sql.

I will post my findings here:
Spark SQL and Hive are following ISO-8601 standard to calculate the week number of the year for a given date.
One point to note: Spark SQL internally is using java.util.Calendar API to do the work , java 8' java.time API has been natively supporting ISO-8601 standard,using java.time API, we don't have to do the trick(c.setMinimalDaysInFirstWeek(4))

On Spark 3.0 you can use EXTRACT function. A few examples:
> SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456');
2019
> SELECT extract(week FROM timestamp'2019-08-12 01:00:00.123456');
33
> SELECT extract(doy FROM DATE'2019-08-12');
224
> SELECT extract(SECONDS FROM timestamp'2019-10-01 00:00:01.000001');
1.000001
> SELECT extract(days FROM interval 1 year 10 months 5 days);
5
> SELECT extract(seconds FROM interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
30.001001
Documentation here

Related

KQL join two tables with different TimeStamp

I´m working with KQL and trying to join two tables on a timestamp field. The problem is that they have different values when it comes to seconds.
The table "TableToJoin" will ingest record every minute (so seconds are 00) and the MeasureTime column I made will have different seconds depending on when I hit the run button (knowing it will starting counting 36h from now)
Do you know a method I could use to fix this?
I paste my code below:
range MeasureTime from ago(36h) to now() step(10m)
| join kind=rightouter
(TableToJoin| where TagName == 'TagName') on $left.MeasureTime == $right.Timestamp | take 10
TableToJoin TimeStamp:
2021-11-01T14:09:00Z
2021-11-01T14:08:00Z
2021-11-01T14:06:00Z
2021-11-01T14:05:00Z
2021-11-01T14:04:00Z
2021-11-01T14:03:00Z
2021-11-01T14:02:00Z
2021-11-01T14:01:00Z
2021-11-01T14:00:00Z
MeasureTime TimeStamp:
2021-11-01T13:59:20.5230363Z
2021-11-01T14:00:20.5230363Z
2021-11-01T14:01:20.5230363Z
2021-11-01T14:02:20.5230363Z
2021-11-01T14:03:20.5230363Z
2021-11-01T14:04:20.5230363Z
2021-11-01T14:05:20.5230363Z
2021-11-01T14:06:20.5230363Z
Thanks in advance
You can use the bin() function to "round" the timestamp. For example:
range MeasureTime from ago(36h) to now() step(10m)
| extend MeasureTime = bin(MeasureTime, 10s)
| join kind=rightouter (
TableToJoin| where TagName == 'TagName'
) on $left.MeasureTime == $right.Timestamp | take 10

Getting latest date in a partition by year / month / day using SparkSQL

I am trying to incrementally transform new partitions in a source table into a new table using Spark SQL. The data in both the source and target are partitioned as follows: /data/year=YYYY/month=MM/day=DD/. I was initially just going to select the MAX of year, month and day to get the newest partition, but that is clearly wrong. Is there is a good way to do this?
If I construct a date and take the max like MAX( CONCAT(year,'-','month','-',day)::date ) this would be quite ineffecient, right? Because it will need to scan all data to pull the newest partition.
Try below to get the latest partition without reading data at all, only metadata:
spark.sql("show partitions <table>").agg(max('partition)).show
You can use the result of show partitions as it would be more efficient as it will hit the metastore only. However, you can't just apply a max to the value there, we will need to construct the date first and then do the max.
Here's a sample:
from pyspark.sql import functions as F
df = sqlContext.sql("show partitions")
df.show(10, False)
date = F.to_date(F.regexp_replace(F.regexp_replace("partition", "[a-z=]", ""), "/", "-"))
df.select(F.max(date).alias("max_date")).show()
Input Values:
+------------------------+
|partition |
+------------------------+
|year=2019/month=11/day=5|
|year=2019/month=9/day=5 |
+------------------------+
Result:
+----------+
| max_date|
+----------+
|2019-11-05|
+----------+

How to use SparkSQL to select rows in Spark DF based on multiple conditions

I am relatively new to pyspark and I have a spark dataframe with a date column "Issue_Date". The "Issue_Date" column contains several dates from 1970-2060 (due to errors). From the spark dataframe, I have created a temp table from it and have been able to filter the data from year 2018. I would also like to include the data from year 2019 (i.e., multiple conditions). Is there a way to do so? I've tried many combinations but couldn't get it. Any form of help is appreciated, thank you.
# Filter data from 2018
sparkdf3.createOrReplaceTempView("table_view")
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) = 2018")
sparkdf4.count()
Did you try using year(Issue_Date) >= 2018?:
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) >= 2018")
If your column has errors, and you want to specify a range you can use year IN (2018, 2019):
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) in (2018, 2019)")

Generating monthly timestamps between two dates in pyspark dataframe

I have some DataFrame with "date" column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date" column.
One of the solution is below:
month_step = 31*60*60*24
min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()
df_ts = spark.range(
(min_date / month_step) * month_step,
((max_date / month_step) + 1) * month_step,
month_step
).select(col("id").cast("timestamp").alias("yearmonth"))
df_formatted_ts = df_ts.withColumn(
"yearmonth",
f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')
df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)
The problem is that I took as a month_step 31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?
Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.
That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:
|2010-01 |
|2010-03 |
|2010-04 |
|2010-05 |
|2010-06 |
|2010-07 |
I checked and there are just 3 months which were skipped from 2001 through 2018.
Suppose you had the following DataFrame:
data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#| minDate| maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+
You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.
Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:
import pyspark.sql.functions as f
df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
#+----------+
#| date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+

Using Dataframe instead of spark sql for data analysis

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output,
Is there a way to do similar thing using dataframe only not sql.
val districtWiseGenderCountDF = hiveContext.sql("""
| SELECT District,
| count(CASE WHEN Gender='M' THEN 1 END) as male_count,
| count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count
| FROM agency_enrollment
| GROUP BY District
| ORDER BY male_count DESC, FEMALE_count DESC
| LIMIT 10""".stripMargin)
Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like
without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)
val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count
see How to pivot DataFrame? for a generic example

Resources