This question already has answers here:
to_date gives null on format yyyyww (202001 and 202053)
(3 answers)
Closed 10 months ago.
Having a date, I create a column with ISO 8601 week date format:
from pyspark.sql import functions as F
df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col'])
df = df.withColumn(
'iso_from_date',
F.concat_ws(
'-',
F.expr('extract(yearofweek from date_col)'),
F.lpad(F.weekofyear('date_col'), 3, 'W0'),
F.expr('weekday(date_col) + 1')
)
)
df.show()
# +----------+-------------+
# | date_col|iso_from_date|
# +----------+-------------+
# |2019-03-18| 2019-W12-1|
# |2019-12-30| 2020-W01-1|
# |2022-01-03| 2022-W01-1|
# |2022-01-10| 2022-W02-1|
# +----------+-------------+
Using Spark 3, how to get back the date, given ISO 8601 week date?
I tried the following, but it is both, incorrect and uses LEGACY configuration which I don't like.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
df.withColumn('date_from_iso', F.to_date('iso_from_date', "YYYY-'W'ww-uu")).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2021-12-27|
# |2022-01-10| 2022-W02-1| 2022-01-03|
# +----------+-------------+-------------+
I am aware of the possibility to create a udf which works:
import datetime
#F.udf('date')
def iso_to_date(iso_date):
return datetime.datetime.strptime(iso_date, '%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
But I am looking for a more efficient option. The ideal option should not use LEGACY configuration and be translatable to SQL or Scala (no inefficient udf).
In PySpark, I have found a nicer than udf option. This will use pandas_udf which is vectorized (more efficient):
import pandas as pd
#F.pandas_udf('date')
def iso_to_date(iso_date: pd.Series) -> pd.Series:
return pd.to_datetime(iso_date, format='%G-W%V-%u')
df.withColumn('date_from_iso', iso_to_date('iso_from_date')).show()
# +----------+-------------+-------------+
# | date_col|iso_from_date|date_from_iso|
# +----------+-------------+-------------+
# |2019-03-18| 2019-W12-1| 2019-03-18|
# |2019-12-30| 2020-W01-1| 2019-12-30|
# |2022-01-03| 2022-W01-1| 2022-01-03|
# |2022-01-10| 2022-W02-1| 2022-01-10|
# +----------+-------------+-------------+
It works in Spark 3 without the LEGACY configuration. So it's acceptable.
However, there is room for improvement, as this option is not transferable to SQL or Scala.
Related
Using PySpark, I need to parse a single dataframe column into two columns.
Input data:
file name
/level1/level2/level3/file1.ext
/level1/file1000.ext
/level1/level2/file20.ext
Output:
file name
path
file1.ext
/level1/level2/level3/
file1000.ext
/level1/
file20.ext
/level1/level2/
I know I could use substring with hard coded positions, but this is not a good case for hard coding as the length of the file name values may change from row to row, as shown in the example.
However, I know that I need to break the input string after the last slash (/). This is a rule to help avoid hard coding a specific position for splitting the input string.
There are several ways to do it with regex functions, or with the split method.
from pyspark.sql.functions import split, element_at, regexp_extract
df \
.withColumn("file_name", element_at(split("raw", "/"), -1) ) \
.withColumn("file_name2", regexp_extract("raw", "(?<=/)[^/]+$", 0)) \
.withColumn("path", regexp_extract("raw", "^.*/", 0)) \
.show(truncate=False)
+-------------------------------+------------+------------+----------------------+
|raw |file_name |file_name2 |path |
+-------------------------------+------------+------------+----------------------+
|/level1/level2/level3/file1.ext|file1.ext |file1.ext |/level1/level2/level3/|
|/level1/file1000.ext |file1000.ext|file1000.ext|/level1/ |
|/level1/level2/file20.ext |file20.ext |file20.ext |/level1/level2/ |
+-------------------------------+------------+------------+----------------------+
A couple of other options:
from pyspark.sql import functions as F
df=spark.createDataFrame(
[('/level1/level2/level3/file1.ext',),
('/level1/file1000.ext',),
('/level1/level2/file20.ext',)],
['file_name']
)
df = df.withColumn('file', F.substring_index('file_name', '/', -1))
df = df.withColumn('path', F.expr('left(file_name, length(file_name) - length(file))'))
df.show(truncate=0)
# +-------------------------------+------------+----------------------+
# |file_name |file |path |
# +-------------------------------+------------+----------------------+
# |/level1/level2/level3/file1.ext|file1.ext |/level1/level2/level3/|
# |/level1/file1000.ext |file1000.ext|/level1/ |
# |/level1/level2/file20.ext |file20.ext |/level1/level2/ |
# +-------------------------------+------------+----------------------+
I am working on one CSV file below using PySpark(on databricks), but I am not sure how to get the total scan (event name) duration time. Assume one scan per time.
timestamp
event
value
1
2020-11-17_19:15:33.438102
scan
start
2
2020-11-17_19:18:33.433002
scan
end
3
2020-11-17_20:05:21.538125
scan
start
4
2020-11-17_20:13:08.528102
scan
end
5
2020-11-17_21:23:19.635104
pending
start
6
2020-11-17_21:33:26.572123
pending
end
7
2020-11-17_22:05:29.738105
pending
start
.........
Below are some of my thoughts:
first get scan start time
scan_start = df[(df['event'] == 'scan') & (df['value'] == 'start')]
scan_start_time = scan_start['timestamp']
get scan end time
scan_end = df[(df['event'] == 'scan') & (df['value'] == 'end')]
scan_end_time = scan_start['timestamp']
the duration of each scan
each_duration = scan_end_time.values - scan_start_time.values
total duration
total_duration_ns = each_duration.sum()
But, I am not sure how to do the calculation in PySpark.
First, do we need to create a schema to pre-define the column name 'timestamp' type in timestamp? (Assume all the column types (timestamp, event, value) are in str type)
On the other hand, assume we have multiple(1000+files) similar to the above CSV files stored in databricks, how can we create a reusable code for all the CSV files. Eventually, create one table to store info of the total scan_duration.
Can someone please share with me some code in PySpark?
Thank you so much
This code will compute for each row the difference between the current timestamp and the timestamp in the previous row.
I'm creating a dataframe for reproducibility.
from pyspark.sql import SparkSession, Window
from pyspark.sql.types import *
from pyspark.sql.functions import regexp_replace, col, lag
import pandas as pd
spark = SparkSession.builder.appName("DataFarme").getOrCreate()
data = pd.DataFrame(
{
"timestamp": ["2020-11-17_19:15:33.438102","2020-11-17_19:18:33.433002","2020-11-17_20:05:21.538125","2020-11-17_20:13:08.528102"],
"event": ["scan","scan","scan","scan"],
"value": ["start","end","start","end"]
}
)
df=spark.createDataFrame(data)
df.show()
# +--------------------+-----+-----+
# | timestamp|event|value|
# +--------------------+-----+-----+
# |2020-11-17_19:15:...| scan|start|
# |2020-11-17_19:18:...| scan| end|
# |2020-11-17_20:05:...| scan|start|
# |2020-11-17_20:13:...| scan| end|
# +--------------------+-----+-----+
Convert "timestamp" column to TimestampType() to be able to compute differences:
df=df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," "))
df.show(truncate=False)
# +——————————-------------———+---——+—---—+
# |timestamp |event|value|
# +————————————-------------—+---——+—---—+
# |2020-11-17 19:15:33.438102|scan |start|
# |2020-11-17 19:18:33.433002|scan |end |
# |2020-11-17 20:05:21.538125|scan |start|
# |2020-11-17 20:13:08.528102|scan |end |
# +——————————-------------———+---——+---——+
df = df.withColumn("timestamp",
regexp_replace(col("timestamp"),"_"," ").cast(TimestampType()))
df.dtypes
# [('timestamp', 'timestamp'), ('event', 'string'), ('value', 'string')]
Use pyspark.sql.functions.lag function that returns the value of the previous row (offset=1 by default).
See also How to calculate the difference between rows in PySpark? or Applying a Window function to calculate differences in pySpark
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.orderBy('timestamp')).cast("long")).show(truncate=False)
# WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
Using Window without partition gives a warning.
It is better to partition the dataframe for the Window operation, I partitioned here by type of event:
df.withColumn("lag_previous", col("timestamp").cast("long") - lag('timestamp').over(
Window.partitionBy("event").orderBy('timestamp')).cast("long")).show(truncate=False)
# +———————————-------------——+---——+—---—+—------—————+
# |timestamp |event|value|lag_previous|
# +———————————-------------——+---——+---——+------——————+
# |2020-11-17 19:15:33.438102|scan |start|null |
# |2020-11-17 19:18:33.433002|scan |end |180 |
# |2020-11-17 20:05:21.538125|scan |start|2808 |
# |2020-11-17 20:13:08.528102|scan |end |467 |
# +—————-------------————————+---——+—---—+—------—————+
From this table you can filter out the rows with "end" value to get the total durations.
In ISO 8601, durations are in the format PT5M ( 5 minutes) or PT2H5M (2 hours 5 minutes). I have a JSON file that contains values in such a format. I wanted to know if spark can extract the duration in minutes. I tried to read it as "DateType" and used the "minutes" function to get minutes, it returned me with null values.
Example json
{"name": "Fennel Mushrooms","cookTime":"PT30M"}
Currently, I am reading it as a string and using the "regex_extract" function. I wanted to know a more efficient way.
https://www.digi.com/resources/documentation/digidocs/90001437-13/reference/r_iso_8601_duration_format.htm
Spark does not provide for a way to convert ISO 8601 duration into intervals. Neither does timedelta in Python datetime library.
However, pd.Timdelta can parse ISO 8601 duration to time deltas. To support of a wider category of ISO 8601 duration, we can wrap the pd.Timdelta in a pandas_udf
from pyspark.sql import functions as F
import pandas as pd
df = spark.createDataFrame([("PT5M", ), ("PT50M", ), ("PT2H5M", ), ], ("duration", ))
#F.pandas_udf("int")
def parse_iso8601_duration(str_duration: pd.Series) -> pd.Series:
return str_duration.apply(lambda duration: (pd.Timedelta(duration).seconds / 60))
df.withColumn("duration_in_minutes", parse_iso8601_duration(F.col("duration"))).show()
Output
+--------+-------------------+
|duration|duration_in_minutes|
+--------+-------------------+
| PT5M| 5|
| PT50M| 50|
| PT2H5M| 125|
+--------+-------------------+
I have a dataframe with time-series data and I am trying to add a lot of moving average columns to it with different windows of various ranges. When I do this column by column, results are pretty slow.
I have tried to just pile the withColumn calls until I have all of them.
Pseudo code:
import pyspark.sql.functions as pysparkSqlFunctions
## working from a data frame with 12 colums:
## - key as a String
## - time as a DateTime
## - col_{1:10} as numeric values
window_1h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-3600, 0)
window_2h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-7200, 0)
df = df.withColumn("col1_1h", pysparkSqlFunctions.avg("col_1").over(window_1h))
df = df.withColumn("col1_2h", pysparkSqlFunctions.avg("col_1").over(window_2h))
df = df.withColumn("col2_1h", pysparkSqlFunctions.avg("col_2").over(window_1h))
df = df.withColumn("col2_2h", pysparkSqlFunctions.avg("col_2").over(window_2h))
What I would like is the ability to add all 4 columns (or many more) in one call, hopefully traversing the data only once for better performance.
I prefer to import the functions library as F as it looks neater and it is the standard alias used in the official Spark documentation.
The star string, '*', should capture all the current columns within the dataframe. Alternatively, you could replace the star string with *df.columns. Here the star explodes the list into separate parameters for the select method.
from pyspark.sql import functions as F
df = df.select(
"*",
F.avg("col_1").over(window_1h).alias("col1_1h"),
F.avg("col_1").over(window_2h).alias("col1_2h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
)
I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions