Parse Date Format - apache-spark

I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases

i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output
the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC.
Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.
To answer your question if you replace +0530 by Z it should solve your problem.
Here is the spark code in scala that I tried and worked:
val data = Seq("20201021T00:00:00+0530",
"20211011T00:00:00+0530",
"20200212T00:00:00+0300",
"20211021T00:00:00+0530",
"20211021T00:00:00+0900",
"20211021T00:00:00-0500")
import spark.implicits._
val sparkDF = data.toDF("custom_time")
import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))
spark_DF2.show(false)
here is the snapshot of the output. As you can see there are no null values.

You can usually use x, X or Z for offset pattern as you can find on Spark date pattern documentation page. You can then parse your date with the following complete pattern: yyyyMMdd'T'HH:mm:ssxx
However, if you use those kind of offset patterns, your date will be first converted in UTC format, meaning all timestamp with a positive offset will be matched to the previous day. For instance "20201021T00:00:00+0530" will be matched to 2020-10-20 using to_date with the previous pattern.
If you want to get displayed date as a date, ignoring offset, you should first extract date string from complete timestamp string using regexp_extract function, then perform to_date.
If you take your example "20201021T00:00:00+0530", what you want to extract with a regexp is 20201021 part and apply to_date on it. You can do it with the following pattern: ^(\\d+). If you're interested, you can find how to build other patterns in java's Pattern documentation.
So your code should be:
from pyspark.sql import functions as F
sparkDF.select(
F.to_date(
F.regexp_extract(F.col('timestamp'), '^(\\d+)', 0), 'yyyyMMdd'
).alias('timestamp_parsed')
).show()
And with your input you will get:
+----------------+
|timestamp_parsed|
+----------------+
|2020-10-21 |
|2021-10-11 |
|2020-02-12 |
|2021-10-21 |
|2021-10-21 |
|2021-10-21 |
+----------------+

You can create "udf" in spark and use it. Below is the code in scala.
import spark.implicits._
//just to create the dataset for the example you have given
val data = Seq(
("20201021T00:00:00+0530"),
("20211011T00:00:00+0530"),
("20200212T00:00:00+0300"),
("20211021T00:00:00+0530"),
("20211021T00:00:00+0900"),
("20211021T00:00:00-0500"))
val dataset = data.toDF("timestamp")
val udfToDateUTC = functions.udf((epochMilliUTC: String) => {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HH:mm:ssZ")
val res = OffsetDateTime.parse(epochMilliUTC, formatter).withOffsetSameInstant(ZoneOffset.UTC)
res.toString()
})
dataset.select(dataset.col("timestamp"),udfToDateUTC(dataset.col("timestamp")).alias("timestamp_parsed")).show(false)
//output
+----------------------+-----------------+
|timestamp |timestamp_parsed |
+----------------------+-----------------+
|20201021T00:00:00+0530|2020-10-20T18:30Z|
|20211011T00:00:00+0530|2021-10-10T18:30Z|
|20200212T00:00:00+0300|2020-02-11T21:00Z|
|20211021T00:00:00+0530|2021-10-20T18:30Z|
|20211021T00:00:00+0900|2021-10-20T15:00Z|
|20211021T00:00:00-0500|2021-10-21T05:00Z|
+----------------------+-----------------+

from pyspark.sql.functions import date_format
customer_data = select("<column_name>",date_format("<column_name>",'yyyyMMdd').cast('customer')

Related

Pyspark - How to decode a column in URL format

Do you know how to decode the 'campaign' column below in Pyspark? The records in this column are strings in URL format:
+--------------------+------------------------+
|user_id |campaign |
+--------------------+------------------------+
|alskd9239as23093 |MM+%7C+Cons%C3%B3rcios+%|
|lfifsf093039388 |Aquisi%C3%A7%C3%A3o+%7C |
|kasd877191kdsd999 |Aquisi%C3%A7%C3%A3o+%7C |
+--------------------+------------------------+
I know that it is possible to do this with the urllib library in Python. However, my dataset is large and it takes too long to convert it to a pandas dataframe. Do you know how to do this with a Spark DataFrame?
There is no need to convert to intermediate pandas dataframe, you can use pyspark user defined functions (udf) to unquote the quoted string:
from urllib.parse import unquote
df.withColumn('campaign', F.udf(unquote, F.StringType())('campaign'))
If there are null values in the campaign column, then you have to do null check before unquoting the strings:
f = lambda s: unquote(s) if s else s
df.withColumn('campaign', F.udf(f, F.StringType())('campaign'))
+-----------------+-----------------+
| user_id| campaign|
+-----------------+-----------------+
| alskd9239as23093|MM+|+Consórcios+%|
| lfifsf093039388| Aquisição+||
|kasd877191kdsd999| Aquisição+||
+-----------------+-----------------+

add the day information to timestep in a dataframe

I am trying to read the csv file into a dataframe,the csv fileThe csv file looks like this.
The cell value only contains the hour information and miss the day information. I would like to read this csv file into a dataframe and transform the timing information into the format like 2021-05-07 04:04.00 i.e., I would like to add the day information. How to achieve that?
I used the following code, but it seems that pyspark just add the day information as 1970-01-01, kind of system setting.
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
df_1 = spark.read.csv('test1.csv', header = True)
df_1 = df_1.withColumn('Timestamp', to_timestamp(col('Timing'), 'HH:mm'))
df_1.show(truncate=False)
And I got the following result.
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|1970-01-01 04:04:00|
|19:04.0|1970-01-01 19:04:00|
You can concat a date string before calling to_timestamp:
import pyspark.sql.functions as F
df2 = df_1.withColumn(
'Timestamp',
F.to_timestamp(
F.concat_ws(' ', F.lit('2021-05-07'), 'Timing'),
'yyyy-MM-dd HH:mm.s'
)
)
df2.show()
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|2021-05-07 04:04:00|
|19:04.0|2021-05-07 19:04:00|
+-------+-------------------+

PySpark split using regex doesn't work on a dataframe column with string type

I have a PySpark data frame with a string column(URL) and all records look in the following way
ID URL
1 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189
I want to basically extract the number after conversations/ from URL column using regex into another column.
I tried the following code but it doesn't give me any results.
df1 = df.withColumn('CONV_ID', split(convo_influ_new['URL'], '(?<=conversations/).*').getItem(0))
Expected:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 2938419189
Result:
ID URL CONV_ID
1 https://app.xyz.com/inboxes/136636/conversations/2686735685 https://app.xyz.com/inboxes/136636/conversations/2686735685
2 https://app.xyz.com/inboxes/136636/conversations/2938415796 https://app.xyz.com/inboxes/136636/conversations/2938415796
3 https://app.drift.com/inboxes/136636/conversations/2938419189 https://app.drift.com/inboxes/136636/conversations/2938419189
Not sure what's happening here. I tried the regex script in different online regex tester toolds and it highlights the part I want but never works in PySpark. I tried different PySpark functions like f.split, regexp_extract, regexp_replace, but none of them work.
If you are URLs have always that form, you can actually just use substring_index to get the last path element :
import pyspark.sql.functions as F
df1 = df.withColumn("CONV_ID", F.substring_index("URL", "/", -1))
df1.show(truncate=False)
#+---+-------------------------------------------------------------+----------+
#|ID |URL |CONV_ID |
#+---+-------------------------------------------------------------+----------+
#|1 |https://app.xyz.com/inboxes/136636/conversations/2686735685 |2686735685|
#|2 |https://app.xyz.com/inboxes/136636/conversations/2938415796 |2938415796|
#|3 |https://app.drift.com/inboxes/136636/conversations/2938419189|2938419189|
#+---+-------------------------------------------------------------+----------+
You can use regexp_extract instead:
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.regexp_extract('URL', 'conversations/(.*)', 1)
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+
Or if you want to use split, you don't need to specify .*. You just need to specify the pattern used for splitting.
import pyspark.sql.functions as F
df1 = df.withColumn(
'CONV_ID',
F.split('URL', '(?<=conversations/)')[1] # just using 'conversations/' should also be enough
)
df1.show()
+---+--------------------+----------+
| ID| URL| CONV_ID|
+---+--------------------+----------+
| 1|https://app.xyz.c...|2686735685|
| 2|https://app.xyz.c...|2938415796|
| 3|https://app.drift...|2938419189|
+---+--------------------+----------+

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

Spark - Seconds to HH/mm aa

I have a script that is currently windowing 30 minute periods and calculating an average of these 30 minutes.
In order to get my windowing to work the way I wanted, I needed to convert a base timestamp MM/dd/yyyy HH:mm:ss aa to be unix_timestamp for only hours and minutes.
Current Code:
val taxiSub = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/user/zeppelin/taxi/taxi_subset.csv")
taxiSub.createOrReplaceTempView("taxiSub")
val time=taxiSub.withColumn("Pickup",from_unixtime(unix_timestamp(col(("tpep_pickup_datetime")),"MM/dd/yyyy hh:mm:ss aa"),"MM/dd/yyyy HH:mm")).withColumn("Dropoff",from_unixtime(unix_timestamp(col(("tpep_dropoff_datetime")),"MM/dd/yyyy hh:mm:ss aa"),"MM/dd/yyyy HH:mm"))
val stamp = time.withColumn("tmp",to_timestamp(col("Pickup"),"MM/dd/yyyy HH:mm"))
.withColumn("StartTimestamp", unix_timestamp(concat_ws(":",hour(col("tmp")),minute(col("tmp"))),"HH:mm")).drop("tmp")
val windowSpec = Window.orderBy("StartTimestamp").rangeBetween(-1800,Window.currentRow)
val byRange = stamp.withColumn("avgPassengers",avg(col("passenger_count")).over(windowSpec)).orderBy(desc("StartTimestamp")).withColumn("EndTime",col("StartTimestamp")+1800)
val answer = byRange.withColumn("Start",)
byRange.createOrReplaceTempView("byRangeTable")
spark.sqlContext.sql("SELECT StartTimestamp,EndTime,avg(avgPassengers) as AvgPassengers FROM byRangeTable group by StartTimestamp,EndTime ORDER BY AvgPassengers DESC ").show(truncate=false)
Current Output:
+--------------+-------+------------------+
|StartTimestamp|EndTime|AvgPassengers |
+--------------+-------+------------------+
|28140 |29940 |2.0851063829787235|
|28200 |30000 |2.0833333333333335|
|26940 |28740 |2.0714285714285716|
How can I convert 'StartTimestamp' and 'EndTime' back to be of the form HH/mm aa.
I.e I'm trying to convert the above to:
+--------------+------------+------------------+
|StartTimestamp|EndTime |AvgPassengers |
+--------------+------------+------------------+
|07:49:00 am |08:19:00 am |2.0851063829787235|
|07:50:00 am |08:20:00 am |2.0833333333333335|
|07:29:00 am |07:59:00 am |2.0714285714285716|
Use from_unixtime() function with output format as 'hh:mm:ss a'.
Example:
spark.sql("select from_unixtime('28140','hh:mm:ss a')").show()
//+-----------+
//| _c0|
//+-----------+
//|01:49:00 AM|
//+-----------+
For your case:
//in dataframe api
df.withColumn("StartTimestamp",from_unixtime(col("StartTimestamp"),"hh:mm:ss a")).
withColumn("EndTime",from_unixtime(col("EndTime"),"hh:mm:ss a")).show()
//in sql
sqlContext.sql("select from_unixtime(StartTimestamp,'hh:mm:ss a') as StartTimestamp,from_unixtime(EndTime,'hh:mm:ss a') as EndTime,AvgPassengers from tmp").show()
//timestamp values differ from question based on session timezone
//+--------------+-----------+------------------+
//|StartTimestamp| EndTime| AvgPassengers|
//+--------------+-----------+------------------+
//| 01:49:00 AM|02:19:00 AM|2.0851063829787235|
//| 01:50:00 AM|02:20:00 AM|2.0833333333333335|
//+--------------+-----------+------------------+

Resources