I have a script that is currently windowing 30 minute periods and calculating an average of these 30 minutes.
In order to get my windowing to work the way I wanted, I needed to convert a base timestamp MM/dd/yyyy HH:mm:ss aa to be unix_timestamp for only hours and minutes.
Current Code:
val taxiSub = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/user/zeppelin/taxi/taxi_subset.csv")
taxiSub.createOrReplaceTempView("taxiSub")
val time=taxiSub.withColumn("Pickup",from_unixtime(unix_timestamp(col(("tpep_pickup_datetime")),"MM/dd/yyyy hh:mm:ss aa"),"MM/dd/yyyy HH:mm")).withColumn("Dropoff",from_unixtime(unix_timestamp(col(("tpep_dropoff_datetime")),"MM/dd/yyyy hh:mm:ss aa"),"MM/dd/yyyy HH:mm"))
val stamp = time.withColumn("tmp",to_timestamp(col("Pickup"),"MM/dd/yyyy HH:mm"))
.withColumn("StartTimestamp", unix_timestamp(concat_ws(":",hour(col("tmp")),minute(col("tmp"))),"HH:mm")).drop("tmp")
val windowSpec = Window.orderBy("StartTimestamp").rangeBetween(-1800,Window.currentRow)
val byRange = stamp.withColumn("avgPassengers",avg(col("passenger_count")).over(windowSpec)).orderBy(desc("StartTimestamp")).withColumn("EndTime",col("StartTimestamp")+1800)
val answer = byRange.withColumn("Start",)
byRange.createOrReplaceTempView("byRangeTable")
spark.sqlContext.sql("SELECT StartTimestamp,EndTime,avg(avgPassengers) as AvgPassengers FROM byRangeTable group by StartTimestamp,EndTime ORDER BY AvgPassengers DESC ").show(truncate=false)
Current Output:
+--------------+-------+------------------+
|StartTimestamp|EndTime|AvgPassengers |
+--------------+-------+------------------+
|28140 |29940 |2.0851063829787235|
|28200 |30000 |2.0833333333333335|
|26940 |28740 |2.0714285714285716|
How can I convert 'StartTimestamp' and 'EndTime' back to be of the form HH/mm aa.
I.e I'm trying to convert the above to:
+--------------+------------+------------------+
|StartTimestamp|EndTime |AvgPassengers |
+--------------+------------+------------------+
|07:49:00 am |08:19:00 am |2.0851063829787235|
|07:50:00 am |08:20:00 am |2.0833333333333335|
|07:29:00 am |07:59:00 am |2.0714285714285716|
Use from_unixtime() function with output format as 'hh:mm:ss a'.
Example:
spark.sql("select from_unixtime('28140','hh:mm:ss a')").show()
//+-----------+
//| _c0|
//+-----------+
//|01:49:00 AM|
//+-----------+
For your case:
//in dataframe api
df.withColumn("StartTimestamp",from_unixtime(col("StartTimestamp"),"hh:mm:ss a")).
withColumn("EndTime",from_unixtime(col("EndTime"),"hh:mm:ss a")).show()
//in sql
sqlContext.sql("select from_unixtime(StartTimestamp,'hh:mm:ss a') as StartTimestamp,from_unixtime(EndTime,'hh:mm:ss a') as EndTime,AvgPassengers from tmp").show()
//timestamp values differ from question based on session timezone
//+--------------+-----------+------------------+
//|StartTimestamp| EndTime| AvgPassengers|
//+--------------+-----------+------------------+
//| 01:49:00 AM|02:19:00 AM|2.0851063829787235|
//| 01:50:00 AM|02:20:00 AM|2.0833333333333335|
//+--------------+-----------+------------------+
Related
I have the following DataFrame containing the date format - yyyyMMddTHH:mm:ss+UTC
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",),
("20211011T00:00:00+0530",),
("20200212T00:00:00+0300",),
("20211021T00:00:00+0530",),
("20211021T00:00:00+0900",),
("20211021T00:00:00-0500",)
]
,['timestamp'])
sparkDF.show(truncate=False)
+----------------------+
|timestamp |
+----------------------+
|20201021T00:00:00+0530|
|20211011T00:00:00+0530|
|20200212T00:00:00+0300|
|20211021T00:00:00+0530|
|20211021T00:00:00+0900|
|20211021T00:00:00-0500|
+----------------------+
I m aware of the date format to parse and convert the values to DateType
Timestamp Parsed
sparkDF.select(F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530").alias('timestamp_parsed')).show()
+----------------+
|timestamp_parsed|
+----------------+
| 2020-10-21|
| 2021-10-11|
| null|
| 2021-10-21|
| null|
| null|
+----------------+
As you can see , its specific to +0530 strings , I m aware of the fact that I can use multiple patterns and coalesce the first non-null values
Multiple Patterns & Coalesce
sparkDF.withColumn('p1',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0530"))\
.withColumn('p2',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0900"))\
.withColumn('p3',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss-0500"))\
.withColumn('p4',F.to_date(F.col('timestamp'),"yyyyMMdd'T'HH:mm:ss+0300"))\
.withColumn('timestamp_parsed',F.coalesce(F.col('p1'),F.col('p2'),F.col('p3'),F.col('p4')))\
.drop(*['p1','p2','p3','p4'])\
.show(truncate=False)
+----------------------+----------------+
|timestamp |timestamp_parsed|
+----------------------+----------------+
|20201021T00:00:00+0530|2020-10-21 |
|20211011T00:00:00+0530|2021-10-11 |
|20200212T00:00:00+0300|2020-02-12 |
|20211021T00:00:00+0530|2021-10-21 |
|20211021T00:00:00+0900|2021-10-21 |
|20211021T00:00:00-0500|2021-10-21 |
+----------------------+----------------+
Is there a better way to accomplish this, as there might be a bunch of other UTC within the data source, is there a standard UTC TZ available within Spark to parse all the cases
i think you have got the 2nd argument of your to_date function wrong which is causing null values in your output
the +530 in your timestamp is the Zulu value which just denotes how many hours and mins ahead (for +) or behind (for -) the current timestamp is withrespect to UTC.
Please refer to the response by Basil here Java / convert ISO-8601 (2010-12-16T13:33:50.513852Z) to Date object This link has full details available for the same.
To answer your question if you replace +0530 by Z it should solve your problem.
Here is the spark code in scala that I tried and worked:
val data = Seq("20201021T00:00:00+0530",
"20211011T00:00:00+0530",
"20200212T00:00:00+0300",
"20211021T00:00:00+0530",
"20211021T00:00:00+0900",
"20211021T00:00:00-0500")
import spark.implicits._
val sparkDF = data.toDF("custom_time")
import org.apache.spark.sql.functions._
val spark_DF2 = sparkDF.withColumn("new_timestamp", to_date($"custom_time", "yyyyMMdd'T'HH:mm:ssZ"))
spark_DF2.show(false)
here is the snapshot of the output. As you can see there are no null values.
You can usually use x, X or Z for offset pattern as you can find on Spark date pattern documentation page. You can then parse your date with the following complete pattern: yyyyMMdd'T'HH:mm:ssxx
However, if you use those kind of offset patterns, your date will be first converted in UTC format, meaning all timestamp with a positive offset will be matched to the previous day. For instance "20201021T00:00:00+0530" will be matched to 2020-10-20 using to_date with the previous pattern.
If you want to get displayed date as a date, ignoring offset, you should first extract date string from complete timestamp string using regexp_extract function, then perform to_date.
If you take your example "20201021T00:00:00+0530", what you want to extract with a regexp is 20201021 part and apply to_date on it. You can do it with the following pattern: ^(\\d+). If you're interested, you can find how to build other patterns in java's Pattern documentation.
So your code should be:
from pyspark.sql import functions as F
sparkDF.select(
F.to_date(
F.regexp_extract(F.col('timestamp'), '^(\\d+)', 0), 'yyyyMMdd'
).alias('timestamp_parsed')
).show()
And with your input you will get:
+----------------+
|timestamp_parsed|
+----------------+
|2020-10-21 |
|2021-10-11 |
|2020-02-12 |
|2021-10-21 |
|2021-10-21 |
|2021-10-21 |
+----------------+
You can create "udf" in spark and use it. Below is the code in scala.
import spark.implicits._
//just to create the dataset for the example you have given
val data = Seq(
("20201021T00:00:00+0530"),
("20211011T00:00:00+0530"),
("20200212T00:00:00+0300"),
("20211021T00:00:00+0530"),
("20211021T00:00:00+0900"),
("20211021T00:00:00-0500"))
val dataset = data.toDF("timestamp")
val udfToDateUTC = functions.udf((epochMilliUTC: String) => {
val formatter = DateTimeFormatter.ofPattern("yyyyMMdd'T'HH:mm:ssZ")
val res = OffsetDateTime.parse(epochMilliUTC, formatter).withOffsetSameInstant(ZoneOffset.UTC)
res.toString()
})
dataset.select(dataset.col("timestamp"),udfToDateUTC(dataset.col("timestamp")).alias("timestamp_parsed")).show(false)
//output
+----------------------+-----------------+
|timestamp |timestamp_parsed |
+----------------------+-----------------+
|20201021T00:00:00+0530|2020-10-20T18:30Z|
|20211011T00:00:00+0530|2021-10-10T18:30Z|
|20200212T00:00:00+0300|2020-02-11T21:00Z|
|20211021T00:00:00+0530|2021-10-20T18:30Z|
|20211021T00:00:00+0900|2021-10-20T15:00Z|
|20211021T00:00:00-0500|2021-10-21T05:00Z|
+----------------------+-----------------+
from pyspark.sql.functions import date_format
customer_data = select("<column_name>",date_format("<column_name>",'yyyyMMdd').cast('customer')
I am trying to read the csv file into a dataframe,the csv fileThe csv file looks like this.
The cell value only contains the hour information and miss the day information. I would like to read this csv file into a dataframe and transform the timing information into the format like 2021-05-07 04:04.00 i.e., I would like to add the day information. How to achieve that?
I used the following code, but it seems that pyspark just add the day information as 1970-01-01, kind of system setting.
spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
df_1 = spark.read.csv('test1.csv', header = True)
df_1 = df_1.withColumn('Timestamp', to_timestamp(col('Timing'), 'HH:mm'))
df_1.show(truncate=False)
And I got the following result.
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|1970-01-01 04:04:00|
|19:04.0|1970-01-01 19:04:00|
You can concat a date string before calling to_timestamp:
import pyspark.sql.functions as F
df2 = df_1.withColumn(
'Timestamp',
F.to_timestamp(
F.concat_ws(' ', F.lit('2021-05-07'), 'Timing'),
'yyyy-MM-dd HH:mm.s'
)
)
df2.show()
+-------+-------------------+
| Timing| Timestamp|
+-------+-------------------+
|04:04.0|2021-05-07 04:04:00|
|19:04.0|2021-05-07 19:04:00|
+-------+-------------------+
I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions
I would like to understand the best way to do an aggregation in Spark in this scenario:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)
val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
Person("Giorgio",30,"12/10/2009")
Person("Diego", 10,"12/10/2010"),
Person("Diego", 20,"12/10/2010"),
Person("Diego", 30,"22/11/2011"),
Person("Giorgio",10,"31/12/9999"),
Person("Giorgio",30,"31/12/9999"))).toDF()
val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))
val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show below */
Basically I want to group all records by name column and then based on the filterType parameter, I want to filter all valid records for a Person, then after filtering, I want to sum all acc values obtaining a final
DataFrame with name and totalAcc columns.
For example:
filterType = MIN , I want to take all records with having min(logDate) , could be many of them, so basically in this case I completely ignore filterDate param:
Diego,10,12/10/2010
Diego,20,12/10/2010
Giorgio,30,12/10/2009
Final result expected from aggregation is: (Diego, 30),(Giorgio,30)
filterType = MAX , I want to take all records with logDate > filterDate, I for a key I don't have any records respecting this condition, I need to take records with min(logDate) as done in MIN scenario, so:
Diego, 10, 12/10/2010
Diego, 20, 12/10/2010
Giorgio, 20, 31/12/9999
Giorgio, 10, 31/12/9999
Giorgio, 30, 31/12/9999
Final result expected from aggregation is: (Diego,30),(Giorgio,60)
In this case for Diego I didn't have any records with logDate > logFilter, so I fallback to apply MIN scenario, taking just for Diego all records with min logDate.
You can write your conditional aggregation using when/otherwise as
df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
.filter($"sum".isNotNull)
which would give you your desired output according to filterType
But
Eventually you would require both aggregated dataframes so I would suggest you to avoid filterType field and just go with aggregation by creating additional column for grouping using when/otherwise function. So that you can have both aggregated values in one dataframe as
df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
.groupBy("name", "additionalGrouping").agg(sum($"acc"))
.drop("additionalGrouping")
.show(false)
which would output as
+-------+--------+
|name |sum(acc)|
+-------+--------+
|Diego |10 |
|Giorgio|60 |
+-------+--------+
Updated
Since the question is updated with the logic changed, here is the idea and solution to the changed scenario
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc)
val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate")
.groupBy("name")
.agg(sum($"acc").as("sum"))
val finalDF =
if(filterType == "MIN") {
minDF
}
else if(filterType == "MAX"){
val tempMaxDF = df2
.groupBy("name")
.agg(sum(when($"logDate" > filterDate,$"acc")).as("sum"))
tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull))
}
else {
df2
}
so for filterType = MIN you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|30 |
+-------+---+
and for filterType = MAX you should have
+-------+---+
|name |sum|
+-------+---+
|Diego |30 |
|Giorgio|60 |
+-------+---+
In case if the filterType isn't MAX or MIN then original dataframe is returned
I hope the answer is helpful
You don't need conditional aggregation. Just filter:
df
.where(if (filterType == "MAX") $"logDate" < filterDate else $"logDate" > filterDate)
.groupBy("name").agg(sum($"acc")
I have two columns that need to be sorted in a custom way.
For Eg:
Month Column it should be sorted in a way that Jan2015 to Dec(CurrentYear)
and Also suppose I have Column as Quarter and I want it to or Order by as Q1-2015,Q2-2015,... Q4-CurrentYear ..
in orderby of Spark Sql I'll be giving as orderBy("Month","Quarter") but the Order should be Custom Sequence As before .
I have tried the code below:
import org.apache.spark.sql.SaveMode
import org.apache.spark.storage.StorageLevel
val vDF=spark.sql(""" select month,quarter from table group by month,quarter order by month,quarter """);
vDF.repartition(10).orderBy("Month","Quarter").write(results.csv);
As of now the Month gets Ordered as Apr,Aug,Dec.... in a alphabetical way and Quarter as Q1-2015,Q1-2016,.... but the requirement is the mentioned above
I'd just parse the dates:
import org.apache.spark.sql.functions._
val df = Seq(
("Jul", 2017"), ("May", "Q2-2017"),
("Jan", "Q1-2016"), ("Dec", "Q4-2016"), ("Aug", "Q1-2016")
).toDF("month", "quater")
df.orderBy(unix_timestamp(
concat_ws(" ", col("month"), substring(col("quater"), 4, 6)), "MMM yyyy"
)).show()
+-----+-------+
|month| quater|
+-----+-------+
| Jan|Q1-2016|
| Aug|Q1-2016|
| Dec|Q4-2016|
| May|Q2-2017|
| Jul|Q3-2017|
+-----+-------+