pyspark convert millisecond timestamp to timestamp - apache-spark

I am new in spark and i would like to retrieve timestamps in my DF.
checkpoint actual values
1594976390070
and i want :
checkpoint values without ms
1594976390070 / 1000
Actually i am using this piece of code to cast as timestamp:
# Casting dates as Timestamp
for d in dateFields:
df= df.withColumn(d,checkpoint.cast(TimestampType()))
I wonder how to convert it into a simple timestamp.

Divide your column by 1000 and use F.from_unixtime to convert to timestamp type:
import pyspark.sql.functions as F
for d in dateFields:
df = df.withColumn(d,
(checkpoint / F.lit(1000.)).cast('timestamp')
)

Related

How to convert excel date to numeric value using Python

How do I convert Excel date format to number in Python? I'm importing a number of Excel files into Pandas dataframe in a loop and some values are formatted incorrectly in Excel. For example, the number column is imported as date and I'm trying to convert this date value into numeric.
Original New
1912-04-26 00:00:00 4500
How do I convert the date value in original to the numeric value in new? I know this code can convert numeric to date, but is there any similar function that does the opposite?
df.loc[0]['Date']= xlrd.xldate_as_datetime(df.loc[0]['Date'], 0)
I tried to specify the data type when I read in the files and also tried to simply change the data type of the column to 'float' but both didn't work.
Thank you.
I found that the number means the number of days from 1900-01-00.
Following code is to calculate how many days passed from 1900-01-00 until the given date.
import pandas as pd
from datetime import datetime, timedelta
df = pd.DataFrame(
{
'date': ['1912-04-26 00:00:00'],
}
)
print(df)
# date
#0 1912-04-26 00:00:00
def date_to_int(given_date):
given_date = datetime.strptime(given_date, '%Y-%m-%d %H:%M:%S')
base_date = datetime(1900, 1, 1) - timedelta(days=2)
delta = given_date - base_date
return delta.days
df['date'] = df['date'].apply(date_to_int)
print(df)
# date
#0 4500

Order by ascending utcstamp not working -- missing zero from behind the numbers (Pyspark)

I need to order a pyspark sql dataframe by ascending order of day and month. However, due to the format of the UTC stamp, this is happening:
How can I add the zero behind the single numbers and solve this? I'm programming in pyspark. This is the code I used:
data_grouped = data.groupby('month','day').agg(mean('parameter')).orderBy(["month", "day"], ascending=[1, 1])
data_grouped.show()
You can cast the ordering columns to integer:
import pyspark.sql.functions as F
data_grouped = data.groupby('month','day').agg(F.mean('parameter')) \
.orderBy(F.col("month").cast("int"), F.col("day").cast("int"))

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

Apache Spark subtract days from timestamp column

I am using Spark Dataset and having trouble subtracting days from a timestamp column.
I would like to subtract days from Timestamp Column and get new Column with full datetime format. Example:
2017-09-22 13:17:39.900 - 10 ----> 2017-09-12 13:17:39.900
With date_sub functions I am getting 2017-09-12 without 13:17:39.900.
You cast data to timestamp and expr to subtract an INTERVAL:
import org.apache.spark.sql.functions.expr
val df = Seq("2017-09-22 13:17:39.900").toDF("timestamp")
df.withColumn(
"10_days_before",
$"timestamp".cast("timestamp") - expr("INTERVAL 10 DAYS")).show(false)
+-----------------------+---------------------+
|timestamp |10_days_before |
+-----------------------+---------------------+
|2017-09-22 13:17:39.900|2017-09-12 13:17:39.9|
+-----------------------+---------------------+
If data is already of TimestampType you can skip cast.
Or you can simply use date_sub function from pyspark +1.5:
from pyspark.sql.functions import *
df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))

Convert a Spark dataframe column from string to date

I have a spark dataframe i built from a sql context.
I truncated the a datetime field using DATE_FORMAT(time, 'Y/M/d HH:00:00') AS time_hourly
Now the column type is a string. How can I convert a string dataFrame column to datetime type?
You can use a trunc(column date, format) to not to lose date datatype.
There is a to_date function to convert string to date
Assuming that df is your dataframe and the column name to be cast is time_hourly
You can try the following:
from pyspark.sql.types import DateType
df.select(df.time_hourly.cast(DateType()).alias('datetime'))
For more info please see:
1) the documentation of "cast()"
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html
2) the documentation of data-types
https://spark.apache.org/docs/1.6.2/api/python/_modules/pyspark/sql/types.html

Resources