Create a timestamp Column in Spark Dataframe from other column having timestamp value - apache-spark

I have a spark dataframe having a timestamp Column.
I want to get previous day date of the column.Then add time (3,59,59) to the date.
Ex- value in current column(x1) : 2018-07-11 21:40:00
previous day date : 2018-07-10
after adding time(3,59,59) to the previous day date ,it should be like :
2018-07-10 03:59:59 (x2)
I want to add a column in the dataframe with "x2" values corresponding to "x1" values in all records.
I want one more column with values equal to difference of (x1-x2).totalDays in exact double values

Substracting day and adding time and converting to timestamp type
from pyspark.sql.types import *
from pyspark.sql import *
>>>df.withColumn('x2',concat(date_sub(col("x1"),1),lit(" 03:59:59")).cast("timestamp"))
Caluculating Time and Date difference:
Date Difference:-
Using datediff function we can caluculate date difference
>>>df1.withColumn("x3",datediff(col("x1"),col("x2")))
Time Difference
Calculate time difference for this convert to unix time then subtract x2 column from x1
>>>df1.withColumn("x3",unix_timestamp(col("x1"))-unix_timestamp(col("x2")))

Related

Adding date & calendar week column in py spark dataframe

I'm using spark 2.4.5. I want to add two new columns, date & calendar week, in my pyspark data frame df.
So I tried the following code:
from pyspark.sql.functions import lit
df.withColumn('timestamp', F.lit('2020-05-01'))
df.show()
But I'm getting error message: AssertionError: col should be Column
Can you explain how to add date column & calendar week?
Looks like you missed the lit function in your code.
Here's what you were looking for:
df = df.withColumn("date", lit('2020-05-01'))
This is your answer if you want to hardcode the date and week. If you want to programmatically derive the current timestamp, I'd recommend using a UDF.
I see two questions here: First, how to cast a string to a date. Second, how to get the week of the year from a date.
Cast string to date
You can either simply use cast("date") or the more specific F.to_date.
df = df.withColumn("date", F.to_date("timestamp", "yyyy-MM-dd"))
Extract week of year
Using format date allows you to format a date column to any desired format. w is the week of the year. W would be the week of the month.
df = df.withColumn("week_of_year", F.date_format("date", "w"))
Related Question: pyspark getting weeknumber of month

Pandas - Remove Timestamp

I am trying to calculate basic statistics using pandas. I have precip values for a whole year from 1956. I created a "Date" column that has date for the entire year using pd.date_range. Then I calculated the max value for the year and the date of maximum value. The date of maximum value show "Timestamp('1956-06-19 00:00:00" as the output. How do I extract just the date. I do not need the timestamp or the 00:00:00 time
#Create Date Column
year = 1956
start_date = datetime.date(year,1,1)
end_date = datetime.date(year,12,31)
precip_file["Date"] = pd.date_range(start=start_date,end=end_date,freq="D")
#Yearly maximum value and date of maximum value
yearly_max = precip_file["Precip (mm)"].max(skipna=True)
max_index = precip_file["Precip (mm)"].idxmax()
yearly_max_date = precip_file.iat[max_index,2
Image of output dictionary I am trying to create
May be a duplicate of this question, although I can't tell whether you are trying to convert one DateTime or a column of DateTimes.

Finding an element of pandas series at a certain time (date)

I have some pandas series with the type "pandas.core.series.Series". I know that I can see its datetimeindex when I add ".index" to the end of it.
But what if I want to get the element of the series at this time? and whats if I have a "pandas._libs.tslibs.timestamps.Timestamp" and want to get the element of the series at this time?
If your dataframe is indexed by date, you can:
df[date] to access all the rows indexed by such date (e.g. df['2019-01-01']);
df[date1:date2] to access all the rows with date index between date1 and date2 (e.g. df['2019-01-01': '2019-11-25']);
df[:date] to access all the rows with index before date value (e.g. df[:'2019-01-01']);
df[date:] to access all the rows with index after date value (e.g. df['2019-01-01':]).

spark - get average of past N records excluding the current record

Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.

Construct a DataTime index from multiple columns of a datadrame

I am parsing a dataframe from a sas7bdat file and I want to convert the index into datetime to resample the data.
I have one column with the Date which is type String and another column of the time which is of type datetime.time. Does anybody know how to convert this to one column of datetime?
I already tried the pd.datetime like this but it requires individual columns for year, month and day:
df['TimeIn']=str(df['TimeIn'])
df['datetime']=pd.to_datetime(df[['Date', 'TimeIn']], dayfirst=True)
This gives me a value error:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
DataFrame column headers
If you convert both the date and time column to str then you can concatenate them and then call to_datetime:
In[155]:
df = pd.DataFrame({'Date':['08/05/2018'], 'TimeIn':['10:32:12']})
df
Out[155]:
Date TimeIn
0 08/05/2018 10:32:12
In[156]:
df['new_date'] = pd.to_datetime(df['Date']+' '+df['TimeIn'])
df
Out[156]:
Date TimeIn new_date
0 08/05/2018 10:32:12 2018-08-05 10:32:12

Resources