Adding date & calendar week column in py spark dataframe - apache-spark

I'm using spark 2.4.5. I want to add two new columns, date & calendar week, in my pyspark data frame df.
So I tried the following code:
from pyspark.sql.functions import lit
df.withColumn('timestamp', F.lit('2020-05-01'))
df.show()
But I'm getting error message: AssertionError: col should be Column
Can you explain how to add date column & calendar week?

Looks like you missed the lit function in your code.
Here's what you were looking for:
df = df.withColumn("date", lit('2020-05-01'))
This is your answer if you want to hardcode the date and week. If you want to programmatically derive the current timestamp, I'd recommend using a UDF.

I see two questions here: First, how to cast a string to a date. Second, how to get the week of the year from a date.
Cast string to date
You can either simply use cast("date") or the more specific F.to_date.
df = df.withColumn("date", F.to_date("timestamp", "yyyy-MM-dd"))
Extract week of year
Using format date allows you to format a date column to any desired format. w is the week of the year. W would be the week of the month.
df = df.withColumn("week_of_year", F.date_format("date", "w"))
Related Question: pyspark getting weeknumber of month

Related

Change number to datetime for whole column in dataframe (ddf) in Pandas

I have an Excel .xlsb sheet with data, some columns have number as output data, other columns should have dates as output. After uploading the data in Python, some columns have a number in stead of date. How can I change the format of the number in that specific column to a date?
I use Pandas and ddf
The output of the dataframe of column date of birth ('dob_l1') shows '12150', which should be date '6-4-1933'.
I tried to solve this, but unfortunately I only managed to get the date '2050-01-12' which is incorrect.
I used code 'ddf['nwdob_l1'] = pd.to_datetime(ddf['dob_l1'], format='%d%m%y',errors='coerce')'
Who can help me. I was happy to received some good feedback from joe90. He showed me a function that could help for singular dates:
import datetime
def xldate2date(xl):
# valid for dates from 1900-03-01
basedate = datetime.date(1899,12,30)
d = basedate + datetime.timedelta(days=xl)
return d
# Example:
# >>> print(xldate2date(44948))
# 2023-01-22
That is correct, however, I need to change all values in the column (> 500.000), so I cannot do that 1-by-1.
As that question is closed, I hereby open a new question.
Is there anyone who can help me to find the correct code to get the right date in the whole column?
When you read the data in using pandas there are tools for the dates. You want to use parse_dates
Documentation for read_excel
example:
import pandas as pd
df = pd.read_excel('file/path/the.xlsx', parse_dates=['Date'])
This will change the date to be datetime64 format which is better than a number.

How to cast a string column to date having two different types of date formats in Pyspark

I have a dataframe column which is of type string and has dates in it. I want to cast the column from string to date but the column contains two types of date formats.
I tried using the to_date function but it is not working as expected and giving null values after applying function.
Below are the two date formats I am getting in the df col(datatype - string)
I tried applying the to_date function and below are the results
Please let me know how we can solve this issue and get the date column in only one format
Thanks in advance
You can use pyspark.sql.functions.coalesce to return the first non-null result in a list of columns. So the trick here is to parse using multiple formats and take the first non-null one:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("9/1/2022",),
("2022-11-24",),
], ["Alert Release Date"])
x = F.col("Alert Release Date")
df.withColumn("date", F.coalesce(F.to_date(x, "M/d/yyyy"), F.to_date(x, "yyyy-MM-dd"))).show()
+------------------+----------+
|Alert Release Date| date|
+------------------+----------+
| 9/1/2022|2022-09-01|
| 2022-11-24|2022-11-24|
+------------------+----------+

How to extract the year and quarter from the String date in databricks SQL

Can someone show me how to extract the year from the String date in databricks SQL.
I am based in the UK and our date format is normally as follows:
dd/mm/yyyy
The field containing the dates is set as StringType()
I am trying to extract the year from the string as follows:
select year(cast(financials_0_accountsDate as Date)) from `financiallimited_csv`
I'm using the following the code to extract the quarter
select quarter(cast(financials_0_accountsDate as Date)) from `financiallimited_csv`
However, both result in NULL values.
Any thoughts on how to extract the year and quarter from dates with StringType() dd/mm/yyyy?
The table looks like the following:
Could you try the to_date function?
select year(to_date(financials_0_accountsDate, 'dd/MM/yyyy')) from `financiallimited_csv`

Converting timestamp in a dataframe to date and time in python

In the image I have a dataframe.In that I have a column called timestamp ,from that I want to seperate month and have to make it as a new column.How to do that?
If your Timestamp is not already datetime than convert like so:
df["Timestamp_converted"] = pd.to_datetime(df["Timestamp"], format="%Y-%m-%d %H:%M:%S")
You get the month as a separate column with this:
df["month"] = df.Timestamp_converted.dt.month

Create a timestamp Column in Spark Dataframe from other column having timestamp value

I have a spark dataframe having a timestamp Column.
I want to get previous day date of the column.Then add time (3,59,59) to the date.
Ex- value in current column(x1) : 2018-07-11 21:40:00
previous day date : 2018-07-10
after adding time(3,59,59) to the previous day date ,it should be like :
2018-07-10 03:59:59 (x2)
I want to add a column in the dataframe with "x2" values corresponding to "x1" values in all records.
I want one more column with values equal to difference of (x1-x2).totalDays in exact double values
Substracting day and adding time and converting to timestamp type
from pyspark.sql.types import *
from pyspark.sql import *
>>>df.withColumn('x2',concat(date_sub(col("x1"),1),lit(" 03:59:59")).cast("timestamp"))
Caluculating Time and Date difference:
Date Difference:-
Using datediff function we can caluculate date difference
>>>df1.withColumn("x3",datediff(col("x1"),col("x2")))
Time Difference
Calculate time difference for this convert to unix time then subtract x2 column from x1
>>>df1.withColumn("x3",unix_timestamp(col("x1"))-unix_timestamp(col("x2")))

Resources