Timestamp parsing in pyspark - apache-spark

df1:
Timestamp:
1995-08-01T00:00:01.000+0000
Is there a way to separate the day of the month in the timestamp column of the data frame using pyspark. Not able to provide the code, I am new to spark. I do not have a clue on how to proceed.

You can parse this timestamp using unix_timestamp:
from pyspark.sql import functions as F
format = "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
df2 = df1.withColumn('Timestamp2', F.unix_timestamp('Timestamp', format).cast('timestamp'))
Then, you can use dayofmonth in the new Timestamp column:
df2.select(F.dayofmonth('Timestamp2'))
More detials about these functions can be found in the pyspark functions documentation.

Code:
df1.select(dayofmonth('Timestamp').alias('day'))

Related

Using PySpark to read in datalake table and can't parse timestamp column in Synapse Analytics

I can read in the datalake table and print schema but if I try and display data I get the following error. I am working within Synapse Analytics using a PySpark Notebook and Apache Spark Pool.
See error message:
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '10/27/2022 1:14:31 PM' in the new parser.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
I don't want to use the LEGACY version.
I've tried converting using the following code
df = df.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"M/dd/yyyy h:m:s"))
df = df.withColumn("SinkModifiedOn",to_date(col("SinkModifiedOn"),"M/dd/yyyy h:m:s"))
I've also tried converting the suspect columns to StringType() or DateType() but no luck.
Any help appreciated
Thank you
Try the script with below date format
df = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
I repro'd the same with sample input. Below is the approach.
Code:
df1=spark.createDataFrame(
data = [ ("1","Arpit","10/27/2022 1:14:31 PM"),("2","Anand","10/28/2022 1:14:31 PM"),("3","Mike","10/29/2022 1:14:31 PM")],
schema=["id","Name","SinkCreatedOn"])
df1.printSchema()
from pyspark.sql.functions import *
df_output = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
df1.show()
df_output.show()
df1
df_output

spark convert datetime to timestamp

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()
You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
spark = SparkSession.builder.master("local[1]")\
.appName("date.com")\
.getOrCreate()
my_df = spark.createDataFrame(["13/04/2020", "16/04/2020", "19/04/2020"], StringType()).toDF("date")
expected_df = spark.createDataFrame(["2020/04/12", "2020/04/16", "2020/04/19"], StringType()).toDF("date")
I have tried the following spark sql command, but this returns the date as literally 'yyyy/MM/dd' rather than '2020/04/12'.
select date_format(date, 'dd/MM/yyyy'), 'yyyy/MM/dd' as reformatted_date
FROM my_df
I have also looked at the following documentation but didn't see anything that fits my scenario: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If it's not possible in spark sql then pyspark would work.
Any ideas?
You need to convert to date type using to_date first:
select date_format(to_date(date, 'dd/MM/yyyy'), 'yyyy/MM/dd') as reformatted_date
from my_df
df1.select( to_date(date_format(to_date(lit("12/12/2020"), "dd/MM/yyyy"), "yyyy-MM-dd") ).as("campo")).show()

How do I add a new date column with constant value to a Spark DataFrame (using PySpark)?

I want to add a column with a default date ('1901-01-01') with exiting dataframe using pyspark?
I used below code snippet
from pyspark.sql import functions as F
strRecordStartTime="1970-01-01"
recrodStartTime=hashNonKeyData.withColumn("RECORD_START_DATE_TIME",
lit(strRecordStartTime).cast("timestamp")
)
It gives me following error
org.apache.spark.sql.AnalysisException: cannot resolve '1970-01-01'
Any pointer is appreciated?
Try to use python native datetime with lit, I'm sorry don't have the access to machine now.
recrodStartTime = hashNonKeyData.withColumn('RECORD_START_DATE_TIME', lit(datetime.datetime(1970, 1, 1))
I have created one spark dataframe:
from pyspark.sql.types import StringType
df1 = spark.createDataFrame(["Ravi","Gaurav","Ketan","Mahesh"], StringType()).toDF("Name")
Now lets add one new column to the exiting dataframe:
from pyspark.sql.functions import lit
import dateutil.parser
yourdate = dateutil.parser.parse('1901-01-01')
df2= df1.withColumn('Age', lit(yourdate)) // addition of new column
df2.show() // to print the dataframe
You can validate your your schema by using below command.
df2.printSchema
Hope that helps.
from pyspark.sql import functions as F
strRecordStartTime = "1970-01-01"
recrodStartTime = hashNonKeyData.withColumn("RECORD_START_DATE_TIME", F.to_date(F.lit(strRecordStartTime)))

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Resources