Casting date to integer returns null in Spark SQL - apache-spark

I want to convert a date column into integer using Spark SQL.
I'm following this code, but I want to use Spark SQL and not PySpark.
Reproduce the example:
from pyspark.sql.types import *
import pyspark.sql.functions as F
# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1992-07-01","false","M",5000.50)
]
columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df = df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
df = df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
display(df)
What I want is to do the same transformation, but using Spark SQL. I am using the following code:
df.createOrReplaceTempView("date_to_integer")
%sql
select
seg.*,
CAST (jobStartDate AS INTEGER) as JobStartDateAsInteger2 -- return null value
from date_to_integer seg
How to solve it?

First you need to CAST your jobStartDate to DATE and then use UNIX_TIMESTAMP to transform it to UNIX integer.
SELECT
seg.*,
UNIX_TIMESTAMP(CAST (jobStartDate AS DATE)) AS JobStartDateAsInteger2
FROM date_to_integer seg

Related

Apache Spark function to_timestamp() not working with PySpark on Databricks

I'm getting NULL output when I execute the code to_timestamp()
The code that I'm executing is as follows:
.withColumn("LAST_MODIFICATION_DT", to_timestamp(concat(col('LAST_MOD_DATE'), lit(' '), col('LAST_MOD_TIME')), 'yyy-MM-dd HH:mm:ss'))
The schema for the fields LAST_MOD_DATE & LAST_MOD_TIME is as follows:
I'm getting the output 'NULL' for the column 'LAST_MODIFICATION_DT'
Any thoughts?
In Spark SQL concat doesn't convert null to ''; any null argument will cascade into a null result. It's often easier to write these kind of expressions in python and register them as UDFs, eg
from pyspark.sql.types import StringType
def concat2_(s1, s2) -> str:
return str(s1) + ' ' + str(s2)
concat2 = spark.udf.register("concat2", concat2_, StringType())
Then you can use it in Spark queries in built in python,
from pyspark.sql.functions import col
df = spark.sql('select 1 a, 2 b').withColumn("c",concat2(col('a'),col('b')))
display(df)
or SQL
%sql
with q as
(select 1 a, 2 b)
select a,b,concat2(a,b) c
from q

spark convert datetime to timestamp

I have a column in pyspark dataframe which is in the format 2021-10-28T22:19:03.0030059Z (string datatype). How to convert this into a timestamp datatype in pyspark?
I'm using the code snippet below but this returns nulls, as it's unable to convert it. Can someone please recommend on how to convert this?
df3.select(to_timestamp(df.DateTime, 'yyyy-MM-ddHH:mm:ss:SSS').alias('dt'),col('DateTime')).show()
You have to escape (put it in '') T and Z:
import pyspark.sql.functions as F
df = spark.createDataFrame([{"DateTime": "2021-10-28T22:19:03.0030059Z"}])
df.select(F.to_timestamp(df.DateTime, "yyyy-MM-dd'T'HH:mm:ss.SSSSSSS'Z'").alias('dt'),F.col('DateTime')).show(truncate = False)`

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
spark = SparkSession.builder.master("local[1]")\
.appName("date.com")\
.getOrCreate()
my_df = spark.createDataFrame(["13/04/2020", "16/04/2020", "19/04/2020"], StringType()).toDF("date")
expected_df = spark.createDataFrame(["2020/04/12", "2020/04/16", "2020/04/19"], StringType()).toDF("date")
I have tried the following spark sql command, but this returns the date as literally 'yyyy/MM/dd' rather than '2020/04/12'.
select date_format(date, 'dd/MM/yyyy'), 'yyyy/MM/dd' as reformatted_date
FROM my_df
I have also looked at the following documentation but didn't see anything that fits my scenario: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If it's not possible in spark sql then pyspark would work.
Any ideas?
You need to convert to date type using to_date first:
select date_format(to_date(date, 'dd/MM/yyyy'), 'yyyy/MM/dd') as reformatted_date
from my_df
df1.select( to_date(date_format(to_date(lit("12/12/2020"), "dd/MM/yyyy"), "yyyy-MM-dd") ).as("campo")).show()

How to convert a sql to spark dataset?

I have a Val test=sql ("Select * from table1) which returns a dataframe. I want to convert it to dataset which is not working.
test.toDS is throwing error.
Please provide more detail about the error.
If you want to convert a dataframe in dataset use the code below :
case class MyClass(field1: Int, field2: Long) // for example
val df = sql ("Select * from table1)
val ds : Dataset[MyClass] = df.as[MyClass]

Save and append a file in HDFS using PySpark

I have a data frame in PySpark called df. I have registered this df as a temptable like below.
df.registerTempTable('mytempTable')
date=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
Now from this temp table I will get certain values, like max_id of a column id
min_id = sqlContext.sql("select nvl(min(id),0) as minval from mytempTable").collect()[0].asDict()['minval']
max_id = sqlContext.sql("select nvl(max(id),0) as maxval from mytempTable").collect()[0].asDict()['maxval']
Now I will collect all these values like below.
test = ("{},{},{}".format(date,min_id,max_id))
I found that test is not a data frame but it is a str string
>>> type(test)
<type 'str'>
Now I want save this test as a file in HDFS. I would also like to append data to the same file in hdfs.
How can I do that using PySpark?
FYI I am using Spark 1.6 and don't have access to Databricks spark-csv package.
Here you go, you'll just need to concat your data with concat_ws and right it as a text:
query = """select concat_ws(',', date, nvl(min(id), 0), nvl(max(id), 0))
from mytempTable"""
sqlContext.sql(query).write("text").mode("append").save("/tmp/fooo")
Or even a better alternative :
from pyspark.sql import functions as f
(sqlContext
.table("myTempTable")
.select(f.concat_ws(",", f.first(f.lit(date)), f.min("id"), f.max("id")))
.coalesce(1)
.write.format("text").mode("append").save("/tmp/fooo"))

Resources