Date conversion in pyspark or sparksql - apache-spark

Currently having a field with the below date format.
3/2/2021 18:48
I need to convert it to 2021-03-02. I tried taking a substring and converting to date format. But it is not providing the desired output. Any suggestions will be helpful

Below if you are using spark SQL:
from_unixtime(unix_timestamp('3/2/2021 18:48', 'M/d/yyyy'), 'yyyy-MM-dd')
Same functions are available in Dataframe API as well:
https://spark.apache.org/docs/2.4.0/api/sql/index.html#from_unixtime
https://spark.apache.org/docs/2.4.0/api/sql/index.html#unix_timestamp

Related

How to read string column in format ‘15Aug21:12:45:24’ as time stamp in Tera data?

I have a character column in teradata table with format like this - ‘15AUG21:06:38:03’. I need to convert this column into time stamp so that I can use this column in order by statement. I am using teradata sql assistant to read data.
Use TO_TIMESTAMP:
SELECT TO_TIMESTAMP ('15AUG21:06:38:03', 'DDMONYY:HH24:MI:SS');

Specifying timestamp or date format in Athen Table

I have a timestamp in ISO-8601 format and want to specify it either as a timestamp or datetime format when creating a table in Athena. Any clues on how to do this ?
Thanks!
When you create table in Athena you can set a column as date or timestamp only in the Unix format as follows:
DATE, in the UNIX format, such as YYYY-MM-DD.
TIMESTAMP. Instant in time and date in the UNiX format, such as
yyyy-mm-dd hh:mm:ss[.f...]. For example, TIMESTAMP '2008-09-15
03:04:05.324'. This format uses the session time zone.
If the format is different, define it as a String and when you query the data use the date function:
from_iso8601_date(string) → date
You can convert the data to make it easier and cheaper for specific use cases by using CTAS (create table as select) query that will generate a new copy of the data in a simpler and more efficient (compressed and columnar) parquet format.

How to handle dates in cx_oracle using python?

I'm trying to access Oracle table using cx_oracle module and convert that as a dataframe, every thing is fine except couple of date columns has date format like "01-JAN-01" Python considering it as datetime.datetime(1,1,1,0,0) and after creating dataframe it's showing as 0001-01-01 00:00:00. I am expecting output as 2001-01-01 00:00:00. Please help me on this. Thanks in advance.
You have a couple of choices. You could
* Retrieve it from the Oracle database with [read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html specifying the date in a format (TO_CHAR) more appropriate for the default date format of pandas
* Retrieve it from the database as a string (as above) and then convert it into a date in the pandas framework.

Why spark returns different value for unix_timestamp function?

One of my Hive table has a column which has dates in String format. After loading that table into Spark I am converting those dates to unix timestamp format using unix_timestamp() function. I am getting different values for same date and same date format when I use it in different Spark environments.
Here is the sample date for one of the row 2017-08-04 03:26:51.756658 and date format passed to unix_timestamp() function is yyyy-MM-dd HH:mm:ss.SSSSSS
Actual command used is as follows:
val baseWithUnixTime = base.withColumn("ZZOTIMSTP",
$"ZZOTIMSTP".cast(TimestampType))
.withColumn("ZZOTIMSTP", $"ZZOTIMSTP".cast(TimestampType))
.withColumn("unix_time", unix_timestamp($"ZZOTIMSTP", "yyyy-MM-dd HH:mm:ss.SSSSSS"))
When I execute above command in my local Spark, for this 2017-08-04 03:26:51.756658 value I get 1501813611 as unix timestamp.
When I execute same command in EMR Spark cluster, I get 1501817211 value.
If I try the same thing in Hive using select unix_timestamp("2017-08-04 03:26:51.756658", "yyyy-MM-dd HH:mm:ss.SSSSSS"); command, I get 1501817967 this value.
To summarize, Environment wise results are as follows:
+---------------------------+-----------+
| local Spark (version 2.2) | 1501813611|
+---------------------------|-----------+
| Spark in EMR (version 2.1)| 1501817211|
+---------------------------|-----------+
| Hive | 1501817967|
+---------------------------|-----------+
I wonder which is giving me true value? Why Hive and Spark give different values for same function and same set of values?
Use this Spark settings:
set ('spark.sql.session.timeZone', 'GMT+XX')
XX - desired value

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources