Timestamp casting makes value null - apache-spark

When cast the column datatype from string to timestamp the value becomes null.
I have values in the following format
20070811T00789.167861+0100
I want to cast the type to "timestamp", when i do the following
df.withColumn('arrivetime', df['arrivetime'].cast('timestamp'))
the value is becoming null. How to cast the column to timestamp without affecting the value and its format?

I dont' know exactly what format you are going for with the 5 digits for time and the 6 (nano seconds?) at the end, but do know that timestamps in Spark are milliseconds, not nanoseconds, so you are going to lose information.
That being said, you can use Spark's unix_timestamp method to convert strings to timestamps using the SimpleDateFormat syntax.
First you probably have to get rid of the last 3 digits of the timestamp, by using Spark's regexp_replace
In Scala that would look like:
regexp_replace(df("arrivetime"), """(\.\d{3})\d*""", """$1""")
Then you could use the unix_timestamp like so:
unix_timestamp([replaced string], "yyyyMMdd'T'HHmmss.SSSz")

Related

Hive TimeStamp column with TimeZone

I have a created a table in hive with one column as timestamp datatype. While I am inserting into the hive getting different than the existing.
My column expected value : 2021-11-03 16:57:10.842 UTC (This I am getting as string). How I can store the above output in hive table( column with Datatype as timestamp)
You need to use cast to convert this to timestamp after removing the word UTC. Since hive doesnt care about timezone intentionally, and display data in UTC, you should be in good shape.
select cast( substr('2021-11-03 16:57:10.84 UTC',1,23) as timestamp) as ts
Pls note you need to have the data in above yyyy-MM-dd hh:mm:ss.SS format.
Also pls note you can not use from_unixtime(unix_timestamp(string_col , 'dd-MM-yyyy HH:mm:ss.SSS')) because we will loose millisecond part.

Presto epoch string to timestamp

Require your help as stuck in time conversion in presto.
I have a epoch column with name timestamp as a string datatype and i want to convert this into date timestamp.
I have used the below query after reading through various blogs:
SELECT date_parse(to_iso8601(from_unixtime(CAST(timestamp AS bigint)) AS date ,
'%Y-%m-%dT%H:%i:%s.%fZ'))
FROM wqmparquet;
Everytime i run this query i get an error:
INVALID_FUNCTION_ARGUMENT: Invalid format: "2020-04-27T19:49:50.000Z" is malformed at "T19:49:50.000Z"
Can somebody please help me on this.
I might be oversimplifying this, but if you want to convert an epoch string to a timestamp datatype, you can just do:
from_unixtime(cast(timestamp as bigint))
You can generate a timestamp with time zone by passing a second argument to from_unixtime(), as a time zone string.

Azure Data Factory Mapping Data Flow: Epoch timestamp to Datetime

I have a JSON-based source I'd like to transform using ADF Mapping Data Flow. I have a string containing an epoch timestamp value that I want to transform to Datetime value to later sink it into Parquet file.
Do you know a way? Docs of this language are here.
Source file:
{
"timestamp":"1574127407",
"name":"D.A."
}
Use toTimestamp() and set the formatting you wish as 2nd parameter
toTimestamp(1574127407*1000l)
From string:
toTimestamp(toInteger(toString(byName('timestamp')))*1000l,'yyyy-MM-dd HH:mm:ss')
I have came across various epoch timestamp values which are of 13 digits i.e., they even have milliseconds detailed information.
In such case, converting to integer using 'toInteger' won't serve the purpose instead this will keep the values as NULL. So, to fix this issue, we need to convert it to Long using toLong as below:
toTimestamp(toLong(toString(created)),'yyyy-MM-dd HH:mm:ss')
In above expression, 'created' is a field whose value is 13-digit epoch timestamp, something like this created='1635359043307'.
Here, toTimestamp returns the Date Timestamp with above-mentioned date format.
FYI, you can use this site https://www.epochconverter.com/ to check epoch timestamp to human date.

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources