Why spark returns different value for unix_timestamp function? - apache-spark

One of my Hive table has a column which has dates in String format. After loading that table into Spark I am converting those dates to unix timestamp format using unix_timestamp() function. I am getting different values for same date and same date format when I use it in different Spark environments.
Here is the sample date for one of the row 2017-08-04 03:26:51.756658 and date format passed to unix_timestamp() function is yyyy-MM-dd HH:mm:ss.SSSSSS
Actual command used is as follows:
val baseWithUnixTime = base.withColumn("ZZOTIMSTP",
$"ZZOTIMSTP".cast(TimestampType))
.withColumn("ZZOTIMSTP", $"ZZOTIMSTP".cast(TimestampType))
.withColumn("unix_time", unix_timestamp($"ZZOTIMSTP", "yyyy-MM-dd HH:mm:ss.SSSSSS"))
When I execute above command in my local Spark, for this 2017-08-04 03:26:51.756658 value I get 1501813611 as unix timestamp.
When I execute same command in EMR Spark cluster, I get 1501817211 value.
If I try the same thing in Hive using select unix_timestamp("2017-08-04 03:26:51.756658", "yyyy-MM-dd HH:mm:ss.SSSSSS"); command, I get 1501817967 this value.
To summarize, Environment wise results are as follows:
+---------------------------+-----------+
| local Spark (version 2.2) | 1501813611|
+---------------------------|-----------+
| Spark in EMR (version 2.1)| 1501817211|
+---------------------------|-----------+
| Hive | 1501817967|
+---------------------------|-----------+
I wonder which is giving me true value? Why Hive and Spark give different values for same function and same set of values?

Use this Spark settings:
set ('spark.sql.session.timeZone', 'GMT+XX')
XX - desired value

Related

Read partioned hive table in pyspark instead of a parquet

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?
Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Date conversion in pyspark or sparksql

Currently having a field with the below date format.
3/2/2021 18:48
I need to convert it to 2021-03-02. I tried taking a substring and converting to date format. But it is not providing the desired output. Any suggestions will be helpful
Below if you are using spark SQL:
from_unixtime(unix_timestamp('3/2/2021 18:48', 'M/d/yyyy'), 'yyyy-MM-dd')
Same functions are available in Dataframe API as well:
https://spark.apache.org/docs/2.4.0/api/sql/index.html#from_unixtime
https://spark.apache.org/docs/2.4.0/api/sql/index.html#unix_timestamp

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Error while writing data from python to redshift - Invalid date format - length must be 10 or more

I have a dataframe in python where date columns in datetime64[ns] data type. Now I am trying to write this data frame to redshift. I am getting following stl_load_errors:
Invalid date format - length must be 10 or more
All my dates are 2016-10-21 format, thus have length of 10. More over, I have ensured that no row has any messed up format like 2016-1-8 where it can have only 8 character. So the error is not making sense.
Any one faced similar error while writing data to redshift ? Any explanation ?
Note:
Here's some context. I am running the python script from EC2. This script writes the data in json format to S3 bucket and then this json is uploaded to an empty redshift table. The redshift table describes the date columns as 'date' format. I know there's another way which uses boto3/copy but for now I am stuck to this method.

Unable to read timstamp value in pyspark from Hive (spark 1.6.1)

I am trying to read a Hive Table having a date column with datatype as timestamp, length=9.
my code looks something like the following -
df = hc.sql("select * from schema.table")
It can read all other columns (datatype = varchar).
It either reads null or gives none in date column
I have printed the df.dtypes and found that the dataframe schema was inferred correctly and date columns have timestamp datatype
surprisingly the same code works in notebook, only fails in spark-shell environment.
Can someone guide me what could be wrong or limitation that renders this error and how could I rectify it?
I have seen this problem in spark, Where It will display as null when the datatype is timestamp, Its a bug. There is a way to get around it where you have to read that date column as string using something like this to_char(Column_name ,'YYYY-MM-DD HH-MM-SS') as column_name, and then cast it to timestamp. If you can tell me the source type and the tool you used to pull the data like sqoop or are you getting the data in some form of files ?? I can help you better.

Resources