A DATETIME column in Synapse tables is loading date values that are a few hours into the past compared to the incoming value - azure

I have a datetime column in Synapse called "load_day" which is being loaded through a pyspark dataframe (parquet). During runtime, the code adds a new column in the dataframe with an incoming date ('timestamp') of format yyyy-mm-dd hh:mm:ss into the dataframe.
df = df.select(lit(incoming_date).alias("load_day"), "*")
Later we are writing this dataframe into a synapse table using a df.write command.
But what's strange is that every date value that is going into this load_day column is being written as a value that is a few hours into the past. This is happening with all the synapse tables in my database for all the new loads that I'm doing. To my knowledge, nothing in the code has changed from before.
Eg: If my incoming date is "2022-02-19 00:00:00" it's being written as 2022-02-18 22:00:00.000 instead of 2022-02-19 00:00:00.000. The hours part in the date is also not stable; sometimes it writes as 22:00:00.000 and sometimes 23:00:00.000
I debugged the code but the output of the variable looks totally fine. It just shows the value as 2022-02-19 00:00:00 as expected but the moment the data is getting ingested into the Synapse table, it goes back a couple of hours.
I'm not understanding why this might be happening or what to look for during debugging.
Did any of you face something like this before? Any ideas on how to I can approach this to find out what's causing this erroneous date?

Related

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

Apache PySpark - Get latest record issue

We have ~100M records, that have been collected for 2 weeks. Same records can appear multiple times. For the duplicated records, I only need the latest one based on the "LastModified" date.
I have tried with the following Spark script but it seemed to pickup the value randomly.
df.orderBy(unix_timestamp(df["LastModified"], "MM/dd/yyyy hh:mm:ss a").desc()).dropDuplicates(["LastModified"])
I have checked the data, the date format, ... all looked good.
Anyone have any ideas?

Azure Data Factory v2 - wrong year copying from parquet to SQL DB

I'm having a weird issue with Azure Data Factory v2. There's a Spark Job which is running and producing parquet files as output, an ADFv2 copy activity then takes the output parquet and copies the data into an Azure SQL Database. All is working fine except for dates! When the data lands in SQL the year is 1969 years out. So todays date (2018-11-22) would land as 3987-11-22.
I've tried changing the source and destination types between Date, DateTime, DateTimeOffset and String but with no success. At the moment I'm correcting the dates in the database but this is not really ideal.
I've opened the source parquet files using Parquet Viewer, Spark and Python (desktop) and they all correctly show the year as 2018
Based on Parquet encoding definitions,no Date, DateTime, DateTimeOffset and String format exist,so you do not need to try with these formats.
Based on this Data type mapping for Parquet files in Azure Data Factory:
The DateTimeOffset format corresponds to Int96,I suggest you trying this transmission on the source of parquet file.
According to parquet date type definition,
https://drill.apache.org/docs/parquet-format/#sql-types-to-parquet-logical-types
The date is stored as "the number of days from the Unix epoch, 1 January 1970"
And ADF is using .net type doing the transformation. According to .net type definition, Time values are measured in 100-nanosecond units called ticks. A particular date is the number of ticks since 12:00 midnight, January 1, 0001 A.D. (C.E.)
https://learn.microsoft.com/en-us/dotnet/api/system.datetime?view=netframework-4.7.2
Seems extra 1969 is added for this reason. But not sure whether is this a bug. What is your parquet data type? is it Date? and what is the sql data type?
Could you provide the copy activity run id? Or maybe some parquet sample data?

Hive query manipulation for missing data not produced on non-business days (weekends & holidays)

I've a query regarding some tweaking my Hive query in the requirement defined below; couldn't get my head around on this.
Case: The data gets generated only on business days i.e., weekdays & non-holidays dates. This data I load in Hive. The source & target, both are HDFS.
Stringent process: The data should be replicated for every day. So, for Saturday & Sunday, I'll copy the same data of Friday. Same is the case for public holidays.
Current process: As of now I'm executing it manually to load weekends' data.
Requirement: I need to automate this in the query itself.
Any suggestions? A solution in spark for the same is also welcome if feasible.
Though clear what the issue is, it is unclear when you say " in the query itself".
Two options
When querying results, look for data using a scalar sub query (using Impala) that looks first for the max date relative to a given select date i.e. max less than or dqual to given seldct date; thus no replication.
Otherwise use scheduling and when scheduled a) check date for weekend via Linux or via SQL b) maintain a table of holiday dates and check for existence. If either or both of the conditions are true, then copy from the existing data as per bullet 1 whereby select date is today, else do your regular processing.
Note you may need to assume that you are running processing to catch up due to some error. Implies some control logic but is more robust.

Spark filter dataframe returns empty result

I'm working in a project with Scala and Spark processing files that are stored in HDFS. Those files are landing in HDFS everyday in the morning. I have a job that reads that file from HDFS each day, process it and then writes the result in HDFS. After I convert the file into a Dataframe, this job executes a filter to get only the rows that contain a timestamp higher than the highest timestamp that was processed within the last file. This filter has an unknown behavior only some days. Some days works as expected and other days despite of the new file contains rows that match that filter, the filter result is empty. This happens all the times for the same file when it's executed in TEST environment but in my local works as expected using the same file with the same HDFS connection.
I've tried to filter in different ways but none of then work in that environment for some specific files but all of then work fine in my LOCAL:
1) Spark sql
val diff = fp.spark.sql("select * from curr " +
s"where TO_DATE(CAST(UNIX_TIMESTAMP(substring(${updtDtCol},
${substrStart},${substrEnd}),'${dateFormat}') as TIMESTAMP))" +
s" > TO_DATE(CAST(UNIX_TIMESTAMP('${prevDate.substring(0,10)}'
,'${dateFormat}') as TIMESTAMP))")
2) Spark filter functions
val diff = df.filter(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
3) Adding extra column with the result of the filter and then filter by this new column
val test2 = df.withColumn("PrevDate", lit(prevDate.substring(0,10)))
.withColumn("DatePre", date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat))
.withColumn("Result", date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
.withColumn("x", when(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)), lit(1)).otherwise(lit(0)))
val diff = test2.filter("x == 1")
I think that the issue is not caused either by the filter itself or probably by the file but I would like to receive feedback about what should I check or if anybody has faced this before.
Please let me know what information could be useful to post here in order to receive some feedback.
A part of file example looks like the following:
|TIMESTAMP |Result|x|
|2017-11-30-06.46.41.288395|true |1|
|2017-11-28-08.29.36.188395|false |0|
The TIMESTAMP values are compared with the previousDate (for instance: 2017-11-29) and I create a column called 'Result' with the result of that comparison that always works in both environment and also another column called 'x' with the same result.
As I mentioned before, if I use the comparator function between both dates or the result in column 'Result' or 'x' to filter the dataframe, sometimes the result is an empty dataframe but in local using the same HDFS and file, the result contains data.
I suspect it to be a data/date format issue. Did you get a chance to verify if the dates converted are as expected?
If the date string for both the columns has timezone included, the behavior is predictable.
If only of one of them has timezone included, the results will be different when executed in local and remote. It totally depends on timezone of cluster.
For debugging the issue, I would suggest you to have additional columns to capture the unix_timestamp(..)/millis of the respective date strings and have and additional column to capture the difference of the two columns. The diff column should help to find out where and why conversions gone wrong. Hope this helps.
In case anybody wants to know what happened with this issue and how I finally found the cause of the error here is the explanation. Basically it was caused by the different timezone of the machines where the job was executed (LOCAL machine and TEST server). The unix_timestamp function returned the correct value having in mind the timezone of the servers. Basically at the end I didn't have to use the unix_timestamp function and I didn't need to use the full content of the date field. Next time I will double check this before.

Resources