Azure Data Factory v2 - wrong year copying from parquet to SQL DB - azure

I'm having a weird issue with Azure Data Factory v2. There's a Spark Job which is running and producing parquet files as output, an ADFv2 copy activity then takes the output parquet and copies the data into an Azure SQL Database. All is working fine except for dates! When the data lands in SQL the year is 1969 years out. So todays date (2018-11-22) would land as 3987-11-22.
I've tried changing the source and destination types between Date, DateTime, DateTimeOffset and String but with no success. At the moment I'm correcting the dates in the database but this is not really ideal.
I've opened the source parquet files using Parquet Viewer, Spark and Python (desktop) and they all correctly show the year as 2018

Based on Parquet encoding definitions,no Date, DateTime, DateTimeOffset and String format exist,so you do not need to try with these formats.
Based on this Data type mapping for Parquet files in Azure Data Factory:
The DateTimeOffset format corresponds to Int96,I suggest you trying this transmission on the source of parquet file.

According to parquet date type definition,
https://drill.apache.org/docs/parquet-format/#sql-types-to-parquet-logical-types
The date is stored as "the number of days from the Unix epoch, 1 January 1970"
And ADF is using .net type doing the transformation. According to .net type definition, Time values are measured in 100-nanosecond units called ticks. A particular date is the number of ticks since 12:00 midnight, January 1, 0001 A.D. (C.E.)
https://learn.microsoft.com/en-us/dotnet/api/system.datetime?view=netframework-4.7.2
Seems extra 1969 is added for this reason. But not sure whether is this a bug. What is your parquet data type? is it Date? and what is the sql data type?
Could you provide the copy activity run id? Or maybe some parquet sample data?

Related

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

A DATETIME column in Synapse tables is loading date values that are a few hours into the past compared to the incoming value

I have a datetime column in Synapse called "load_day" which is being loaded through a pyspark dataframe (parquet). During runtime, the code adds a new column in the dataframe with an incoming date ('timestamp') of format yyyy-mm-dd hh:mm:ss into the dataframe.
df = df.select(lit(incoming_date).alias("load_day"), "*")
Later we are writing this dataframe into a synapse table using a df.write command.
But what's strange is that every date value that is going into this load_day column is being written as a value that is a few hours into the past. This is happening with all the synapse tables in my database for all the new loads that I'm doing. To my knowledge, nothing in the code has changed from before.
Eg: If my incoming date is "2022-02-19 00:00:00" it's being written as 2022-02-18 22:00:00.000 instead of 2022-02-19 00:00:00.000. The hours part in the date is also not stable; sometimes it writes as 22:00:00.000 and sometimes 23:00:00.000
I debugged the code but the output of the variable looks totally fine. It just shows the value as 2022-02-19 00:00:00 as expected but the moment the data is getting ingested into the Synapse table, it goes back a couple of hours.
I'm not understanding why this might be happening or what to look for during debugging.
Did any of you face something like this before? Any ideas on how to I can approach this to find out what's causing this erroneous date?

Specify datetime2 format in Azure SQL data warehouse (synapse)

What is the correct way to specify the format of a datetime2 field when creating a table in Azure SQL data warehouse? I don't seem to be able to find an example in the documentation.
The data looks like this:
"2020-09-14T20:50:48.000Z"
CREATE TABLE [Foo].[Bar](
...
MyDateTime datetime2(['YYYY-MM-DDThh:mm:ss[.fractional seconds]')
)
As Panagiotis notes, the underlying representation is an int/long for the actual date value. This is how RDBMS engines can quickly compute the delta between two dates (days between Monday and Friday is a simple subtraction problem). To answer your question, you simply would format your create table as:
CREATE TABLE [Foo].[Bar](
...
MyDateTime datetime2
)
If you're interested in formatting the result in a query, you can look to the CONVERT or FORMAT functions. For example, if you wanted the format dd-mm-yyyy (Italian date), you could use either of the following:
SELECT
CONVERT(VARCHAR, CURRENT_TIMESTAMP, 105)
, FORMAT(CURRENT_TIMESTAMP, 'dd-MM-yyyy')
Note: CONVERT is generally faster than FORMAT and is the recommended approach if you have a date format that is supported. This is because the FORMAT function relies on the CLR which will include a context/process jump.

How to specify query for Azure Data Factory Source (Azure Table Storage) for yesterday's records

I am copying records from an Azure Storage Table (source) to Azure Storage Table (sink) everyday. So if I am executing the query on December 24th 2019 (UTC) for instance, then I want to copy records for December 23rd 2019 (UTC). The query works and is doing what I intend it to do. Here is the query:
Timestamp ge datetime'2019-12-23T00:00Z' and Timestamp lt datetime'2019-12-24T00:00Z'
In the query above, the Timestamp column is automatically stamped in the Azure Storage Table when a new record is inserted in it. That is how Azure Storage Table works.
And here is the screenshot of the Data Factory Pipeline:
I want to parameterize the query now. That is: if the query is run on 24th December, 2019 then it should copy 23rd December 2019's records and keep sliding as it executes everyday on a schedule. I don't know how to do that. I know that there is a utcNow function and there is a subtractFromTime Function. I just don't know how to put it together.
#4c74356b41, Thank you for your kind support. Based on your answers and some more googling, I was able to piece it together. Here is the final expression:
Timestamp ge #{concat('datetime','''',addDays(startOfDay(utcNow()), -1),'''')} and Timestamp lt #{concat('datetime','''',startOfDay(utcNow()),'''')}
You can do something like this:
addDays(startOfDay(utcNow()), -1)
this would find the start of the previous day
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions#date-functions

PySpark: how to read in partitioning columns when reading parquet

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.
I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:
paths_to_files = ['hdfs://data/table_name/2018/10/29',
'hdfs://data/table_name/2018/10/30']
And then try to do something like:
df = sqlContext.read.format("parquet").load(paths_to_files)
However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.
I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.
I have two questions:
what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the
approach i outlined above?
Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.
Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.
Not sure why you think you need to read/write SQL, though.
Use the Dataframe API instead, e.g
df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()
Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates
Alternatively, you can move the files to a directory structure fit for parquet
( e.g. .../table/year=2018/month=10/day=29/file.parquet)
then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

Resources