How to specify query for Azure Data Factory Source (Azure Table Storage) for yesterday's records - azure

I am copying records from an Azure Storage Table (source) to Azure Storage Table (sink) everyday. So if I am executing the query on December 24th 2019 (UTC) for instance, then I want to copy records for December 23rd 2019 (UTC). The query works and is doing what I intend it to do. Here is the query:
Timestamp ge datetime'2019-12-23T00:00Z' and Timestamp lt datetime'2019-12-24T00:00Z'
In the query above, the Timestamp column is automatically stamped in the Azure Storage Table when a new record is inserted in it. That is how Azure Storage Table works.
And here is the screenshot of the Data Factory Pipeline:
I want to parameterize the query now. That is: if the query is run on 24th December, 2019 then it should copy 23rd December 2019's records and keep sliding as it executes everyday on a schedule. I don't know how to do that. I know that there is a utcNow function and there is a subtractFromTime Function. I just don't know how to put it together.

#4c74356b41, Thank you for your kind support. Based on your answers and some more googling, I was able to piece it together. Here is the final expression:
Timestamp ge #{concat('datetime','''',addDays(startOfDay(utcNow()), -1),'''')} and Timestamp lt #{concat('datetime','''',startOfDay(utcNow()),'''')}

You can do something like this:
addDays(startOfDay(utcNow()), -1)
this would find the start of the previous day
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions#date-functions

Related

Incremental load in Azure Data Factory

I am replicating my data from Azure SQl DB TO Azure SQL DB. I have some tables with date columns and some tables with just the ID columns which are assigning primary key. While performing incremental load in ADF, I can select date as watermark column for the tables which have date column and id as watermark column for the tables which has id column, But the issue is my id has guid values, So can I i take that as my watermark column ? and if yes while copy activity process it gives me following error in ADF
Please see the image for above reference
How can I overcome this issue. Help is appreciated
Thank you
Gp
I have tried dynamic mapping https://martinschoombee.com/2022/03/22/dynamic-column-mapping-in-azure-data-factory/ from here but it does not work it still gives me same error.
Regarding your question about watermak:
A watermark is a column that has the last updated time stamp or an incrementing key
So GUID column would not be a good fit.
Try to find a date column, or an integer identity which is ever incrementing, to use as watermark.
Since your source is SQL server, you can also use change data capture.
Links:
Incremental loading in ADF
Change data capture
Regards,
Chen
The watermark logic takes advantange of the fact that all the new records which are inserted after the last watermark saved should only be considered for copying from source A to B , basically we are using ">=" operator to our advantage here .
In case of guid you cannot use that logic as guid cann surely be unique but not ">=" or "=<" will not work.

Cognos Analytics: Split variable into two by month

In Cognos Analytics I have a dataset containing rows with data (used disk space in MB), with each row being either February or June. Because I want to compare the two months, I want to create two new variables: one with the February data and one with the June data.
In the Query editor I've tried: count (MB) when month = 'February'. This, and a couple of other entries don't work.
I wonder if anyone can provide me the right line of code.
Thanks in advance!
Try this:
Go to query explorer
Create a query for each month
Join the 2 queries (this will result in 3rd query)
At this point you should be able to handle each month as a separate data item

Getting dates for historical data copy in Azure datafactory

I have to copy historical data from a rest api for one year eg., from March 1,2019 to March1,2020.The Rest API takes a start and end date as params.
To prevent load, however I have to call the API in pieces like copy with the start and end date as March1,2019 to March 30,2019....Once thats done then April 1,2019 to april 30,2019 and so on till March 1 ,2020 automatically and without manual intervention.
I was able to use utc now and add days for copying data for previous day to current startofday but am unable to figure out the copy of historical data.Any idea if this is possible?
You can try something like this:
1.create two variable named start_date and end_date.
2.create a Until activity, and type this expression:#greaterOrEquals(variables('end_date'),'2020-03-01T00:00:00Z')
3.create a Copy activity
Source setting:
Source dataset:
Sink dataset:
4.create tow Set variable activity to change start_date and end_date
Result:
By the way, you can change your date format according you need. Reference https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-date-and-time-format-strings.

Azure Data Factory v2 - wrong year copying from parquet to SQL DB

I'm having a weird issue with Azure Data Factory v2. There's a Spark Job which is running and producing parquet files as output, an ADFv2 copy activity then takes the output parquet and copies the data into an Azure SQL Database. All is working fine except for dates! When the data lands in SQL the year is 1969 years out. So todays date (2018-11-22) would land as 3987-11-22.
I've tried changing the source and destination types between Date, DateTime, DateTimeOffset and String but with no success. At the moment I'm correcting the dates in the database but this is not really ideal.
I've opened the source parquet files using Parquet Viewer, Spark and Python (desktop) and they all correctly show the year as 2018
Based on Parquet encoding definitions,no Date, DateTime, DateTimeOffset and String format exist,so you do not need to try with these formats.
Based on this Data type mapping for Parquet files in Azure Data Factory:
The DateTimeOffset format corresponds to Int96,I suggest you trying this transmission on the source of parquet file.
According to parquet date type definition,
https://drill.apache.org/docs/parquet-format/#sql-types-to-parquet-logical-types
The date is stored as "the number of days from the Unix epoch, 1 January 1970"
And ADF is using .net type doing the transformation. According to .net type definition, Time values are measured in 100-nanosecond units called ticks. A particular date is the number of ticks since 12:00 midnight, January 1, 0001 A.D. (C.E.)
https://learn.microsoft.com/en-us/dotnet/api/system.datetime?view=netframework-4.7.2
Seems extra 1969 is added for this reason. But not sure whether is this a bug. What is your parquet data type? is it Date? and what is the sql data type?
Could you provide the copy activity run id? Or maybe some parquet sample data?

Hive query manipulation for missing data not produced on non-business days (weekends & holidays)

I've a query regarding some tweaking my Hive query in the requirement defined below; couldn't get my head around on this.
Case: The data gets generated only on business days i.e., weekdays & non-holidays dates. This data I load in Hive. The source & target, both are HDFS.
Stringent process: The data should be replicated for every day. So, for Saturday & Sunday, I'll copy the same data of Friday. Same is the case for public holidays.
Current process: As of now I'm executing it manually to load weekends' data.
Requirement: I need to automate this in the query itself.
Any suggestions? A solution in spark for the same is also welcome if feasible.
Though clear what the issue is, it is unclear when you say " in the query itself".
Two options
When querying results, look for data using a scalar sub query (using Impala) that looks first for the max date relative to a given select date i.e. max less than or dqual to given seldct date; thus no replication.
Otherwise use scheduling and when scheduled a) check date for weekend via Linux or via SQL b) maintain a table of holiday dates and check for existence. If either or both of the conditions are true, then copy from the existing data as per bullet 1 whereby select date is today, else do your regular processing.
Note you may need to assume that you are running processing to catch up due to some error. Implies some control logic but is more robust.

Resources