Copy data every 1 minute from DataLake by DataFactory - azure

I have a Data Lake storage with the following folder structure:
{YEAR}
- {MONTH}
- {DAY}
- {HOUR}
- {sometext}_{YEAR}_{MONTH}_{DAY}_{HOUR}_{Minute}_{someuuid}.json
example
Could you please help me to configure Data Factory Copy data action?
I need to run Trigger every 1 minute - to copy data from Data Lake by previous minute to Cosmos DB
I've tried this
where the first expresion is
#formatDateTime(utcnow(),'yyyy/MM/dd/HH')
and the second one
#{formatDateTime(utcnow(),'yyyy')}_#{formatDateTime(utcnow(),'MM')}_#{formatDateTime(utcnow(),'dd')}_#{formatDateTime(utcnow(),'HH')}_#{formatDateTime(addMinutes(utcnow(), -1),'mm')}*.json
But it can skip some data, especially when Hour changes.
I'm a new in Data Factory and don't know what is the more efficient way how to do that. Please help

The Pipeline Expression Language has a number of Date functions built in. You can use the addMinutes function to add 1 minute.
To avoid clock skew, I would capture the utcnow() value and store it without any formatting:
In another variable, add a minute to the captured value rather than executing utcnow() again:
Once you have those variables, just use them to format the date string(s).
Result:
NOTE: use concat with the formatDateString to get the wildcard value you want:
Result:

Related

Time function in Azure Data Factory - Expression Builder

I only need to take the time part from the 'Timestamp type source attribute' and load it into a dedicated SQL pool table (Time datatype column). But I don't find a time function within the expression builder in ADF, is there a way I can do it?
-What did I do?
-I took the time part from the source attribute using substring and then tried to load the same into the destination table, when I do the destination table inserted null values as the column at the destination table is set to time datatype.
I tried to reproduce this and got the same issue. The following is a demonstration of the same. I have a table called mydemo as shown below.
CREATE TABLE [dbo].[mydemo]
(
id int NOT NULL,
my_date date,
my_time time
)
WITH
(
DISTRIBUTION = HASH (id),
CLUSTERED COLUMNSTORE INDEX
)
GO
The following is my source data in my dataflow.
time is not a recognized datatype in azure dataflow (date and timestamp are accepted). Therefore, dataflow fails to convert string (substring(<timestamp_col>,12,5)) into time type.
For better understanding, you can load your sink table as source in dataflow. The time column will be read as 1900-01-01 12:34:56 when time value in the table row is 12:34:56.
#my table row
insert into mydemo values(200,'2022-08-18','12:34:56')
So, instead of using substring(<timestamp_col>,12,5) to return 00:01, use concat('1900-01-01 ',substring(<timestamp_col>,12,8)) which returns 1900-01-01 00:01:00.
Configure the sink, mapping and look at the resulting data in data preview. Now, azure dataflow will be able to successfully insert the values and give desired results.
The following is the output after successful insertion of record into dedicated pool table.
NOTE: You can construct valid yyyy-MM-dd hh:mm:ss as a value using concat('yyyy-MM-dd ',substring(<timestamp_col>,12,8)) in place of 1900-01-01 hh:mm:ss in derived column transformation.

Getting dates for historical data copy in Azure datafactory

I have to copy historical data from a rest api for one year eg., from March 1,2019 to March1,2020.The Rest API takes a start and end date as params.
To prevent load, however I have to call the API in pieces like copy with the start and end date as March1,2019 to March 30,2019....Once thats done then April 1,2019 to april 30,2019 and so on till March 1 ,2020 automatically and without manual intervention.
I was able to use utc now and add days for copying data for previous day to current startofday but am unable to figure out the copy of historical data.Any idea if this is possible?
You can try something like this:
1.create two variable named start_date and end_date.
2.create a Until activity, and type this expression:#greaterOrEquals(variables('end_date'),'2020-03-01T00:00:00Z')
3.create a Copy activity
Source setting:
Source dataset:
Sink dataset:
4.create tow Set variable activity to change start_date and end_date
Result:
By the way, you can change your date format according you need. Reference https://learn.microsoft.com/en-us/dotnet/standard/base-types/standard-date-and-time-format-strings.

Hive query manipulation for missing data not produced on non-business days (weekends & holidays)

I've a query regarding some tweaking my Hive query in the requirement defined below; couldn't get my head around on this.
Case: The data gets generated only on business days i.e., weekdays & non-holidays dates. This data I load in Hive. The source & target, both are HDFS.
Stringent process: The data should be replicated for every day. So, for Saturday & Sunday, I'll copy the same data of Friday. Same is the case for public holidays.
Current process: As of now I'm executing it manually to load weekends' data.
Requirement: I need to automate this in the query itself.
Any suggestions? A solution in spark for the same is also welcome if feasible.
Though clear what the issue is, it is unclear when you say " in the query itself".
Two options
When querying results, look for data using a scalar sub query (using Impala) that looks first for the max date relative to a given select date i.e. max less than or dqual to given seldct date; thus no replication.
Otherwise use scheduling and when scheduled a) check date for weekend via Linux or via SQL b) maintain a table of holiday dates and check for existence. If either or both of the conditions are true, then copy from the existing data as per bullet 1 whereby select date is today, else do your regular processing.
Note you may need to assume that you are running processing to catch up due to some error. Implies some control logic but is more robust.

Adding Extraction DateTime in Azure Data Factory

I want to write a generic DataFactory in V2 with below scenario.
Source ---> Extracted (Salesforce or some other way), which don't have
extraction timestamp. ---> I want to write it to Blob with extraction
Time Stamp.
I want it to be generic, so I don't want to give column mapping anywhere.
Is there any way to use expression or system variable in Custom activity to append a column in output dataset? I like to have a very simple solution to make implementation realistic.
To do that you should change the query to add the column you need, with the query property in the copy activity of the pipeline. https://learn.microsoft.com/en-us/azure/data-factory/connector-salesforce#copy-activity-properties
I dont know much about Salesforce, but in SQL Server you can do the following:
SELECT *, CURRENT_TIMESTAMP as AddedTimeStamp from [schema].[table]
This will give you every field on your table and will add a column named AddedTimeStamp with the CURRENT_TIMESTAMP value in every row of the result.
Hope this helped!

How to perform operations inside Cassandra Trigger?

My application collects per sec data from devices and inserts into cassandra table. My idea is to write a trigger for the per sec data table which will automatically convert the per sec into hourly / daily data. And also I'll store the hourly and daily data in the same table with different key. To achieve this use case, I need to perform below operations inside my trigger code.
How can I insert a data into the same table which will invoke the trigger again ? ( will be used for converting per hour to per day )
How can I insert a data into different table ? ( store accumulated data into a temp table )
How can I select a data from different table ? ( fetch the last data for accumulation )
If I know the above info, my application will just insert per sec data and rest(per sec -to- hour -to- day convertion) will be automatically taken care by my trigger code.
Can you please help me to get the above info ?
It would be great if you give some code snippet for the same.
Unless you're comfortable with Cassandra internals, you should do this in a data abstraction layer instead of a trigger.

Resources