Im ingesting a data with the api calls and would like to widgets to parametirze. In azure I have the following set up:
I have the list of attribute_codes, reading them with lookup activtiy and passing these parameter inside the databricks notebook code. Code inside the databricks:
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Removing the folder in Data Lake
dbutils.fs.rm(f'/mnt/bronze/attribute_code/{day}',True)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
My problem is that, since its in the ForEach loop, eveytime new parameter is passed, it deletes the entire folder with previosly, loaded data. Now someone can come and say to remove line where I drop and create the spacific daily folder but pipeline should run multiple times a day and I need to drop previously loaded data on that day and load new one.
My goal is to iterte over the entire list of the attribute_code and load them all in one folder with the name "data_{count}.json
Instead of using dbutils.fs.rm in your notebook, you can use delete activity before for each activity to get desired results.
Using dbutils.fs.rm, the folder is being deleted each time the notebook is triggered inside for each loop deleting previously created files as well.
So, using a delete activity only before for each loop to delete the folder (deletes only if it exists), you can load data as per requirement.
For path, I have used the following dynamic content:
attribute/#{formatDateTime(utcNow(),'yyyy-MM-dd')}
And using the following code in my databricks notebook:
#I used similar code
data, response = get_data_url(url=f"https://p.cloud.com/api/rest/v1/attributes/{attribute_code}/options",access_token=access_token)
#Creating the folder in the Data Lake
dbutils.fs.mkdirs(f'/mnt/bronze/attribute_code/{day}')
count = 0
#Putting the response inside of the Data Lake folder
dbutils.fs.put(f'/mnt/bronze/attribute_code/{day}/data_{count}.json', response.text)
Lets say I have the following output from my look up activity:
When I run the pipeline, it would run successfully. Only the latest look up data would be loaded.
Related
In my Azure data factory I need to copy data from an SFTP source that has structured the data into date based directories with the following hierarchy
year -> month -> date -> file
I have created a linked service and a binary dataset where the dataset "filesystem" points to the host and "Directory" points to the folder that contains the year directories. Ex: host/exampledir/yeardir/
with yeardir containing the year directories.
When I manually write into the dataset that I want the folder "2015" it will copy the entirety of the 2015 folder, however if I put a parameter for the directory and then input the same folder path from a copy activity it creates a file called "2015" inside of my blob storage that contains no data.
My current workaround is to make a nested sequence of get metadata for loops that drill into each folder and subfolder and copy the individual file ends. However the desired result is to instead have the single binary dataset copy each folder without the need for get metadata.
Is this possible within the scope of the data factory?
edit:
manual filepath that works
parameterized filepath
properties used in copy activity
To add further context I have tried manually writing the filepath into the copy activity as shown in the photo, I have also attempted to use variables, dynamic content for the parameter (using base filepath and concat) and also putting the base filepath into the dataset alongside #dataset().filePath. None of these solutions have worked for me so far and either copy nothing or create the empty file I mentioned earlier.
The sink is a binary dataset linked to Azure Data Lake Storage Gen2.
sink filepath
Update:
The accepted answer is the solution. My problem was that the source dataset when retrieved would have a newline at the end when passed as a parameter. I used concat to clean this up and this has worked since then.
Since giving exampledir/yeardir/2015 worked perfectly for you and you want to copy all the folders present in exampledir/yeardir, you can follow the below procedure:
I have taken a get metadata activity to get the child items of the folder exampledir/yeardir/ (In my demonstration, I have taken path as 'maindir/yeardir'.).
This will give you all the year folders present. I have taken only 2020 and 2021 as an example.
Now, with only one for each activity with items value as the child items output of get metadata activity, I have directly used copy activity.
#activity('Get Metadata1').output.childItems
Now, inside for each I have my copy data activity. For both source and sink, I have created a dataset parameter for paths. I have given the following dynamic content for source path.
maindir/yeardir/#{item().name}
For sink, I have given the output directory as follows:
outputDir/#{item().name}
Since giving path manually as exampledir/yeardir/2015 worked, we have got the list of year folders using get metadata activity. We looped through each of this and copy each folder with source path as exampledir/yeardir/<current_iteration_year_folder>.
Based on how I have given my sink path, the data will be copied with contents. The following is a reference image.
I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png
I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.
I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate
When copying a file from S3 to AzureBlobStorage, I would like to add the date and time string in addition to the source file name.
In essence, the S3 folder structure looks like this
data/yyyy/mm/dd/files
*yyyy=2019-2022, mm=01-12, dd=01-31
And when copying these to Blob, we want to store them in the following folder structure.
data/year=yyyy/month=mm/day=dd/files
Attached is a picture of the folder structure of the S3 bucket and the folder structure we want to achieve with Blob Storage.
I manually renamed all the photo folders in Blob Storage, but there are thousands of files and it takes time, so I want to do it automatically.
Do I use the "GetMetadata" or "ForEach" activity?
Or use dynamic parameters in the "Copy" activity to set up a sink dataset?
Also, I am not an experienced data engineer and am not familiar with Synapse, so I have no idea how to do this due to my lack of knowledge.
Any help woud be appreciated.
Thanks.
Using the Get Metadata activity, ForEach activity, and Execute pipeline activity get the nested folder structure from the source dataset. Pass the extracted folder structure to the sink dataset dynamically by adding the required string value to the folder structure.
Create a source dataset with the dataset parameter for the directory.
Pipeline1:
Using the Get Metadata activity, get the child items under the container (data/).
Pass the child items to the ForEach activity to loop each folder.
#activity('get sub folder list_yyyy').output.childItems
Inside ForEach activity, add the execute pipeline activity. Create a new pipeline (pipeline2) with 2 parameters in it to hold the source and sink folder structure. Pass the pipeline2 parameter values from pipeline1.
Subolder1: #item().name
Sink_dir1: #concat('year=',item().name)
Pipeline2:
In pipeline2, repeat the same processes as pipeline1. Using Get Metadata activity get the child items under the folder (yyyy folder) and pass the child items to ForEach activity.
Pipeline2 parameters:
Get Metadata:
Dataset property - dir: #pipeline().parameters.SubFolder1
Inside ForEach activity, add execute pipeline to pass the current item to nested pipeline (pipeline3). Create 2 pipeline parameters inside pipeline3 to hold source and sink structures.
SubFolder2: #concat(pipeline().parameters.SubFolder1,'/',item().name)
sink_dir2: #concat(pipeline().parameters.sink_dir1,'/month=',item().name)
Pipeline3:
Using the Get Metadata activity get the child items under the source structure.
Dataset property – dir: #pipeline().parameters.SubFolder2
Pass the child items to ForEach activity. Inside ForEach activity add copy data activity to copy files from source to sink.
Connect the source to the source dataset and pass the directory parameter dynamically by concatenating the parameter value and current child item.
dir: #concat(pipeline().parameters.SubFolder2,'/',item().name,'/')
Create a sink dataset with dataset parameters to pass the directory path dynamically.
In the sink, pass the directory path dynamically by concatenating the parameter value with the current child item path.
Sink_dir: #concat(pipeline().parameters.sink_dir2,'/day=',item().name,'/')
Output structure: It creates the folder structure automatically if not available in the sink.
You will first need the file name (use Getmetadata). Then for each filename, append date and time string using functions like concat(). You can also create a variable 'NewFileName' and use it to pass as a parameter to the copy activity. Then copy source will have the original file name and sink will have the new file name. Copy activity will be parameterized as you will be passing file name dynamically.
Hope this helps.
I'm trying to solve following scenario in Azure Data Factory:
I have a large number of folders in Azure Blob Storage. Each folder contains varying number of files in parquet format. Folder name contains the date when data contained in the folder was generated, something like this: DATE=2021-01-01. I need to filter the files and save them into another container in delimited format and each file should have the date indicated in source folder name in it's file name.
So when my input looks something like this...
DATE=2021-01-01/
data-file-001.parquet
data-file-002.parquet
data-file-003.parquet
DATE=2021-01-02/
data-file-001.parquet
data-file-002.parquet
...my output should look something like this:
output-data/
data_2021-01-01_1.csv
data_2021-01-01_2.csv
data_2021-01-01_3.csv
data_2021-01-02_1.csv
data_2021-01-02_2.csv
Reading files from subfolders and filtering them and saving them is easy. Problems start when I'm trying to set output dataset file name dynamically. I can get the folder names using Get Metadata activity and then I can use ForEach activity to set them into variables. However, I can't figure out how to use this variable in filtering data flow sinks dataset.
Update:
My Get Metadata1 activity, set the container input as:
Set the container input as follows:
My debug info is as follows:
I think I've found the solution. I'm using csv files for example.
My input looks something like this
container:input
2021-01-01/
data-file-001.csv
data-file-002.csv
data-file-003.csv
2021-01-02/
data-file-001.csv
data-file-002.csv
My debug result is as follows:
Using Get Metadata1 activity to get the folder list and then using ForEach1 activity to iterate this list.
Inside the ForEach1 activity, we now using data flow to move data.
Set the source dataset to the container and declare a parameter FolderName.
Then add dynamic content #dataset().FolderName to the source dataser.
Back to the ForEach1 activity, we can add dynamic content #item().name to parameter FolderName.
Key in File_Name to the tab. It will store the file name as a column eg. /2021-01-01/data-file-001.csv.
Then we can process this column to get the file name we want via DerivedColumn1.
Addd expression concat('data_',substring(File_Name,2,10),'_',split(File_Name,'-')[5]).
In the Settings of sink, we can select Name file as column data and File_Name.
That's all.
I am using Stream Analytics to join streaming data (via IoT Hub) and reference data (via blob storage). The reference data blob file is generated every minute with latest data and is in a format "filename-{date} {time}.csv". The reference blob file data is used in the Azure Machine Learning function as parameters in SA job. The output of stream analytics job (into Azure SQL or Power BI) seems to be generating multiple rows instead of one for Azure Machine Learning function's output, one each for parameter values from previous blob files. My understanding is that it should only use the latest blob file content but looks like it is using all the blob files and generating multiple rows from AML output. Here is the query I am using:
SELECT
AMLFunction(Ref.Input1, Ref.Input2), *
FROM IoTInput Stream
LEFT JOIN RefBlobInput Ref ON Stream.DeviceId = Ref.[DeviceID]
Please can you advice if the query or the file path needs changing to avoid duplicating records? Thanks
To take effect of only latest file, you need to store your file in particular folder structure.
If you have note down, whenever you select reference data file as stream input; stream input dialog asks you for folder structure along with date and time format.
Stream always search for reference file from latest {date}/{time} folder. i.e. you need to store your file like,
2018-01-25/07:30/filename.json (YYYY-MM-DD/HH-mm/filename.json)
NOTE: Here your time folder needs to be unique for each minute. Same as, date folder needs to be unique for each date. Whenever you create new file, create it with under new time stamp folder and under current date folder.
You can use any datetime format that stream input supports.