I have a scenario where i am storing payload from 2 subscriber (Service bus topics) into 2 different storage/container. Storing mechanism are different in these case.
Now i have to run a sync every 30 minutes which will compare the files created target and Source,
if anything is missing in the target it should be able to copy that file from source to target.
I am looking at AZCopy sync, but that is a local application . There are logic app and Function app option as well.
Kindly share what is the best solution to this problem
This method is a little on the brute force side, but should work. Here is the top level pipeline diagram:
Here are the steps:
Get Metadata for the "Needs sync" folder (FolderA). Be sure to check add the "Child items" argument:
ForEach over the FolderA child items to extract the file names and append them to an array variable:
This makes it easier to work with the names later.
Get Metadata for the "Always right" folder (FolderB). Same process as above, but over the FolderB location.
ForEach over FolderB's child items.
Inside the ForEach, add an If Condition to test whether or not the FolderB item exists in the FolderA list.
If the FolderB item is not in the FolderA list, append it to a Missing_Items array variable.
From here, it's a matter of looping over the Missing Items array and handling it however you prefer [probably with a Copy activity].
Related
I am trying to copy files out of a s3 bucket using azure data factory. Firstly I want a list of the directories.
Using the CLI I would use. {aws s3 ls }
From there I can determine from the list in a foreach an push that into a variable.
In adf, I have tried to use 'get metadata', although this works in theory. In practice there are 76 files in each directory and the loop is over 1.5m. This just isn't worth it, it takes far too long, especially as the directories only takes about 20 seconds for 20000 directories.
Is there a method to do this list. When creating the dataset we have a no permissions, however when we use specific location it does.
Many thanks
I have found another way of completing this task.
So to begin with I am using get metadata with the child option. It produces an array.
I push this into a string variable. With this variable you can then create a stored procedure to pick this apart, using openjson to get just the value. This can then be pulled apart further to get the directory names.
I then merge these into a table.
Using lookup I can then run another stored procedure to return the value I require from the table. This whole process runs in a couple of minutes.
Anyone who wants a further explanation, please ask, I will try and create a walk through to assist
I am new to Synapse and I have to make a pipeline that will delete files from folders in a hierarchy like the attached image. expecting hierarchy. The red half circles mark the files I would like to delete files for example older than 2 months.
As for now I have made a pipline for a single folder and using the for each loop I can get to the files and delete the corresponding one. And it works, since I have about 60-70 folders and even more files I wanted to go a level higher up and make a pipeline for each folder to execute. And with this is a problem. When i use GetMetadata Activity for top folder, and use for each loop to take name folders then i can not acess files in folder just only folder. Could you help me someone how to slove this?
deleting pipline for single folder using for each loop
We can achieve this using nested for each activities with the help of execute pipeline activity. As mentioned, Get metadata with wildcards returns all files without folders and Delete activity is unable to recognize wildcard folder paths(Folder/*).
I have created a similar folder structure for demo. In my pipeline, I have first created an array parameter req_files (sample1.csv and sample2.csv) with names of files required.
Note: If you want to dynamically do this, you can use append variable to build required file names (file09/22 and file08/22).
I used one get metadata to get folder names (which are inside root folder). I am iterating through the output of get metadata in my for each activity (items value is #activity('root folder contents').output.childItems).
Inside my for each, I used another get metadata activity to loop through each of the sub folders (to get file contents).
Now I have the folder name and list of files inside it. I am going to use execute pipeline to implement nested for each. Create 3 parameters in a new pipeline called delete_pipeline (where I perform delete) as current_folder, folder_files and files_needed.
Pass the following dynamic content for each of them from parent pipeline.
current_folder: #item().name
folder_files: #activity('sub folder contents').output.childItems
files_needed: #pipeline().parameters.req_files
Now in delete_pipeline, I have a for each loop to loop through the list of files we are passing (items value is #pipeline().parameters.folder_files).
Inside this for each, I am using an If condition activity. This is because I want to delete files which are not in my req_files parameter (array from parent pipeline which we passed to files_needed parameter in delete_pipeline). The condition for if condition activity will be as following:
#contains(pipeline().parameters.files_needed,item().name)
We need to delete the file only when it is not present in req_files (files_needed). So, when the condition is false, we perform delete.
I have created 2 parameters file_namepath_of_file_to_delete and file_name_to_delete in the dataset I am using for delete activity with following dynamic content.
file_namepath_of_file_to_delete: Folder/#{pipeline().parameters.current_folder}
file_name_to_delete: #item().name
When I run the pipeline, it keeps the required files and deletes the rest. The following are output images for reference.
Debug output: https://i.imgur.com/E6GNVHW.png
My folder after I run the pipeline: https://i.imgur.com/bqN00Dw.png
I'm quite new to Data Factory and Logic Apps (but I am experienced with SSIS since many years),
I succeeded in loading a folder with 100 text-files into SQL-Azure with DATA FACTORY
But the files themselves are untouched
Now, another requirement is that I loop through the folders to get all files with a certain file extension,
In the end I should move (=copy & delete) all the files from the 'To_be_processed' folder to the 'Processed' folder
I can not find where to put 'wildcards' and such:
For example, get all files with file extensions .001, 002, 003, 004, 005, ...until... , 996, 997, 998, 999 (thousand files)
--> also searching in the subfolders.
Is it possible to call a Data Factory from within a Logic App ? (although this seems unnecessary)
Please find some more detailed information in this screenshot:
(click to enlarge)
Thanks in advance helping me out exploring this new technology!
Interesting situation.
I agree that using Logic Apps just for this additional layer of file handling seems unnecessary, but Azure Data Factory may currently be unable to deal with exactly what you need...
In terms of adding wild cards to your Azure Data Factory datasets you have 3 attributes available within the JSON type properties block, as follows.
Folder Path - to specify the directory. Which can work with a partition by clause for a time slice start and end. Required.
File Name - to specify the file. Which again can work with a partition by clause for a time slice start and end. Not required.
File Filter - this is where wildcards can be used for single and multiple characters. (*) for multi and (?) for single. Not required.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-onprem-file-system-connector
I have to say that separately none of the above are ideal for what you require and I've already fed back to Microsoft that we need a more flexible attribute that combines the 3 above values into 1, allowing wildcards in various places and a partition by condition that works with more than just date time values.
That said. Try something like the below.
"typeProperties": {
"folderPath": "TO_BE_PROCESSED",
"fileFilter": "17-SKO-??-MD1.*" //looks like 2 middle values in image above
}
On a side note; there is already a Microsoft feedback item thats been raised for a file move activity which is currently under review.
See here: https://feedback.azure.com/forums/270578-data-factory/suggestions/13427742-move-activity
Hope this helps
We have used a C# application which we call through 'app services' -> webjobs.
Much easier to iterate through folders. To call SQL we used sql bulkinsert
I have a file that i want to add to sourcecontrol on linux using cleartool .
I've followed the IBM documentation for this, i've tried this:
cleartool mkelem testScript.sh
I got an error: Can't modify directory "." because it is not checked out.
I also would like to know how can i checkout/checkin files or directories and setting activities.
You need to checkout the parent folder first.
cd /path/to/file/
cleartool mkact newfile
cleartool checkout -c "add file" .
cleartool mkelem testScript.sh
cleartool checkin -nc
The cleartool mkact would work if you are in an UCM view.
It will create and set a new activity, which will record the files and folder you will modify.
Here, the new activity newFile will record the new version of the parent folder, as well as the version 0 and 1 of the file.
You should create separate questions for .. separate questions...
Going back to the original - the reason why it isn't working is, as VonC has pointed out, you haven't checked out the parent of the file. Remember, when you run "cleartool mkelem", you are about to modify the contents of the parent directory (. in this case) by adding a new "pointer" to the element you're now creating. As with everything else in clearcase, when you want to modify the contents of an element, you have to check it out first.
One of ClearCase's greatest strength (and hardest to wrap one's head around) is the concept of an "element", IMO. "Everything" behaves similarly with an element. Making any change to an "element" (file or directory) means you have to check it out first to make that change.
In the case of a file, that's easy to grasp - you're just editing lines in a file. For a directory, it's almost as easy - you can think of a directory as just a list of pointers to data blobs. We make the name of the blob something convenient we can remember (like foo.java or myapplication.cc or README.md). But we can also change the name of the pointer (even though it points to the same data blob) by renaming a file. We can remove the pointer to the blob without impacting the blob itself by using "rmname". That's essentially what "rmname" does.
In ClearCases' case, the mkelem command is a little bit special - it creates the initial datablob, and adds a pointer to that datablob in the current directory (kind of does 2 things at once).
For a group of developers, all the differences are stored in a normal property file:
token1=some value
token2=9000
etc.
The 'tokens' are used in a series of XML files that reside in the normal src/main/resources directory. When Gradle copies these files into the build directory (and I don't know for sure what task that is), is there any opportunity to execute custom code? Specifically, I would like to have the token values from the property file substituted into the copy. Thus, the original copy remains untouched, but the version in the runtime has the desired values for the given developer.
Finally, I know this can done brute force with two or three steps that change the file after it is copied. I really want to know if there is an elegant way to do this in a single step.
After compilation, Gradle calls processResources task that copies the resources into the build directory. While copying resources, processResources can be configured to do the filtering (or possibly execute custom code by adding a doLast):
processResources {
filter org.apache.tools.ant.filters.ReplaceTokens, tokens: [
...
]
}
These two links can provide more help:
http://java.dzone.com/articles/resource-filtering-gradle
http://mrhaki.blogspot.in/2010/11/gradle-goodness-add-filtering-to.html