Can Alteryx workflow for ML pipeline be built with flat files first and later swapped with API connector? - alteryx

I'm trying to build a ML pipeline using Alteryx. I'll be pulling data via API and build an automated workflow. But first, until I get license and make sure, I'd like to use flat files to build the pipeline. The data in flat files and data from API (batch) would essentially be the same. Can I develop the full pipeline first and then swap the ingestion portion later with API connector?
I have searched online but haven't found an answer to this.

Related

Is there a way to visualise the changes to an ADF pipeline when reviewing a PR?

I am currently reviewing between 5-15 pull requests a week on a project being developed using Azure Data Factory (ADF) and Databricks.
Most pull requests contain changes to our ADF pipelines, which gets stored in source code as nested JSON.
What I've found is that, as a reviewer, being able to visually see the changes being made to an ADF pipeline in the pull request make a huge difference in the speed and accuracy at which I can perform my review. Obviously, I can check out the branch and go view the pipelines for that branch directly on ADF, but that does not give me a differential view.
My question is this: Is there a way to parse two ADF pipeline json objects (source and destination branch versions of the same file) and generate a visual representation of each object? Ideally highlighting the differences, but just showing them would be a good first stab.
Bonus points if we can fit this into a Azure DevOps release pipeline and generate it automatically as part of the CI/CD pipeline.
If you are already using Azure DevOps then you should have exactly what you are wanting available in every Pull Request. For any Pull Request you can click on the Files tab and it will show a side by side comparison of every file. It color codes it and includes additions, updates, and removals. It is very helpful for review. Please refer to this screenshot for details and illustration:

Azure Data Factory V2 multiple environments like in SSIS

I'm coming from a long SSIS background, we're looking to use Azure data factory v2 but I'm struggling to find any (clear) way of working with multiple environments. In SSIS we would have project parameters tied to the Visual Studio project configuration (e.g. development/test/production etc...) and say there were 2 parameters for SourceServerName and DestinationServerName, these would point to different servers if we were in development or test.
From my initial playing around I can't see any way to do this in data factory. I've searched google of course, but any information I've found seems to be around CI/CD then talks about Git 'branches' and is difficult to follow.
I'm basically looking for a very simple explanation and example of how this would be achieved in Azure data factory v2 (if it is even possible).
It works differently. You create an instance of data factory per environment and your environments are effectively embedded in each instance.
So here's one simple approach:
Create three data factories: dev, test, prod
Create your linked services in the dev environment pointing at dev sources and targets
Create the same named linked services in test, but of course these point at your tst systems
Now when you "migrate" your pipelines from dev to test, they use the same logical name (just like a connection manager)
So you don't designate an environment at execution time or map variables or anything... everything in test just runs against test because that's the way the linked servers have been defined.
That's the first step.
The next step is to connect only the dev ADF instance to Git. If you're a newcomer to Git it can be daunting but it's just a version control system. You save your code to it and it remembers every change you made.
Once your pipeline code is in git, the theory is that you migrate code out of git into higher environments in an automated fashion.
If you go through the links provided in the other answer, you'll see how you set it up.
I do have an issue with this approach though - you have to look up all of your environment values in keystore, which to me is silly because why do we need to designate the test servers hostname everytime we deploy to test?
One last thing is that if you a pipeline that doesn't use a linked service (say a REST pipeline), I haven't found a way to make that environment aware. I ended up building logic around the current data factories name to dynamically change endpoints.
This is a bit of a bran dump but feel free to ask questions.
Although it's not recommended - yes, you can do it.
Take a look at Linked Service - in this case, I have a connection to Azure SQL Database:
You have possibilities to use dynamic content for either the server name and database name.
Just add a parameter to your pipeline, pass it to the Linked Service and use in the required field.
Let me know whether I explained it clearly enough?
Yes, it's possible although not so simple as it was in VS for SSIS.
1) First of all: there is no desktop application for developing ADF, only the browser.
Therefore developers should make the changes in their DEV environment and from many reasons, the best way to do it is a way of working with GIT repository connected.
2) Then, you need "only":
a) publish the changes (it creates/updates adf_publish branch in git)
b) With Azure DevOps deploy the code from adf_publish replacing required parameters for target environment.
I know that at the beginning it sounds horrible, but the sooner you set up an environment like this the more time you save while developing pipelines.
How to do these things step by step?
I describe all the steps in the following posts:
- Setting up Code Repository for Azure Data Factory v2
- Deployment of Azure Data Factory with Azure DevOps
I hope this helps.

Azure Pipelines (DevOps): Custom Consumable Statistic/Metric

I have a build up on Azure Pipelines, and one of the steps provides a code metric that I would like to have be consumable after the build is done. Ideally, this would be in the form of a badge like this (where we have text on the left and the metric in the form of a number on the right). I'd like to put such a badge on the README of the repository to make this metric visible on a per-build basis.
Azure DevOps does have a REST API that one can use to access built-in aspects of a given build. But as far as I can tell there's no way to expose a custom statistic or value that is generated or provided during a build.
(The equivalent in TeamCity would be outputting ##teamcity[buildStatisticValue key='My Custom Metric' value='123'] via Console.WriteLine() from a simple C# program, that TeamCity can then consume and use/make available.)
Anyone have experience with this?
One option is you could use a combination of adding a build tag using a command:
##vso[build.addbuildtag]"My Custom Metric.123"
Then use the Tags - Get Build Tags API.
GET https://dev.azure.com/{organization}/{project}/_apis/build/builds/{buildId}/tags?api-version=5.0

Azure data factory | incremental data load from SFTP to Blob

I created a (once run) DF (V2) pipeline to load files (.lta.gz) from a SFTP server into an azure blob to get historical data.
Worked beautifully.
Every day there will be several new files on the SFTP server (which cannot be manipulated or deleted). So I want to create an incremental load pipeline which checks daily for new files - if so ---> copy new files.
Does anyone have any tips for me how to achieve this?
Thanks for using Data Factory!
To incrementally load newly generated files on SFTP server, you can leverage the GetMetadata activity to retrieve the LastModifiedDate property:
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Essentially you author a pipeline containing the following activities:
getMetadata (return list of files under a given folder)
ForEach (iterate through each file)
getMetadata (return lastModifiedTime for a given file)
IfCondition (compare lastModifiedTime with trigger WindowStartTime)
Copy (copy file from source to destination)
Have fun building data integration flows using Data Factory!
since I posted my previous answer in May last year, many of you contacted me asking for pipeline sample to achieve the incremental file copy scenario using the getMetadata-ForEach-getMetadata-If-Copy pattern. This has been important feedback that incremental file copy is a common scenario that we want to further optimize.
Today I would like to post an updated answer - we recently released a new feature that allows a much easier and scalability approach to achieve the same goal:
You can now set modifiedDatetimeStart and modifiedDatetimeEnd on SFTP dataset to specify the time range filters to only extract files that were created/modified during that period. This enables you to achieve the incremental file copy using a single activity:
https://learn.microsoft.com/en-us/azure/data-factory/connector-sftp#dataset-properties
This feature is enabled for these file-based connectors in ADF: AWS S3, Azure Blob Storage, FTP, SFTP, ADLS Gen1, ADLS Gen2, and on-prem file system. Support for HDFS is coming very soon.
Further, to make it even easier to author an incremental copy pipeline, we now release common pipeline patterns as solution templates. You can select one of the templates, fill out the linked service and dataset info, and click deploy – it is that simple!
https://learn.microsoft.com/en-us/azure/data-factory/solution-templates-introduction
You should be able to find the incremental file copy solution in the gallery:
https://learn.microsoft.com/en-us/azure/data-factory/solution-template-copy-new-files-lastmodifieddate
Once again, thank you for using ADF and happy coding data integration with ADF!

Add datafactory project in visual studio for version2 of ADF

As we can add Data Factory project for ADF version1. Can we add project for ADF version2 and if yes how can we do that? Whenever I am trying to add Data Factory Project it gives me option for version1 and not for version2. Is there anyway to add ADF version2 project to my solution?
Things have changed when we talk & work on ADF V1 and V2. I am presuming when you are saying ADFV1 project in VS you used to create empty project for data integration solution where you had to create linked services, datasets & pipelines based on the defined template - json format.
Now from ADFV2 on wards if you want to code ADFV2 in C# you can simply create a console application and code your way through ADFV2. Likewise if you choose other programming methods the way to do it changes.
For more detailed reference please have a look at the complete ADFV2 documentation.

Resources