Filesystem SDK vs Azure Data Factory - azure

I'm very new to the Azure Data Lake Storage and currently training on Data Factory. I have a developer background so right away I'm not a fan of the 'tools' approach for development. I really don't like how there's all these settings to set and objects you have to create everywhere. I much prefer a code approach which allows us to detach the logic from the service (don't like the publishing thing to save), see everything by scrolling or navigate to different objects in a project, see differences easier in source control and etc. So I found this Micrososft's Filesystem SDK that seems to be an alternative to Data Factory:
https://azure.microsoft.com/en-us/blog/filesystem-sdks-for-azure-data-lake-storage-gen2-now-generally-available/
What has been your experience using this approach? Is this a good alternative? Is there a way to run SDK code in data factory? that way we can leverage scheduling and triggers? I guess i'm looking for Pros/cons.
thank you

Well, the docs refer to several SDKs, one of them being the .Net SDK and the title is
Use .NET (or Python or Java etc.) to manage directories, files, and ACLs in Azure Data Lake Storage Gen2
So, the SDK lets you manage the filesystem only. No support for triggers, pipelines, dataflows and the lot. You will have to stick to the Azure Data Factory for that.
Regarding this:
I'm not a fan of the 'tools' approach for development
I hate to tell you but the world is moving that way whether you like it or not. Take Logic Apps for example. Azure Data Factory isn't aimed at the hardcore developer but fulfils a need for people working with large sets of data like Data Engineers. I am already glad it integrates with git very well. Yes, there is some overhead in defining sinks and sources but they are reusable across pipelines.
If you really want to use code try Azure Databricks. Take a look at this Q&A as well.
TL;DR:
The FileSystem SDK is not an alternative.

The code-centric alternative to Azure Data Factory for building and managing your Azure Data Lake is Spark. Typically either Azure Databricks or Azure Synapse Spark.

Related

What would be best technology for extracting and parsing a file

I'm pretty new to Azure, and wanted some direction regarding my needs. I have this flat file from a provider, hosted in their FTP server. I need to retreive it, extract data from the file, store results, etc.
What feature on Azure would you recommend?
It depends, but I recommend the usage of Azure Data Factory as it's the default ETL solution available on Azure.
Other alternatives:
Azure Logic Apps
Azure Functions (with time trigger)

Azure Datalake on-premise or hybrid stack

We are trying to evaluate a good fit for our solution. We want to process big-data, for that we want build the solution around Hadoop stack. We wanted to know how azure can help in these situations. The solution we are building is a SAAS. But some of our clients have confidential data which they want to hold only in their premise.
So can we run azure data lake on premise for those clients?
Can we have a hybrid model where storage used will be on premise but the processing done will be on cloud.
The reason we are asking this is to answer the question of scalability and reliability.
I know this is vague but if you need more clarification please let us know.
Azure Data Lake (gen2) Hierarchical Filesystem support in Azure Stack would enable you to use it natively for your storage requirements. Unfortunately currently Azure Stack does not support Azure Data Lake.
You can find the list of available services here: https://azure.microsoft.com/en-gb/overview/azure-stack/keyfeatures/. Some of the big data ecosystem Azure tools are in development but not yet generally available.
There is a feature request to add this support. You can vote this up to help Microsoft prioritize this. https://feedback.azure.com/forums/344565-azure-stack/suggestions/38222872-support-adls-gen2-on-azure-stack

Javascript in USQL; Is it a possible option in Azure data lake analytics?

We plan to follow Lambda architecture for our solution. The solution stack is on top of Azure. Azure data lake analytics used for batch processing, stream analytics is for online processing. We wanted to use same code and configuration is being used in both batch and streaming layer. Is there any option to use javascript in USQL with the help of .Net assemblies. Azure stream analytics is supports only javascript UDF. Has anyone tried similar options in azure stack?
This probably wouldn't help much but wanted to share as if someone does have C# experience it would be an interesting approach.
https://josephwoodward.co.uk/2016/09/executing-javascript-inside-dot-net-core-using-javascript-services
in regards to specifically your issue, have you looked at the Node.js Module?
https://learn.microsoft.com/en-us/javascript/api/overview/azure/data-lake-analytics?view=azure-node-latest

Which Azure products are needed for a staging database?

I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.
Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.
Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.

Copy files from on-prem to azure

I'm new to Azure eco system. I'm doing some research on copying data from on-prem to azure. I found following options:
AzCopy
Azure Data Factory (Copy Data Tool)
Data Management Gateway
Ours is a Microsoft shop; so, I'm looking for tools that gel with MS platform. Also, down the line, we want to automate the entire thing as much as we can. So, I think, Azure Storage Explorer is out of the question. Is there a preference among the above 3. Or, are there any better tools?
I think you are mixing stuff, Copy Data Tool is just an Azure Data Factory Wizard to make some sample data moving between resources. Azure Data Factory uses the data management gateway to get on premises resources such as files and databases.
What you want to do can be made with Azure Data Factory. I recommend using version 2 (even in its preview version) because its Authoring is easier to understand if you are new to the tool. You can graphically configure linked services, datasets and pipelines from there.
I hope this helped, if you need further help just ask away!
If you're already familiar with SSIS, there's also the option to use SSIS in ADF that enables on-prem data access via VNet.

Resources