Azure Datalake on-premise or hybrid stack - azure

We are trying to evaluate a good fit for our solution. We want to process big-data, for that we want build the solution around Hadoop stack. We wanted to know how azure can help in these situations. The solution we are building is a SAAS. But some of our clients have confidential data which they want to hold only in their premise.
So can we run azure data lake on premise for those clients?
Can we have a hybrid model where storage used will be on premise but the processing done will be on cloud.
The reason we are asking this is to answer the question of scalability and reliability.
I know this is vague but if you need more clarification please let us know.

Azure Data Lake (gen2) Hierarchical Filesystem support in Azure Stack would enable you to use it natively for your storage requirements. Unfortunately currently Azure Stack does not support Azure Data Lake.
You can find the list of available services here: https://azure.microsoft.com/en-gb/overview/azure-stack/keyfeatures/. Some of the big data ecosystem Azure tools are in development but not yet generally available.
There is a feature request to add this support. You can vote this up to help Microsoft prioritize this. https://feedback.azure.com/forums/344565-azure-stack/suggestions/38222872-support-adls-gen2-on-azure-stack

Related

Azure backup lifecycle management

Looking for a link on Azure Backup lifecycle management or help file will also do or anyways to design the Backup Lifecycle management
The Azure Backup service provides simple, secure, and cost-effective solutions to back up your data and recover it from Azure. You can back up anything from on-premise data to Azure File Shares and VMs, including Azure PostgreSQL databases.
To start off, you can learn more about Azure Backup here. This article summarizes Azure Backup architecture, components, and processes. While it’s easy to start protecting infrastructure and applications on Azure, you must ensure that the underlying Azure resources are set up correctly and being used optimally in order to accelerate your time to value.
To learn more about the capabilities of Azure Backup, and how to efficiently implement solutions that better protect your deployments, detailed guidance and best practices have been described to design your backup solution on Azure.
For additional reading, also refer to some Frequently asked questions about Azure Backup.

Filesystem SDK vs Azure Data Factory

I'm very new to the Azure Data Lake Storage and currently training on Data Factory. I have a developer background so right away I'm not a fan of the 'tools' approach for development. I really don't like how there's all these settings to set and objects you have to create everywhere. I much prefer a code approach which allows us to detach the logic from the service (don't like the publishing thing to save), see everything by scrolling or navigate to different objects in a project, see differences easier in source control and etc. So I found this Micrososft's Filesystem SDK that seems to be an alternative to Data Factory:
https://azure.microsoft.com/en-us/blog/filesystem-sdks-for-azure-data-lake-storage-gen2-now-generally-available/
What has been your experience using this approach? Is this a good alternative? Is there a way to run SDK code in data factory? that way we can leverage scheduling and triggers? I guess i'm looking for Pros/cons.
thank you
Well, the docs refer to several SDKs, one of them being the .Net SDK and the title is
Use .NET (or Python or Java etc.) to manage directories, files, and ACLs in Azure Data Lake Storage Gen2
So, the SDK lets you manage the filesystem only. No support for triggers, pipelines, dataflows and the lot. You will have to stick to the Azure Data Factory for that.
Regarding this:
I'm not a fan of the 'tools' approach for development
I hate to tell you but the world is moving that way whether you like it or not. Take Logic Apps for example. Azure Data Factory isn't aimed at the hardcore developer but fulfils a need for people working with large sets of data like Data Engineers. I am already glad it integrates with git very well. Yes, there is some overhead in defining sinks and sources but they are reusable across pipelines.
If you really want to use code try Azure Databricks. Take a look at this Q&A as well.
TL;DR:
The FileSystem SDK is not an alternative.
The code-centric alternative to Azure Data Factory for building and managing your Azure Data Lake is Spark. Typically either Azure Databricks or Azure Synapse Spark.

Javascript in USQL; Is it a possible option in Azure data lake analytics?

We plan to follow Lambda architecture for our solution. The solution stack is on top of Azure. Azure data lake analytics used for batch processing, stream analytics is for online processing. We wanted to use same code and configuration is being used in both batch and streaming layer. Is there any option to use javascript in USQL with the help of .Net assemblies. Azure stream analytics is supports only javascript UDF. Has anyone tried similar options in azure stack?
This probably wouldn't help much but wanted to share as if someone does have C# experience it would be an interesting approach.
https://josephwoodward.co.uk/2016/09/executing-javascript-inside-dot-net-core-using-javascript-services
in regards to specifically your issue, have you looked at the Node.js Module?
https://learn.microsoft.com/en-us/javascript/api/overview/azure/data-lake-analytics?view=azure-node-latest

Copy files from on-prem to azure

I'm new to Azure eco system. I'm doing some research on copying data from on-prem to azure. I found following options:
AzCopy
Azure Data Factory (Copy Data Tool)
Data Management Gateway
Ours is a Microsoft shop; so, I'm looking for tools that gel with MS platform. Also, down the line, we want to automate the entire thing as much as we can. So, I think, Azure Storage Explorer is out of the question. Is there a preference among the above 3. Or, are there any better tools?
I think you are mixing stuff, Copy Data Tool is just an Azure Data Factory Wizard to make some sample data moving between resources. Azure Data Factory uses the data management gateway to get on premises resources such as files and databases.
What you want to do can be made with Azure Data Factory. I recommend using version 2 (even in its preview version) because its Authoring is easier to understand if you are new to the tool. You can graphically configure linked services, datasets and pipelines from there.
I hope this helped, if you need further help just ask away!
If you're already familiar with SSIS, there's also the option to use SSIS in ADF that enables on-prem data access via VNet.

How is Azure Storage Tables implemented?

I'm the type of developer that likes to understand the whole stack and viewing Azure Storage Tables as a black box makes me uncomfortable.
RDBMS is an entire field of study in Computer Science. The components necessary to support ACID operations, query optimizations down to the details of B-trees to create indexes is essentially a well documented, solved problem.
Apache HBase and MongoDB are open source and Google has published multiple papers on BigTable, but I can't find anything on Microsoft's Azure Storage Tables, other than usage / developer documentation. Has Microsoft published any details on the actual implementation (algorithms, data structures and infrastructure) behind Azure Storage Tables?
The Azure Storage team presented a paper at SOSP11 describing the inner workings of the Azure Storage Service (including the Table Services).

Resources