Azure CosmosDB Spark OLTP Connector with Managed Identity - apache-spark

We are trying to ingest some data from DataLake to Azure Cosmos DB and Spark OLTP Connector seems to be the easiest to use.
But due to the company's policy, we are not supposed to use the master keys and we usually use managed identity for the applications. I see the Cosmos DB Java client builder has the 'TokenCredential' option with sample code as:
CosmosAsyncClient client = CosmosClientBuilder
.credential(new DefaultAzureCredentialBuilder().build())
.buildAsyncClient();
Is there anyway to setup the connector to use the same authentication mechanism with managed identity?

I see the Cosmos DB Java client builder has the 'TokenCredential' option with sample code
In CosmosAsyncClient you also have to mention the maker key. there is no such way to use managed identities.
we are not supposed to use the master keys and we usually use managed identity for the applications.
As you want to transfer data from Datalake to CosmosDB with Managed Identities you can use Copy Data Tool in Azur data factory. Create Linked service for cosmos db and in authentication type select Managed identity either system or user.
You can refer this So Thread by #KarthikBhyresh-MT for more understanding on Copy data tool.

Currently, the Spark Connector does not support MSI. I see you correctly created the Issue on the repo that holds the source code: https://github.com/Azure/azure-sdk-for-java/issues/29958
That will surely be used for tracking purposes or at least linking to the workitem that tracks the progress on that area. The feature will be available in the future but there is currently no ETA.

Related

Data Migration from Snowflake (on GCP Instance) to Snowflake (Azure Instance)

I am looking for some inputs on how to do a GCP cloud to AZURE cloud data migration.
Scenario -
I have a snowflake instance configured on GCP cloud (multiple databases holding legacy data) and I have another snowflake instance configured on Azure Cloud (DWH created on this instance).
I want to move/copy the data of all the databases (including all child objects - schema, table, views etc) sitting on GCP snowflake instance to snowflake instance configured on Azure Cloud.
Can you please guide me on what can be the best solution for such data migration and any steps or documentation link would be really helpful.
Many thanks - Minti
Please check the Database replication mechanism which can be used as a migration tool for SF account from 1 cloud platform to another. https://docs.snowflake.com/en/user-guide/database-replication-intro.html
Not something I've done before to be honest but if you didn't want to use external tools one possible method would be to secure share your GCP databases with your Azure Snowflake account.
You then might be able to create a new database that is a clone of this share (not sure if this is possible).
Most objects get cloned apart from stages and pipes but tables, views etc should carry over
This is a pretty easy process with a couple of prerequisites.
Make sure you have Organizations enabled on your GCP account.
This feature allows you to self-provision Snowflake accounts on any cloud provider/region. Open a support case to enable it.
Introduction to Organizations
Create a new account on Azure if you haven't already.
Enable Replication on both accounts
This can be done when logged into the account with the ORGADMIN role
Replicate your databases
Note: this will work for having a replica of the databases in the GCP Snowflake account databases in your Azure Snowflake account. If you want to permanently migrate your databases you need to set up Failover/Failback. This is a Business Critical feature, but Snowflake support will enable it for lower editions until you can complete your migration, at which point they will disable it.
Replicating a Database to Another Account
There are two options
You could make use of the replication feature
High level Steps include the below
a. Target account to be created - Can use the Organizations feature available in Snowflake(Enabled by Snowflake Support upon request)
b. Account level objects should be created manually in the target account
Note: The Failover feature is supported for the accounts whose edition is Business-critical and above. However, for account migration scenarios, this feature will be enabled for a temporary period by the Snowflake Support.
c. Replication - the below links can be referenced for a complete understanding of the process.
https://docs.snowflake.com/en/user-guide/database-replication-intro.html#introduction-to-database-replication-across-multiple-accounts
https://docs.snowflake.com/en/user-guide/database-replication-config.html#replicating-a-database-to-another-account
https://docs.snowflake.com/en/user-guide/database-failover-config.html#failing-over-databases-across-multiple-accounts
Please find the link below to have an overview on the costs associated
https://docs.snowflake.com/en/user-guide/database-replication-billing.html#understanding-billing-for-database-replication
Limitations
https://docs.snowflake.com/en/user-guide/database-replication-intro.html#current-limitations-of-replication
One other option is to create the target account and use the unloading and loading feature
https://docs.snowflake.com/en/user-guide-data-unload.html
https://docs.snowflake.com/en/user-guide-data-load.html

How do I setup an AutoResolve Integrated Runtime in Azure Purview

I am trying to test out Azure Purview and connect it to an Azure SQL Server. Since the SQL server is hosted in the cloud I want to use the default AutoResolve Integrated Runtime to get connected but there is not one setup or an option to setup a new one. Has anyone else using Purview been able to setup (or needed to setup) an AutoResolve IR?
To connect to Azure SQL DB/MI you can directly go to the Azure Purview portal and register new data sources and select Azure SQL DB/MI.
In this article - Manage data sources in Azure Purview (Preview), you learn how to register new data sources, manage collections of data sources, and view sources in Azure Purview (Preview).
Only to connect on-premise SQL server you need to Set up a
self-hosted integration runtime to scan the data source.
If the data source is located on Azure, you don't need any integration runtime to scan the data source.
Reference: Register and scan an Azure SQL Database.
CHEEKATLAPRADEEP-MSFT is absolutely correct, to go a step further, since you know what an auto resolve integration runtime is, you probably are utilizing Azure Data Factory so in addition to registering your SQL Server, you can also link your Azure Data Factory for data lineage purposes. Based on the pipelines that are executed, it will autonomously create the data lineage.
Navigation to Link Data Factory
Data lineage created by linking Data Factory
Keep in mind, you will have to execute pipelines after linkage for it to pick up the data lineage. Also, for sources or destinations not supported yet, it will not get the data lineage.

How to manage CosmosDB Stored procedures, Function and Triggers as like SQL DB project

I am developing a Saas based application which has hybrid DB architecture (Azure SQL Server and Azure Cosmos DB).
To manage SQL Server Tables, Stored procedures, triggers and functions we will create a SQLDB project (.sqlproj). Also we can generate .dacpac and deploy in the sql server.
As like SQL, we will have collections, stored procedures, triggers and functions in Azure CosmosDB.
How to manage CosmosDB collection, procedure, trigger? Is there any project templete available to manage? Suggest a solution to proceed.
Based on my experience with CosmosDb, I believe there is nothing sort of project templates available for CosmosDb. Because it is not as easy as SQL Db project.
I suggest you will have to store them as json files in local solution version control and version them accordingly.
You could write necessary programming logic to execute these scripts/cosmos DB logic using SQL API for .NET or another platform. This way you are controlling the collections, udf, triggers etc from your code, and you can version your code accordingly.
More references here: https://learn.microsoft.com/en-us/azure/cosmos-db/programming
Azure CosmosDBs can be managed through ARM templates. You can use these to version your databases/collections/etc. See Microsoft.DocumentDB resource types documentation.

Azure Data Lake Store exposed via OData

Is it possible to expose an Azure Data Lake Store via OData? The main goal is consume this service in Salesforce (via Salesforce Connect).
If so, should it take place through Azure Data Factory?
Update
A little bit more of context:
We have historical data stored in Azure Data Lake Storage (ADLS) that we want to expose via OData (to be visualised in Salesforce via Salesforce Connect / External objects). After digging into the issue and potential solutions, we don't think ADLS is the right service to be used in this particular case. Instead, we'll might need to configure a Data Factory pipeline to copy the data we are interested in to a SQL Database and read the data from there via a simple ASP.Net application using Entity Data Model and WCF Data Services Entity Framework Provider (got some insights from this website).
I don't think OData has a connector for ADLS. However, given OData is basically a REST API, you could probably build an OData API over the existing ADLS REST APIs if they are not providing what you need. I am not sure how ADF would come into this picture?
Maybe it would be useful if you tell us what you want to achieve?

Could any one help me how to perform Azure table storage deployment through VSTS?

I am a new to azure.Could any one help me what is table storage in Azure and how can I do table storage deployment through VSTS?Please share your thoughts and what steps involved in this and which plugin/task I can use in VSTS to perform this?
About Azure Table storage, you can refer to this article: Azure Table storage overview.
Regarding Azure table storage with VSTS, you can manage azure tables and table entities through Azure PowerShell task.
Azure Table storage stores large amounts of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud. Azure tables are ideal for storing structured, non-relational data. Common uses of Table storage include:
Storing TBs of structured data capable of serving web scale
applications
Storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access
Quickly querying data using a clustered index
Accessing data using the OData protocol and LINQ queries with WCF
Data Service .NET Libraries
You can use Table storage to store and query huge sets of structured, non-relational data, and your tables will scale as demand increases.
You’ll have to install Azure Storage Client Library for .NET to work with Azure Storage.
For more details, refer to the documentations Get started with Azure Table storage using .NET and Get started with Azure table storage and Visual Studio Connected Services (ASP.NET) incase if you haven't checked earlier.

Resources