How to connect data from Data Lake Gen 1 to Tableau - python-3.x

I would like to connect data from 'Data Lake Storage Gen 1 account' to Tableau. What tableau version is recommended for Gen 1 connection?
PS: I am aware Gen 2 can be connected to Tableau 2021.1

Tableau document doesn't talked about which version is recommended for Gen 1 connection, but is also provide a way to connect to Data Lake Storage Gen 1 account.
You can ref this document: Visualize Live Azure Data Lake Storage Data in Tableau:
Authenticating to a Gen 1 DataLakeStore Account
Gen 1 uses OAuth 2.0 in Azure AD for authentication.
For this, an Active Directory web application is required. You can create one as follows:
Sign in to your Azure Account through the .
Select "Azure Active Directory".
Select "App registrations".
Select "New application registration".
Provide a name and URL for the application. Select Web app for the
type of application you want to create.
Select "Required permissions" and change the required permissions
for this app. At a minimum, "Azure Data Lake" and "Windows Azure
Service Management API" are required.
Select "Key" and generate a new key. Add a description, a duration,
and take note of the generated key. You won't be able to see it
again.
To authenticate against a Gen 1 DataLakeStore account, the following properties are required:
Schema: Set this to ADLSGen1.
Account: Set this to the name of the account.
OAuthClientId: Set this to the application Id of the app you created.
OAuthClientSecret: Set this to the key generated for the app you
created.
TenantId: Set this to the tenant Id. See the property for more
information on how to acquire this.
Directory: Set this to the path which will be used to store the
replicated file. If not specified, the root directory will be used.

Related

Trying to set a link services through registerd app to azure data lake storage and keep getting 24200 error

I am new to azure. We have azure data lake storage set. I am trying to set the link services from the data factory to the azure data lake storage gen2. It keeps failing when I test the link service to the data lake storage. As far as I can see, I have granted the "Storage blob contributor" role to the user in the azure data lake storage. I still keep getting permission denied error when I test the link services
ADLS Gen2 operation failed for: Storage operation '' on container 'testconnection' get failed with 'Operation returned an invalid status code 'Forbidden''. Possible root causes: (1). It's possible because the service principal or managed identity don't have enough permission to access the data. (2). It's possible because some IP address ranges of Azure Data Factory are not allowed by your Azure Storage firewall settings. Azure Data Factory IP ranges please refer https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses.. Account: 'dlsisrdatapoc001'. ErrorCode: 'AuthorizationFailure'. Message: 'This request is not authorized to perform this operation.'.
What I could observe is that when I open the network to all (public) in the data lake storage, it works, when I set the firewall with CIDR it fails. Couldn't narrow the cause of the problem. I do have the "Allow azure services on the trusted services list to access this account" checked.
Completely lost
As mentioned in the error description, the error usually occurs if you don't have sufficient permissions to perform the action or if you don't add the required IPs in the firewall settings of your storage account.
To resolve the error, please check if you added the Storage Blob Data Contributor role to your managed identity along with the user like below:
Go to Azure Portal -> Storage Accounts -> Your Storage Account -> Access Control (IAM) ->Add role assignment
Make sure to select the managed identity, based on the authentication method you selected while creating linked service.
As mentioned in this MsDoc, make sure to add all the required IPs based on your resource location and service tag.
Download the JSON file to know the IP range for service tag in your resource location and add them in the firewall settings like below:
Make sure to select the Resource type as
Microsoft.DataFactory/factories while choosing CIDR.
For more in detail, please refer below links:
Error when I am trying to connect between Azure Data factory and Azure Data lake Gen2 by Anushree Garg
Storage Accoung V2 access with firewall, VNET to data factory V2 by Cindy Pau

AzureML to use geo-replication / secondary Blob Storage container for Datastore

For sake of safety I wish to use geo-replication / secondary Blob Storage container for a data source for AzureML Datastore. So I do the following:
New Datastore
Enter name + Azure Blob Storage + Enter manually
For URL I paste "Secondary Blob Service Endpoint" value from "Storage account endpoints" and I add container name at the end, e.g. https://somedata-secondary.blob.core.windows.net/container-name
Select subscription ID
I select the resource group in which somedata is hosted,
I add account key taken from "Access keys" section, I tried also with SAS token
After finalizing, the new datastore seem to appear in the list but it is impossible to Browse (preview), throwing the error "Invalid host".
What is the correct way of doing this?
Is it possible at all to access this geo-replication / secondary Blob Storage as datastore?
Please check with below points:
Initially please check if Share Access Token (SAS) token is outdated or expired
Please note that Both primary and geo-secondary are required to have the same service tier and strongly recommended that the geo-secondary is configured with the same backup storage redundancy and compute size as the primary.
Note: You can only access your storage account by its primary name. In the event of failover, that name will be mapped to the alternate datacenter.
There are two disadvantages of GRS redundancy:
Replication between regions is asynchronous and so data is propagated with a small delay
The second region cannot be accessed or read until the storage account fails over
Active geo-replication - Azure SQL Database | Microsoft Docs
As the replicated endpoint will be https://account-secondary.blob.core.windows.net. Note that this DNS entry won’t even be registered unless read access geo redundant replication is enabled.
The access keys for your storage account are the same for both the primary and secondary endpoints. You can use the same primary (or secondary) access key for the secondary too.

Azure Synapse severless SQL pool - query execution fails

After completing tutorial 1, I am working on this tutorial 2 from Microsoft Azure team to run the following query (shown in step 3). But the query execution gives the error shown below:
Question: What may be the cause of the error, and how can we resolve it?
Query:
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet',
FORMAT='PARQUET'
) AS [result]
Error:
Warning: No datasets were found that match the expression 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Schema cannot be determined since no files were found matching the name pattern(s) 'https://contosolake.dfs.core.windows.net/users/NYCTripSmall.parquet'. Please use WITH clause in the OPENROWSET function to define the schema.
NOTE: The path of the file in the container is correct, and actually I generated the following query just by right clicking the file inside container and generated the script as shown below:
Remarks:
Azure Data Lake Storage Gen2 account name: contosolake
Container name: users
Firewall settings used on the Azure Data lake account:
Azure Data Lake Storage Gen2 account is allowing public access (ref):
Container has required access level (ref)
UPDATE:
The owner of the subscription is someone else, and I did not get the option Check the "Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account" box described in item 3 of Basics tab > Workspace details section of tutorial 1. I also do not have permissions to add roles - although I'm the owner of synapse workspace. So I am using workaround described in the Configure anonymous public read access for containers and blobs from Azure team.
--Workaround
If you are unable to granting Storage Blob Data Contributor, use ACL to grant permissions.
All users that need access to some data in this container also needs
to have the EXECUTE permission on all parent folders up to the root
(the container). Learn more about how to set ACLs in Azure Data Lake
Storage Gen2.
Note:
Execute permission on the container level needs to be set within the
Azure Data Lake Gen2. Permissions on the folder can be set within
Azure Synapse.
Go to the container holding NYCTripSmall.parquet.
--Update
As per your update in comments, it seems you would have to do as below.
Contact the Owner of the storage account, and ask them to perform the following tasks:
Assign the workspace MSI to the Storage Blob Data Contributor role on
the storage account
Assign you to the Storage Blob Data Contributor role on the storage
account
--
I was able to get the query results following the tutorial doc you have mentioned for the same dataset.
Since you confirm that the file is present and in the right path, refresh linked ADLS source and publish query before running, just in case if a transient issue.
Two things I suspect are
Try setting Microsoft network routing in Network Routing settings in ADLS account.
Check if built-in pool is online and you have atleast contributer roles on both Synapse workspace and Storage account. (If the current credentials using to run the query has not created the resources)

Empty error while executing SSIS package in Azure Data Factory

I have created a simple SSIS project and in this project, I have a package that will delete a particular file in Downloads folder.
I deployed this project to Azure. And when I am trying to execute this package using Azure Data Factory then the pipeline fails with an empty error (I am attaching the screenshot here).
enter image description here
What I have done to fix this error is:
I have added self-hosted IR to Azure-SSIS IR as the proxy to access the data on-premise.
Set the ConnectByProxy as True.
Converted the project to Project Deployment Model.
Please help me out to fix this error and if you need more details then just leave a comment.
Windows Authentication :
To access data stores such as SQL servers/file shares on-premises or Azure Files, check the Windows authentication check box.
If this check box is selected, fill in the Domain, Username, and Password fields with the values for your package execution credentials. The domain is Azure, the username is storage account name>, and the password is storage account key> to access Azure Files, for example.
Using the secrets stored in your Azure Key Vault
As a substitute, you can leverage secrets from your Azure Key Vault as values. Select the AZURE KEY VAULT check box next to them to do so. Create a new key vault connected service or choose or update an existing one. Then choose your value's secret name and version. You can pick or update an existing key vault or create a new one when creating or editing your key vault connected service. If you haven't previously done so, allow Data Factory managed identity access to your key vault. You may also directly input your secret in the format key vault linked service name>/secret name>/secret version>.
Note : If you are using Windows Authentication, there are four methods to
access data stores with Windows authentication from SSIS packages
running on your Azure-SSIS IR: Access data stores and file shares with
Windows authentication from SSIS packages in Azure | Docs
Make Sure it Falls under one of such methods, else it could potentially fail at the Run Time.

Azure blob storage networking rules (Ip) for Azure data warehouse

I need to load external data (in blob storage) to my Azure data warehouse using Polybase. I had it working fine when I was using Classic Azure Storage.
Recently, I have to update our Storage to ARM and I could not figure out how to set up the firewall rule on the ARM Storage to my Azure data warehouse. If I set the firewall to "All networks" everything works seamlessly. However, I cannot let the blob wide open.
I tried using nslookup to find the outbound ip for our Azure Data warehouse and put the value into the Firewall of the Storage; I got "This request is not authorized to perform this operation." error
Is there a way I can find the ip address for an Azure Data warehouse? Or I should use different approach to make it work?
Any Suggestions are appreciated.
Kevin
Under the section 1.1 Create a Credential, it states:
Don't skip this step if you are using this tutorial as a template for loading your own data. To access data through a credential, use the following script to create a database-scoped credential, and then use it when defining the location of the data source.
-- A: Create a master key.
-- Only necessary if one does not already exist.
-- Required to encrypt the credential secret in the next step.
CREATE MASTER KEY;
-- B: Create a database scoped credential
-- IDENTITY: Provide any string, it is not used for authentication to Azure storage.
-- SECRET: Provide your Azure storage account key.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'user',
SECRET = '<azure_storage_account_key>'
;
-- C: Create an external data source
-- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure blob storage.
-- LOCATION: Provide Azure storage account name and blob container name.
-- CREDENTIAL: Provide the credential created in the previous step.
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<blob_container_name>#<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = AzureStorageCredential
);
Edit: (additional way to access Blobs from ADW through the use of SAS):
You also can create a Storage linked service by using a shared access signature. It provides the data factory with restricted/time-bound access to all/specific resources (blob/container) in the storage.
A shared access signature provides delegated access to resources in your storage account. You can use a shared access signature to grant a client limited permissions to objects in your storage account for a specified time. You don't have to share your account access keys. The shared access signature is a URI that encompasses in its query parameters all the information necessary for authenticated access to a storage resource. To access storage resources with the shared access signature, the client only needs to pass in the shared access signature to the appropriate constructor or method. For more information about shared access signatures, see Shared access signatures: Understand the shared access signature model.
Full document can be found here

Resources