I've been building a nice ETL solution with Data Factory. We're bound to go to the production stage and now security becomes a real concern. Somehow I can't seem to get this right....
I've set up a CosmosDB and SQL Server/DB in Azure. I've added those to a virtual network and disallowed any connections outside of that network.
In DataFactory I've set up an Integration Runtime with Virtual network configuration.
I've added a Managed Private Endpoint in DataFactory, connected to the SQL server.
I've set up a Linked Service to the SQL server using that endpoint.
When I set up a Dataset using that Linked Service it works as expected. I can test the connection succesfully, select a table and retrieve it's schema.
However.....
I've set up a dataflow that retrieves data from the CosmosDB, does all kinds of magic with it and writes it to a sink using the dataset defined above. When I try to test the connection on this sink, it fails stating that it can't access the SQL database.
I'm assuming this has something to do with the difference between running a pipeline (using the IR) and running a dataflow (whatever that uses?). I can't, however, find what runs that dataflow and how to make sure that "thing" can access the SQL server.
What am I doing wrong?
After activating, then deactivating, the "Deny public network access" option on the SQL Server "Firewalls and Virtual Networks" setting page, everything started working as expected.
Apparently "Have you tried turning it off an on again" is still the solution to some issues....
Related
is it possible to use Azure Data Factory on-premise without letting the data run through the cloud? I know Talend got a prodcut, where the data is transfered only on our machines and not in the cloud.
Read documentation on Microsoft.com but didnt find any useful information
You may use a self-hosted integration runtime to transfer data entirely through your on-premises infrastructure, as long as both the data source and sink are on-premises.
However, the control flow will still happen through the cloud, even if the data itself never leaves your data center. For this reason, setting up a self-hosted integration runtime will still require outbound network access from your infrastructure to Azure.
Check out this piece of documentation for more information: https://learn.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime?tabs=data-factory#command-flow-and-data-flow
I'm trying to build a very basic data flow in Azure Data Factory pulling a JSON file from blob storage, performing a transformation on some columns, and storing in a SQL database. I originally authenticated to the storage account using Managed Identity, but I get the error below when attempting to test the connection to the source:
com.microsoft.dataflow.broker.MissingRequiredPropertyException:
account is a required property for [myStorageAccountName].
com.microsoft.dataflow.broker.PropertyNotFoundException: Could not
extract value from [myStorageAccountName] - RunId: xxx
I also see the following message in the Factory Validation Output:
[MyDataSetName] AzureBlobStorage does not support SAS,
MSI, or Service principal authentication in data flow.
With this I assumed that all I would need to do is switch my Blob Storage Linked Service to an Account Key authentication method. After I switched to Account Key authentication though and select my subscription and storage account, when testing the connection I get the following error:
Connection failed Fail to connect to
https://[myBlob].blob.core.windows.net/: Error Message: The
remote server returned an error: (403) Forbidden. (ErrorCode: 403,
Detail: This request is not authorized to perform this operation.,
RequestId: xxxx), make sure the
credential provided is valid. The remote server returned an error:
(403) Forbidden.StorageExtendedMessage=, The remote server returned an
error: (403) Forbidden. Activity ID:
xxx.
I've tried selecting from Azure directly and also entering the key manually and get the same error either way. One thing to note is the storage account only allows access to specified networks. I tried connecting to a different, public storage account and am able to access fine. The ADF account has the Storage Account Contributor role and I've added the IP address of where I am working currently as well as the IP range of Azure Data Factory that I found here: https://learn.microsoft.com/en-us/azure/data-factory/azure-integration-runtime-ip-addresses
Also note, I have about 5 copy data tasks working perfectly fine with Managed Identity currently, but I need to start doing more complex operations.
This seems like a similar issue as Unable to create a linked service in Azure Data Factory but the Storage Account Contributor and Owner roles I have assigned should supersede the Reader role as suggested in the reply. I'm also not sure if the poster is using a public storage account or private.
Thank you in advance.
At the very bottom of the article listed above about white listing IP ranges of the integration runtime, Microsoft says the following:
When connecting to Azure Storage account, IP network rules have no
effect on requests originating from the Azure integration runtime in
the same region as the storage account. For more details, please refer
this article.
I spoke to Microsoft support about this and the issue is that white listing public IP addresses does not work for resources within the same region because since the resources are on the same network, they connect to each other using private IP's rather than public.
There are four options to resolve the original issue:
Allow access from all networks under Firewalls and Virtual Networks in the storage account (obviously this is a concern if you are storing sensitive data). I tested this and it works.
Create a new Azure hosted integration runtime that runs in a different region. I tested this as well. My ADF data flow is running in East region and I created a runtime that runs in East 2 and it worked immediately. The issue for me here is I would have to have this reviewed by security before pushing to prod because we'd be sending data across the public network, even though it's encrypted, etc, it's still not as secure as having two resources talking to each other in the same network.
Use a separate activity such as an HDInsight activity like Spark or an SSIS package. I'm sure this would work, but the issue with SSIS is cost as we would have to spin up an SSIS DB and then pay for the compute. You also need to execute multiple activities in the pipeline to start and stop the SSIS pipeline before and after execution. Also I don't feel like learning Spark just for this.
Finally, the solution that works that I used is I created a new connection that replaced the Blob Storage with a Data Lakes Gen 2 connection for the data set. It worked like a charm. Unlike Blob Storage connection, Managed Identity is supported for Azure Data Lakes Storage Gen 2 as per this article. In general, the more specific the connection type, the more likely the features will work for the specific need.
This is what you faced now:
From the description we know that is a connection error of storage. I also set the contributer role to the data factory, but still get the problem.
The problem comes from the network and firewall of your storage account. Please have a check of it.
Make sure you have add the client id and the 'Trusted Microsoft services' exception.
Have a look of this doc:
https://learn.microsoft.com/en-us/azure/storage/common/storage-network-security#trusted-microsoft-services
Then, go to your adf, choose these:
After that, it should be ok.
I have recently started using MS Power BI, and have come across a problem which seems inconsistent, and the answer likewise.
I am connecting to an Azure SQL database, and therefore have chosen this as the data source in the desktop app. Everything seems to be working just fine, and I can create tables, graphs and whatnot. One thing is off, though: When I choose Azure SQL DB as the source, the connection dialog box does not appear to be any different than if I just choose (non-Azure) SQL DB. Puzzling.
The other thing, which is actually the main issue: In Power BI (the website), I can open my published reports, but some of them don't show up, and I get an error message in a pink bar at the top, saying the data source is not available because the gateway can't be reached. I am well aware of this, because I have deliberately stopped the gateway service (PBIEgwService) running locally, because I have read several places that if the data source is Azure, an on-premises gateway is not needed. (E.g.: "Question: Do I need a gateway for cloud data sources like Azure SQL Database?
Answer: No! The service will be able to connect to that data source without a gateway." here: https://learn.microsoft.com/en-us/power-bi/service-gateway-onprem-faq)
So in short: Why does PBI not (always) connect directly to Azure?
And yes, I have checked the credentials. I can connect just fine in PBI desktop.
Are you allowing Azure Services to connect in your Azure SQL firewall?
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-firewall-configure#manage-server-level-ip-firewall-rules-using-the-azure-portal
I was trying to retrieve data from Azure SQL Database using Azure Data Lake analytics by following this guide. I run U-SQL job on Azure Data Lake analytics and got following error:
Failed to connect to data source: 'SampleSource', with error(s):
'Cannot open server '' requested by the login. Client with
IP address '25.66.9.211' is not allowed to access the server. To
enable access, use the Windows Azure Management Portal or run
sp_set_firewall_rule on the master database to create a firewall rule
for this IP address or address range. It may take up to five minutes
for this change to take effect.'
After running my job couple of times, I observed IP range that needs to be added in server is pretty wide. It seems we need to add 25.66.xxx.xxx. I have two questions:
How can we narrow this range?
Why the typical setting that allows all azure services access doesn't work?
At this point you will have to add the full range (which are all internal IPs). The reason the typical setting is not working, is that the machines needing access are not managed in the same way. We are working on having them added. What data center is the database in?
I am working on a solution that uses SQL Azure. Part of the project deals with backups and using the DAC Web Services for backups.
The issue is that there is a different endpoint depending on which region the Azure SQL database is in. As I am working with multiple groups, and cannot ensure which region the database will be in, I am looking for a way to programmatically determine the region.
The region is also important, as I want to copy the backups to a different region just to be on the safe side.
I know that I can look in the Admin console, but I would like to use code to solve this problem.
Additional information:
The application is running on Azure using Worker Roles for functionality.
I do not have access to all of the account-id's to use the full REST API.
I do have access to the master database on the Azure Sql Server.
Working on this in C# (I failed to put the language)
You can use Get Servers request (GET https://management.database.windows.net:8443/<subscription-id>/servers) of Azure REST API to enumerate SQL servers which gives the Location or Region more info at msdn -> http://msdn.microsoft.com/en-us/library/windowsazure/gg715269.aspx