Get data from Azure Synpase to Azure Machine Learning - azure

I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?

The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.

Related

Can Azure Synapse query from external relational stores?

This diagram from this URL states Azure Synapse cannot query external relational stores but Azure databricks can.
But here I see it is possible with Azure Synapse. We could also use polybase in Azure Synapse. None of these articles are outdated. So what am I missing?
Your second URL is for External tables, which are not the same as an external relational stores (Azure SQL, MySQL, PostgreSQL, etc.) I do not believe any of the Synapse engines can connect directly to relational data stores [although I'm not certain of Spark's limitations in this regards], but Pipelines can. While they both use Spark, Databricks is a separate product and not related to Synapse.
Polybase uses External Tables, which are metadata references over blobs in storage (Blob or ADLS). Synapse supports External tables in both Dedicated SQL Pool and Serverless SQL. Spark tables are also queryable from Serverless SQL because they are stored as Parquet files in ADLS. I believe this is also implemented as an External Table reference, although it does not display as such in the Workspace UI.

Databricks Delta Tables - Where are they normally stored?

I'm beginning my journey into Delta Tables and one thing that is still confusing me is where is the best place to save your delta tables if you need to query them later.
For example I'm migrating several tables from on-prem to azure databricks into individual delta tables. My question is, should I save the individual delta tables which could be significant in size into the DBFS databricks internal storage, or should I mount a blob storage location and save the delta lake tables there? What do people normally do in these situations?
I usually recommend people to store data in a separate storage account (either mounted, or used directly), and don't use the internal storage of workspace for that tasks. Primary reason - it's easier to share this data with other workspaces, or other systems if it's necessary. Internal storage should be primarily used for temp files, libraries, init scripts, etc.
There is a number of useful guides available that can help:
Azure Databricks Best Practices, and it's specifically says about internal storage
About securing access to Azure Data Lake

How can I implement in the Micorsoft Azure / Microsoft Synapse serverless SQL Pool service the Row Level Security feature on external tables?

I am looking at a Data Lake csv file and want to create an external table in the serverless SQL Pool of Microsoft Synapse. The goal is to query this file with Row Level Security constraints in place.
When the external table is created on a dedicated Server, I am able to query the file with Row Level Security constraints in place.
How can I make the Row Level security for external tables on a serverless SQL Pool?
Unfortunately, row level-security is not supported in serverless SQL pool at the moment.
Can you please vote for this on our User Voice?
https://feedback.azure.com/forums/307516-sql-data-warehouse?category_id=171048
You can't use the feature as it is. T-SQL support on Serverless is limited.
E.g. CREATE FUNCTION isn't supported.
This syntax is not supported by serverless SQL pool in Azure Synapse Analytics.
You could of course try to DIY using Views which are supported in Serverless.
In the figure below Entitlements would become another CSV and EXTERNAL TABLE that you would create.
You'll have to either find the right function to get current user and/or role for View's SELECT query, or provide it via some wrapper code from some other place where you maintain your own Context.
Disclaimer: I've not done this in Serverless so can't say for sure.

Could any one help me how to perform Azure table storage deployment through VSTS?

I am a new to azure.Could any one help me what is table storage in Azure and how can I do table storage deployment through VSTS?Please share your thoughts and what steps involved in this and which plugin/task I can use in VSTS to perform this?
About Azure Table storage, you can refer to this article: Azure Table storage overview.
Regarding Azure table storage with VSTS, you can manage azure tables and table entities through Azure PowerShell task.
Azure Table storage stores large amounts of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud. Azure tables are ideal for storing structured, non-relational data. Common uses of Table storage include:
Storing TBs of structured data capable of serving web scale
applications
Storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access
Quickly querying data using a clustered index
Accessing data using the OData protocol and LINQ queries with WCF
Data Service .NET Libraries
You can use Table storage to store and query huge sets of structured, non-relational data, and your tables will scale as demand increases.
You’ll have to install Azure Storage Client Library for .NET to work with Azure Storage.
For more details, refer to the documentations Get started with Azure Table storage using .NET and Get started with Azure table storage and Visual Studio Connected Services (ASP.NET) incase if you haven't checked earlier.

Google Analytics data in Azure

Has anybody ever moved Google Analytics data into Azure? I have seen a handful of ways to do it but I am not sure what I am getting myself into. The Google Analytics data is becoming quite large and I am wondering if it is best suited to leave it in google storage and access it from Azure or move it to something like HDInsight or Data Lake. I need to join the data across several disparate data stores, SQL Azure, Blob, and Table Storage. I was also looking into Apache Drill and Presto as a possible solution to unify the data access. Just looking to see if anybody out there has dealt with this same issue and has any experience to share. Thanks!
Preface
I don't have experience with Presto so I can only comment on the feasibility of doing this with Drill. Also I have not used Azure services so my advice is theoretical.
Drill Storage Plugins
Drill will allow you to perform any SQL queries you want on data originating from different sources, provided that each data source has a storage plugin. A storage plugin is simply a piece of code in Drill that allows you to interface with a data source. Since you are concerned with performing queries on 3 data sources, we need to determine if each of those 3 data sources have a Storage plugin.
SQL Azure
I assume SQL Azure has a jdbc driver for java. If so then Drill can be configured to use SQL Azure by following these instructions.
Azure Blob
Azure Blob storage has an implementation of the hadoop filesystem api which Drill uses to read data from file systems. So you could theoretically add the hadoop-azure jar and its dependencies https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/2.7.0 to Drill's class path and configure Drill's DFS storage plugin to use it.
Additionally the data in Azure Blob would have to be stored in a supported file format like: json, parquet, csv, or hadoop sequence files.
Azure Table
This looks like Microsoft's custom NoSQL database. Currently Drill does not support it.
Conclusion
With a bit of work you could use Drill to query data on both Azure SQL and Blob, but not Azure Table.

Resources