Can Azure Synapse query from external relational stores?

Can Azure Synapse query from external relational stores? - azure

This diagram from this URL states Azure Synapse cannot query external relational stores but Azure databricks can.
But here I see it is possible with Azure Synapse. We could also use polybase in Azure Synapse. None of these articles are outdated. So what am I missing?

Your second URL is for External tables, which are not the same as an external relational stores (Azure SQL, MySQL, PostgreSQL, etc.) I do not believe any of the Synapse engines can connect directly to relational data stores [although I'm not certain of Spark's limitations in this regards], but Pipelines can. While they both use Spark, Databricks is a separate product and not related to Synapse.
Polybase uses External Tables, which are metadata references over blobs in storage (Blob or ADLS). Synapse supports External tables in both Dedicated SQL Pool and Serverless SQL. Spark tables are also queryable from Serverless SQL because they are stored as Parquet files in ADLS. I believe this is also implemented as an External Table reference, although it does not display as such in the Workspace UI.

Related

Get data from Azure Synpase to Azure Machine Learning

I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?

The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.

Difference between the dedicated sql pool and dedicated sql pool inside the azure synapse analytics?

Difference between the dedicated sql pool and dedicated sql pool inside the azure synapse analytics?
While provision the azure synapse analytics we will use the Azure storage layer gen2 ,as per the msdn the data will be stored in the azure storage gen2 but azure gen2 will use the hdfs features.so how the dfs feature will use the syanpse analytics?

They both are the same thing. Either you can first create a Dedicated SQL pool and link it with Synapse Workspace, or you can first create the Synapse Workspace and then dedicated pool inside it.
A dedicated SQL pool offers T-SQL based compute and storage capabilities. After creating a dedicated SQL pool in your Synapse workspace, data can be loaded, modeled, processed, and delivered for faster analytic insight.
Apart from Dedicated SQL pool, Azure Synapse provide Serverless SQL and Apache Spark pools. Based on your requirement you can choose the appropriate.
Serverless SQL pool is a query service over the data in your data lake. It enables you to access your data through the following functionalities:
A familiar T-SQL syntax to query data in place without the need to copy or load data into a specialized store.
Integrated connectivity via the T-SQL interface that offers a wide range of business intelligence and ad-hoc querying tools, including the most popular drivers.
You will be directly passing the file path stored in Data Lake Gen2 in T-SQL statement. Refer example below:
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv',
format = 'csv',
parser_version = '2.0',
firstrow = 2 ) as rows
For more related information, I recommend you to go through this document.

Load data from Databricks to Azure Analysis Services (AAS)

Objective
I'm storing data as Delta Lake format at ADLS gen2. Also they are available through Hive catalog.
It's important to notice that we're currently using PowerBI, but in future we may switch to Excel over AAS.
Question
What is the best way (or hack) to connect AAS to my ADLS gen2 data in Delta Lake format?
The issue
There are no Databricks/Hive among AAS supported sources. AAS supports ADLS gen2 through Blob connector, but AFAIK, it doesn't support Delta Lake format, only parquet.
Possible solution
From this article I see that the issue may be potentially solved with PowerBI on-premise API gateway:
One example is the integration between Azure Analysis Services (AAS)
and Databricks; Power BI has a native connector to Databricks, but
this connector hasn’t yet made it to AAS. To compensate for this, we
had to deploy a Virtual Machine with the Power BI Data Gateway and
install Spark drivers in order to make the connection to Databricks
from AAS. This wasn’t a show stopper, but we’ll be happy when AAS has
a more native Databricks connection.
The issue with this solution is that we're planning to stop using PowerBI. I don't quite understand how it works, what PBI license and implementation/maintenance efforts it requires. Could you please provide deeper insight on how it'll work?
UPD, 26 Dec 2020
Now, when Azure Synapse Analytics is GA, it has full support of SQL on-demand. That means that serverless Synapse may theoretically be used as a glue between AAS and Delta Lake. See "Direct Query Databricks' Delta Lake from Azure Synapse".
In the same time, is that possible to query Databricks Catalog (internal/external) from Synapse on-demand using ODBC? Synapse supports ODBC as external source.

Power BI Dataflows now supports Parquet files, so you can load from those files to Power BI, however the standard design pattern is to use Azure SQL Data Warehouse to load the file then layer Azure Analysis Service (AAS) over that. AAS does not support parquet, you would have to create a CSV version of the final table, or load it to a SQL Database.
As mentioned the typical architecture, is to have Databricks do some or all of the ETL, then have Azure SQL DW sit over it.
Azure SQL DW has now morphed into Azure Synapse, but this has the benefit of that a Databricks/Spark database now has a shadow copy but accessible by the SQL on Demand functionality. SQL on Demand doesn't require to to have an instance of the data warehouse component of Azure Synapse, it runs on demand, and you per per TB of query. A good outline of how it can help is here. The other option is to have Azure Synapse load the data from external table into that service then connect AAS to that.

Hadoop on Azure using IaaS

I am looking at having a Hadoop cluster setup for Big Data analytics using the virtualized environment in Azure. As the data volume is very high, I am looking at having data stored in secondary storage like Azure Data Lake Store and Hadoop cluster storage will act as the primary storage.
I would like to know, how can this be configured so that when i create a Hive table and partition, part of the data can reside in Primary storage and the rest in the secondary storage?
Thanks
Regards,
Madhu

You can't mix file systems with a Hive table by default. The Hive metastore only consists of one filesystem location for a database / table definition.
You might try to use Waggle Dance to setup a federated Hive solution, but it's probably too much work than simply allowing Hive data to exist in Azure

I don't know about Hadoop and Hive but you could combine Azure Data Lake Store (ADLS) and Azure SQL Data Warehouse (ADW), ie use Polybase in ADW to create an external table on the 'cold' data in ADLS and an internal table for your 'warm' data. ADW has the advantage that you can pause it.
Optionally create a view over the top to combine the external and internal table.

Could any one help me how to perform Azure table storage deployment through VSTS?

I am a new to azure.Could any one help me what is table storage in Azure and how can I do table storage deployment through VSTS?Please share your thoughts and what steps involved in this and which plugin/task I can use in VSTS to perform this?

About Azure Table storage, you can refer to this article: Azure Table storage overview.
Regarding Azure table storage with VSTS, you can manage azure tables and table entities through Azure PowerShell task.

Azure Table storage stores large amounts of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud. Azure tables are ideal for storing structured, non-relational data. Common uses of Table storage include:
Storing TBs of structured data capable of serving web scale
applications
Storing datasets that don't require complex joins, foreign keys, or
stored procedures and can be denormalized for fast access
Quickly querying data using a clustered index
Accessing data using the OData protocol and LINQ queries with WCF
Data Service .NET Libraries
You can use Table storage to store and query huge sets of structured, non-relational data, and your tables will scale as demand increases.
You’ll have to install Azure Storage Client Library for .NET to work with Azure Storage.
For more details, refer to the documentations Get started with Azure Table storage using .NET and Get started with Azure table storage and Visual Studio Connected Services (ASP.NET) incase if you haven't checked earlier.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string