what are best practices using azure synapse MPP dedicated sqlPools with large data volumes? [closed]

what are best practices using azure synapse MPP dedicated sqlPools with large data volumes? [closed] - auto-increment

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 days ago.
Improve this question
hello stackoverflow community...
regarding Microsoft's Azure Synapse Analytics (ASA) cloud-based OLAP there seems to be a critical oversight with ASA dedicated MPP sqlpools
i'm rolling my own IDENTITY surrogate keys because azure synapse MPP dedicated SQLpools (RDBs) don't seem to support this. in doing so data sets that could be loaded with "order (N)" performance (i.e., pre-sorted data sets) will be degraded to "order (N log N)" (each time i insert a record i must calculate it's surrogate key ( e.g., surrogateKey <- [ MAX(table[surrogateKey]) + 1 ] )
INITIAL LOADING, as well as INCREMENTAL LOADING, of MANY LARGE ODS datasets (sometimes exceeding billions of rows) is a requirement for the DW environments I support.
this currently does not appear to be supported. if this a challenge of using azure synapse dedicated SQLpools using a ASA-dedicated-MPP-sqlpool-db, how do we overcome these very real limitations?
how do we best implement and support necessary surrogate keys with ASA for OLAP data architectures without sound assistance and support from the RDBMS, and have it be performant?
please also help find answers to these important, related questions, stackoverflow...
how do i best ingress and ingest source data into ASA-dedicated-MPP-sqlpool-db analytics tables that:
0.) maintain their own auto-numbered [i.e. MSSQL IDENTITY(SEED, INCREMENT)] ENFORCED, UNIQUE, PRIMARY KEY, surrogate key fields ?
1.) also support NONCLUSTERED indexes ?
2.) bulk insert operations should be an order (N) operation with pre-sorted source data sets. how do i accomplish this for massive insert ops with "order (N)" performance with azure synapse sqlpool DBs ? it sure seems like this is now order (N log N) ?
3.) support PK - FK referential integrity ? how do we approach doing this and have a performant analytics data model with an azure sqlpool DBs ?
thank you
there must be a better way to do this with azure dedicated sqlpool DBs ?

Synapse dedicated SQL pools do support IDENTITY. And there shouldn’t be any reason it won’t support tables of billions of rows. Large data loads should use COPY INTO or Polybase for best performance.

Related

ADF, Azure Function or Hybrid [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 days ago.
Improve this question
I need to download data from several APIs some using basic REST some using GraphQL, parse the data and map some of the fields to various Azure SQL tables discarding the unneeded data (will be later visualised in PowerBI).
I started using Azure Data Factory but got frustrated with the lack of simple functions like converting json field containing html into text.
I then looked at Azure Functions, thought Python (although I’m open to NodeJS) however I’ve got a lot of data to download and upSert into the database and there is mentions on the internet the ADF is the most efficient to bulk upSert data.
Then I thought ADF using function to get the data and ADF to bill copy.
So my question is what should I be using for my use case? I’m open to any suggestions but it needs to be on Azure and cost sensitive. The ingestion needs to run daily upSerting around 300,000 records.

I think this pretty much comes down to taste, as you can probably solve this entirely only using ADF or an azure function, depending on more specific circumstances of your case. In my personal experience I've often ended up using the hybrid variant because I can be easier due to more flexibility compared to the standard API components of ADF, doing the extraction from an API using an azure function, storing the data in blob storage/data lake and then loading the data into a database using ADF. This setup can be pretty cost effective from my experience, depending on if you can use an azure function consumption plan (cheaper than alternatives) and/or can void using data flows in ADF (a significant cost driver in ADF)

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?

It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

How efficient can Azure BLOB Table service can be?

How efficient azure blob tables can be?
Azure BLOB service has various components like Containers, Queues and Tables too. How efficient can tables be, what is their exact use case and why are they generally used with a supporting service like Azure CosmoDB.
Can anyone help me understand the concept and thought behind it?
Edit: The problem I am facing is that I have to log a processing batch of 700 000 data rows in C#, into BLOB Tables. How do I achieve this in the best practices?

This is a three in one question :-)
How efficient can tables be
Very efficient, if used properly. Every row in a table has a PartitionKey and Rowkey. When querying data it performs very well if you can reduce the set by using (parts of) the PartitionKey and RowKey. As soon as you start filtering on other columns performance can decrease very fast. See also the docs regarding this topic.
what is their exact use case
It is basically a key/value pair nosql solution. It can be used very efficient to store simple data in a fast and cheap manner. It is one of the cheapest options when it comes to data storage. Tables don't have a fixed schema (hence, nosql) and is used to store for example logs, configuration data and simple data structures.
and why are they generally used with a supporting service like Azure CosmosDB.
This is not the case. Azure Table Storage can be used on its own. CosmosDB has a Table API that lets you make uses of CosmosDB against code written for Azure Table Storage without code modifications. It allows for premium performance as not only the PartitionKey and Rowkey are indexed, but all the other columns as well. So as soon as you start filtering on other columns performance will still be very good. But it will costs you more in terms of money.
Data storage could be best done using batches as data is written per partition. See the answer of Ivan.
Some more material on when to use it:
https://markheath.net/post/azure-tables-what-are-they-good-for
https://blogs.msdn.microsoft.com/brunoterkaly/2013/01/13/knowing-when-to-choose-windows-azure-table-storage-or-windows-azure-sql-database/

Azure elastic query performance

We have two applications that use separate databases on SQL Server 2012, however we have several stored procedures that get data from the other using INNER JOINs (7 joins in total). We are looking to see if it is possible to move to Azure and have set up a test using our existing databases, with external tables to get the data from the other database.
The problem is that the performance of these queries goes from 1-15 seconds on our server, to 4+ minutes on Azure. We have tried moving the tables to the same database and it did fix the speed problem, although it isn't ideal to move all the tables over to the same DB.
For the purpose of our test, we are using Azure Standard elastic pool with 50 DTUs.

Cross database queries show good performance when the remote tables are not big. When remote tables are big, this article shows you how to perform joins remotely using table variables and improve performance.
This other article shows you also how to push parameterized operations to remote databases and improve performance.
Hope this helps.

Comparisons between BigData Solutions. [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I've been researching on BigData for the last couple of months and started doing my FYP which is to analyze BigData using MapReduce and HDInsight in Windows Azure.
I just came to this particular confusion where, which platform could be better to do BigData analytics in terms of cost, performance, stability etc. such as Amazon, Oracle, IBM etc. This question could be too broad but I just wanted to get a basic idea of how they can be differentiated when compared to Azure HDInsight.
To be short, HDInsight vs Other BigData Solutions for BigData analytics. Any help would be appreciated.

Comparison among different hadoop distributors
1]1
You can find reference about Microsoft distribution at this article

One important detail is that it is not only about the platform. I agree that it is important to understand your options but humbly I suggest that you take into consideration your (and your team's) skills.
One platform may be better than another one but if you are starting from scratch then you may fail in achieving your goals in terms of timelines, budget or even fail completely.

Combining Operational and Analytical Technologies- Using Hadoop
New technologies like NoSQL, MPP databases, and Hadoophave emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business.
One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.

As per my knowledge, each cloud service provider has their positives and negatives.
I have a good knowledge about google cloud. so, I tried to compare w.r.t. google cloud.
Below are the two links, which provides product mapping with respect to google cloud.
https://cloud.google.com/free/docs/map-azure-google-cloud-platform
https://cloud.google.com/free/docs/map-aws-google-cloud-platform
For example, Azure HDInsight maps with Google Cloud Dataproc and Google Cloud Dataflow. Here using Dataproc, we can run Hadoop Mapreduce jobs. Dataflow we can use for both batch and stream data processing.
In AWS, Azure HDInsight maps with Amazon Elastic MapReduce (EMR).
Each service provider has different pricing mechanism based on type of CPU, no.of cores and storage option. In google cloud, we have option for preemptible instances, which will cost very cheap but we can use them only for short term. (Max 24 hrs).
You can compare pricing from below links:
https://cloud.google.com/dataproc/pricing
https://cloud.google.com/dataflow/pricing
https://azure.microsoft.com/en-us/pricing/details/hdinsight/
https://aws.amazon.com/emr/pricing/
There is a tool in market to compare different cloud services:
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string