Maintaining relationships between data stored in Azure SQL and Table Storage - azure

Working on a project in which the high volume data will be moved from SQL Server (running on an Azure VM) to Azure Table Storage for scaling and cheaper storage reasons. There are several foreign-keys within the data that is being moved to Table Storage, which are GUIDs (Primary-key) in SQL tables. Obviously, there is no means to ensure referential integrity, as transactions do not span different Azure storage types. I would like to know if anyone has had any success with this storage design. Are there any transaction management solutions that allow transactions to be created that span SQL Server and Azure Table Storage? What are the implications of having queries that read from both databases (SQL and Table Storage)?

If you're trying to perform a transaction that spans both SQL Server and Azure Tables, your best bet will be using the eventually consistent transaction pattern.
In a nutshell, you will put your updates into a queue message, then have a worker process (whether it be a WebJob, Worker Role, something running on your VM) dequeue the message with peeklock, make sure that all steps within the transaction are executed, then call complete on the message.
If you're looking to perform a transaction on just an Azure Table, you can do so with batch updates as long as your entities live within the same partition.

Related

Moving Data from Azure Blob Storage to Azure Synapse(Sql dedicated pools)

I have a requirement to move Azure Blob Storage data to Azure Synapse (SQL dedicated pools).
Azure Blob Storage container has around 850 Gb of data(in form of multiple json files). I have created a Synapse pipeline . I have used polybase to move the data from blob storage to SQL dedicated pools. In case of Polybase we would need a staging environment for which i have used a staging blob container.
Azure Blob storage -> staging container -> SQL dedicated pool(Azure Synapse)
I have not kept any restrictions on DIU and parallel processing so it uses 32 DIU and parallel goes processing numbers goes upto 120-130 .
first stage is completed in 5 hrs moving 850gb of data to staging container but the second stage it still runs for 15 hours now but not yet completed and DIU i can see is 2 and parallel processing 1 .
Do i need to explicitly specify the DIU and parallel processing .
Is there any better way to do this except polybase.
There are 3 keys points missing from your question:
Why are you using Staging Table?
Do you need to perform any transformation on the data?
Why your staging table is in Blob Storage?
As per the best approach:
It is best practice to load data into a staging table. Staging tables
allow you to handle errors without interfering with the production
tables. A staging table also gives you the opportunity to use SQL pool
built-in distributed query processing capabilities for data
transformations before inserting the data into production tables.
So, as you haven't mentioned if your approach is Extract, Transform and Load (ETL) or Extract, Load and Transform (ELT); reason for staging table isn't clear. If your approach is ETL, staging is good else it is not required.
Secondly, Staging Tables should be in SQL Pool, not in blob storage.
Before loading the data in Staging Table, you need to define external tables in your data warehouse. PolyBase uses external tables to define and access the data in Azure Storage. An external table is similar to a database view. The external table contains the table schema and points to data that is stored outside the data warehouse.
You also need to take care of resource class and distribution method. Refer this third-party article to know more about these 2 workload management terminologies.
As there are so many parameters you need to consider to find out what exactly the issue is, I suggest you to first go through important official documents and then make appropriate changes in your architecture.
Helpful links:
Design a PolyBase data loading strategy for dedicated SQL pool in Azure Synapse Analytics
Workload management with resource classes in Azure Synapse Analytics
Best practices for loading data into a dedicated SQL pool in Azure Synapse Analytics

Multi-tenant IoT application with Azure

I'm building an Azure IoT Hub application. I have several customers, each with a set of devices. Do you think all those customers should be connected to the same hub or a different one(s)?
I would like to populate a multi tenant db (single db, multiple schemas) via azure stream analytics. The idea is to use a job that partitions the data by customer and saves it in a table of a specific schema (schema associated to a specific customer) on my db. It's possible to do it, or the only way to keep customer data separate is to have several db's (instead of having one db and multiple schemas)?
I'm building an Azure IoT Hub application. I have several customers,
each with a set of devices. Do you think all those customers should be
connected to the same hub or a different one(s)?
It really depends on the data which is processed and also your actual requirements. If sharing the IoT Hub resource details with other customers is not an issue, then you can use the same IoT Hub. Else, choose individual IoT Hubs.
I would like to populate a multi tenant db (single db, multiple
schemas) via azure stream analytics. The idea is to use a job that
partitions the data by customer and saves it in a table of a specific
schema (schema associated to a specific customer) on my db. It's
possible to do it, or the only way to keep customer data separate is
to have several db's (instead of having one db and multiple schemas)?
SQL output in Azure Stream Analytics supports writing in parallel as an option. This option allows for fully parallel job topologies, where multiple output partitions are writing to the destination table in parallel. Enabling this option in Azure Stream Analytics however may not be sufficient to achieve higher throughputs, as it depends significantly on your database configuration and table schema. The choice of indexes, clustering key, index fill factor, and compression have an impact on the time to load tables. For more information about how to optimize your database to improve query and load performance based on internal benchmarks, see SQL Database performance guidance. Ordering of writes is not guaranteed when writing in parallel to SQL Database.
See Increase throughput performance to Azure SQL Database from Azure Stream Analytics for more details.

Storing IOT Data in Azure: SQL vs Cosmos vs Other Methods

The project I am working on as an architect has got an IOT setup where lots of sensors are sending data like water pressure, temperature etc. to an FTP(cant change it as no control over it due to security). From here few windows service on Azure pull the data and store it into an Azure SQL Database.
Here is my observation with respect to this architecture:
Problems: 1 TB limit in Azure SQL. With higher tier it can go to 4 TB but that's the max. So it does not appear to be infinitely scalable plus with size, the query issues could be a problem. Columnstore index and partitioning seem to be options but size limitation and DTUs is a deal breaker.
Problem-2- IOT data and SQL Database(downstream storage) seem to be tightly coupled. If some customer wants to extract few months of data or even more with millions of rows, DB will get busy and possibly throttle other customers due to DTU exhaustion.
I would like to have some ideas on possibly scaling this further. SQL DB is great and with JSON support it is awesome but a it is not horizontally scalable solution.
Here is what I am thinking:
All the messages should be consumed from FTP by Azure IOT hub by some means.
From the central hub, I want to push all messages to Azure Blob Storage in 128 MB files for later analysis at cheap cost.
At the same time,  I would like all messages to go to IOT hub and from there to Azure CosmosDB(for long term storage)\Azure SQL DB(Long term but not sure due to size restriction).
I am keeping data in blob storage because if client wants or hires a Machine learning team to create some models, I would prefer them to pull data from Blob storage rather than hitting my DB.
Kindly suggest few ideas on this. Thanks in advance!!
Chandan Jha
First, Azure SQL DB does have Hyperscale which is much larger than 4TB. That said, there is a tipping point where it makes sense to consider alternative architectures when you get to be bigger than what one machine can handle for your solution. While CosmosDB does give you a horizontal sharding solution, you can do the same with N SQL Databases (there are libraries to help there). Stepping back, it is actually pretty important to understand what you want to do with the data if it were in a database. Both CosmosDB and SQL DB are set up for OLTP-style operations (with some limited forms of broader queries - SQL DB supports columnstore and batch mode, for example, which means you could do a reasonably-sized data mart just fine there too). If you are just storing things in the database in the hope of needing to support future data scientists, then you may or may not really need either of these two OLTP stores.
Synapse SQL is set up for analytics and generally has support to read from data in formats in Azure Storage. So, this may be a better strategy if you want to support arbitrarily-large IoT data and do analytics/ML processing over it.
If you know your solution will never be above , you may not need to consider something like Synapse, but it is set up for those scenarios if you are of sufficient size.
Option - 1:
Why don't you extract and serialize the data based on the partition id (device id), send it over the to IoT hub, where you can have the Azure Functions or Logic Apps that de-serializes the data into files that are stored in the blob containers.
Option - 2:
You can also attempt to create a module that extracts the data into excel file, which is then sent to the IoT hub to be stored in the storage containers.

SQL Azure and CDN

what is the best way to limit latency for SQL Azure in global applications?
My Application uses SQL Azure and would like to know based on the network location of users if its possible to connect SQL Azure near to users.
So Logically would need to have SQL Azure database with global replication but not geo-replication as each copy would serve as Master and not secondary.
Thank you in advance.
You may want to try CosmosDB to distribute data globally and obtain low latency as explained on this article and this documentation.
For replicating data using SQL Data Sync with Azure SQL Database, take in consideration paired regions which may reduce latency. With SQL Data Sync a hub database can be defined and many member database on another region, and data can be synched on both ways between the hub and any member database.

SQL Azure failover / backup strategy for web app

I'm building a web app using Azure & SQL Azure. I'm setting it up so each organization has their own database. Low to moderate traffic per customer organization.
I'm thinking about using SQL Azure Data Sync as part of a failover/backup plan, so that if SQL Azure goes down, my app can switch over to my on-premises SQL Server (read-only mode).
I would also be able to do all of my backups on-prem, instead of in the cloud which could incur costs.
One issue may be trying to data-sync multiple databases to my on-prem
sql server (not sure what the limit is on the number of databases
that can be synced to one server)
Bandwidth may be an issue, but I'll probably only sync daily.
Does anyone see any other problems with this approach?
Data Sync is ok, but may or may not be good for your particular DR plan since it's not a transactional sync model.
One option to consider is making a database copy:
CREATE DATABASE destination_database_name
AS COPY OF [source_server_name.]source_database_name
Then you can create a backup from this copy, store the backup in blob storage, and (optionally) delete the database copy. While this does add an additional cost due to a second database being live, you can keep that cost to a minimum if you delete the database instance after creating a backup and storing to blob storage (remember that databases are amortized daily).
Since your backups would then be in blob storage, you could keep multiple backups in blob storage, and pull a backup to your on-premises server if needed.

Resources