Spark errors when writing to Synapse DWH pool - apache-spark

I am trying to write a dataframe in either append/overwrite mode into a Synapse table using ("com.databricks.spark.sqldw") connector .The official docs doesn't mention much about ACID properties of this write operation. My question is that , if the write operation fails in the middle of the write, would the actions preformed previously be rolled back?
One thing that the docs does mention is that there are two classes of exception that could be thrown during this operation: SqlDWConnectorException and SqlDWSideException .My logic is that if the write operation is ACID compliant,then we do not do anything,but if not,then we plan to encapsulate this operation in a try-catch block and look for other options(maybe retry,or timeout).

As a good practice you should write your code to be re-runnable, eg delete potentially duplicate records. Imagine you are re-running a file for a failed day or someone want to reprocess a certain period. However SQL pools does implement ACID through transaction isolation levels:
Use transactions in a SQL pool in Azure Synapse
SQL pool implements ACID transactions. The isolation level of the
transactional support is default to READ UNCOMMITTED. You can change
it to READ COMMITTED SNAPSHOT ISOLATION by turning ON the
READ_COMMITTED_SNAPSHOT database option for a user SQL pool when
connected to the master database.
You should bear in mind that the default transaction isolation level for dedicated SQL pools is READ UNCOMMITTED which does allow dirty reads. So the way I think about it is, ACID (Atomic, Consistent, Isolated, Durable) is a standard and each provider implements the standard to different degrees through transaction isolation levels. Each transaction isolation level can be strongly meeting ACID or weakly meeting ACID. Here is my summary for READ UNCOMMITTED:
A - you should reasonably expect your transaction to be atomic but you should (IMHO) write your code to be re-runnable
C - you should reasonably expect your transaction to be consistent but bear in mind dedicated SQL pools does not support foreign keys and the NOT ENFORCED keyword is applied to unique indexes on creation.
I - READ UNCOMMITED does not meet 'I' Isolated criteria of ACID, allowing dirty reads (uncommitted data) but the gain is concurrency. You can change the default to READ COMMITTED SNAPSHOT ISOLATION as described above, but you would need a good reason to do so and conduct extensive tests on your application as the impacts on behaviour, performance, concurrency etc
D - you should reasonably expect your transaction to be durable
So the answer to your question is, depending on your transaction isolation level (bearing in mind the default is READ UNCOMMITTED in a dedicated SQL pool), each transaction meets ACID to a degree, most notably Isolation (I) is not fully met. You have the opportunity to change this by altering the default transaction
at the cost of reducing concurrency and the now obligatory regression test. I think you are most interested in Atomicity and my advice is there, make sure your code is re-runnable anyway.
You tend to see the 'higher' transaction isolation levels (READ SERIALIZABLE) in more OLTP systems rather than MPP systems like Synapse, the cost being concurrency. You want your bank withdrawal to work right?

It has guaranteed ACID transaction behavior.
Refer: What is Delta Lake, where it states:
Azure Synapse Analytics is compatible with Linux Foundation Delta
Lake. Delta Lake is an open-source storage layer that brings ACID
(atomicity, consistency, isolation, and durability) transactions to
Apache Spark and big data workloads. This is fully managed using
Apache Spark APIs available in Azure Synapse.

Related

REST API to query Databricks table

I have a usecase and needed help with the best available approach.
I use Azure databricks to create data transformations and create table in the presentation layer/gold layer. The underlying data in these tables are in Azure Storage account.
The transformation logic runs twice daily and updates the gold layer tables.
I have several such tables in the gold layer Eg: a table to store Single customer view data.
An external application from a different system needs access to this data i.e. the application would initiate an API call for details regarding a customer and need to send back the response for matching details (customer details) by querying the single customer view table.
Question:
Is databricks SQL API the solution for this?
As it is a spark table, the response will not be quick i assume. Is this correct or is there a better solution for this.
Is databricks designed for such use cases or is a better approach to copy this table (gold layer) in an operational database such as azure sql db after the transformations are done in pyspark via databricks?
What are the cons of this approach? One would be the databricks cluster should be up and running all time i.e. use interactive cluster. Anything else?
It's possible to use Databricks for that, although it heavily dependent on the SLAs - how fast should be response. Answering your questions in order:
There is no standalone API for execution of queries and getting back results (yet). But you can create a thin wrapper using one of the drivers to work with Databricks: Python, Node.js, Go, or JDBC/ODBC.
Response time heavily dependent on the size of the data, and if the data is already cached on the nodes, and other factors (partitioning of the data, data skipping, etc.). Databricks SQL Warehouses are also able to cache results of queries execution so they won't reprocess the data if such query was already executed.
Storing data in operational databases is also one of the approaches that often used by different customers. But it heavily dependent on the size of the data, and other factors - if you have huge gold layer, then SQL databases may also not the best solution from cost/performance perspective.
For such queries it's recommended to use Databricks SQL that is more cost efficient that having always running interactive cluster. Also, on some of the cloud platforms there is already support for serverless Databricks SQL, where the startup time is very short (seconds instead of minutes), so if your queries to gold layer doesn't happen very often, you may have them configured with auto-termination, and pay only when they are used.

Does cosmosdb stores data differently depending on which api you use?

What is the ramification of using each api?
For example if I am using sql api, am I sacrificing ACID, and which part of CAP am I using? How do azure achieve horizontal scalability
If I am using document api or key value api, is the internal data layout different?
The internal storage format for the various database API's in Cosmos DB doesn't have any bearing on ACID or CAP. The API you choose should be driven by the appropriate use case needed and/or your familiarity with it. For instance, both SQL API and Mongo DB API are document databases. But if you have experience using Mongo, then it's probably a better choice rather than say Gremlin which is a graph, Table which is a key/value store, or Cassandra which is a columnar store.
Cosmos DB provides ACID support for data within the same logical partition. With regards to CAP that depends on which consistency level/model you choose. Strong consistency trades "A" availability for "C" consistency. All the other consistency models trade consistency for availability but at varying degrees. For instance, bounded staleness defines an upper bound that data can lag in consistency by time or updates. As that boundary approaches, Cosmos will throttle new writes to allow the replication queue to catch up, ensuring the consistency guarantees are met.
Achieving horizontal scalability is a function of your partitioning strategy. For write heavy workloads, the objective should be choosing a partition key that will allow for writes to be distributed across a wide range of values. For read heavy workloads, you want for queries to be served by one or a bounded number of partitions. If it's both, then using change feed to copy data into 2 or more containers as needed such that the cost of copying data is cheaper than running it in a container that would result in cross-partition queries. At the end of the day, choosing a good partition key requires testing under production levels of load and data.

Mapping Dataflow vs SQL Stored Procedure in ADF pipeline

I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?
This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH
One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.

Pagination in QLDB

I noticed QLDB does not support LIMIT or SKIP query parameters required to implement basic pagination.
Is this going to be supported in the future or is there some other way to implement pagination in QLDB?
LIMIT/SKIP is not currently supported. QLDB is purpose built for data ingestion. We recommend doing reporting and analytics in another purpose built database.
Let's consider a banking application with 2 use-cases:
Moving money between accounts
Providing monthly statements
The first is a very good fit for QLDB, where indexes are being used to read balances and then few documents are being updated or created. Under OCC, QLDB makes it easy to write these transactions correctly and performance should be very good. For example, if an account has $50 remaining and two competing transactions try to deduct $50, only 1 will succeed (the other will fail to commit). Meanwhile, other transactions will continue to succeed. Beyond being simple and performant, you also get integrity via the QLDB hash chain and proof system.
The second is not a good fit. To compute a statement, we would need to lookup transactions for an account. But, what happens if that account changes (maybe somebody just sent you some money!) while we're doing the lookup? Again, under OCC, we will fail the transaction and the statement generation will need to retry. For a small bank, that's probably fine, but I think you can see where this is going. QLDB is purpose built for data ingestion, and the further you stray from what it was built for, the poorer the performance will be.
This begs the question of how to actually do these queries in another database. You can use the S3 Export or Kinesis Data Streaming features to get data out. S3 Exports are better suited for bulk operations (which many analytic databases prefer, e.g. Redshift), while Streams are better for real-time analytics (e.g. using ElasticSearch).
Conversely, I would not recommend using Redshift or ElasticSearch for the first use-case as you will not get the performance, integrity or durability that databases designed for OLTP use-cases offer (e.g. QLDB, DynamoDb, Aurora).

multi-master in cosmosdb/documentdb

How can I set up multiple write regions in cosmosdb so that I do not need to combine query results of two or more different regions in my application layer? From this documentation, it seems like cosmosdb global distribution is global replication with one writer and multiple read secondarys, not true multi-master. https://learn.microsoft.com/en-us/azure/documentdb/documentdb-multi-region-writers
As of May 2018, Cosmos DB now supports multi-master natively using a combination of CRDT data types and automatic conflict resolution.
Multi-master in Azure Cosmos DB provides high levels of availability
(99.999%), single-digit millisecond latency to write data and
scalability with built-in comprehensive and flexible conflict
resolution support.
Multi-master is composed of multiple master regions that equally
participate in a write-anywhere model (active-active pattern) and it
is used to ensure that data is available at any time where you need
it. Updates made to an individual region are asynchronously propagated
to all other regions (which in turn are master regions in their own).
Azure Cosmos DB regions operating as master regions in a multi-master
configuration automatically work to converge the data of all replicas
and ensure global consistency and data integrity.
Azure Cosmos DB implements the logic for handling conflicting writes
inside the database engine itself. Azure Cosmos DB offers
comprehensive and flexible conflict resolution support by offering
several conflict resolution models, including Automatic (CRDT-
conflict-free replicated data types), Last Write Wins (LWW), and
Custom (Stored Procedure) for automatic conflict resolution. The
conflict resolution models provide correctness and consistency
guarantees and remove the burden from developers to have to think
about consistency, availability, performance, replication latency, and
complex combinations of events under geo-failovers and cross-region
write conflicts.
More details here: https://learn.microsoft.com/en-us/azure/cosmos-db/multi-region-writers
It's currently in preview and might require approval before you can use it:
According to your supplied link, based on my understanding. Multi-master in cosmosdb/documentdb is implemented by multiple documentdbs separately for write regions and read the documents from the combined query. Currently it seems that it is not supported to set up multiple write regions in cosmosdb so that don't need to combine query results of two or more different regions .
The referenced article states how to implement multi-master in Cosmosdb, while explicitly stating that it is not a multi-master database.
There are ways to "simulate" multi-master scenarios by configuring the consistency level (e.g. session) which will allow callers to see their local copy without having it written to the write region. You can find the details of the various levels here: https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels.
Aside from that, consider if you truly need multi-master by working with the consistency levels, considering what acceptable latency is, etc. There are few scenarios that can't tolerate latency, particularly when you have adequate tools to provide a user experience that approximates a local write master. There is no such thing as real-time when remote networks are involved ;)

Resources