We are migrating from Oracle to VoltDB, should we put all the business logic (the migrated stored procedures) inside the database? Is that the best practice for maximum performance?
I work at VoltDB. There isn't one right answer to your question, it would depend on the particular schema and procedures, but I can explain a little about the stored procedures in VoltDB and differences with Oracle.
First, VoltDB is not a general-purpose database like Oracle, but was specially-designed to provide high performance and scalability for OLTP and "Fast Data" workloads. Typically these workloads involve discrete transactions on small set of records, but which come at rates of thousands to millions per second. The use cases range from providing real-time analytics on fast-moving data sets, to transforming and enriching streaming data, to providing low-latency responses that often involve a data-driven decision to high scale interactive applications.
The procedures in VoltDB are typically focused on applying atomic changes to small sets of records, and they often are used to make event-driven changes in real-time vs. running batch processes on data in bulk as you often see in Oracle. VoltDB automatically generates CRUD-style procedures for each table in the schema now including UPSERT. You can declare single-SQL-statement procedures in the DDL. Procedures that include multiple SQL statements and control flow logic are written as simple java classes that run on the database. VoltDB also supports Ad-Hoc SQL statements (ANSI SQL-92 compatible) sent directly from a client, using either a native language client library or the JDBC or ODBC drivers, or over the embedded HTTP-JSON interface.
If the stored procedures in Oracle are for OLTP operations then they may translate somewhat directly into VoltDB procedures. If they are performing batch operations on data in bulk, then often these processes may be redesigned to be event-driven real-time processes that would produce the same result incrementally. If they still have to be done as a long-running batch, typically they would be broken into separate more discrete procedures driven by a client process.
Related
I am trying to write a dataframe in either append/overwrite mode into a Synapse table using ("com.databricks.spark.sqldw") connector .The official docs doesn't mention much about ACID properties of this write operation. My question is that , if the write operation fails in the middle of the write, would the actions preformed previously be rolled back?
One thing that the docs does mention is that there are two classes of exception that could be thrown during this operation: SqlDWConnectorException and SqlDWSideException .My logic is that if the write operation is ACID compliant,then we do not do anything,but if not,then we plan to encapsulate this operation in a try-catch block and look for other options(maybe retry,or timeout).
As a good practice you should write your code to be re-runnable, eg delete potentially duplicate records. Imagine you are re-running a file for a failed day or someone want to reprocess a certain period. However SQL pools does implement ACID through transaction isolation levels:
Use transactions in a SQL pool in Azure Synapse
SQL pool implements ACID transactions. The isolation level of the
transactional support is default to READ UNCOMMITTED. You can change
it to READ COMMITTED SNAPSHOT ISOLATION by turning ON the
READ_COMMITTED_SNAPSHOT database option for a user SQL pool when
connected to the master database.
You should bear in mind that the default transaction isolation level for dedicated SQL pools is READ UNCOMMITTED which does allow dirty reads. So the way I think about it is, ACID (Atomic, Consistent, Isolated, Durable) is a standard and each provider implements the standard to different degrees through transaction isolation levels. Each transaction isolation level can be strongly meeting ACID or weakly meeting ACID. Here is my summary for READ UNCOMMITTED:
A - you should reasonably expect your transaction to be atomic but you should (IMHO) write your code to be re-runnable
C - you should reasonably expect your transaction to be consistent but bear in mind dedicated SQL pools does not support foreign keys and the NOT ENFORCED keyword is applied to unique indexes on creation.
I - READ UNCOMMITED does not meet 'I' Isolated criteria of ACID, allowing dirty reads (uncommitted data) but the gain is concurrency. You can change the default to READ COMMITTED SNAPSHOT ISOLATION as described above, but you would need a good reason to do so and conduct extensive tests on your application as the impacts on behaviour, performance, concurrency etc
D - you should reasonably expect your transaction to be durable
So the answer to your question is, depending on your transaction isolation level (bearing in mind the default is READ UNCOMMITTED in a dedicated SQL pool), each transaction meets ACID to a degree, most notably Isolation (I) is not fully met. You have the opportunity to change this by altering the default transaction
at the cost of reducing concurrency and the now obligatory regression test. I think you are most interested in Atomicity and my advice is there, make sure your code is re-runnable anyway.
You tend to see the 'higher' transaction isolation levels (READ SERIALIZABLE) in more OLTP systems rather than MPP systems like Synapse, the cost being concurrency. You want your bank withdrawal to work right?
It has guaranteed ACID transaction behavior.
Refer: What is Delta Lake, where it states:
Azure Synapse Analytics is compatible with Linux Foundation Delta
Lake. Delta Lake is an open-source storage layer that brings ACID
(atomicity, consistency, isolation, and durability) transactions to
Apache Spark and big data workloads. This is fully managed using
Apache Spark APIs available in Azure Synapse.
I have a requirement where I need to choose between Mapping Data Flow vs SQL Stored Procedures in an ADF pipeline to implement some business scenarios. The data volume is not too huge now but might get larger at a later stage.
The business logic are at times complex where I will have to join multiple tables, write sub queries, use windows functions, nested case statements, etc.
All of my business requirements could be easily implemented through a SP but there is a slight inclination towards mapping data flow considering that it runs spark underneath and can scale up as required.
Does ADF Mapping data flow has an upper hand over SQL Stored Procedures when used in an ADF pipeline?
Some of the concerns that I have with the mapping data flow are as below.
Time taken to implement complex logic using data flows is much more
than a stored procedure
The execution time for a mapping data flow is
much higher considering the time it takes to spin up the spark
cluster.
Now, if I decide to use SQL SPs in the pipeline, what could be the disadvantages?
Would there be issues with the scalability if the data volume grows rapidly at some point in time?
This is kind of an opinion question which doesn't tend to do well on stackoverflow, but the fact you're comparing Mapping Data Flows with stored procs tells me that you have Azure SQL Database (or similar) and Azure Data Factory (ADF) in your architecture.
If you think about the fact Mapping Data Flows is backed by Spark clusters, and you already have Azure SQL DB, then what you really have is two types of compute. So why have both? There's nothing better than SQL at doing joins, nested queries etc. Azure SQL DB can easily be scaled up and down (eg via its REST API) - that seemed to be one of your points.
Having said that, Mapping Data Flows is powerful and offers a nice low-code experience. So if your requirement is to have low-code with powerful transforms then it could be a good choice. Just bear in mind that if your data is already in a database and you're using Mapping Data Flows, that what you're doing is taking data out of SQL, up into a Spark cluster, processing it, then pushing it back down. This seems like duplication to me, and I reserve Mapping Data Flows (and Databricks notebooks) for things I cannot already do in SQL, eg advanced analytics, hard maths, complex string manipulation might be good candidates. Another use case might be work offloading, where you deliberately want to offload work from your db. Just remember the cost implication of having two types of compute running at the same time.
I also saw an example recently where someone had implemented a slowly changing dimension type 2 (SCD2) using Mapping Data Flows but had used 20+ different MDF components to do it. This is low-code in name only to me, with high complexity, hard to maintain and debug. The same process can be done with a single MERGE statement in SQL.
So my personal view is, use Mapping Data Flows for things that you can't already do with SQL, particularly when you already have SQL databases in your architecture. I personally prefer an ELT pattern, using ADF for orchestration (not MDF) which I regard as easier to maintain.
Some other questions you might ask are:
what skills do your team have? SQL is a fairly common skill. MDF is still low-code but niche.
what skills do your support team have? Are you going to train them on MDF when you hand this over?
how would you rate the complexity and maintainability of the two approaches, given the above?
HTH
One disadvantage with using SP's in your pipeline, is that your SP will run directly against the database server. So if you have any other queries/transactions or jobs running against the DB at the same time that your SP is executing you may experience longer run times for each (depending on query complexity, records read, etc.). This issue could compound as data volume grows.
We have decided to use SP's in our organization instead of Mapping Data Flows. The cluster spin up time was an issue for us as we scaled up. To address the issue I mentioned previously with SP's, we stagger our workload, and schedule jobs to run during off-peak hours.
Is the SPARK SQL family of API's for writing to a database like this:
DF.write.mode("append").jdbc(url, table, prop)
able to work at scale?
Or is there a time that sqoop should then be used?
In general writing over JDBC will be typically limited by the capabilities of the destination system. In general JDBC connectors are not designed for batch data migrations, and majority of vendors, have their own, platform specific bulk insert tools.
Specific writing mode like append has little or no impact at all.
And as always - if you have questions about performance implications of a specific choice it's best to benchmark it yourself on the platform you use, data that reflects properties of the real input and using resources comparable to the ones, you have at your disposal in production.
I'm looking at both projects and I can't really see the difference
from Cassandra Site:
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store...Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.
from CouchDB Site:
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.
That said, I see the specific differences between each project as: access methods, written languages, etc. but to put AN EXAMPLE, when you talk about SOLR or Sphinx you know both are indexers with big differences but at the end are indexers.
Can I say here that Cassandra and CouchDB are non-relational databases that in some cases one can replace the other?
CouchDB is a document store. You put documents (JSON objects) in it and define views (indexes) over them. The objects can be arbitrarily complex with potentially deep structure. Further, they are not constrained to following some consistent schema.
Cassandra is a ragged-table key-value store. It just stores rows, each of which has a set of named columns grouped in to families with values. It sounds quite close to BigTable; BigTable doesn't require each row to have the same structure (unlike an SQL database). The values may have some structure, but this kind of store doesn't know anything about that -- they're just strings/byte sequences.
Yes, they are both non-relational databases, and there is probably a fair amount of overlap in their applicability, but they do have distinctly different data organization models. Each can probably be forced into emulating the other, but each model will map best to a different set of problems.
CouchDB has a feature present in very few open source database technologies: offline replication. CouchDB is designed so that applications can be run at the edge of the network. These applications are available even when internet connectivity fails.
Offline replication can also be leveraged to build large clusters, but CouchDB is designed to be robust and simple whether it is running on a single server, a datacenter, or even a smartphone.
I'm interested in people's thoughts comparing storing data in a traditional SQL based Database or utilising a Memory-Mapped File such as the one in the new .Net 4.0 runtime. The data in question would be arrays of simple structures.
Obvious pros and cons:
SQL Database Pros
Adhoc query support
SQL Management Tools
Schema changes (adding more columns and setting default values)
Memory-Mapped Pros
Lighter overhead? (this is an assumption on my part)
Shareable between process threads
Any others?
Is it worth it for performance gains?
You could try out MongoDB, and get a mixture of both worlds (database-like features over a memory mapped store).
MongoDB bridges the gap between
key-value stores (which are fast and
highly scalable) and traditional RDBMS
systems (which provide rich queries
and deep functionality).
Here's a good article that can walk you through installing and coding to MongoDB:
Going NoSQL with MongoDB
SQLServer can use memory mapped files if you choose "SharedMemory" as the protocol. Otherwise it'll use Pipes, TCP or VIA.
Regarding pros and cons.. to me they are amost not comparable. SQL has the whole query/multiuser/transaction etc infrastructure built in. If you store with MMF's you are on your own regarding all that. On the other hand, MMF are built in the OS.. no seed for a server/service.