Confusion About Delta Lake

Confusion About Delta Lake - delta-lake

I have tried to read a lot about databricks delta lake. From what I understand it adds ACID transactions to your data storage and accelerated query performance with a delta engine. If so, why do we need other data lakes which do not support ACID transactions? Delta lakes claims to combine both worlds of data lakes and data warehouse, we know that it can not replace a traditional data warehouse yet due to its current support of operations. But should it replace data lakes? Why the need to have two copies of data - one in data lake and one in delta lake?

Delta Lake is a type of lake house. Other examples of lake houses include Hudi and Iceberg.
A lake house is a tool that manages the deta lake in an efficient way and support ACID transactions and advanced features like data versioning.
The question should be - "Is there any benefit by using a pure data lake over a lake house?"
I guess the best advantage of a pure data lake is that it's OOTB, therefore cheaper/less complex than using a lake house, which provides you some advantages that you don't always need.

Delta Lake is a product (like Redshift) rather than a concept/approach/theory (like dimensional modelling).
As with any product in any walk of life, some of the claims made for the product will be true and some will be marketing spin. Whether the claimed benefits for a product actually make it superior to an alternative product will change from use case to use case.
Asking why there are other data lake solutions besides Delta Lake is a bit like asking why there is more than one DBMS in the world.

In my personal case there was already a data lake, a sybase IQ but its performance is poor compared to the queries that I can perform through spark to delta, speed is an important factor, and in partitioned tables it is remarkable

Delta lake is an open standard. Acid transactions are in reference to writes that fail midway. Transactions are a safety mechanism. Core support is in spark but other tools have added support for Delta lake. Delta lake is not a product. There is also the lake house design which again isn't a product but a way to approach building a data lake. If you follow the principles you can use any technology.

Related

HBase to Delta Tables

We are in the process of migrating a Hadoop Workload to Azure Databricks. In the existing Hadoop ecosystem, we have some HBase tables which contains some data(not big). Since, Azure Databricks does not support Hbase, we were planning if we can replace the HBase tables with Delta tables.
Is this technically feasible, if yes, is there any challenges or issues we might face during the migration or in the target system.

It all comes to the access patterns. HBase is OLTP system where you usually operate on individual records (read/insert/update/delete) and expect subsecond (or millisecond) response time. Delta Lake, on other side is OLAP system designed for efficient processing of many records together, but it could be slower when you read individual records, and especially when you update or delete them.
If your application needs subseconds queries, especially with updates, then it make sense to setup a test to check if Delta Lake is the right choice for that - you may need to look onto Databricks SQL that is doing a lot of optimizations for fast data access.
If it won't fulfill your requirements, then you may look onto other products in Azure ecosystem, like, Azure Redis or Azure CosmosDB that are designed for OLTP-style data processing.

Can multiple data pipeline merge data on the same delta table simultaneously without causing inconsistency?

I know ACID Transactions are one of the important feature of the delta lake while performing read and write. Is this also true for merge operation? What if two pipelines try to perform update operation based on the different condition on the same record.Can it cause any data inconsistency?

Well, it depends.
Delta Lake uses Optimistic Control to handle concurrency, this means that it would likely work IF you're writing to HDFS, since delta needs the underlying object store to support "compare-and-swap" operations or a way for it to fail if two writers are tying to overwrite each other’s log entries, and HDFS supports that.
On S3, this is not supported:
Delta Lake has built-in support for S3. Delta Lake supports concurrent
reads from multiple clusters, but concurrent writes to S3 must
originate from a single Spark driver in order for Delta Lake to
provide transactional guarantees. This is because S3 currently does
provide mutual exclusion, that is, there is no way to ensure that only
one writer is able to create a file.
On the proprietary Delta Engine, Databricks does support multi cluster writes to S3 using a propriety server that handles those calls.
So to sum it up:
It should be possible if you're writing to HDFS.
On S3, it won't work, unless you're using the paid version of Delta Lake.

Delta Lake independent of Apache Spark?

I have been exploring the data lakehouse concept and Delta Lake. Some of its features seem really interesting. Right there on the project home page https://delta.io/ there is a diagram showing Delta Lake running on "your existing data lake" without any mention of Spark. Elsewhere it suggests that Delta Lake indeeds runs on top of Spark. So my question is, can it be run independently from Spark? Can I, for example, set up Delta Lake with S3 buckets for storage in Parquet format, schema validation etc, without using Spark in my architecture?

You might keep an eye on this: https://github.com/delta-io/delta-rs
It's early and currently read-only, but worth watching as the project evolves.

tl;dr No
Delta Lake up to and including 0.8.0 is tightly integrated with Apache Spark so it's impossible to have Delta Lake without Spark.

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.

I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

Can some experts give some succinct answers to the differences between Presto and Impala from these perspectives?
Fundamental architecture design
SQL compliance
Real-world latency
Any SPOF or fault-tolerance functionality
Structured and unstructured data use scenario performance

Apache Impala is a query engine for HDFS/Hive systems only.
PrestoDB, as well as the community version Trino, on the other hand are a generic query engine, which support HDFS as just one of many choices. There is a long list of connectors available, Hive/HDFS support is just one of them. This also means that you can query different data source in the same system, at the same time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Confusion About Delta Lake - delta-lake

In my personal case there was already a data lake, a sybase IQ but its performance is poor compared to the queries that I can perform through spark to delta, speed is an important factor, and in partitioned tables it is remarkable

Related

HBase to Delta Tables

Can multiple data pipeline merge data on the same delta table simultaneously without causing inconsistency?

Delta Lake independent of Apache Spark?

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

Categories

Resources