I have been exploring the data lakehouse concept and Delta Lake. Some of its features seem really interesting. Right there on the project home page https://delta.io/ there is a diagram showing Delta Lake running on "your existing data lake" without any mention of Spark. Elsewhere it suggests that Delta Lake indeeds runs on top of Spark. So my question is, can it be run independently from Spark? Can I, for example, set up Delta Lake with S3 buckets for storage in Parquet format, schema validation etc, without using Spark in my architecture?
You might keep an eye on this: https://github.com/delta-io/delta-rs
It's early and currently read-only, but worth watching as the project evolves.
tl;dr No
Delta Lake up to and including 0.8.0 is tightly integrated with Apache Spark so it's impossible to have Delta Lake without Spark.
Related
I recently started working with spark and was eager to know if I have to perform queries which would be better spark sql or databricks sql and why?
We need to distinguish two things here:
Spark SQL as a dialect of the SQL language. Originally started as Shark & Hive on Spark projects (blog), it's now going close to ANSI SQL.
Spark SQL as execution engine inside Spark.
As was mentioned in this answer, Databricks SQL as language is primarily based on Spark SQL with some additions specific to Delta Lake tables (like CREATE TABLE CLONE, ...). ANSI compatibility in Databricks SQL is controlled with ANSI_MODE setting, and will be enabled by default in the future.
But when it comes to the execution, Databricks SQL is different from Spark SQL engine because it uses Photon engine heavily optimized for modern hardware and BI/DW workloads. With Photon you can get significant speedup (2-3x) compared to standard Spark SQL engine on the complex queries that process a lot of data.
In basic nut shell, you can download Apache Spark with pre-built Hadoop. You need to download the package from free. Additionally you can add Delta Lake and other third-party software.
Now Databricks is platform where you have to pay, it contains Apache SPARK + Delta Lake + many built in extras.
As expected, performance and SQL dialect between Hadoop and Delta Lake are different since they are different databases.
You can install Delta Lake in Apache Spark so you compare Hadoop vs Delta Lake
We are in the process of migrating a Hadoop Workload to Azure Databricks. In the existing Hadoop ecosystem, we have some HBase tables which contains some data(not big). Since, Azure Databricks does not support Hbase, we were planning if we can replace the HBase tables with Delta tables.
Is this technically feasible, if yes, is there any challenges or issues we might face during the migration or in the target system.
It all comes to the access patterns. HBase is OLTP system where you usually operate on individual records (read/insert/update/delete) and expect subsecond (or millisecond) response time. Delta Lake, on other side is OLAP system designed for efficient processing of many records together, but it could be slower when you read individual records, and especially when you update or delete them.
If your application needs subseconds queries, especially with updates, then it make sense to setup a test to check if Delta Lake is the right choice for that - you may need to look onto Databricks SQL that is doing a lot of optimizations for fast data access.
If it won't fulfill your requirements, then you may look onto other products in Azure ecosystem, like, Azure Redis or Azure CosmosDB that are designed for OLTP-style data processing.
Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed?
I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache spark. Not sure what other tools are required.
Are there any other tool suggestions to implement on-premise data lake concept?
Yes, you can use Delta Lake on-premise. It's just a matter of the using correct version of the Delta library (0.6.1 for Spark 2.4, 0.8.0 for Spark 3.0). Or running the spark-shell/pyspark as following (for Spark 3.0):
pyspark --packages io.delta:delta-core_2.12:0.8.0
then you can write data in Delta format, like this:
spark.range(1000).write.format("delta").mode("append").save("1.delta")
It can work with local files as well, but if you need to build a real data lake, then you need to use something like HDFS that is also supported out of the box.
I have tried to read a lot about databricks delta lake. From what I understand it adds ACID transactions to your data storage and accelerated query performance with a delta engine. If so, why do we need other data lakes which do not support ACID transactions? Delta lakes claims to combine both worlds of data lakes and data warehouse, we know that it can not replace a traditional data warehouse yet due to its current support of operations. But should it replace data lakes? Why the need to have two copies of data - one in data lake and one in delta lake?
Delta Lake is a type of lake house. Other examples of lake houses include Hudi and Iceberg.
A lake house is a tool that manages the deta lake in an efficient way and support ACID transactions and advanced features like data versioning.
The question should be - "Is there any benefit by using a pure data lake over a lake house?"
I guess the best advantage of a pure data lake is that it's OOTB, therefore cheaper/less complex than using a lake house, which provides you some advantages that you don't always need.
Delta Lake is a product (like Redshift) rather than a concept/approach/theory (like dimensional modelling).
As with any product in any walk of life, some of the claims made for the product will be true and some will be marketing spin. Whether the claimed benefits for a product actually make it superior to an alternative product will change from use case to use case.
Asking why there are other data lake solutions besides Delta Lake is a bit like asking why there is more than one DBMS in the world.
In my personal case there was already a data lake, a sybase IQ but its performance is poor compared to the queries that I can perform through spark to delta, speed is an important factor, and in partitioned tables it is remarkable
Delta lake is an open standard. Acid transactions are in reference to writes that fail midway. Transactions are a safety mechanism. Core support is in spark but other tools have added support for Delta lake. Delta lake is not a product. There is also the lake house design which again isn't a product but a way to approach building a data lake. If you follow the principles you can use any technology.
According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?
This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.
As per the doc, it says below :
Warning
Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.
It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.
Here is a blog post explaining how to do it.
I am currently using a solution to generate manifests of Delta tables using Apache Spark (https://docs.delta.io/latest/presto-integration.html#language-python).
I generate a manifest file for each Delta Table using:
deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")
Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.
CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/' -- location of
the generated manifest
It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..
At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.
To use Crawler to add meta of Delta lakes to Catalog, here is a workaround . The workaround is not pretty and has two major parts.
1) Get the manifest of referenced files of the Delta Lake. You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as
import org.apache.spark.sql.functions.input_file_name
spark.read.format("delta")
.load(<path-to-delta-lake>)
.select(input_file_name)
.distinct
2) Use Scala or Python Glue API and the manifest to create or update table in Catalog.
AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore.
Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake
But there are a few downsides to it -
It creates a symlink table in Glue metastore
This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version
There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore
Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats
Yes it is possible but only recently.
See the attached AWS Blog entry for details on this just announced capability.
https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/