I recently started working with spark and was eager to know if I have to perform queries which would be better spark sql or databricks sql and why?
We need to distinguish two things here:
Spark SQL as a dialect of the SQL language. Originally started as Shark & Hive on Spark projects (blog), it's now going close to ANSI SQL.
Spark SQL as execution engine inside Spark.
As was mentioned in this answer, Databricks SQL as language is primarily based on Spark SQL with some additions specific to Delta Lake tables (like CREATE TABLE CLONE, ...). ANSI compatibility in Databricks SQL is controlled with ANSI_MODE setting, and will be enabled by default in the future.
But when it comes to the execution, Databricks SQL is different from Spark SQL engine because it uses Photon engine heavily optimized for modern hardware and BI/DW workloads. With Photon you can get significant speedup (2-3x) compared to standard Spark SQL engine on the complex queries that process a lot of data.
In basic nut shell, you can download Apache Spark with pre-built Hadoop. You need to download the package from free. Additionally you can add Delta Lake and other third-party software.
Now Databricks is platform where you have to pay, it contains Apache SPARK + Delta Lake + many built in extras.
As expected, performance and SQL dialect between Hadoop and Delta Lake are different since they are different databases.
You can install Delta Lake in Apache Spark so you compare Hadoop vs Delta Lake
Related
I'm looking at using both Databricks and Snowflake, connected by the Spark Connector, all running on AWS. I'm struggling to understand the following before triggering a decision:
How well does the Spark Connector perform? (performance, extra costs, compatibility)
What comparisons can be made between Databricks SQL and Snowflake SQL in terms of performance and standards?
What have been the “gotchas” or unfortunate surprises about trying to use both?
Snowflake has invested in the Spark connector's performance and according to benchmarks[0] it performs well.
The SQL dialects are similar. "Databricks SQL maintains compatibility with Apache Spark SQL semantics." [1] "Snowflake supports most of the commands and statements defined in SQL:1999." [2]
I haven't experienced gotchas. I would avoid using different regions. The performance characteristics of DataBricks SQL are different since 6/17 when they made their Photon engine default.
As always, the utility will depend on your use case, for example:
If you were doing analytical DataBricks SQL queries on partitioned compressed Parquet DeltaLake, then the performance ought to be roughly similar to Snowflake -- but if you were doing analytical DataBricks SQL queries against a JDBC MySQL connection then performance of Snowflake should be vastly better.
If you were doing wide table scan style queries (e.g. select * from foo (no where, no limit)) in DataBricks SQL and then doing analysis in a kernel (or something) then switching to Snowflake isn't going to do much for you.
etc
[0] - https://www.snowflake.com/blog/snowflake-connector-for-spark-version-2-6-turbocharges-reads-with-apache-arrow/
[1] - https://docs.databricks.com/sql/release-notes/index.html
[2] - https://docs.snowflake.com/en/sql-reference/intro-summary-sql.html
Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed?
I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache spark. Not sure what other tools are required.
Are there any other tool suggestions to implement on-premise data lake concept?
Yes, you can use Delta Lake on-premise. It's just a matter of the using correct version of the Delta library (0.6.1 for Spark 2.4, 0.8.0 for Spark 3.0). Or running the spark-shell/pyspark as following (for Spark 3.0):
pyspark --packages io.delta:delta-core_2.12:0.8.0
then you can write data in Delta format, like this:
spark.range(1000).write.format("delta").mode("append").save("1.delta")
It can work with local files as well, but if you need to build a real data lake, then you need to use something like HDFS that is also supported out of the box.
Can you please help me to understand the difference between Spark SQl and Hive?
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Built on top of Apache Hadoop, Hive provides the following features:
Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
A mechanism to impose structure on a variety of data formats
Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.
Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.
Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.
We are in the process of migrating from traditional RDMS technologies to HIVE and in the process we have few questions on HIVE spatial capabilties. 1) Mapping of operation dbo.fHTMToString spatial from sql server operation to HIVE 2) Mapping of operation dbo.FHtmToLatLon spatial from sql server operation in HIVE
Spatial functions are not available yet in Hive. There are a proposal document you can find here: https://cwiki.apache.org/confluence/display/Hive/Spatial+queries
Try the Esri Spatial Framework for Hadoop, which includes an implementation of ST_Geometry for Hive.
You can start with the sample for Hive in the GIS Tools for Hadoop repository.
There are example calls to the various ST_Geometry UDFs in the scripts under spatial-framework-for-hadoop/hive/test/.
(Disclosure: I'm a collaborator.)
SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL.
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive.
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools.
3. Hive-Level Design
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design.
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages:
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive-on-Spark congruent to Hive MapReduce and Tez.