Spatial functions in HIVE - apache-spark

We are in the process of migrating from traditional RDMS technologies to HIVE and in the process we have few questions on HIVE spatial capabilties. 1) Mapping of operation dbo.fHTMToString spatial from sql server operation to HIVE 2) Mapping of operation dbo.FHtmToLatLon spatial from sql server operation in HIVE

Spatial functions are not available yet in Hive. There are a proposal document you can find here: https://cwiki.apache.org/confluence/display/Hive/Spatial+queries

Try the Esri Spatial Framework for Hadoop, which includes an implementation of ST_Geometry for Hive.
You can start with the sample for Hive in the GIS Tools for Hadoop repository.
There are example calls to the various ST_Geometry UDFs in the scripts under spatial-framework-for-hadoop/hive/test/.
(Disclosure: I'm a collaborator.)

Related

Spark SQL vs Databricks SQL

I recently started working with spark and was eager to know if I have to perform queries which would be better spark sql or databricks sql and why?
We need to distinguish two things here:
Spark SQL as a dialect of the SQL language. Originally started as Shark & Hive on Spark projects (blog), it's now going close to ANSI SQL.
Spark SQL as execution engine inside Spark.
As was mentioned in this answer, Databricks SQL as language is primarily based on Spark SQL with some additions specific to Delta Lake tables (like CREATE TABLE CLONE, ...). ANSI compatibility in Databricks SQL is controlled with ANSI_MODE setting, and will be enabled by default in the future.
But when it comes to the execution, Databricks SQL is different from Spark SQL engine because it uses Photon engine heavily optimized for modern hardware and BI/DW workloads. With Photon you can get significant speedup (2-3x) compared to standard Spark SQL engine on the complex queries that process a lot of data.
In basic nut shell, you can download Apache Spark with pre-built Hadoop. You need to download the package from free. Additionally you can add Delta Lake and other third-party software.
Now Databricks is platform where you have to pay, it contains Apache SPARK + Delta Lake + many built in extras.
As expected, performance and SQL dialect between Hadoop and Delta Lake are different since they are different databases.
You can install Delta Lake in Apache Spark so you compare Hadoop vs Delta Lake

How to link Virtuoso distributed version to Hadoop

I Have a cluster of 4 nodes, I installed Hadoop+ Spark (GraphX)...
Now I have to process a big RDF dataset,
my question is : Can I install Virtuoso on the cluster so to store this RDF datasets and to be able to execute SPARQL distributed queries?
To the best of your knowledge, I need a web endpoint to allow users putting their SPARQL Queries.
in other words: is Virtuoso a good solution that works in a hadoop cluster, and can use SPARK to execute the distributed queries?
The Apache Spark website indicates that Spark SQL can be used to query across JDBC and JSON data sources --
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
Virtuoso (both Open Source and Enterprise Edition) can deliver SPARQL results as JSON serializations, so that is an option.
We (OpenLink Software) also provide JDBC drivers for Virtuoso (again, both Open Source and Enterprise Edition), so that is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting these working beyond assisting with Virtuoso JDBC URLs and/or retrieving SPARQL query results in JSON serialization.
In the other direction, Virtuoso (Enterprise Edition; not Open Source Edition) can be used to query against external ODBC data sources, and there are ODBC drivers available for Hadoop/SPARK data sources, so this is also an option.
We are not Apache Spark experts, so we cannot provide much guidance for getting their drivers working, but once you have a functional ODBC DSN on the Virtuoso host, we can assist in getting Virtuoso connected to and querying against it.
Are you seeking to upload RDF datasets from your Hadoop Cluster using SPARK jobs? If so, you can use JDBC and the connection to Virtuoso.
I stumbled upon a Dzone doc that covers SPARK and JDBC which once understood you can apply to Virtuoso via its ability to process SPARQL queries via SQL connections.
I hope that helps, if not, we can discuss further.

Need use case or example for Spark’s Relationship to Hive

I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.

Difference Between Spark SQL and Hive

Can you please help me to understand the difference between Spark SQl and Hive?
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Built on top of Apache Hadoop, Hive provides the following features:
Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
A mechanism to impose structure on a variety of data formats
Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.
Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.
Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.

SparkSQL vs Hive on Spark - Difference and pros and cons?

SparkSQL CLI internally uses HiveQL and in case Hive on spark(HIVE-7292) , hive uses spark as backend engine. Can somebody throw some more light, how exactly these two scenarios are different and pros and cons of both approaches?
When SparkSQL uses hive
SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. This metadata enables SparkSQL to do better optimization of the queries that it executes. Here Spark is the query processor.
When Hive uses Spark See the JIRA entry: HIVE-7292
Here the the data is accessed via spark. And Hive is the Query processor. So we have all the deign features of Spark Core to take advantage of. But this is a Major Improvement for Hive and is still "in progress" as of Feb 2 2016.
There is a third option to process data with SparkSQL
Use SparkSQL without using Hive. Here SparkSQL does not have access to the metadata from the Hive Metastore. And the queries run slower. I have done some performance tests comparing options 1 and 3. The results are here.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world:
SparkSQL is pure SQL, and Spark API is language for writing stored procedure
Hive on Spark is similar to SparkSQL, it is a pure SQL interface that use spark as execution engine, SparkSQL uses Hive's syntax, so as a language, i would say they are almost the same.
but Hive on Spark has a much better support for hive features, especially hiveserver2 and security features, hive features in SparkSQL is really buggy, there is a hiveserver2 impl in SparkSQL, but in latest release version (1.6.x), hiveserver2 in SparkSQL doesn't work with hivevar and hiveconf argument anymore, and the username for login via jdbc doesn't work either...
see https://issues.apache.org/jira/browse/SPARK-13983
i believe hive support in spark project is really very low priority stuff...
sadly Hive on spark integration is not that easy, there are a lot of dependency conflicts... such as
https://issues.apache.org/jira/browse/HIVE-13301
and, when i'm trying hive with spark integration, for debug purpose, i'm always starting hive cli like this:
export HADOOP_USER_CLASSPATH_FIRST=true
bin/hive --hiveconf hive.root.logger=DEBUG,console
our requirement is using spark with hiveserver2 in a secure way (with authentication and authorization), currently SparkSQL alone can not provide this, we are using ranger/sentry + Hive on Spark.
hope this can help you to get a better idea which direction you should go.
here is related answer I find in the hive official site:
1.3 Comparison with Shark and Spark SQL 
There are two related projects in the Spark ecosystem that provide Hive QL support on Spark: Shark and Spark SQL. 
●The Shark project translates query plans generated by Hive into its own representation and executes them over Spark.  
●Spark SQL is a feature in Spark. It uses Hive’s parser as the frontend to provide Hive QL support. Spark application developers can easily express their data processing logic in SQL, as well as the other Spark operators, in their code. Spark SQL supports a different use case than Hive. 
Compared with Shark and Spark SQL, our approach by design supports all existing Hive features, including Hive QL (and any future extension), and Hive’s integration with authorization, monitoring, auditing, and other operational tools. 
3. Hive­-Level Design 
As noted in the introduction, this project takes a different approach from that of Shark or Spark SQL in the sense that we are not going to implement SQL semantics using Spark's primitives. On the contrary, we will implement it using MapReduce primitives. The only new thing here is that these MapReduce primitives will be executed in Spark. In fact, only a few of Spark's primitives will be used in this design. 
The approach of executing Hive’s MapReduce primitives on Spark that is different from what Shark or Spark SQL does has the following direct advantages: 
1.Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future. 
2.This approach avoids or reduces the necessity of any customization work in Hive’s Spark execution engine.
3.It will also limit the scope of the project and reduce longterm maintenance by keeping Hive­-on­-Spark congruent to Hive MapReduce and Tez. 

Resources