SQL View on Delta Lake table - apache-spark

I need to create an abstraction on top of existing Delta Lake Table in Databricks.
Is it possible to make SQL Server kind of SQL View based on Delta Lake Table in Spark?

SQL view can be created on delta lake by multiple ways now.
Through Spark:
CREATE OR REPLACE VIEW sqlView
AS SELECT col1, .., coln FROM delta_table
Hive table can be created on delta table (path). Just add jars in hive environment, set following properties & create external table (hive supported 2.x)
`
ADD JAR /path/to/delta-core-shaded-assembly_2.11-0.1.0.jar;
ADD JAR /path/to/hive-delta_2.11-0.1.0.jar;
SET hive.input.format=io.delta.hive.HiveInputFormat;
SET hive.tez.input.format=io.delta.hive.HiveInputFormat;
CREATE EXTERNAL TABLE deltaTable(col1 INT, col2 STRING)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '/delta/table/path'
`
For more details: https://github.com/delta-io/connectors
Presto & Athena can be integrated with delta.
https://docs.delta.io/latest/presto-integration.html

A view can be created in Delta Lake just like in relational DBs using below DDL statement:
CREATE OR REPLACE VIEW SampleDB.Sample_View
AS
SELECT
ColA
,COlB
FROM SampleDB.Sample_Table
Create View Documentation

Related

Deleting data from Hive managed table (Partitioned and Bucketed)

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj
Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

How to Data to an existing delta table in databricks?

I am having data in parquet format in ADLS gen2. I want to implement dalta layers in my project.
So I kept all the data from on-prem in ADLS Gen2 via ADF in a separate container called landing zone.
Now i created a separated container called Bronze where I want to keep delta table.
For this I have did like below.
I have created a database in databricks. And I have created a delta table in data bricks using below SQL code.
create table if not exists externaltables.actv_snap_view(
id String,
mbr_id String,
typ_id String,
strt_dttm String,
otcome_typ_id String,
cdc String
)
using delta
location '/mnt/Storage/Bronze/actv_snap_view'
Now my table is not having any data.
How can I add data which is in data lake landing zone into delta table which I created.
My database is in databricks after data is added to the table where will be the underlined data will be stored.
You can follow the steps below to create table using data from landingzone (source for parquet files), where the table belongs to the database present in bronze container.
Considering your ADLS containers are mounted, you can create a database and specify its location as your bronze container mount point as suggested by #Ganesh Chandrasekaran.
create database demo location "/mnt/bronzeoutput/"
Now use the following SQL syntax to create a table using parquet file present in mount point of the landingzone container.
create table demo.<table_name> (<columns>) using parquet location '/mnt/landingzoneinput/<parquet_file_name>';
Using the above steps, you have created a database in your bronze container where you can store your tables. To populate a table created inside this database of bronze container, you are using the files present in your landingzone container.
Update:
Using the create table statement above is creating a table with data from the parquet file, but this table does not reflect in the data lake.
You can instead use the query given below. It first creates a table in the database (present in bronze container). Now you can insert the values from your parquet file present in landingzone.
create table demo.<table_name> (<columns>);
-- demo database is inside bronze container
insert into demo.<table_name> select * from <data_source>.`/mnt/landingzoneinput/source_file`

How can I see the location of an external Delta table in Spark using Spark SQL?

If I create an external table in Databricks, how can I check its location (in Delta lake) using an SQL query?
This can be done by using of multiple ways .
%sql
show create table database.tablename
or
%sql
desc formatted database.tablename
It can be done by using the following command.
describe detail <the table>
The location would be listed in the column location.

Spark Sql - Insert Into External Hive Table Error

I am trying to insert data into a external hive table through spark sql.
My hive table is bucketed via a column.
The query to create the external hive table is this
create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet
Now I tried to store data from a parquet file (stored in hdfs) into the table.
This is my code
SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("hive.execution.engine","tez").
config("hive.exec.max.dynamic.partitions","400").
config("hive.exec.max.dynamic.partitions.pernode","400").
config("hive.enforce.bucketing","true").
config("optimize.sort.dynamic.partitionining","true").
config("hive.vectorized.execution.enabled","true").
config("hive.enforce.sorting","true").
enableHiveSupport()
.master(args[0]).getOrCreate();
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";
session.sql(insertSql);
When I run the code , its throwing the below error
mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)
== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
What is the difference between using the hive execution engine as Tez and Spark ?
Have you tried
LOAD DATA LOCAL INPATH '/path/to/data'
OVERWRITE INTO TABLE tablename;
Creating external table in Hive, HDFS location to be specified.
create external table tab1 ( col1 type,col2 type,col3 type)
clustered by (col1,col2) sorted by (col1) into 8 buckets
stored as parquet
LOCATION hdfs://url:port/user/clsadmin/tab1
There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.
*== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^*
parquetInput is parquet HDFS file path and not Hive table name. Hence the error.
There are two ways you can solve this issue:
Define the external table for "parquetInput" and give the table
name
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1

save dataframe as external hive table

I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)
You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Resources