save dataframe as external hive table - apache-spark

I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table

you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"

In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')

For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
df.write.mode(SaveMode.Overwrite).parquet(path)
val sql =
s"""
|alter table $targetTable
|add if not exists partition
|(year=$year,month=$month)
|location "$path"
""".stripMargin
hc.sql(sql)

You can also save dataframe with manual create table
dataframe.registerTempTable("temp_table");
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html

Related

Deleting data from Hive managed table (Partitioned and Bucketed)

We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Thanks
Anuj
Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.

API or Catalog operations for creating Hive External table using HiveWarehouseSession.session(spark).build()

I'm using HDP 3.X cluster and running spark sql using spark_llap, Is there a way to create external hive table using hive.createTable because the example provided in Hortonworks website is to use following code whereas this code will create manged table but I need external table.
hive.createTable("web_sales").ifNotExists().column("sold_time_sk", "bigint").column("ws_ship_date_sk", "bigint").create()
you can directly use spark session to create a table.
example1 :
//drop the table if already created
spark.sql("drop table if exists my_table");
//create the table using the dataframe schema
spark.sql("create table my_table(....
") row format delimited fields terminated by '|' location '/my/hdfs/location'");
example 2:
spark.sql('create table movies \
(movieId int,title string,genres string) \
row format delimited fields terminated by ","\
stored as textfile') # in textfile format
spark.sql("create table ratings\
(userId int,movieId int,rating float,timestamp string)\
stored as ORC" )

Spark Sql - Insert Into External Hive Table Error

I am trying to insert data into a external hive table through spark sql.
My hive table is bucketed via a column.
The query to create the external hive table is this
create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet
Now I tried to store data from a parquet file (stored in hdfs) into the table.
This is my code
SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("hive.execution.engine","tez").
config("hive.exec.max.dynamic.partitions","400").
config("hive.exec.max.dynamic.partitions.pernode","400").
config("hive.enforce.bucketing","true").
config("optimize.sort.dynamic.partitionining","true").
config("hive.vectorized.execution.enabled","true").
config("hive.enforce.sorting","true").
enableHiveSupport()
.master(args[0]).getOrCreate();
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";
session.sql(insertSql);
When I run the code , its throwing the below error
mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)
== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
What is the difference between using the hive execution engine as Tez and Spark ?
Have you tried
LOAD DATA LOCAL INPATH '/path/to/data'
OVERWRITE INTO TABLE tablename;
Creating external table in Hive, HDFS location to be specified.
create external table tab1 ( col1 type,col2 type,col3 type)
clustered by (col1,col2) sorted by (col1) into 8 buckets
stored as parquet
LOCATION hdfs://url:port/user/clsadmin/tab1
There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.
*== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^*
parquetInput is parquet HDFS file path and not Hive table name. Hence the error.
There are two ways you can solve this issue:
Define the external table for "parquetInput" and give the table
name
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

Create Spark SQL tables from multiple parquet paths

I use databricks. I am trying to create a table as below
target_table_name = 'test_table_1'
spark.sql("""
drop table if exists %s
""" % target_table_name)
spark.sql("""
create table if not exists {0}
USING org.apache.spark.sql.parquet
OPTIONS (
path ("/mnt/sparktables/ds=*/name=xyz/")
)
""".format(target_table_name))
Even though using "*" gives me flexibility on loading different files (pattern matching) and eventually create a table, I wish to create a table based on two completely different paths (no pattern matching).
path1 = /mnt/sparktables/ds=*/name=xyz/
path2 = /mnt/sparktables/new_path/name=123fo/
Spark uses Hive metastore to create these permanent tables. These tables are essentially external tables in Hive.
Generally what you are trying is not possible because Hive external table location needs to be unique at the time of creation.
However, you could still achieve the hive table with different location, if you incorporate partitioning strategy on your hive metastore.
In hive metastore you can have partitions which point to different locations.
However there is no off the shelf way to achieve this. Firstly you would need to specify a partition key for your dataset and create a table from the 1st location where the entire data belongs to one partition. Then alter table to add a new partition.
Sample:
create external table tableName(<schema>) partitioned by ('name') location '/mnt/sparktables/ds=*/name=xyz/'
Then you can add partitions
alter table tableName add partition(name='123fo') location '/mnt/sparktables/new_path/name=123fo/'
The alternate to this process is create 2 dataframe out of the 2 location , combine them then saveAsaTable
I would do something like this:
create or replace view 'mytable' as
select * from parquet.`path1`
union all
select * from parquet.`path2`
The view understands how to query from both locations. I assume you will not append/overwrite the table as it would lead to more ambiguity.
You can create data frames separately for two or more parquet files and then union them (assuming they have identical schemas)
df1.union(df2)

Resources