I am trying to insert data into a external hive table through spark sql.
My hive table is bucketed via a column.
The query to create the external hive table is this
create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet
Now I tried to store data from a parquet file (stored in hdfs) into the table.
This is my code
SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";
When I run the code , its throwing the below error
mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)
== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
What is the difference between using the hive execution engine as Tez and Spark ?
Have you tried
LOAD DATA LOCAL INPATH '/path/to/data'
Creating external table in Hive, HDFS location to be specified.
create external table tab1 ( col1 type,col2 type,col3 type)
clustered by (col1,col2) sorted by (col1) into 8 buckets
stored as parquet
LOCATION hdfs://url:port/user/clsadmin/tab1
There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.
*== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
parquetInput is parquet HDFS file path and not Hive table name. Hence the error.
There are two ways you can solve this issue:
Define the external table for "parquetInput" and give the table
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1
We have a hive managed table (its both partitioned and bucketed, and transaction = 'true').
We are using Spark (version 2.4) to interact with this hive table.
We are able to successfully ingest data into this table using following;
sparkSession.sql("insert into table values(''))
But we are not able to delete a row from this table. We are attempting to delete using below command;
sparkSession.sql("delete from table where col1 = '' and col2 = '')
We are getting operationNotAccepted exception.
Do we need to do anything specific to be able to perform this action?
Unless DELTA table, this is not possible.
ORC does not support delete for Hive bucketed tables. See https://github.com/qubole/spark-acid
HUDI on AWS could also be an option.
I'm using HDP 3.X cluster and running spark sql using spark_llap, Is there a way to create external hive table using hive.createTable because the example provided in Hortonworks website is to use following code whereas this code will create manged table but I need external table.
hive.createTable("web_sales").ifNotExists().column("sold_time_sk", "bigint").column("ws_ship_date_sk", "bigint").create()
you can directly use spark session to create a table.
example1 :
//drop the table if already created
spark.sql("drop table if exists my_table");
//create the table using the dataframe schema
spark.sql("create table my_table(....
") row format delimited fields terminated by '|' location '/my/hdfs/location'");
example 2:
spark.sql('create table movies \
(movieId int,title string,genres string) \
row format delimited fields terminated by ","\
stored as textfile') # in textfile format
spark.sql("create table ratings\
(userId int,movieId int,rating float,timestamp string)\
stored as ORC" )
I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.
We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below
spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")
Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count
in spark:
spark.sql("select count(1) from target") -- giving 50K records
in hive:
select count(1) from target -- giving a count 50k - 100
Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running
But even after doing all these still we are getting original_count-100 in hive where correct count in spark.
There was a mistake in the DDL used for external table. "skip.header.line.count"="1" was there in the DDL and we are having 100 output files. so 1 line each file were skipped , which caused original count - 100 in hive while spark calculated it correctly. Removed "skip.header.line.count"="1" and its giving count as expected now.
I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
LOCATION "hdfs://path"
Drop table something;
No rows affected (0.945 seconds)
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.
I have used one way to save dataframe as external table using parquet file format but is there some other way to save dataframes directly as external table in hive like we have saveAsTable for managed table
you can do this in this way
df.write.format("ORC").options(Map("path"-> "yourpath")) saveAsTable "anubhav"
In PySpark, External Table can be created as below:
df.write.option('path','<External Table Path>').saveAsTable('<Table Name>')
For external table, don't use saveAsTable. Instead, save the data at location of the external table specified by path. Then add partition so that it is registered with hive metadata. This will allow you to hive query by partition later.
// hc is HiveContext, df is DataFrame.
val sql =
|alter table $targetTable
|add if not exists partition
|location "$path"
You can also save dataframe with manual create table
hiveSqlContext.sql("create external table
table_name if not exist as select * from temp_table");
Below mentioned link has a good explanation for create table https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html