How to set hive parameters and multiple statements in spark sql - apache-spark

I have a parquet table and the table contains a column with new line data. So when a hive query is fired on this table new line data will be considered as new records, i can over come this in hive by setting parameter "set hive.query.result.fileformat=SequenceFile;". Now i am migrating this parameter and MR query to be run in spark sql. Also i want to run some other queries like drop table statement before the actual query.
My code is like below
spark.sql(set hive.query.result.fileformat=SequenceFile;drop table output_table; create table output_table stored as orc as select * from source_table;)
With this query i am getting parser error at the semicolon (;) position. How can i properly execute above code in spark sql?

You shouldn't have a semicolon at the end of the code. Remove your semicolon ,add parentheses, and include your set parameter variable in the spark config command. Then it should work.
ex:
spark = (SparkSession
.builder
.appName('hive_validation_test')
.enableHiveSupport()
.config("hive.query.result.fileformat", "SequenceFile")
spark.sql('drop table output_table').
spark.sql('create table output_table stored as orc as select * from source_table').

Related

Altered column in Hive, showing Null value in Spark SQL result

I need to alter one column name in Hive, so I did that with below query. After altering I could see the result in hive for "select columnnm1 from tablename"
alter table tablename change columnnm1 columnnm2 string;
But when I tried to execute the select columnnm2 from spark.sql I am getting NULL values, whereas I could see valid values in hive.
It's managed table. I tried spark metadata refresh, still no luck. As of now I am dropping old table and creating new hive table with correct column name works. But how to handle this ALTER scenario.
spark.catalog.refreshTable("<tablename>")
Thank you.

Query Hive table in spark 2.2.0

I have a hive table(say table1) in avro file format with 1900 columns. When I query the table in hive - I am able to fetch data but when I query the same table in spark sql I am getting metastore client lost connection. Attempting to reconnect
I have also queried another hive table(say table2) in avro file format with 130 columns it's fetching data both in hive and spark.
What I observed is I can see data in hdfs location of table2 but I can't see any data in table1 hdfs location (but it's feching data when I query only in hive)
Split tell you about number of mappers in MR job.
It doesn’t show you the exact location from where the data had been picked.
The below will help you to check where the data for Table1 is stored in HDFS.
For Table 1: You can check the location of data in HDFS by running a SELECT query with WHERE conditions in Hive with MapReduce as the execution engine. Once the job is complete, you can check the map task's log of the YARN application (specifically for the text "Processing file") and find where the input data files have been taken from.
Also, try checking the location of data for both the tables present in HiveMetastore by running "SHOW CREATE TABLE ;" in hive for both the tables in Hive. From the result, try to check the "LOCATION" details.

Set Hive TBLPROPERTIES using (py)Spark

I need to set a custom property in one of my Hive tables using pySpark.
Normally, I would just run this command in any Hive interface to do it:
ALTER TABLE table_name SET TBLPROPERTIES ('key1'='value1');
But the question is, can I accomplish the same within a pySpark script?
Thanks!
Well, that was actually easy... it can be set using sqlContext in pySpark:
sqlContext.sql("ALTER TABLE table_name SET TBLPROPERTIES('key1' = 'value1')")
It will return an empty DataFrame: DataFrame[]
But the property is actually present on the target table. It can be similarly retrieved using:
sqlContext.sql("SHOW TBLPROPERTIES table_name('key1')").collect()[0].asDict()
{'value': u'value1'}

Spark Sql - Insert Into External Hive Table Error

I am trying to insert data into a external hive table through spark sql.
My hive table is bucketed via a column.
The query to create the external hive table is this
create external table tab1 ( col1 type,col2 type,col3 type) clustered by (col1,col2) sorted by (col1) into 8 buckets stored as parquet
Now I tried to store data from a parquet file (stored in hdfs) into the table.
This is my code
SparkSession session = SparkSession.builder().appName("ParquetReadWrite").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
config("hive.execution.engine","tez").
config("hive.exec.max.dynamic.partitions","400").
config("hive.exec.max.dynamic.partitions.pernode","400").
config("hive.enforce.bucketing","true").
config("optimize.sort.dynamic.partitionining","true").
config("hive.vectorized.execution.enabled","true").
config("hive.enforce.sorting","true").
enableHiveSupport()
.master(args[0]).getOrCreate();
String insertSql="insert into tab1 select * from"+"'"+parquetInput+"'";
session.sql(insertSql);
When I run the code , its throwing the below error
mismatched input ''hdfs://url:port/user/clsadmin/somedata.parquet'' expecting (line 1, pos 50)
== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:239)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:115)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
What is the difference between using the hive execution engine as Tez and Spark ?
Have you tried
LOAD DATA LOCAL INPATH '/path/to/data'
OVERWRITE INTO TABLE tablename;
Creating external table in Hive, HDFS location to be specified.
create external table tab1 ( col1 type,col2 type,col3 type)
clustered by (col1,col2) sorted by (col1) into 8 buckets
stored as parquet
LOCATION hdfs://url:port/user/clsadmin/tab1
There won't be necessity that hive will populate the data, either same application or other application can ingest the data into the location and hive will access the data by defining the schema top of the location.
*== SQL ==
insert into UK_DISTRICT_MONTH_DATA select * from 'hdfs://url:port/user/clsadmin/somedata.parquet'
--------------------------------------------------^^^*
parquetInput is parquet HDFS file path and not Hive table name. Hence the error.
There are two ways you can solve this issue:
Define the external table for "parquetInput" and give the table
name
Use LOAD DATA INPATH 'hdfs://url:port/user/clsadmin/somedata.parquet' INTO TABLE tab1

drop table command is not deleting path of hive table which was created by spark-sql

I am trying to drop a table(Internal) table that was created Spark-Sql, some how table is getting dropped but location of the table is still exists. Can some one let me know how to do this?
I tried both Beeline and Spark-Sql
create table something(hello string)
PARTITIONED BY(date_d string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "^"
LOCATION "hdfs://path"
)
Drop table something;
No rows affected (0.945 seconds)
Thanks
Spark internally uses Hive metastore to create Table. If the table is created as an external hive table from spark i.e. the data present in HDFS and Hive provides a table view on that, drop table command will only delete the Metastore information and will not delete the data from HDFS.
So there are some alternate strategy which you could take
Manually delete the data from HDFS using hadoop fs -rm -rf command
Do alter table on the table you want to delete, change the external table to internal table then drop the table.
ALTER TABLE <table-name> SET TBLPROPERTIES('external'='false');
drop table <table-name>;
The first statement will convert the external table to internal table and 2nd statement will delete the table with the data.

Resources