Spark sql return empty dataframe when read hive managed table - apache-spark

Using Spark 2.4 and Hive 3.1.0 in HDP 3.1 , I am trying to read managed table from hive using spark sql, but it returns an empty dataframe, while it could read an external table easily.
How can i read the managed table from hive by spark sql?
Note: The hive maanged table is not empty when reading it usig the hive client.
1- I tried to format the table by ORC an parquet and it failed in both.
2- I failed to read it using HWC.
3- I failed to read it when using JDBC.
os.environ["HADOOP_USER_NAME"] = 'hdfs'
spark = SparkSession\
.builder\
.appName('NHIC')\
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')\
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")\
.enableHiveSupport()\
.getOrCreate()
HiveTableName ='nhic_poc.nhic_data_sample_formatted'
data = spark.sql('select * from '+HiveTableName)
The expected is to return the dataframe with Data but Actually the dataframe is empty.

Could you check if your spark environment is over-configured?
Try to run the code with environment's default configurations, by removing these lines from your code:
os.environ["HADOOP_USER_NAME"] = 'hdfs'
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")

Related

Bucketby Property with Spark saveastable method taking spark 'default' database instead of hive database: HDP 3.0

I'm saving Spark DataFrame using saveAsTable method and writing below code.
val options = Map("path" -> hiveTablePath)
df.write.format("orc")
.partitionBy("partitioncolumn")
.options(options)
.mode(SaveMode.Append)
.saveAsTable(hiveTable)
It's working fine and i am able to see data in hive table. but when I'm using one more property bucketby(5,bucketted_column)
df.write.format("orc")
.partitionBy("partitioncolumn")
.bucketby(5,bucketted_column)
.options(options)
.mode(SaveMode.Append)
.saveAsTable(hiveTable)
It's trying to save it in spark 'default' database instead of hive database.
can someone please suggest me why bucketby(5,bucketted_column) is not working with saveAsTable.
Note: Framework: HDP 3.0
Spark : 2.1
You can try to add this parameter:
.option("dbtable", "schema.tablename")
df.write.mode...saveAsTable("database.table") should do the trick.
Except that the bucketBy format cannot be read via Hive, Hue, Impala. Currently not supported.
Not sure what you mean by spark db, I think you mean Spark metastore?

Write Spark Dataframe to Hive accessible table in HDP2.6

I know there are already lots of answers on writing to HIVE from Spark, but non of them seem to work for me. So first some background. This is an older cluster, running HDP2.6, that's Hive2 and Spark 2.1.
Here an example program:
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.appName("Test App")
.config("spark.sql.warehouse.dir", "/app/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
records.write.saveAsTable("records_table")
If I log into the spark-shell and run that code, a new table called records_table shows up in Hive. However, if I deploy that code in a jar, and submit it to the cluster using spark-submit, I will see the table show up in the same HDFS location as all of the other HIVE tables, but it's not accessible to HIVE.
I know that in HDP 3.1 you have to use a HiveWarehouseConnector class, but I can't find any reference to that in HDP 2.6. Some people have mentioned the HiveContext class, while others say to just use the enableHiveSupport call in the SparkSessionBuilder. I have tried both approaches, but neither seems to work. I have tried saveAsTable. I have tried insertInto. I have even tried creating a temp view, then hiveContext.sql("create table if not exists mytable as select * from tmptable"). With each attempt, I get a parquet file in hdfs:/apps/hive/warehouse, but I cannot access that table from HIVE itself.
Based on the information provided, here is what I suggest you do,
Create Spark Session, enableHiveSupport is required,
val spark = SparkSession.builder()
.appName("Test App")
.enableHiveSupport()
.getOrCreate()
Next, execute DDL for table resultant table via spark.sql,
val ddlStr: String =
s"""CREATE EXTERNAL TABLE IF NOT EXISTS records_table(key int, value string)
|ROW FORMAT SERDE
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|STORED AS INPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|LOCATION '$hdfsLocation'""".stripMargin
spark.sql(ddlStr)
Write data as per your use case,
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.format("orc").insertInto("records_table")
Notes:
Working is going to be similar for spark-shell and spark-submit
Partitioning is can be defined in the DDL, so do no use partitionBy while writing the data frame.
Bucketing/ Clustering is not supported.
Hope this helps/ Cheers.

Spark - Hive table returnig null value on shell

I am trying to pull hive table data on spark shell using spark.sql(" ") but it's giving null values.
Hive table contains data. Even I have written code using HiveContext object but still same issue persists.
hc=SQLContext(sc)
hc.sql("select * from <dbname>.<tablename> ").show()
Could you try setting spark.sql.warehouse.dir to that of your Hive Warehouse Directory instead of /user/hive/warehouse, and hive.metastore.uris to your thrift server
val spark = SparkSession
.builder()
.appName("YourName")
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate()

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table.
I have tried to do some examples of spark structured streaming.
here is my example
val spark =SparkSession.builder().appName("StatsAnalyzer")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")
.getOrCreate()
// Register the dataframe as a Hive table
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta")
csvDF.createOrReplaceTempView("updates")
val query= spark.sql("insert into table_abcd select * from updates")
query.writeStream.start()
As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").
I am getting
spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()
Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?
Looking for a solution
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.
scala> println(spark.version)
2.4.0
val sq = spark.readStream.format("rate").load
scala> :type sq
org.apache.spark.sql.DataFrame
scala> assert(sq.isStreaming)
scala> sq.writeStream.format("hive").start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)
... 49 elided
If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):
The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
I think foreachBatch is your best bet.
import org.apache.spark.sql.DataFrame
sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>
// do whatever you want with your input DataFrame
// incl. writing to Hive
// I simply decided to print out the rows to the console
ds.show
}.start
There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.
On HDP 3.1 with Spark 2.3.2 and Hive 3.1.0 we have used Hortonwork's spark-llap library to write structured streaming DataFrame from Spark to Hive. On GitHub you will find some documentation on its usage.
The required library hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar is available on Maven and needs to be passed on in the spark-submit command. There are many more recent versions of that library, although I haven't had the chance to test them.
After creating the Hive table manually (e.g. through beeline/Hive shell) you could apply the following code:
import com.hortonworks.hwc.HiveWarehouseSession
val csvDF = spark.readStream.[...].load()
val query = csvDF.writeStream
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.option("database", "database_name")
.option("table", "table_name")
.option("metastoreUri", spark.conf.get("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
query.awaitTermination()
Just in case someone actually tried the code from Jacek Laskowski he knows that it does not really compile in Spark 2.4.0 (check my gist tested on AWS EMR 5.20.0 and vanilla Spark). So I guess that was his idea of how it should work in some future Spark version.
The real code is:
scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
scala> sq.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => batchDs.show).start
res0: org.apache.spark.sql.streaming.StreamingQuery =
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5ebc0bf5

How to create a Spark Dataframe(v1.6) on a secured Hbase Table?

I am trying to create a spark dataframe on a existing HBase Table(HBase is secured via Kerberos). I need to perform some spark Sql operations on this table.
I have tried creating a RDD on a Hbase table but unable to convert it into dataframe.
You can create hive external table with HBase storage handler and then use that table to run your spark-sql queries.
Creating the hive external table:
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);
Spark-sql:
val df=spark.sql("SELECT * FROM foo WHERE …")
Note: Here spark is a SparkSession

Resources