Write Spark Dataframe to Hive accessible table in HDP2.6 - apache-spark

I know there are already lots of answers on writing to HIVE from Spark, but non of them seem to work for me. So first some background. This is an older cluster, running HDP2.6, that's Hive2 and Spark 2.1.
Here an example program:
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.appName("Test App")
.config("spark.sql.warehouse.dir", "/app/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
records.write.saveAsTable("records_table")
If I log into the spark-shell and run that code, a new table called records_table shows up in Hive. However, if I deploy that code in a jar, and submit it to the cluster using spark-submit, I will see the table show up in the same HDFS location as all of the other HIVE tables, but it's not accessible to HIVE.
I know that in HDP 3.1 you have to use a HiveWarehouseConnector class, but I can't find any reference to that in HDP 2.6. Some people have mentioned the HiveContext class, while others say to just use the enableHiveSupport call in the SparkSessionBuilder. I have tried both approaches, but neither seems to work. I have tried saveAsTable. I have tried insertInto. I have even tried creating a temp view, then hiveContext.sql("create table if not exists mytable as select * from tmptable"). With each attempt, I get a parquet file in hdfs:/apps/hive/warehouse, but I cannot access that table from HIVE itself.

Based on the information provided, here is what I suggest you do,
Create Spark Session, enableHiveSupport is required,
val spark = SparkSession.builder()
.appName("Test App")
.enableHiveSupport()
.getOrCreate()
Next, execute DDL for table resultant table via spark.sql,
val ddlStr: String =
s"""CREATE EXTERNAL TABLE IF NOT EXISTS records_table(key int, value string)
|ROW FORMAT SERDE
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|STORED AS INPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|LOCATION '$hdfsLocation'""".stripMargin
spark.sql(ddlStr)
Write data as per your use case,
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.format("orc").insertInto("records_table")
Notes:
Working is going to be similar for spark-shell and spark-submit
Partitioning is can be defined in the DDL, so do no use partitionBy while writing the data frame.
Bucketing/ Clustering is not supported.
Hope this helps/ Cheers.

Related

Spark sql return empty dataframe when read hive managed table

Using Spark 2.4 and Hive 3.1.0 in HDP 3.1 , I am trying to read managed table from hive using spark sql, but it returns an empty dataframe, while it could read an external table easily.
How can i read the managed table from hive by spark sql?
Note: The hive maanged table is not empty when reading it usig the hive client.
1- I tried to format the table by ORC an parquet and it failed in both.
2- I failed to read it using HWC.
3- I failed to read it when using JDBC.
os.environ["HADOOP_USER_NAME"] = 'hdfs'
spark = SparkSession\
.builder\
.appName('NHIC')\
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')\
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")\
.enableHiveSupport()\
.getOrCreate()
HiveTableName ='nhic_poc.nhic_data_sample_formatted'
data = spark.sql('select * from '+HiveTableName)
The expected is to return the dataframe with Data but Actually the dataframe is empty.
Could you check if your spark environment is over-configured?
Try to run the code with environment's default configurations, by removing these lines from your code:
os.environ["HADOOP_USER_NAME"] = 'hdfs'
.config('spark.sql.warehouse.dir', 'hdfs://192.168.1.65:50070/user/hive/warehouse')
.config("hive.metastore.uris", "thrift://192.168.1.66:9083")

How to insert spark structured streaming DataFrame to Hive external table/location?

One query on spark structured streaming integration with HIVE table.
I have tried to do some examples of spark structured streaming.
here is my example
val spark =SparkSession.builder().appName("StatsAnalyzer")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")
.getOrCreate()
// Register the dataframe as a Hive table
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta")
csvDF.createOrReplaceTempView("updates")
val query= spark.sql("insert into table_abcd select * from updates")
query.writeStream.start()
As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").
I am getting
spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()
Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?
Looking for a solution
Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.
scala> println(spark.version)
2.4.0
val sq = spark.readStream.format("rate").load
scala> :type sq
org.apache.spark.sql.DataFrame
scala> assert(sq.isStreaming)
scala> sq.writeStream.format("hive").start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)
... 49 elided
If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):
The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
I think foreachBatch is your best bet.
import org.apache.spark.sql.DataFrame
sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>
// do whatever you want with your input DataFrame
// incl. writing to Hive
// I simply decided to print out the rows to the console
ds.show
}.start
There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.
On HDP 3.1 with Spark 2.3.2 and Hive 3.1.0 we have used Hortonwork's spark-llap library to write structured streaming DataFrame from Spark to Hive. On GitHub you will find some documentation on its usage.
The required library hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar is available on Maven and needs to be passed on in the spark-submit command. There are many more recent versions of that library, although I haven't had the chance to test them.
After creating the Hive table manually (e.g. through beeline/Hive shell) you could apply the following code:
import com.hortonworks.hwc.HiveWarehouseSession
val csvDF = spark.readStream.[...].load()
val query = csvDF.writeStream
.format(HiveWarehouseSession.STREAM_TO_STREAM)
.option("database", "database_name")
.option("table", "table_name")
.option("metastoreUri", spark.conf.get("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/path/to/checkpoint/dir")
.start()
query.awaitTermination()
Just in case someone actually tried the code from Jacek Laskowski he knows that it does not really compile in Spark 2.4.0 (check my gist tested on AWS EMR 5.20.0 and vanilla Spark). So I guess that was his idea of how it should work in some future Spark version.
The real code is:
scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset
scala> sq.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => batchDs.show).start
res0: org.apache.spark.sql.streaming.StreamingQuery =
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper#5ebc0bf5

Set partition location in Qubole metastore using Spark

How to set partition location for my Hive table in Qubole metastore?
I know that this is MySQL DB, but how to access to it and pass a SQL script with a fix using Spark?
UPD: The issue is that ALTER TABLE table_name [PARTITION (partition_spec)] SET LOCATION works slowly for >1000 partitions. Do you know how to update metastore directly for Qubole? I want to pass locations in a batch to metastore to increase performance.
Set Hive metastore uris in your Spark config, if not set already. This can be done in the Qubole cluster settings.
Setup a SparkSession with some properties
val spark: SparkSession =
SparkSession
.builder()
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
Assuming AWS, define an external table on S3 using spark.sql
CREATE EXTERNAL TABLE foo (...) PARTITIONED BY (...) LOCATION 's3a://bucket/path'
Generate your dataframe according to that table schema.
Register a temp table for the dataframe. Let's call it tempTable
Run an insert command with your partitions, again using spark.sql
INSERT OVERWRITE TABLE foo PARTITION(part1, part2)
SELECT x, y, z, part1, part2 from tempTable
Partitions must go last in the selection
Partition locations will be placed within the table location in S3.
If you wanted to use external partitions, check out the Hive documentation on ALTER TABLE [PARTITION (spec)] that accepts a LOCATION path

pyspark, how to read Hive tables with SQLContext?

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Can I acess Hive table with SQLContext for simple queries (I know
HiveContext is more powerfull but for me this is just to understand
things)
If this is possible what is missing ? I couldn't find any info apart
from the hive-site.xml that I tried but doesn't seems to work
Thanks a lot
Cheers
Fabien
As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()
You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.
To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name
Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")

Read avro data using spark dataset in java

I am newbie to spark and am trying to load avro data to spark 'dataset' (spark 1.6) using java. I see some examples in scala but not in java.
Any pointers to examples in java will be helpful. I tried to create a javaRDD and then convert it to 'dataset'. I believe there must be a straight forward way.
first of all you need to set hadoop.home.dir
System.setProperty("hadoop.home.dir", "C:/app/hadoopo273/winutils-master/hadoop-2.7.1");
then create a Spark session specifying where the avro file will be located
SparkSession spark = SparkSession .builder().master("local").appName("ASH").config("spark.cassandra.connection.host", "127.0.0.1").config("spark.sql.warehouse.dir", "file:///C:/cygwin64/home/a622520/dev/AshMiner2/cass-spark-embedded/cassspark/cassspark.all/spark-warehouse/").getOrCreate();
In my code am using an embedded spark environement
// Creates a DataFrame from a specified file
Dataset<Row> df = spark.read().format("com.databricks.spark.avro") .load("./Ash.avro");
df.createOrReplaceTempView("words");
Dataset<Row> wordCountsDataFrame = spark.sql("select count(*) as total from words");
wordCountsDataFrame.show();
hope this helps

Resources