How to use SPARK to query on HIVE? - apache-spark

I am trying to use spark to run queries on hive table.
I have followed lots of articles present on internet, but had no success.
I have moved the hive-site.xml file to spark location.
Could you please explain how to do that? I am using Spark 1.6
Thank you in advance.
Please find my code below.
import sqlContext.implicits._
import org.apache.spark.sql
val eBayText = sc.textFile("/user/cloudera/spark/servicesDemo.csv")
val hospitalDataText = sc.textFile("/user/cloudera/spark/servicesDemo.csv")
val header = hospitalDataText.first()
val hospitalData = hospitalDataText.filter(a=>a!=header)
case class Services(uhid:String,locationid:String,doctorid:String)
val hData = hospitalData.map(_.split(",")).map(p=>Services(p(0),p(1),p(2)))
val hosService = hData.toDF()
hosService.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("/user/hive/warehouse/hosdata")
This code created 'hosdata' folder at specified path, which contains data in 'parquet' format.
But when i went to hive and check table got created or not the, i did not able to see any table name as 'hosdata'.
So i run below commands.
hosService.write.mode("overwrite").saveAsTable("hosData")
sqlContext.sql("show tables").show
shows me below result
+--------------------+-----------+
| tableName|isTemporary|
+--------------------+-----------+
| hosdata| false|
+--------------------+-----------+
But again when i check in hive, i can not see table 'hosdata'
Could anyone let me know what step i am missing?

There are multiple ways you can use to query Hive using Spark.
Like in Hive CLI, you can query using Spark SQL
Spark-shell is available to run spark class files in which you need to define variable like for hive, spark configuration object. Spark Context-sql() method allows you to execute the same query that you might have executed on Hive
Performance tuning is definitely an important perspect as you can use broadcast and other methods for faster execution.
Hope this helps.

Related

Write Spark Dataframe to Hive accessible table in HDP2.6

I know there are already lots of answers on writing to HIVE from Spark, but non of them seem to work for me. So first some background. This is an older cluster, running HDP2.6, that's Hive2 and Spark 2.1.
Here an example program:
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.appName("Test App")
.config("spark.sql.warehouse.dir", "/app/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
records.write.saveAsTable("records_table")
If I log into the spark-shell and run that code, a new table called records_table shows up in Hive. However, if I deploy that code in a jar, and submit it to the cluster using spark-submit, I will see the table show up in the same HDFS location as all of the other HIVE tables, but it's not accessible to HIVE.
I know that in HDP 3.1 you have to use a HiveWarehouseConnector class, but I can't find any reference to that in HDP 2.6. Some people have mentioned the HiveContext class, while others say to just use the enableHiveSupport call in the SparkSessionBuilder. I have tried both approaches, but neither seems to work. I have tried saveAsTable. I have tried insertInto. I have even tried creating a temp view, then hiveContext.sql("create table if not exists mytable as select * from tmptable"). With each attempt, I get a parquet file in hdfs:/apps/hive/warehouse, but I cannot access that table from HIVE itself.
Based on the information provided, here is what I suggest you do,
Create Spark Session, enableHiveSupport is required,
val spark = SparkSession.builder()
.appName("Test App")
.enableHiveSupport()
.getOrCreate()
Next, execute DDL for table resultant table via spark.sql,
val ddlStr: String =
s"""CREATE EXTERNAL TABLE IF NOT EXISTS records_table(key int, value string)
|ROW FORMAT SERDE
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|STORED AS INPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|OUTPUTFORMAT
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|LOCATION '$hdfsLocation'""".stripMargin
spark.sql(ddlStr)
Write data as per your use case,
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.format("orc").insertInto("records_table")
Notes:
Working is going to be similar for spark-shell and spark-submit
Partitioning is can be defined in the DDL, so do no use partitionBy while writing the data frame.
Bucketing/ Clustering is not supported.
Hope this helps/ Cheers.

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

How to list partition-pruned inputs for Hive tables?

I am using Spark SQL to query data in Hive. The data is partitioned and Spark SQL correctly prunes the partitions when querying.
However, I need to list either the source tables along with partition filters or the specific input files (.inputFiles would be an obvious choice for this but it does not reflect pruning) for a given query in order to determine on which part of the data the computation will be taking place.
The closest I was able to get was by calling df.queryExecution.executedPlan.collectLeaves(). This contains the relevant plan nodes as HiveTableScanExec instances. However, this class is private[hive] for the org.apache.spark.sql.hive package. I think the relevant fields are relation and partitionPruningPred.
Is there any way to achieve this?
Update: I was able to get the relevant information thanks to Jacek's suggestion and by using getHiveQlPartitions on the returned relation and providing partitionPruningPred as the parameter:
scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))
This contained all the data I needed, including the paths to all input files, properly partition pruned.
Well, you're asking for low-level details of the query execution and things are bumpy down there. You've been warned :)
As you noted in your comment, all the execution information are in this private[hive] HiveTableScanExec.
One way to get some insight into HiveTableScanExec physical operator (that is a Hive table at execution time) is to create a sort of backdoor in org.apache.spark.sql.hive package that is not private[hive].
package org.apache.spark.sql.hive
import org.apache.spark.sql.hive.execution.HiveTableScanExec
object scan {
def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables }
}
Change the code to meet your needs.
With the scan.findHiveTables, I usually use :paste -raw while in spark-shell to sneak into such "uncharted areas".
You could then simply do the following:
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
// Create a Hive table
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
tableName = "h1",
source = "hive", // <-- that makes for a Hive table
schema = new StructType().add($"id".long),
options = Map.empty[String, String])
// select * from h1
val q = spark.table("h1")
val execPlan = q.queryExecution.executedPlan
scala> println(execPlan.numberedTreeString)
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L]
// Use the above code and :paste -raw in spark-shell
import org.apache.spark.sql.hive.scan
scala> scan.findHiveTables(execPlan).size
res11: Int = 1
relation field is the Hive table after it's been resolved using ResolveRelations and FindDataSourceTable logical rule that Spark analyzer uses to resolve data source and hive tables.
You can get pretty much all the information Spark uses from a Hive metastore using ExternalCatalog interface that is available as spark.sharedState.externalCatalog. That gives you pretty much all the metadata Spark uses to plan queries over Hive tables.

pyspark, how to read Hive tables with SQLContext?

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Can I acess Hive table with SQLContext for simple queries (I know
HiveContext is more powerfull but for me this is just to understand
things)
If this is possible what is missing ? I couldn't find any info apart
from the hive-site.xml that I tried but doesn't seems to work
Thanks a lot
Cheers
Fabien
As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()
You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.
To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name
Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")

How to work with PySpark, SparkSQL and Cassandra?

I am a bit confused with the different actors in this story: PySpark, SparkSQL, Cassandra and the pyspark-cassandra connector.
As I understand Spark evolved quite a bit and SparkSQL is now a key component (with the 'dataframes'). Apparently there is absolutely no reason to work without SparkSQL, especially if connecting to Cassandra.
So my question is: what component are needed and how do I connect them together in the simplest way possible?
With spark-shell in Scala I could do simply
./bin/spark-shell --jars spark-cassandra-connector-java-assembly-1.6.0-M1-SNAPSHOT.jar
and then
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("mykeyspace")
val dataframe = cc.sql("SELECT count(*) FROM mytable group by beamstamp")
How can I do that with pyspark?
Here are a couple of subquestions along with partial answers I have collected (correct if I'm wrong).
Is pyspark-casmandra needed (I don't think so - I don't understand what is was doing in the first place)
Do I need to use pyspark or could I use my regular jupyter notebook and import the necessary things myself?
Pyspark should be started with the spark-cassandra-connector package as described in the Spark Cassandra Connector python docs.
./bin/pyspark
--packages com.datastax.spark:spark-cassandra-connector_$SPARK_SCALA_VERSION:$SPARK_VERSION
With this loaded you will be able to use any of the Dataframe operations already present inside of Spark on C* dataframes. More details on options of using C* dataframes.
To set this up to run with jupyter notebook just set up your env with the following properties.
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
And calling pyspark will start up a notebook correctly configured.
There is no need to use pyspark-cassandra unless you are interseted in working with RDDs in python which has a few performance pitfalls.
In Python connector is exposed DataFrame API. As long as spark-cassandra-connector is available and SparkConf contains required configuration there is no need for additional packages. You can simply specify the format and options:
df = (sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="mytable", keyspace="mykeyspace")
.load())
If you wan to use plain SQL you can register DataFrame as follows:
df.registerTempTable("mytable")
## Optionally cache
sqlContext.cacheTable("mytable")
sqlContext.sql("SELECT count(*) FROM mytable group by beamstamp")
Advanced features of the connector, like CassandraRDD are not exposed to Python so if you need something beyond DataFrame capabilities then pyspark-cassandra may prove useful.

Resources