pyspark, how to read Hive tables with SQLContext? - apache-spark

I am new to the Hadoop ecosystem and I am still confused with few things. I am using Spark 1.6.0 (Hive 1.1.0-cdh5.8.0, Hadoop 2.6.0-cdh5.8.0)
I have some Hive table that exist and I can do some SQL queries using HUE web interface with Hive (map reduce) and Impala (mpp).
I am now using pySpark (I think behind this is pyspark-shell) and I wanted to understand and test HiveContext and SQLContext. There are many thready that discussed the differences between the two and for various version of Spark.
With Hive context, I have no issue to query the Hive tables:
from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc)
FromHive = mysqlContext.sql("select * from table.mytable")
FromHive.count()
320
So far so good. Since SQLContext is subset of HiveContext, I was thinking that a basic SQL select should work:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
FromSQL = mysqlContext.sql("select * from table.mytable")
FromSQL.count()
Py4JJavaError: An error occurred while calling o81.sql.
: org.apache.spark.sql.AnalysisException: Table not found: `table`.`mytable`;
I added the hive-site.xml to pyspark-shell. When running
sc._conf.getAll(
I see:
('spark.yarn.dist.files', '/etc/hive/conf/hive-site.xml'),
My questions are:
Can I acess Hive table with SQLContext for simple queries (I know
HiveContext is more powerfull but for me this is just to understand
things)
If this is possible what is missing ? I couldn't find any info apart
from the hive-site.xml that I tried but doesn't seems to work
Thanks a lot
Cheers
Fabien

As mentioned in other answer, you can't use SQLContext to access Hive tables, they've given a seperate HiveContext in Spark 1.x.x which is basically an extension of SQLContext.
Reason::
Hive uses an external metastore to keep all the metadata, for example the information about db and tables. This metastore can be configured to be kept in MySQL etc. Default is derby.
This done so that all the users accessing Hive may see all the contents facilitated by metastore.
Derby creates a private metastore as a directory metastore_db in the directory from where the spark app is executed. Since this metastore is private, what ever you create or edit in this session, will not be accessible to anyone else. SQLContext basically facilitates a connection to a private metastore.
Needless to say, in Spark 2.x.x they've merged the two into SparkSession which acts as a singular entry point to spark. You can enable Hive support while creating SparkSession by .enableHiveSupport()

You cannot use standard SQLContext to access Hive directly. To work with Hive you need Spark binaries built with Hive support and HiveContext.
You could use use JDBC data source, but it won't be acceptable performance wise for large scale processing.

To access SQLContext tables, you need to register it temporarily. Then you can easily make SQL queries on it. Suppose you have some data in the form of JSON. You can make it in dataframe.
Like below:
from pyspark.sql import SQLContext
sqlSparkContext = SQLContext(sc)
df = sqlSparkContext.read.json("your json data")
sql_df = df.registerTempTable("mytable")
FromSQL = sqlSparkContext.sql("select * from mytable")
FromSQL.show()
Also you can collect the SQL data in row type array as below:-
r = FromSSQL.collect()
print r.column_Name

Try without keeping sc into sqlContext,I think when we create sqlContext object with sc spark is trying to call HiveContext but we are having sqlContext instead
>>>df=sqlContext.sql("select * from <db-name>.<table-name>")
Use the superset of SQL Context i.e HiveContext to Connect and load the hive tables to spark dataframes
>>>df=HiveContext(sc).sql("select * from <db-name>.<table-name>")
(or)
>>>df=HiveContext(sc).table("default.text_Table")
(or)
>>> hc=HiveContext(sc)
>>> df=hc.sql("select * from default.text_Table")

Related

spark.sql vs SqlContext

I have used SQL in Spark, in this example:
results = spark.sql("select * from ventas")
where ventas is a dataframe, previosuly cataloged like a table:
df.createOrReplaceTempView('ventas')
but I have seen other ways of working with SQL in Spark, using the class SqlContext:
df = sqlContext.sql("SELECT * FROM table")
What is the difference between both of them?
Thanks in advance
From a user's perspective (not a contributor), I can only rehash what the developer's provided in the upgrade notes:
Upgrading From Spark SQL 1.6 to 2.0
SparkSession is now the new entry point of Spark that replaces the old SQLContext and HiveContext. Note that the old SQLContext and HiveContext are kept for backward compatibility. A new catalog interface is accessible from SparkSession - existing API on databases and tables access such as listTables, createExternalTable, dropTempView, cacheTable are moved here.
Before 2.0, the SqlContext needed an extra call to the factory that creates it. With SparkSession, they made things a lot more convenient.
If you take a look at the source code, you'll notice that the SqlContext class is mostly marked #deprecated. Closer inspection shows that the most commonly used methods simply call sparkSession.
For more info, take a look at the developer notes, Jira issues, conference talks on spark 2.0, and Databricks blog.
Before Spark 2.x SQLContext was build with help of SparkContext but after Spark 2.x SparkSession was introduced which have the functionality of HiveContext and SQLContect both.So no need of creating SQLContext separatly.
**before Spark2.x**
sCont = SparkContext()
sqlCont = SQLContext(sCont)
**after Spark 2.x:**
spark = SparkSession()
Sparksession is the preferred way of working with Spark object now. Both Hivecontext and SQLContext are available as a part of this single object SparkSession.
You are using the latest syntax by creating a view df.createOrReplaceTempView('ventas').
Next create the df1 as javaobject
df1=sqlcontext.sql("select col1,col2,col3 from table")
Next create df2 as DATAFRAME
df2=spark.sql("select col1,col2,col3 from table")
Check the difference using type(df2) and type(df1)

Understanding how Hive SQL gets executed in Spark

I am new to spark and hive. I need to understand what happens behind when a hive table is queried in Spark. I am using PySpark
Ex:
warehouse_location = '\user\hive\warehouse'
from pyspark.sql import SparkSession
spark =SparkSession.builder.appName("Pyspark").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
DF = spark.sql("select * from hive_table")
In the above case, does the actual SQL run in spark framework or does it run in MapReduce framework of Hive.
I am just wondering how the SQL is being processed. Whether in Hive or in Spark?
enableHiveSupport() and HiveContext are quite misleading, as they suggest some deeper relationship with Hive.
In practice Hive support means that Spark will use Hive metastore to read and write metadata. Before 2.0 there where some additional benefits (window function support, better parser), but this no longer the case today.
Hive support does not imply:
Full Hive Query Language compatibility.
Any form of computation on Hive.
SparkSQL allows reading and writing data to Hive tables. In addition to Hive data, any RDD can be converted to a DataFrame, and SparkSQL can be used to run queries on the DataFrame.
The actual execution will happen on Spark. You can check this in your example by running a DF.count() and track the job via Spark UI at http://localhost:4040.

Comparing 2 dataframes in spark when using hivecontext for 1 dataframe and sqlcontext for the other

when i am storing hive table in one dataframe using HiveContext and DB2 table in another dataframe using sqlcontext on querying both the dataframes it is not detecting the Db2 while it detects hive. What is a common sqlcontext that can be used?
TL;DR Use the same context for all tables.
If you need Hive support use HiveContext or SparkSession with Hive support. Don't create a separate session to connect to specific DataSource.

Difference between various sparkcontexts in Spark 1.x and 2.x

Can anyone explain the difference between SparkContext, SQLContext, HiveContext and SparkSession EntryPoints and each one's usecases.
SparkContext is used for basic RDD API on both Spark1.x and Spark2.x
SparkSession is used for DataFrame API and Struct Streaming API on Spark2.x
SQLContext & HiveContext are used for DataFrame API on Spark1.x and deprecated from Spark2.x
Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. SparkContext is a Class to access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Converting CSV to ORC with Spark

I've seen this blog post by Hortonworks for support for ORC in Spark 1.2 through datasources.
It covers version 1.2 and it addresses the issue or creation of the ORC file from the objects, not conversion from csv to ORC.
I have also seen ways, as intended, to do these conversions in Hive.
Could someone please provide a simple example for how to load plain csv file from Spark 1.6+, save it as ORC and then load it as a data frame in Spark.
I'm going to ommit the CSV reading part because that question has been answered quite lots of time before and plus lots of tutorial are available on the web for that purpose, it will be an overkill to write it again. Check here if you want !
ORC support :
Concerning ORCs, they are supported with the HiveContext.
HiveContext is an instance of the Spark SQL execution engine that integrates with data stored in Hive. SQLContext provides a subset of the Spark SQL support that does not depend on Hive but ORCs, Window function and other feature depends on HiveContext which reads the configuration from hive-site.xml on the classpath.
You can define a HiveContext as following :
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
If you are working with the spark-shell, you can directly use sqlContext for such purpose without creating a hiveContext since by default, sqlContext is created as a HiveContext.
Specifying as orc at the end of the SQL statement below ensures that the Hive table is stored in the ORC format. e.g :
val df : DataFrame = ???
df.registerTempTable("orc_table")
val results = hiveContext.sql("create table orc_table (date STRING, price FLOAT, user INT) stored as orc")
Saving as an ORC file
Let’s persist the DataFrame into the Hive ORC table we created before.
results.write.format("orc").save("data_orc")
To store results in a hive directory rather than user directory, use this path instead /apps/hive/warehouse/data_orc (hive warehouse path from hive-default.xml)

Resources