PySpark throwing ParseException for syntactical correct Hive Query - apache-spark

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.

Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

Related

Structured Streaming with Apache Spark coded in Spark.SQL

Streaming transformations in Apache Spark with Databricks is usually coded in either Scala or Python. However, can someone let me know if it's also possible to code Streaming in SQL on Delta?
For example for the following sample code uses PySpark for structured streaming, can you let me know what would be the equivalent in spark.SQL
simpleTransform = streaming.withColumn(" stairs", expr(" gt like '% stairs%'"))\
.where(" stairs")\
.where(" gt is not null")\
.select(" gt", "model", "arrival_time", "creation_time")\
.writeStream\
.queryName(" simple_transform")\
.format(" memory")\
.outputMode("update")\
.start()
You can just register that streaming DF as a temporary view, and perform queries on it. For example (using rate source just for simplicity):
df=spark.readStream.format("rate").load()
df.createOrReplaceTempView("my_stream")
then you can just perform SQL queries directly on that view, like, select * from my_stream:
Or you can create another view, applying whatever transformations you need. For example, we can select only every 5th value if we use this SQL statement:
create or replace temp view my_derived as
select * from my_stream where (value % 5) == 0
and then query that view with select * from my_derived:

The first entry point to Spark SQL

I got some problems finding the what is the first line executed in Spark source code
after I run "spark.sql(SQL_QUERY).explain()".
Does anyone have any idea which module/package I could start to look into?
Thanks.
First of all you need to make spark session or sqlContext and a registered Temporary table from a DataFrame than query on the temporary table like this
results = spark.sql("SELECT * FROM people")
names = results.map(lambda p: p.name)
So I guess the first line is this one :
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L642
But have already been many lines "executed", specifically to create the SparkSession

Spark SQL does not work with Hive View when Hive Regex Column support is enabled

Our team has a bunch of Hive QL, so when migrating to spark, we want to reuse the existng HQL which uses Hive Regex Column Specification like SELECT `(ds)?+.+` FROM.
This could be done simply be simply enable the following configuration:
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
However, with the above configuration enabled, querying any Hive view using Spark SQL failed and Spark SQL Analyzer will complain
pyspark.sql.utils.AnalysisException: u"Invalid usage of '*' in expression 'unresolvedextractvalue';"
A simple pyspark script to reproduce the issue is like the following:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('test_spark').enableHiveSupport().getOrCreate()
spark.conf.set('hive.exec.dynamic.partition.mode', 'nonstrict')
spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
spark_sql = r'''
SELECT
id_listing
FROM
<A Hive View>
WHERE
ds = '2019-03-09'
'''
result = spark.sql(spark_sql)
print(result.count())
if __name__ == '__main__':
main()
I was wondering if there is a way to make Regex Column specification and Hive View coexist in Spark.
I observed this behavior in both Spark 2.3.0 and 2.4.0

Spark DataFrame column name case sensitivity in sparkSQL and Spark Submit

When i am querying dataframes on spark-shell(1.6 version) ,the column names are case insensitive .
On Spark-Shell
val a = sqlContext.read.parquet("<my-location>")
a.filter($"name" <=> "andrew").count()
a.filter($"NamE" <=> "andrew").count()
Both the above results gives me the right count.
But when i build this in a jar and run via "spark-submit",below code fails saying NamE does not exist,since underlying parquet data was saved with column as "name"
Fails:
a.filter($"NamE" <=> "andrew").count()
Pass:
a.filter($"name" <=> "andrew").count()
Am i missing something here?is there a way i can make it case-insensitive.
I know i can use a select before filtering and make all columns as lowercase alias ,but wanted to know why is it behaving differently.
It's a bit tricky here: the plain answer is because you think you're using the same SQLContext in both cases when, actually, you're not. In spark-shell, a SQLContext is created for you, but it's actually a HiveContext:
scala> sqlContext.getClass
res3: Class[_ <: org.apache.spark.sql.SQLContext] = class org.apache.spark.sql.hive.HiveContext
and in your spark-submit, you probably use a simple SQLContext. According to #LostInOverflow's link: Hive is case insensitive, while Parquet is not, so my guess is the following: by using a HiveContext you're probably using some code associated to Hive to download your Parquet data. Hive being case insensitive, it works fine. With a simple SQLContext, it doesn't, which is the expected behavior.
The part you're missing:
... is case insensitive, while Parquet is not
You can try:
val b = df.toDF(df.columns.map(_.toLowerCase): _*)
b.filter(...)
Try to control the case sensitivity with sqlContext explicitly.
Turn off case sensitivity using below statement and check if it helps.
sqlContext.sql("set spark.sql.caseSensitive=false")

Error inside where clause while comparing items in Spark SQL

I have cloudera vm running spark version 1.6.0
I created a dataframe from a CSV file and now filtering columns based on some where clause
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///home/cloudera/sample.csv')
df.registerTempTable("closedtrips")
result = sqlContext.sql("SELECT id,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'")
However it gives me runtime error on the sql line.
py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
: java.lang.RuntimeException: [1.96] failure: identifier expected
SELECT consigner,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'
^
Where am I going wrong here?
The above command fails in vm command line, however works fine when ran on databricks environment
Also why are column names case sensitive in vm, it fails to recognise 'trip frozen' because the actual column is 'Trip Frozen'.
All of this works fine in databricks and breaks in vm
In your VM, are you creating sqlContext as a SQLContext or as a HiveContext?
In Databricks, the automatically-created sqlContext will always point to a HiveContext.
In Spark 2.0 this distinction between HiveContext and regular SQLContext should not matter because both have been subsumed by SparkSession, but in Spark 1.6 the two types of contexts differ slightly in how they parse SQL language input.

Resources