Error inside where clause while comparing items in Spark SQL - apache-spark

I have cloudera vm running spark version 1.6.0
I created a dataframe from a CSV file and now filtering columns based on some where clause
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///home/cloudera/sample.csv')
df.registerTempTable("closedtrips")
result = sqlContext.sql("SELECT id,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'")
However it gives me runtime error on the sql line.
py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
: java.lang.RuntimeException: [1.96] failure: identifier expected
SELECT consigner,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'
^
Where am I going wrong here?
The above command fails in vm command line, however works fine when ran on databricks environment
Also why are column names case sensitive in vm, it fails to recognise 'trip frozen' because the actual column is 'Trip Frozen'.
All of this works fine in databricks and breaks in vm

In your VM, are you creating sqlContext as a SQLContext or as a HiveContext?
In Databricks, the automatically-created sqlContext will always point to a HiveContext.
In Spark 2.0 this distinction between HiveContext and regular SQLContext should not matter because both have been subsumed by SparkSession, but in Spark 1.6 the two types of contexts differ slightly in how they parse SQL language input.

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}
As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)

Spark SQL does not work with Hive View when Hive Regex Column support is enabled

Our team has a bunch of Hive QL, so when migrating to spark, we want to reuse the existng HQL which uses Hive Regex Column Specification like SELECT `(ds)?+.+` FROM.
This could be done simply be simply enable the following configuration:
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
However, with the above configuration enabled, querying any Hive view using Spark SQL failed and Spark SQL Analyzer will complain
pyspark.sql.utils.AnalysisException: u"Invalid usage of '*' in expression 'unresolvedextractvalue';"
A simple pyspark script to reproduce the issue is like the following:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('test_spark').enableHiveSupport().getOrCreate()
spark.conf.set('hive.exec.dynamic.partition.mode', 'nonstrict')
spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
spark_sql = r'''
SELECT
id_listing
FROM
<A Hive View>
WHERE
ds = '2019-03-09'
'''
result = spark.sql(spark_sql)
print(result.count())
if __name__ == '__main__':
main()
I was wondering if there is a way to make Regex Column specification and Hive View coexist in Spark.
I observed this behavior in both Spark 2.3.0 and 2.4.0

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.
Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

Spark org.apache.spark.sql.catalyst.analysis.UnresolvedException error in loading Hive table

While trying to load data from a dataset into Hive table getting the error:
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid
call to dataType on unresolved object, tree: 'ipl_appl_signed_date
My dataset contains same columns as the Hive table and the column for which am getting the error has Date datatype in my code(Java) as well as in Hive.
java code:
Date IPL_APPL_SIGNED_DATE =rs.getDate("DTL.IPL_APPL_SIGNED_DATE"); //using jdbc to get record.
Encoder<DimPolicy> encoder = Encoders.bean(Foo.class);
Dataset<DimPolicy> test=spark.createDataset(allRows,encoder); //spark is the spark session
test.write().mode("append").insertInto("someSchema.someTable"); //
I think the issue is due to a bug in Spark i.e. [SPARK-26379] Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp, that got fixed in 2.3.3, 2.4.1, 3.0.0.
A solution is to downgrade to the version of Spark that is unaffected by the bug (or wait for a new version).

Select not null values in dataframes in Spark

I'm reading a CSV file in Spark 2.0 and counting not null values in a column using the following:
val df = spark.read.option("header", "true").csv(dir)
df.filter("IncidntNum is not null").count()
and it works fine when I test it using spark-shell. When I create a jar file containing the code and submit it to spark-submit, I get an exception at second line above :
Exception in thread "main" org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '' expecting {'(', 'SELECT', ..
== SQL ==
IncidntNum is not null
^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
Any idea why this would happen when I'm using the code that works in spark-shell?
This question has been sitting around a while, but better late than never.
The most likely reason I can think of is that when running using spark-submit you are running in "cluster" mode. That means the driver process will be located on a different machine than when you run spark-shell. That could cause Spark to read a different file.

Resources