Spark SQL does not work with Hive View when Hive Regex Column support is enabled - apache-spark

Our team has a bunch of Hive QL, so when migrating to spark, we want to reuse the existng HQL which uses Hive Regex Column Specification like SELECT `(ds)?+.+` FROM.
This could be done simply be simply enable the following configuration:
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
However, with the above configuration enabled, querying any Hive view using Spark SQL failed and Spark SQL Analyzer will complain
pyspark.sql.utils.AnalysisException: u"Invalid usage of '*' in expression 'unresolvedextractvalue';"
A simple pyspark script to reproduce the issue is like the following:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('test_spark').enableHiveSupport().getOrCreate()
spark.conf.set('hive.exec.dynamic.partition.mode', 'nonstrict')
spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')
spark.conf.set('spark.sql.parser.quotedRegexColumnNames', 'true')
spark_sql = r'''
SELECT
id_listing
FROM
<A Hive View>
WHERE
ds = '2019-03-09'
'''
result = spark.sql(spark_sql)
print(result.count())
if __name__ == '__main__':
main()
I was wondering if there is a way to make Regex Column specification and Hive View coexist in Spark.
I observed this behavior in both Spark 2.3.0 and 2.4.0

Related

Structured Streaming with Apache Spark coded in Spark.SQL

Streaming transformations in Apache Spark with Databricks is usually coded in either Scala or Python. However, can someone let me know if it's also possible to code Streaming in SQL on Delta?
For example for the following sample code uses PySpark for structured streaming, can you let me know what would be the equivalent in spark.SQL
simpleTransform = streaming.withColumn(" stairs", expr(" gt like '% stairs%'"))\
.where(" stairs")\
.where(" gt is not null")\
.select(" gt", "model", "arrival_time", "creation_time")\
.writeStream\
.queryName(" simple_transform")\
.format(" memory")\
.outputMode("update")\
.start()
You can just register that streaming DF as a temporary view, and perform queries on it. For example (using rate source just for simplicity):
df=spark.readStream.format("rate").load()
df.createOrReplaceTempView("my_stream")
then you can just perform SQL queries directly on that view, like, select * from my_stream:
Or you can create another view, applying whatever transformations you need. For example, we can select only every 5th value if we use this SQL statement:
create or replace temp view my_derived as
select * from my_stream where (value % 5) == 0
and then query that view with select * from my_derived:

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.
Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

How can we convert an external table to managed table in SPARK 2.2.0?

The below command was successfully converting external tables to managed tables in Spark 2.0.0:
ALTER TABLE {table_name} SET TBLPROPERTIES(EXTERNAL=FLASE);
However the above command is failing in Spark 2.2.0 with the below error:
Error in query: Cannot set or change the preserved property key:
'EXTERNAL';
As #AndyBrown pointed our in a comment you have the option of dropping to the console and invoking the Hive statement there. In Scala this worked for me:
import sys.process._
val exitCode = Seq("hive", "-e", "ALTER TABLE {table_name} SET TBLPROPERTIES(\"EXTERNAL\"=\"FALSE\")").!
I faced this problem using Spark 2.1.1 where #Joha's answer does not work because spark.sessionState is not accessible due to being declared lazy.
In Spark 2.2.0 you can do the following:
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.CatalogTable
import org.apache.spark.sql.catalyst.catalog.CatalogTableType
val identifier = TableIdentifier("table", Some("database"))
val oldTable = spark.sessionState.catalog.getTableMetadata(identifier)
val newTableType = CatalogTableType.MANAGED
val alteredTable = oldTable.copy(tableType = newTableType)
spark.sessionState.catalog.alterTable(alteredTable)
The issue is case-sensitivity on spark-2.1 and above.
Please try setting TBLPROPERTIES in lower case -
ALTER TABLE <TABLE NAME> SET TBLPROPERTIES('external'='false')
I had the same issue while using a hive external table. I solved the problem by directly setting the propery external to false in hive metastore using a hive metastore client
Table table = hiveMetaStoreClient.getTable("db", "table");
table.putToParameters("EXTERNAL","FALSE");
hiveMetaStoreClient.alter_table("db", "table", table,true);
I tried the above option from scala databricks notebook, and the
external table was converted to MANAGED table and the good part is
that the desc formatted option from spark on the new table is still
showing the location to be on my ADLS. This was one limitation that
spark was having, that we cannot specify the location for a managed
table.
As of now i am able to do a truncate table for this. hopefully there
was a more direct option for creating a managed table with location
specified from spark sql.

Error inside where clause while comparing items in Spark SQL

I have cloudera vm running spark version 1.6.0
I created a dataframe from a CSV file and now filtering columns based on some where clause
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('file:///home/cloudera/sample.csv')
df.registerTempTable("closedtrips")
result = sqlContext.sql("SELECT id,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'")
However it gives me runtime error on the sql line.
py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
: java.lang.RuntimeException: [1.96] failure: identifier expected
SELECT consigner,`safety rating` as safety_rating, route FROM closedtrips WHERE `trip frozen` == 'YES'
^
Where am I going wrong here?
The above command fails in vm command line, however works fine when ran on databricks environment
Also why are column names case sensitive in vm, it fails to recognise 'trip frozen' because the actual column is 'Trip Frozen'.
All of this works fine in databricks and breaks in vm
In your VM, are you creating sqlContext as a SQLContext or as a HiveContext?
In Databricks, the automatically-created sqlContext will always point to a HiveContext.
In Spark 2.0 this distinction between HiveContext and regular SQLContext should not matter because both have been subsumed by SparkSession, but in Spark 1.6 the two types of contexts differ slightly in how they parse SQL language input.

save Spark dataframe to Hive: table not readable because "parquet not a SequenceFile"

I'd like to save data in a Spark (v 1.3.0) dataframe to a Hive table using PySpark.
The documentation states:
"spark.sql.hive.convertMetastoreParquet: When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support."
Looking at the Spark tutorial, is seems that this property can be set:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
# code to create dataframe
my_dataframe.saveAsTable("my_dataframe")
However, when I try to query the saved table in Hive it returns:
hive> select * from my_dataframe;
OK
Failed with exception java.io.IOException:java.io.IOException:
hdfs://hadoop01.woolford.io:8020/user/hive/warehouse/my_dataframe/part-r-00001.parquet
not a SequenceFile
How do I save the table so that it's immediately readable in Hive?
I've been there...
The API is kinda misleading on this one.
DataFrame.saveAsTable does not create a Hive table, but an internal Spark table source.
It also stores something into Hive metastore, but not what you intend.
This remark was made by spark-user mailing list regarding Spark 1.3.
If you wish to create a Hive table from Spark, you can use this approach:
1. Use Create Table ... via SparkSQL for Hive metastore.
2. Use DataFrame.insertInto(tableName, overwriteMode) for the actual data (Spark 1.3)
I hit this issue last week and was able to find a workaround
Here's the story:
I can see the table in Hive if I created the table without partitionBy:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_HAPPY")
hive> desc TBL_HIVE_IS_HAPPY;
OK
user_id string
email string
ts string
But Hive can't understand the table schema(schema is empty...) if I do this:
spark-shell>someDF.write.mode(SaveMode.Overwrite)
.format("parquet")
.saveAsTable("TBL_HIVE_IS_NOT_HAPPY")
hive> desc TBL_HIVE_IS_NOT_HAPPY;
# col_name data_type from_deserializer
[Solution]:
spark-shell>sqlContext.sql("SET spark.sql.hive.convertMetastoreParquet=false")
spark-shell>df.write
.partitionBy("ts")
.mode(SaveMode.Overwrite)
.saveAsTable("Happy_HIVE")//Suppose this table is saved at /apps/hive/warehouse/Happy_HIVE
hive> DROP TABLE IF EXISTS Happy_HIVE;
hive> CREATE EXTERNAL TABLE Happy_HIVE (user_id string,email string,ts string)
PARTITIONED BY(day STRING)
STORED AS PARQUET
LOCATION '/apps/hive/warehouse/Happy_HIVE';
hive> MSCK REPAIR TABLE Happy_HIVE;
The problem is that the datasource table created through Dataframe API(partitionBy+saveAsTable) is not compatible with Hive.(see this link). By setting spark.sql.hive.convertMetastoreParquet to false as suggested in the doc, Spark only puts data onto HDFS,but won't create table on Hive. And then you can manually go into hive shell to create an external table with proper schema&partition definition pointing to the data location.
I've tested this in Spark 1.6.1 and it worked for me. I hope this helps!
I have done in pyspark, spark version 2.3.0 :
create empty table where we need to save/overwrite data like:
create table databaseName.NewTableName like databaseName.OldTableName;
then run below command:
df1.write.mode("overwrite").partitionBy("year","month","day").format("parquet").saveAsTable("databaseName.NewTableName");
The issue is you can't read this table with hive but you can read with spark.
metadata doesn't already exist. In other words, it will add any partitions that exist on HDFS but not in metastore, to the hive metastore.

Resources