How to execute HQL file in pyspark using Hive warehouse connector - apache-spark

I have an hql file. I want to run it using pyspark with Hive warehouse connector. There is an executeQuery method to run queries. I want to know whether hql files can be run like that. Can we run complex queries like that.
Please suggest.
Thanks

I have following solution where i have assumed that there will be multiple queries in hql file.
HQL File : sample_query.hql
select * from schema.table;
select * from schema.table2;
Code : Iterate over each query. You can do as you wish(in terms of HWC operation) in each iteration.
with open('sample_query.hql', 'r') as file:
hql_file = file.read().rstrip()
for query in [x.lstrip().rstrip() for x in hql_file.split(";") if len(x) != 0] :
hive.executeQuery("{0}".format(query))

Related

Structured Streaming with Apache Spark coded in Spark.SQL

Streaming transformations in Apache Spark with Databricks is usually coded in either Scala or Python. However, can someone let me know if it's also possible to code Streaming in SQL on Delta?
For example for the following sample code uses PySpark for structured streaming, can you let me know what would be the equivalent in spark.SQL
simpleTransform = streaming.withColumn(" stairs", expr(" gt like '% stairs%'"))\
.where(" stairs")\
.where(" gt is not null")\
.select(" gt", "model", "arrival_time", "creation_time")\
.writeStream\
.queryName(" simple_transform")\
.format(" memory")\
.outputMode("update")\
.start()
You can just register that streaming DF as a temporary view, and perform queries on it. For example (using rate source just for simplicity):
df=spark.readStream.format("rate").load()
df.createOrReplaceTempView("my_stream")
then you can just perform SQL queries directly on that view, like, select * from my_stream:
Or you can create another view, applying whatever transformations you need. For example, we can select only every 5th value if we use this SQL statement:
create or replace temp view my_derived as
select * from my_stream where (value % 5) == 0
and then query that view with select * from my_derived:

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

The first entry point to Spark SQL

I got some problems finding the what is the first line executed in Spark source code
after I run "spark.sql(SQL_QUERY).explain()".
Does anyone have any idea which module/package I could start to look into?
Thanks.
First of all you need to make spark session or sqlContext and a registered Temporary table from a DataFrame than query on the temporary table like this
results = spark.sql("SELECT * FROM people")
names = results.map(lambda p: p.name)
So I guess the first line is this one :
https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L642
But have already been many lines "executed", specifically to create the SparkSession

How to do append insertion in sparksql?

I have a api endpoint written by sparksql with the following sample code. Every time api accept a request it will run sparkSession.sql(sql_to_hive) which would create a single file in HDFS. Is there any way to do insert by appending data to existing file in HDFS ? Thanks.
sqlContext = SQLContext(sparkSession.sparkContext)
df = sqlContext.createDataFrame(ziped_tuple_list, schema=schema)
df.registerTempTable('TMP_TABLE')
sql_to_hive = 'insert into log.%(table_name)s partition%(partition)s select %(title_str)s from TMP_TABLE'%{
'table_name': table_name,
'partition': partition_day,
'title_str': title_str
}
sparkSession.sql(sql_to_hive)
I don't think this is possible case to append data to the existing file.
But you can work around this case by using either of these ways
Approach1
Using Spark, write to intermediate temporary table and then insert overwrite to final table:
existing_df=spark.table("existing_hive_table") //get the current data from hive
current_df //new dataframe
union_df=existing_df.union(current_df)
union_df.write.mode("overwrite").saveAsTable("temp_table") //write the data to temp table
temp_df=spark.table("temp_table") //get data from temp table
temp_df.repartition(<number>).write.mode("overwrite").saveAsTable("existing_hive_table") //overwrite to final table
Approach2:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer alter table concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
We can also use distributeby,sortby clauses to control number of files, refer this and this link for more details.
Another Approach3 is by using hadoop fs -getMerge to merge all small files into one (this method works for text files and i haven't tried for orc,avro ..etc formats).
When you write the resulted dataframe:
result_df = sparkSession.sql(sql_to_hive)
set it’s mode to append:
result_df.write.mode(SaveMode.Append).

Spark 2.2.1 HQL to ALTER Hive Table with partitions fails with InvalidOperationExeception

I have an application where I'm sending HQL using SparkSession.sql() method.
First I create a table with parititions
CREATE TABLE table_name (Id BigInt) PARTITIONED BY(Age BigInt)
After this I have following ALTER table statement as follows :
ALTER TABLE table_name ADD COLUMNS(Name String)
The ALTER command is failing with following exception :
InvalidOpearationException(message: parititions keys can not be changed)
When I try to execute ALTER statement on a table which doesn't have any partitions it runs fine.
Also the above code runs fine with Spark 1.6.
The above HQL statements also works fine if I directly run them in Hive.
I read the change log of Spark but couldn't find any explanation for following behavior.
Can anyone help me in figuring out what is happening ?

Resources