Spark SQL passing variables - Synapse (Spark pool) - azure

I have following SparkSQL (Spark pool - Spark 3.0) code and I want to pass a variable to it. How can I do that? I tried the following:
#cel 1 (Toggle parameter cell):
%%pyspark
stat = 'A'
#cel2:
select * from silver.employee_dim where Status= '$stat'

When you're running your cell as PySpark, you can pass a variable to your query like this:
#cel 1 (Toggle parameter cell):
%%pyspark
stat = 'A' #define variable
#cel2:
%%pyspark
query = "select * from silver.employee_dim where Status='" + stat + "'"
spark.sql(query) #execute SQL
Since you're executing a SELECT statement, I assume you might want to load the result to a DataFrame:
sqlDf = spark.sql(query)
sqlDf.head(5) #select first 5 rows

Related

Does spark load the entire table in memory if select condition is based on RDD transformation?

DataSet<Row> a = spark.read().format("com.memsql.spark.connector").option("query", "select * from a");
a = a.filter((row)-> row.x = row.y)
Sring xstring = "...select all values of x from a and make comma separated string"
DataSet<Row> b = spark.read().format("com.memsql.spark.connector").option("query", "select * from b where x in " + xstring);
b.show()
In this case as spark would load the entire b table in memory and then filter out xtring rows or it actually create that xstring and then load a subset of table b in memory when we call show
When memsql is queried using option("query", "select * from .......") the entire result (not table) will be read from memsql into executors. The MemSQL Spark Connector 2.0 supports column and filter pushdown for which the SQL needs to have the filter and join condition rather than applying filter and join on dataframe. In your example predicate push down will be used. In your example - entire table 'a' will be read because there is no filter condition, xstring will be build, then only that part of table 'b' is read that matches x in (...) condition.
Here is memsql documentation explaining this.

Using parameterized SQL query while reading large table into pandas dataframe using COPY

I am trying to read a large table (10-15M rows) from a database into pandas dataframe and I'm using the following code:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
I can use this if I have a simple query like this and I pass this into above func:
'''SELECT * from hourly_data'''
But what if I want to pass some variable into this query i.e.
'''SELECT * from hourly_data where starttime >= %s '''
Now where do I pass the parameter?
You cannot use parameters with COPY. Unfortunately that extends to the query you use inside COPY, even if you could use parameters with the query itself.
You will have to construct a query string including the parameter (beware of SQL injection) and use that with COPY.

Unable to query complex SQL statements, from hive table using pyspark

Hi I am trying to query hive table from spark context.
my code:
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
bank = hive_context.table('select * from db.table_name')
bank.show()
simple queries like this works fine, without any error.
but when I try with below query.
query = """with table1 as ( select distinct a,b
from db_first.table_first
order by b )
--select * from table1 order by b
,c as ( select *
from db_first.table_two)
--select * from c
,d as ( select *
from c
where upper(e) = 'Y')
--select * from d
,f as ( select table1.b
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',1) as Int) aid1
,regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',2) aid2
,cast(regexp_extract(g,'(\\d+)-(A|B)-
(\\d+)(.*)',3) as Int) aid3
,from_unixtime(cast(substr(lastdbupdatedts,1,10) as int),"yyyy-MM-dd
HH:mm:ss") lastupdts
,d.*
from d
left outer join table1
on d.hiba = table1.a)
select * from f order by b,aid1,aid2,aid3 limit 100"""
I get the below error, please help.
ParseExceptionTraceback (most recent call last)
<ipython-input-27-cedb6fad210d> in <module>()
3 hive_context = HiveContext(sc)
4 #bank = hive_context.table("bdalab.test_prodapt_inv")
----> 5 bank = hive_context.table(first)
ParseException: u"\nmismatched input '*' expecting <EOF>(line 1, pos 7)\n\n== SQL ==\nselect *
You need to use .sql method instead of .table method if we are using sql query.
1.Using .table method then we need to provide table name:
>>> hive_context.table("<db_name>.<table_name>").show()
2.Using .sql method then provide your with cte expression:
>>> first ="with cte..."
>>> hive_context.sql(first).show()

spark magic - enter sql context as string

Connecting to spark over livy works fine in Jupyter,
as does the following spark magic:
%%spark -c sql
select * from some_table
Now how can I use string variables to query tables?
The following does not work:
query = 'select * from some_table'
Next cell:
%%spark -c sql
query
Nor does the following work:
%%spark -c sql
'select * from some_table'
Any ideas? Is it possible to "echo" the content of a string variable into a cell?
Seems like I found a solution.
There is a function that turns strings into cell magic commands:
%%local
from IPython import get_ipython
ipython = get_ipython()
line = '-c sql -o df'
query = 'select * from some_table'
ipython.run_cell_magic(magic_name='spark', line=line, cell=query)
After this, the query is in the pandas dataframe df.

Converting the Hive SQL output to an array[Double]

I am reading some data from a hive table using a hive context in spark and the out put is a ROW with only one column. I need to convert this to an array of Double. I have tried all possible ways to do it myself with no success. Can somebody please help in this ?
val qRes = hiveContext.sql("""
Select Sum(EQUnit) * Sum( Units)
From pos_Tran_orc T
INNER JOIN brand_filter B
On t.mbbrandid = b.mbbrandid
inner join store_filter s
ON t.msstoreid = s.msstoreid
Group By Transdate
""")
What next ????
You can simply map using Row.getDouble method:
qRes.map(_.getDouble(0)).collect()

Resources