How to query a TempView in Databricks - apache-spark

I have create a tempview as such as:
df.createOrReplaceTempView("table_test")
However, when I run the following command it doesn't work
%sql select top 10 * from table_test;

Try the following
%sql select * from table_test limit 10;
top 10 is more specific to sql server and not the sql engine being used by your notebooks.

Related

Databricks accessing DataFrame in SQL

I'm learning Databricks and got stuck on the simplest step.
I'd like to utilize my DataFrame from DB's SQL ecosystem
Here are my steps:
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
display(df)
Everything is fine, df is displayed. Then submitting:
df.createOrReplaceGlobalTempView("covid")
Finally:
%sql
show tables
No results are displayed. When trying:
display(spark.sql('SELECT * FROM covid LIMIT 10'))
Getting the error:
[TABLE_OR_VIEW_NOT_FOUND] The table or view `covid` cannot be found
When executing:
df.createGlobalTempView("covid")
Again, I'm getting a message covid already exists.
How to access my df from sql ecosystem, please?
In a Databricks notebook, if you're looking to utilize SQL to query your dataframe loaded in python,
you can do so in the following way (using your example data):
Setup df in python
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
setup your global view
df.createGlobalTempView("covid")
Then a simply query in SQL will be equivalent to display() function
%sql
SELECT * FROM global_temp.covid
If you want to avoid using global_temp prefix, use df.createTempView

Variable value has to pass in the Databricks direct sql query instead of spark.sql(""" """)

In the databricks notebook, I have written the query
%sql
set four_date='2021-09-16';
select * from df2_many where four_date='{​​​​​​​{​​​​​​​four_date}​​​​​​​}​​​​'
Its not working, please advise that how to apply in the direct query instead of spark.sql(""" """)
Note: dont use $ its asking value in the text box, please confirm if there is any other alternative solution
how to apply the variable values which is to manipulate in the direct query at the Databricks
If you are using a Databricks Notebook, you will need to use Widgets:
https://docs.databricks.com/notebooks/widgets.html
CREATE WIDGET DATE four_date DEFAULT "2021-09-16"
SELECT * FROM df2_many WHERE four_date=getArgument("four_date")​​​​

How to execute HQL file in pyspark using Hive warehouse connector

I have an hql file. I want to run it using pyspark with Hive warehouse connector. There is an executeQuery method to run queries. I want to know whether hql files can be run like that. Can we run complex queries like that.
Please suggest.
Thanks
I have following solution where i have assumed that there will be multiple queries in hql file.
HQL File : sample_query.hql
select * from schema.table;
select * from schema.table2;
Code : Iterate over each query. You can do as you wish(in terms of HWC operation) in each iteration.
with open('sample_query.hql', 'r') as file:
hql_file = file.read().rstrip()
for query in [x.lstrip().rstrip() for x in hql_file.split(";") if len(x) != 0] :
hive.executeQuery("{0}".format(query))

Hive Query result to XL

I am newbie to Hadoop and Hive. My current requirement is to collect the stats of number of records loaded in 15 tables on each run day. Instead of executing each select Count(*) query and copy output manually to XL. Could anyone suggest what is the best method to automate this task please?
Note: we are not having any GUI to run Hive Queries, submitting Hive queries in normal Unix terminal.
Export to the CSV or TSV file, then open file in Excel. Normally it generates TSV file (tab-separated). This is how to transform it to comma-separated if you prefer CSV;
hive -e "SELECT 'table1' as source, count(*) cnt FROM db.table1
UNION ALL
SELECT 'table2' as source, count(*) cnt FROM db.table2" | tr "\t" "," > mydata.csv
Add more tables to the query.
You can mount directory in which you are writing output file in Windows using SAMBA/NFS. Schedule the command using crontab and voila, every day you have updated file.
Also you can connect directly using ODBC drivers:
https://mapr.com/blog/connecting-apache-hive-to-odbc/
https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-connect-excel-hive-odbc-driver
Error connecting Hortonworks Hive ODBC in Excel 2013

How can i send multiple queries in jaydebeapi in python (Netezza JDBC)

How to send multiple queries in single execute statement for example simplified version of my query is (which i am trying to execute using jaydebeapi )
Create temp table tempTable as
select * from table1 where x=y;
select * from tempTable ;
UPDATE : I am using Netezza JDBC

Resources