Databricks accessing DataFrame in SQL - apache-spark

I'm learning Databricks and got stuck on the simplest step.
I'd like to utilize my DataFrame from DB's SQL ecosystem
Here are my steps:
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
display(df)
Everything is fine, df is displayed. Then submitting:
df.createOrReplaceGlobalTempView("covid")
Finally:
%sql
show tables
No results are displayed. When trying:
display(spark.sql('SELECT * FROM covid LIMIT 10'))
Getting the error:
[TABLE_OR_VIEW_NOT_FOUND] The table or view `covid` cannot be found
When executing:
df.createGlobalTempView("covid")
Again, I'm getting a message covid already exists.
How to access my df from sql ecosystem, please?

In a Databricks notebook, if you're looking to utilize SQL to query your dataframe loaded in python,
you can do so in the following way (using your example data):
Setup df in python
df = spark.read.csv('dbfs:/databricks-datasets/COVID/covid-19-data/us.csv', header=True, inferSchema=True)
setup your global view
df.createGlobalTempView("covid")
Then a simply query in SQL will be equivalent to display() function
%sql
SELECT * FROM global_temp.covid
If you want to avoid using global_temp prefix, use df.createTempView

Related

Loading Data from Azure Synapse Database into a DataFrame with Notebook

I am attempting to load data from Azure Synapse DW into a dataframe as shown in the image.
However, I'm getting the following error:
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Traceback (most recent call last):
AttributeError: 'DataFrameReader' object has no attribute 'sqlanalytics'
Any thoughts on what I'm doing wrong?
That particular method has changed its name to synapsesql (as per the notes here) and is Scala only currently as I understand it. The correct syntax would therefore be:
%%spark
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
It is possible to share the Scala dataframe with Python via the createOrReplaceTempView method, but I'm not sure how efficient that is. Mixing and matching is described here. So for your example you could mix and match Scala and Python like this:
Cell 1
%%spark
// Get table from dedicated SQL pool and assign it to a dataframe with Scala
val df = spark.read.synapsesql("yourDb.yourSchema.yourTable")
// Save the dataframe as a temp view so it's accessible from PySpark
df.createOrReplaceTempView("someTable")
Cell 2
%%pyspark
## Scala dataframe is now accessible from PySpark
df = spark.sql("select * from someTable")
## !!TODO do some work in PySpark
## ...
The above linked example shows how to write the dataframe back to the dedicated SQL pool too if required.
This is a good article for importing / export data with Synpase notebooks and the limitation is described in the Constraints section:
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export#constraints

How to query INFORMATION_SCHEMA view using spark bq connector?

I'm trying to identify partitions which got updated from a BQ table using the below query:
select * from PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS where
table_name='TABLE-NAME' and
extract(date from last_modified_time)='TODAY-DATE'
This is working fine from the BQ console. However when I use the same query from spark-bq connector it's failing.
spark.read.format("bigquery").load("PROJECT-ID.DATASET.INFORMATION_SCHEMA.PARTITIONS")
Error:
"Invalid project ID PROJECT-ID. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash."
I tried multiple combinations like by adding ` after PROJECT-ID but the API is still throwing 400 error.
What is the right way to query the INFORMATION_SCHEMA from spark-bq connector?
Setting the project is as parentProject is solving the issue.
spark.read
.format("bigquery")
.option('parentProject', project_id)
INFORMATION_SCHEMA is not a standard dataset in BigQuery, and as such is not available via the BigQuery Storage API used by the spark-bigquery connector. However, you can query it and load the data into a dataframe in the following manner:
spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")
val tablesDF = spark.read.format("bigquery").load("select * from `<projectId>.<dataset>.__TABLES__`")
table = "INFORMATION_SCHEMA.TABLES"
sql = f"""SELECT *
FROM {project_id}.{dataset}.{table}
"""
return (
spark.
read.
format('bigquery').
load(sql)
)

Upload Pandas dataframe to HANA database using HDBCLI / DBAPI

I connect to HANA database from Python and read any given table from a schema into a Pandas dataframe using the following code:
from hdbcli import dbapi
conn = dbapi.connect(
address=XXXX,
port=32015,
user="username",
password="password",
)
schema = <schema_name>
tablename = <table name>
pd.read_sql(f'select * from {schema}.{tablename}',conn)
This code works without any issue - I am able to download the table into a Pandas Data Frame.
However, I am unable to upload any Pandas Data Frame back to HANA db, even if it is the same schema.
xy.to_sql('new_table',conn)
I tried to even pre-define the table to which to upload in HANA Studio, and define its columns and data types. Nonetheless, I get the following error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': (259, 'invalid table name: Could not find table/view SQLITE_MASTER in schema <RANDOM_SCHEMA>: line 1 col 18 (at pos 17)')
It is important to note that the <RANDOM_SCHEMA> in the above error is not the same schema that was defined above, but it is the my username for HANA Studio.
I thought that since I can read the table into Data Frame, I should be able to write the data frame into a HANA DB table. Am I wrong? What am I missing?
For some reason the code tries to read from a SQLlite catalog table sqlite_master and that table doesn’t exist on HANA (or any other DBMS that is not SQLlite).
Not sure if PANDAS can be configured to use different DBMS differently.
However, for HANA there is a “machine learning” python library available that provides easy integration of dataframes with the HANA database.

Writing spark.sql dataframe result to parquet file

I enabled the following spark.sql session:
# creating Spark context and connection
spark = (SparkSession.builder.appName("appName").enableHiveSupport().getOrCreate())
and am able to produce see the results of the following query:
spark.sql("select year(plt_date) as Year, month(plt_date) as Mounth, count(build) as B_Count, count(product) as P_Count from first_table full outer join second_table on key1=CONCAT('SS',key_2) group by year(plt_date), month(plt_date)").show()
However, when I try to write the resulting dataframe from this query to hdfs, I get the following error:
I am able to save the resulting dataframe of a simple version of this query to the same path. The problem appears by adding functions such as count(), year() and etc.
What is the problem? and how can I save the results to hdfs?
It is giving error due to '(' present in column 'year(CAST(plt_date AS DATE))' :
Use to rename :
data = data.selectExpr("year(CAST(plt_date AS DATE)) as nameofcolumn")
Upvote if works
Refer : Rename Spark Column

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.
Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

Resources