How do I access the Databricks table format column - azure

I can display the Databricks table format using: DESCRIBE {database name}.{table name};
This will display something like:
format id etc.
hive null ...
Is there a way to write a SQL statement like:
SELECT FORMAT FROM {some table} where database = {db name} and table = {table name};
I would like to know if there is a Databricks catalog table that I can query directly. I want to list all of the Databricks tables that have a "format = 'delta'".

Unlike a relational database management system, there is no system catalog to query this information directly.
You need to combine 3 spark statements with python dataframe code to get the answer you want.
%sql
show databases
This command will list all the databases (schemas).
%sql
show tables from dim;
This command will list all the tables in a database (schema).
%sql
describe table extended dim.employee
This command will return detailed information about a table.
As you can see, we want to pick up the following fields (database, table, location, provider and type) for all tables in all databases. Then filter for type 'delta'.
Databricks has the unity catalog in public preview right now.
https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/
Databricks has implemented the information schema that is in all relational database management systems. This is part of this new release.
https://docs.databricks.com/sql/language-manual/sql-ref-information-schema.html
In theory, this statement would bring information back on all tables if the unity catalog was enabled in my service. Since it is not enabled, the query processor does not understand my request.
In short, use spark.sql() and dataframes to write a program to grab the information. But this is a lengthy task. A easier alternative is to use the unity catalog. Make sure it is available in your region.

To return the table in format method, we generally use “Describe Formatted”:
DESCRIBE FORMATTED [db_name.]table_name
DESCRIBE FORMATTED delta.`path-to-table` (Managed Delta Lake)
You cannot use select statement to get the format of the table.
The supported SQL – select statements.
SELECT * FROM boxes
SELECT width, length FROM boxes WHERE height=3
SELECT DISTINCT width, length FROM boxes WHERE height=3 LIMIT 2
SELECT * FROM VALUES (1, 2, 3) AS (width, length, height)
SELECT * FROM VALUES (1, 2, 3), (2, 3, 4) AS (width, length, height)
SELECT * FROM boxes ORDER BY width
SELECT * FROM boxes DISTRIBUTE BY width SORT BY width
SELECT * FROM boxes CLUSTER BY length
For more details, refer “Azure Databricks – SQL Guide: Select”.
Hope this helps.

Related

Is there a way to calculate the number of rows by table, schema and catalog in Databricks SQL (Spark SQL)?

I need to create a dashboard inside Databricks that summarizes the number of rows in the current workspace right now.
Is there a way to create a SQL query to calculate the number of rows by table, schema, and catalog? The expected result would be:
Catalog
Schema
Table
Rows
example_catalog_1
Finance
table_example_1
1567000
example_catalog_1
Finance
table_example_2
67000
example_catalog_2
Procurement
table_example_1
45324888
example_catalog_2
Procurement
table_example_2
89765987
example_catalog_2
Procurement
table_example_3
145000
Currently, I am working on a pure SQL workflow. So I would like to understand if it's possible to execute such an action using SQL, because as much as I know, the dashboards in Databricks do not accept PySpark Codes.
I was looking for a way to do that. I know that it's possible to access the tables in the workspace by using system.information_schema.tables but how to use it to count to total rows for each table presented there?
I was checking that via SQL Server it's possible via sys schema, dynamic query, or BEGIN...END clause. I couldn't find a way in Databricks to do that.
I strongly doubt if you can run that kind of query in the databricks dashboard . The link shared by #Sharma is more as to how to get the record count using dataframe and not how to link that with the databricks dashboard .

Azure Synapse - Select Insert

This is my 1st time working with Azure synapse and It seems Select Insert is not working, is there any workaround for this one, where I will just use select statement and then dump it into a temporary table?
here are the error prompted
The query references an object that is not supported in distributed processing mode.
and this is my query
Select *
Into #Temp1
FROM [dbo].[TblSample]
This is the azure synapse we are currently using
ondemand-sql.azuresynapse.net
In Synapse On-Demand, the use of Temporary Tables is limited. In your case I am assuming that dbo.TblSample is an External Table which is possibly why you are facing this restriction.
Instead of using a Temp Table, can you either just JOIN the TblSample directly or use a CTE if you are SELECTing specific rows and columns?

Databricks Magic Sql - Export Data

Is it possible to export the output of a "magic SQL" command cell in Databricks?
I like the fact that one doesn't have to escape the SQL command and it can be easily formatted. But, I cant seem to be able to use the output in other cells. What I would like to do is export the data to a CSV file, but potentially, finish some final manipulation of the dataframe before I write it out.
sql = "select * from calendar"
df = sqlContext.sql(sql)
display(df.limit(10))
vs (DBricks formatted the following code)
%sql
select
*
from
calendar
but imagine, once you bring in escaped strings, nested joins, etc. Wondering if there is a better way to work with SQL in databricks.
The simplest solution is the most obvious one that I didn't think of: create a view!
%sql
CREATE OR REPLACE TEMPORARY VIEW vwCalendar as
/*
Comments to make your future self happy!
*/
select
c.line1, -- more comments
c.line2, -- more comments
c.zipcode
from
calendar
where
c.status <> 'just an example\'s' -- <<imagine escaping this
and now you can use the view vwCalendar in subsequent SQL cells just like any other table.
and if you want to use it in a python cell:
df = spark.table("vwCalendar")
display(df.limit(3))
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-view.html
https://docs.databricks.com/spark/latest/spark-sql/udf-python.html#user-defined-functions---python

Create Spark SQL tables from multiple parquet paths

I use databricks. I am trying to create a table as below
target_table_name = 'test_table_1'
spark.sql("""
drop table if exists %s
""" % target_table_name)
spark.sql("""
create table if not exists {0}
USING org.apache.spark.sql.parquet
OPTIONS (
path ("/mnt/sparktables/ds=*/name=xyz/")
)
""".format(target_table_name))
Even though using "*" gives me flexibility on loading different files (pattern matching) and eventually create a table, I wish to create a table based on two completely different paths (no pattern matching).
path1 = /mnt/sparktables/ds=*/name=xyz/
path2 = /mnt/sparktables/new_path/name=123fo/
Spark uses Hive metastore to create these permanent tables. These tables are essentially external tables in Hive.
Generally what you are trying is not possible because Hive external table location needs to be unique at the time of creation.
However, you could still achieve the hive table with different location, if you incorporate partitioning strategy on your hive metastore.
In hive metastore you can have partitions which point to different locations.
However there is no off the shelf way to achieve this. Firstly you would need to specify a partition key for your dataset and create a table from the 1st location where the entire data belongs to one partition. Then alter table to add a new partition.
Sample:
create external table tableName(<schema>) partitioned by ('name') location '/mnt/sparktables/ds=*/name=xyz/'
Then you can add partitions
alter table tableName add partition(name='123fo') location '/mnt/sparktables/new_path/name=123fo/'
The alternate to this process is create 2 dataframe out of the 2 location , combine them then saveAsaTable
I would do something like this:
create or replace view 'mytable' as
select * from parquet.`path1`
union all
select * from parquet.`path2`
The view understands how to query from both locations. I assume you will not append/overwrite the table as it would lead to more ambiguity.
You can create data frames separately for two or more parquet files and then union them (assuming they have identical schemas)
df1.union(df2)

Create view in SQL Server CE 3.5

I'm using SQL Server CE as my database.
Can I create a view in SQL Server CE 3.5 ? I tried to create but its saying create view statement not supported.
In my application I have table called Alarm with 12 columns. But I'm always accessing only
three columns. So I want to create the view with those three columns.
Will it improve performance?
It appears that SQL Server Compact Edition does indeed not support creation of views.
But if you're selecting only three columns from your table, a view will not help you here at all.
If you have a view AlarmView which is defined as
CREATE VIEW dbo.AlarmView
AS
SELECT Col1, Col2, Col3 FROM dbo.Alarm
then selecting from that view (`SELECT * FROM dbo.AlarmView WHERE ......) essentially becomes
SELECT Col1, Col2, Col3 FROM dbo.Alarm
WHERE ........
so you get the same statement you'd write yourself.
Views are not designed for performance gains mostly (it helps a little bit that using a view, you're limiting the number of columns that are returned in your SELECT) - they're designed for limiting / modelling access to the tables, e.g. you could grant some user SELECT permission on the view but not on the underlying table, so that user would never be able to see / select any of the other columns.

Resources