Get table name from spark catalog - apache-spark

I have a DataSourceV2Relation object and I would like to get the name of its table from spark catalog. spark.catalog.listTables() will list all the tables, but is there a way to get the specific table directly from the object?

PySpark
from pyspark.sql.catalog import Table
def get_t_object_by_name(t_name:str, db_name='default') -> Table:
catalog_t_list = spark.catalog.listTables('default')
return next((t for t in catalog_t_list if t.name == t_name), None)
Call as : get_t_object_by_name('my_table_name')
Result example : Table(name='my_table_name', database='default', description=None, tableType='MANAGED', isTemporary=False)
Table class definition : https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/catalog/Table.html

Related

Get the Dataframe from temp view name

I can register a Dataframe in the catalog using
df.createOrReplaceTempView("my_name")
I can check if the dataframe is registered
next(filter(lambda table:table.name=='my_name',spark.catalog.listTables()))
But how I can get the DataFrame associated with the name?
a_df = spark.catalog.getTempView()
assert id(a_df)==id(df)
spark.table() method will return the dataframe for the given table/view name
a_df = spark.table('my_name')
You may also use:
df=spark.sql("select * from my_name")
if you want to do some SQL before loading your df.

Generate database schema diagram for Databricks

I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)?
There are 2 variants possible:
using Spark SQL with show databases, show tables in <database>, describe table ...
using spark.catalog.listDatabases, spark.catalog.listTables, spark.catagog.listColumns.
2nd variant isn't very performant when you have a lot of tables in the database/namespace, although it's slightly easier to use programmatically. But in both cases, the implementation is just 3 nested loops iterating over list of databases, then list of tables inside database, and then list of columns inside table. This data could be used to generate a diagram using your favorite diagramming tool.
Here is the code for generating the source for PlantUML (full code is here):
# This script generates PlantUML diagram for tables visible to Spark.
# The diagram is stored in the db_schema.puml file, so just run
# 'java -jar plantuml.jar db_schema.puml' to get PNG file
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
# Variables
# list of databases/namespaces to analyze. Could be empty, then all existing
# databases/namespaces will be processed
databases = ["a", "airbnb"] # put databases/namespace to handle
# change this if you want to include temporary tables as well
include_temp = False
# implementation
spark = SparkSession.builder.appName("Database Schema Generator").getOrCreate()
# if databases aren't specified, then fetch list from the Spark
if len(databases) == 0:
databases = [db["namespace"] for db in spark.sql("show databases").collect()]
with open(f"db_schema.puml", "w") as f:
f.write("\n".join(
["#startuml", "skinparam packageStyle rectangle", "hide circle",
"hide empty methods", "", ""]))
for database_name in databases[:3]:
f.write(f'package "{database_name}" {{\n')
tables = spark.sql(f"show tables in `{database_name}`")
for tbl in tables.collect():
table_name = tbl["tableName"]
db = tbl["database"]
if include_temp or not tbl["isTemporary"]:
lines = []
try:
lines.append(f'class {table_name} {{')
cols = spark.sql(f"describe table `{db}`.`{table_name}`")
for cl in cols.collect():
col_name = cl["col_name"]
data_type = cl["data_type"]
lines.append(f'{{field}} {col_name} : {data_type}')
lines.append('}\n')
f.write("\n".join(lines))
except AnalysisException as ex:
print(f"Error when trying to describe {tbl.database}.{table_name}: {ex}")
f.write("}\n\n")
f.write("#enduml\n")
that then could be transformed into the picture:

How to convert sql query to list?

I am trying to convert my sql query output into a list to look a certain way.
Here is my code:
def get_sf_metadata():
import sqlite3
#Tables I want to be dynamically created
table_names=['AcceptedEventRelation','Asset', 'Book']
#SQLIte Connection
conn = sqlite3.connect('aaa_test.db')
c = conn.cursor()
#select the metadata table records
c.execute("select name, type from sf_field_metadata1 limit 10 ")
print(list(c))
get_sf_metadata()
Here is my output:
[('Id', 'id'), ('RelationId', 'reference'), ('EventId', 'reference')]
Is there any way to make the output looks like this:
[Id id, RelationId reference, EventId reference]
You can try
print(["{} {}".format(i[0], i[1]) for i in list(c)])
That will print you
['Id id', 'RelationId reference', 'EventId reference']

Using parameterized SQL query while reading large table into pandas dataframe using COPY

I am trying to read a large table (10-15M rows) from a database into pandas dataframe and I'm using the following code:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
I can use this if I have a simple query like this and I pass this into above func:
'''SELECT * from hourly_data'''
But what if I want to pass some variable into this query i.e.
'''SELECT * from hourly_data where starttime >= %s '''
Now where do I pass the parameter?
You cannot use parameters with COPY. Unfortunately that extends to the query you use inside COPY, even if you could use parameters with the query itself.
You will have to construct a query string including the parameter (beware of SQL injection) and use that with COPY.

Pulling Name Out of Schema in Spark DataFrame

I am trying to make a function that will pull the name of the column out of a dataframe schema. So what I have is the initial function defined:
val df = sqlContext.parquetFile(inputVal.toString)
val dfSchema = df.schema
def schemaMatchP(schema: StructType) : Map[String,List[Int]] =
schema
// get the 1st word (column type) in upper cases
.map(columnDescr => columnDescr
If I do something like this:
.map(columnDescr => columnDescr.toString.split(',')(0).toUpperCase)
I will get STRUCTFIELD(HH_CUST_GRP_MBRP_ID,BINARYTYPE,TRUE)
How do you handle a StructField so I can grab the 1st element out of each column for the schema. So my Column names: HH_CUST_GRP_MBRP_ID, etc...
When in doubt look what the source does itself. DataFrame.toString has the answer :). StructField is a case class with a name property. So, just do:
schema.map(f => s"${f.name}")

Resources