Generate database schema diagram for Databricks - apache-spark

I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)?

There are 2 variants possible:
using Spark SQL with show databases, show tables in <database>, describe table ...
using spark.catalog.listDatabases, spark.catalog.listTables, spark.catagog.listColumns.
2nd variant isn't very performant when you have a lot of tables in the database/namespace, although it's slightly easier to use programmatically. But in both cases, the implementation is just 3 nested loops iterating over list of databases, then list of tables inside database, and then list of columns inside table. This data could be used to generate a diagram using your favorite diagramming tool.
Here is the code for generating the source for PlantUML (full code is here):
# This script generates PlantUML diagram for tables visible to Spark.
# The diagram is stored in the db_schema.puml file, so just run
# 'java -jar plantuml.jar db_schema.puml' to get PNG file
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
# Variables
# list of databases/namespaces to analyze. Could be empty, then all existing
# databases/namespaces will be processed
databases = ["a", "airbnb"] # put databases/namespace to handle
# change this if you want to include temporary tables as well
include_temp = False
# implementation
spark = SparkSession.builder.appName("Database Schema Generator").getOrCreate()
# if databases aren't specified, then fetch list from the Spark
if len(databases) == 0:
databases = [db["namespace"] for db in spark.sql("show databases").collect()]
with open(f"db_schema.puml", "w") as f:
f.write("\n".join(
["#startuml", "skinparam packageStyle rectangle", "hide circle",
"hide empty methods", "", ""]))
for database_name in databases[:3]:
f.write(f'package "{database_name}" {{\n')
tables = spark.sql(f"show tables in `{database_name}`")
for tbl in tables.collect():
table_name = tbl["tableName"]
db = tbl["database"]
if include_temp or not tbl["isTemporary"]:
lines = []
try:
lines.append(f'class {table_name} {{')
cols = spark.sql(f"describe table `{db}`.`{table_name}`")
for cl in cols.collect():
col_name = cl["col_name"]
data_type = cl["data_type"]
lines.append(f'{{field}} {col_name} : {data_type}')
lines.append('}\n')
f.write("\n".join(lines))
except AnalysisException as ex:
print(f"Error when trying to describe {tbl.database}.{table_name}: {ex}")
f.write("}\n\n")
f.write("#enduml\n")
that then could be transformed into the picture:

Related

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

Query to show Tables and Table definition in Starcounter Database

How can I get a list of table names and definitions by either SQL statement or code behind for the Starcounter DB?
Metadata about created tables, their columns and indexes are stored in meta-data tables. Database classes are publicly exposed for corresponding meta-data tables.
Tables or types are described by Starcounter.Metadata.RawView and Starcounter.Metadata.ClrClass and both extends Starctouner.Metadata.Table. ClrClass contains description for loaded CLR classes only, while RawView describes all created tables. They include descriptions of user-defined classes/tables and metadata classes/tables.
For example, all loaded user-defined classes can be enumerated:
foreach(ClrClass c in Db.SQL<ClrClass>(
"select c from Starcounter.Metadata.ClrClass c where Updatable = ?", true)) {
Console.WriteLine(c.FullName);
}
Property Updatable of Table is true for user-defined tables and false for meta-data/system tables.
Properties or columns are described by Starcounter.Metadata.Member and its children. An example of enumerating all columns for all user-defined tables is:
foreach(Member m in Db.SQL<Member>(
"select m from Column m, RawView v where m.Table = v and v.Updatable = ?",
true)) {
Console.WriteLine(m.Name);
}
Indexes are described by Starcounter.Metadata.Index and Starcounter.Metadata.IndexedColumn.
Currently it is one-to-one match between database classes and tables. However, this and metadata schema might change in future.

non-ordinal access to rows returned by Spark SQL query

In the Spark documentation, it is stated that the result of a Spark SQL query is a SchemaRDD. Each row of this SchemaRDD can in turn be accessed by ordinal. I am wondering if there is any way to access the columns using the field names of the case class on top of which the SQL query was built. I appreciate the fact that the case class is not associated with the result, especially if I have selected individual columns and/or aliased them: however, some way to access fields by name rather than ordinal would be convenient.
A simple way is to use the "language-integrated" select method on the resulting SchemaRDD to select the column(s) you want -- this still gives you a SchemaRDD, and if you select more than one column then you will still need to use ordinals, but you can always select one column at a time. Example:
// setup and some data
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Score(name: String, value: Int)
val scores =
sc.textFile("data.txt").map(_.split(",")).map(s => Score(s(0),s(1).trim.toInt))
scores.registerAsTable("scores")
// initial query
val original =
sqlContext.sql("Select value AS myVal, name FROM scores WHERE name = 'foo'")
// now a simple "language-integrated" query -- no registration required
val secondary = original.select('myVal)
secondary.collect().foreach(println)
Now secondary is a SchemaRDD with just one column, and it works despite the alias in the original query.
Edit: but note that you can register the resulting SchemaRDD and query it with straight SQL syntax without needing another case class.
original.registerAsTable("original")
val secondary = sqlContext.sql("select myVal from original")
secondary.collect().foreach(println)
Second edit: When processing an RDD one row at a time, it's possible to access the columns by name by using the matching syntax:
val secondary = original.map {case Row(myVal: Int, _) => myVal}
although this could get cumbersome if the right hand side of the '=>' requires access to a lot of the columns, as they would each need to be matched on the left. (This from a very useful comment in the source code for the Row companion object)

Querying with cqlengine

I am trying to hook the cqlengine CQL 3 object mapper with my web application running on CherryPy. Athough the documentation is very clear about querying, I am still not aware how to make queries on an existing table(and an existing keyspace) in my cassandra database. For instance I already have this table Movies containing the fields Title, rating, Year. I want to make the CQL query
SELECT * FROM Movies
How do I go ahead with the query after establishing the connection with
from cqlengine import connection
connection.setup(['127.0.0.1:9160'])
The KEYSPACE is called "TEST1".
Abhiroop Sarkar,
I highly suggest that you read through all of the documentation at:
Current Object Mapper Documentation
Legacy CQLEngine Documentation
Installation: pip install cassandra-driver
And take a look at this example project by the creator of CQLEngine, rustyrazorblade:
Example Project - Meat bot
Keep in mind, CQLEngine has been merged into the DataStax Cassandra-driver:
Official Python Cassandra Driver Documentation
You'll want to do something like this:
CQLEngine <= 0.21.0:
from cqlengine.connection import setup
setup(['127.0.0.1'], 'keyspace_name', retry_connect=True)
If you need to create the keyspace still:
from cqlengine.management import create_keyspace
create_keyspace(
'keyspace_name',
replication_factor=1,
strategy_class='SimpleStrategy'
)
Setup your Cassandra Data Model
You can do this in the same .py or in your models.py:
import datetime
import uuid
from cqlengine import columns, Model
class YourModel(Model):
__key_space__ = 'keyspace_name' # Not Required
__table_name__ = 'columnfamily_name' # Not Required
some_int = columns.Integer(
primary_key=True,
partition_key=True
)
time = columns.TimeUUID(
primary_key=True,
clustering_order='DESC',
default=uuid.uuid1,
)
some_uuid = columns.UUID(primary_key=True, default=uuid.uuid4)
created = columns.DateTime(default=datetime.datetime.utcnow)
some_text = columns.Text(required=True)
def __str__(self):
return self.some_text
def to_dict(self):
data = {
'text': self.some_text,
'created': self.created,
'some_int': self.some_int,
}
return data
Sync your Cassandra ColumnFamilies
from cqlengine.management import sync_table
from .models import YourModel
sync_table(YourModel)
Considering everything above, you can put all of the connection and syncing together, as many examples have outlined, say this is connection.py in our project:
from cqlengine.connection import setup
from cqlengine.management import sync_table
from .models import YourTable
def cass_connect():
setup(['127.0.0.1'], 'keyspace_name', retry_connect=True)
sync_table(YourTable)
Actually Using the Model and Data
from __future__ import print_function
from .connection import cass_connect
from .models import YourTable
def add_data():
cass_connect()
YourTable.create(
some_int=5,
some_text='Test0'
)
YourTable.create(
some_int=6,
some_text='Test1'
)
YourTable.create(
some_int=5,
some_text='Test2'
)
def query_data():
cass_connect()
query = YourTable.objects.filter(some_int=5)
# This will output each YourTable entry where some_int = 5
for item in query:
print(item)
Feel free to let ask for further clarification, if necessary.
The most straightforward way to achieve this is to make model classes which mirror the schema of your existing cql tables, then run queries on them
cqlengine is primarily an Object Mapper for Cassandra. It does not interrogate an existing database in order to create objects for existing tables. Rather it is usually intended to be used in the opposite direction (i.e. create tables from python classes). If you want to query an existing table using cqlengine you will need to create python models that exactly correspond to your existing tables.
For example, if your current Movies table had 3 columns, id, title, and release_date you would need to create a cqlengine model that had those three columns. Additionally, you would need to ensure that the table_name attribute on the class was exactly the same as the table name in the database.
from cqlengine import columns, Model
class Movie(Model):
__table_name__ = "movies"
id = columns.UUID(primary_key=True)
title = columns.Text()
release_date = columns.Date()
The key thing is to make sure that model exactly mirrors the existing table. If there are small differences you may be able to use sync_table(MyModel) to update the table to match your model.

Resources