Databricks tables/schemas deployment - apache-spark

Objective
We use Databricks cluster for our ETL process and Databricks Notebooks for DS, ML and QA activities.
Currently, we don't use Databricks Catalog or external Hive Metastore. We define schemas programatically in Spark StructType format, and hardcode paths as following:
tables/some_table.py
class SomeTable(TableBase):
PATH = os.getenv('SOME_TABLE_PATH', /some_folder/some_subfolder/) # actually it's passed as constructor arg
SCHEMA = {
"type": "struct",
"fields": [
{
"name": "some_field",
"type": "string",
"nullable": true
},
...
]
def schema() -> StructType:
return StructType.fromJson(self.SCHEMA)
def save(df: DataFrame):
df.write.parquet(self.PATH)
def read(year: str, month: str, day: str) -> DataFrame:
return self.spark \
.read \
.parquet(self.PATH) \
.filter((F.col('YEAR') == year) & ...)
The issue
From time to time we do some refactorings, changing table's path, schema or partitioning. This is a problem, since Databricks is a shared platform between developers, QA and data scientists. On each change we have to update all notebooks and documentation at multiple places.
Also I would like to use bucketing (clustering), table statistics, Delta Lake, SQL-syntax data exploration, views and some security features in future. Those features also requires tables definitions accessible for Databricks.
The question
How do you usually deploy Databricks schemas and their updates? Shall I use SQL scripts that are executed by infrastructure-as-a-code tool automatically on cluster start? Or is there a simpler/better solution?
Schemas for data frames that are written with Databricks/Spark can be created with df.write.saveAsTable('some_table'). But this is not the best solution, because:
I want to have schema definition before the first write. For example, I'm transforming the dataset of 500 columns to 100 columns, and want to select only required columns based on schema definition.
There are read-only data sets that are ingested (written) with other tools (like ADF or Nifi)
Update
I liked experience with AWS Glue (used as Hive Metastore by EMR) and deployed through Cloud Formation. I suppose Databricks has similar or even simpler experience, just wondering what is the best practice.
Update #2
Extra points for answer to question - how to not duplicate shcema definition between Databricks Catalog (or external Hive Meta Store) and our codebase?
If we'll describe our schemas with SQL syntax, we won't be able to reuse them in unit tests. Is there any clean solution for deploying schemas based on the format described above (see code snippet)?
PS: currently we use Azure cloud

For Databricks on AWS, AWS Glue Catalog is a strong method for centralizing your meta store across all of your compute and query engines can use the same data definition. Glue Catalog promotes a cloud wide data strategy avoiding data silos created by using product specific data catalogs and access controls.
See this Databricks blog post for more information: https://docs.databricks.com/data/metastores/aws-glue-metastore.html
Performance wise, you'll see a lift by having the schema defined and you'll have the ability to collect table & column statistics in the meta store. Delta Lake will collect file level statistics within the Delta Transaction log, enabling data skipping.
Consistent use of the Glue Catalog will prevent schema duplication.
Spark can figure out the schema when it reads Parquet or Delta Lake tables.
For Parquet and JSON tables, you can speed up schema inference by providing Spark with just one file to infer the schema from, then read the entire folder in the next pass. A meta store avoids this hassle and speeds your queries.

Related

How to use Spark on Azure with a Data Factory in order to load and transform 2 files containing data

I am very new to Spark as well as to the Data Factory resource in Azure.
I would like to use Spark on Azure to load and transform 2 files containing data.
Here is what I would like to achieve in more details:
Use a data factory on Azure and create a pipeline and use a Spark activity
Load 2 files containing data in JSONL format
Transform them by doing a "JOIN" on a given field that is existing in both
Output a new file containing the merged data
Anyone can help me achieve that?
As of now, I don't even understand how to load 2 files to work with in a Spark activity from a data factory pipeline...
The two easiest ways to use Spark in an Azure Data Factory (ADF) pipeline are either via a Databricks cluster and the Databricks activity or use an Azure Synapse Analytics workspace, its built-in Spark notebooks and a Synapse pipeline (which is mostly ADF under the hood).
I was easily able to load a json lines file (using this example) in a Synapse notebook using the following python:
%%pyspark
df1=spark.read.load(['abfss://someLake#someStorage.dfs.core.windows.net/raw/json lines example.jsonl'],format='json')
df1.createOrReplaceTempView("main")
df2=spark.read.load(['abfss://someLake#someStorage.dfs.core.windows.net/raw/json lines example 2.jsonl'],format='json')
df2.createOrReplaceTempView("ages") 
You can now either join them in SQL, eg
%%sql
-- Join in SQL
SELECT m.name, a.age
FROM main m
INNER JOIN ages a ON m.name = a.name
Or join them in Python:
df3 = df1.join(df2, on="name")
df3.show()
My results:
I haven't tested this in Databricks but I imagine it's similar. You might have to setup up some more permissions there as Synapse integration is slightly easier.
You could also look at Mapping Data Flows in ADF which uses Spark clusters under the hood and offers a low-code / GUI-based experience.
There is literally a JOIN transformation built into ADF in the Data Flow activity that executes on Spark for you without needing to know anything about clusters or Spark programming :)
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-join

Spark SQL encapsulation of data sources

I have a Dataset where 98% (older than one day ) of its data would be in Parquet file and 2% (the current day - real time feed) of data would be in HBase, i always need to union them to get final data set for that particular table or entity.
So i would like my clients use the data seamlessly like below in any language they use for accessing spark or via spark shell or any BI tools
spark.read.format("my.datasource").load("entity1")
internally i will read entity1's data from parquet and hbase then union them and return it.
I googled and got few examples on extending DatasourceV2, most of them says you need to develop reader, but here i do not need new reader, but need to make use the existing ones (parquet and HBase).
as i am not introducing any new datasource as such, do i need to create new datasource? or is there any higher level abstraction/hook available?
You have to implement a new datasource per se "parquet+hbase", in the implementation you will make use of existing readers of parquet and hbase, may be extending your classes with both of them and union them etc
For your reference here are some links, which can help you implementing new DataSource.
spark "bigquery" datasource implementation
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
Implementing custom datasource
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
After going through various resource below is what i found and implemented the same.
it might help someone, so adding it as answer
Custom datasource is required only if we introduce a new datasource. For combining existing datasources we have to extend SparkSession and DataFrameReader. In the extended data frame reader we can invoke spark parquet read method, hbase reader and get the corresponding datasets then combine the datasets and return the combined dataset.
in scala we can use implicits to add custom logic to the spark session and dataframe.
in java we need to extend spark session and dataframe, then when using it use imports of extended classes

Parquet with Athena VS Redshift

I hope someone out there can help me with this issue. I am currently working on a data pipeline project, my current dilemma is whether to use parquet with Athena or storing it to Redshift
2 Scenarios:
First,
EVENTS --> STORE IT IN S3 AS JSON.GZ --> USE SPARK(EMR) TO CONVERT TO PARQUET --> STORE PARQUET BACK INTO S3 --> ATHENA FOR QUERY --> VIZ
Second,
EVENTS --> STORE IT IN S3 --> USE SPARK(EMR) TO STORE DATA INTO REDSHIFT
Issues with this scenario:
Spark JDBC with Redshift is slow
Spark-Redshift repo by data bricks have a fail build and was updated 2 years ago
I am unable to find useful information on which method is better. Should I even use Redshift or is parquet good enough?
Also it would be great if someone could tell me if there are any other methods for connecting spark with Redshift because there's only 2 solution that I saw online - JDBC and Spark-Reshift(Databricks)
P.S. the pricing model is not a concern to me also I'm dealing with millions of events data.
Here are some ideas / recommendations
Don't use JDBC.
Spark-Redshift works fine but is a complex solution.
You don't have to use spark to convert to parquet, there is also the option of using hive.
see
https://docs.aws.amazon.com/athena/latest/ug/convert-to-columnar.html
Athena is great when used against parquet, so you don't need to use
Redshift at all
If you want to use Redshift, then use Redshift spectrum to set up a
view against your parquet tables, then if necessary a CTAS within
Redshift to bring the data in if you need to.
AWS Glue Crawler can be a great way to create the metadata needed to
map the parquet in to Athena and Redshift Spectrum.
My proposed architecture:
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Athena
and/or
EVENTS --> STORE IT IN S3 --> HIVE to convert to parquet --> Use directly in Redshift using Redshift Spectrum
You MAY NOT need to convert to parquet, if you use the right partitioning structure (s3 folders) and gzip the data then Athena/spectrum then performance can be good enough without the complexity of conversion to parquet. This is dependent on your use case (volumes of data and types of query that you need to run).
Which one to use depends on your data and access patterns. Athena directly uses S3 key structure to limit the amount of data to be scanned. Let's assume you have event type and time in events. The S3 keys could be e.g. yyyy/MM/dd/type/* or type/yyyy/MM/dd/*. The former key structure allows you to limit the amount of data to be scanned by date or date and type but not type alone. If you wanted to search only by type x but don't know the date, it would require a full bucket scan. The latter key schema would be the other way around. If you mostly need to access the data just one way (e.g. by time), Athena might be a good choice.
On the other hand, Redshift is a PostgreSQL based data warehouse which is much more complicated and flexible than Athena. The data partitioning plays a big role in terms of performance, but schema can be designed in many ways to suit your use-case. In my experience the best way to load data to Redshift is first to store it to S3 and then use COPY https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html . It is multiple magnitudes faster than JDBC which I found only good for testing with small amounts of data. This is also how Kinesis Firehose loads data into Redshift. If you don't want to implement S3 copying yourself, Firehose provides an alternative for that.
There are few details missing in the question. How would you manage incremental upsert in data pipeline.
If you have implemented Slowly Changing Dimension (SCD type 1 or 2) The same can't be managed using parquet files. But This can be easily manageable in Redshift.

Using spark sql DataFrameWriter to create external Hive table

As part of a data integration process I am working on, I have a need to persist a Spark SQL DataFrame as an external Hive table.
My constraints at the moment:
Currently limited to Spark 1.6 (v1.6.0)
Need to persist the data in a specific location, retaining the data even if the table definition is dropped (hence external table)
I have found what appears to be a satisfactory solution to write the dataframe, df, as follows:
df.write.saveAsTable('schema.table_name',
format='parquet',
mode='overwrite',
path='/path/to/external/table/files/')
Doing a describe extended schema.table_name against the resulting table confirms that it is indeed external. I can also confirm that the data is retained (as desired) even if the table itself is dropped.
My main concern is that I can't really find a documented example of this anywhere, nor can I find much mention of it in the official docs -
particularly the use of a path to enforce the creation of an external table.
(https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter).
Is there a better/safer/more standard way to persist the dataframe?
I rather creating the Hive tables myself (e.g. CREATE EXTERNAL TABLE IF NOT EXISTS) exactly as I need and then in Spark just do: df.write.saveAsTable('schema.table_name', mode='overwrite').
This way you have control about the table creation and don't depend on the HiveContext doing what you need. In the past there where issues with the Hive tables created this way and the behavior can change in the future since that API is generic and cannot guarantee the underlying implementation by HiveContext.

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Resources