How to enable Databricks Delta feature - databricks

Upgraded my Azure Databricks from standard to primary, trying to start using Databricks Delta:
create table t
using delta
as select * from test_db.src_data;
Databricks Delta is not enabled in your account. Please reach out to
your account manager to talk about using Delta;
I'm the account manager but can not find this setting. Where is it?

using Spark SQL Context in ipynb and scala notebooks :
sql("SET spark.databricks.delta.preview.enabled=true")
sql("SET spark.databricks.delta.merge.joinBasedMerge.enabled = true")
In SQL dbc notebooks:
SET spark.databricks.delta.preview.enabled=true
SET spark.databricks.delta.merge.joinBasedMerge.enabled
When you wanna default the cluster to support Delta , while spinning up the cluster on UI in the last column in the parameters for Environment variables
just this line : spark.databricks.delta.preview.enabled=true
Or the last and the final fun part. When you spin your cluster Select 5.0 or above we should have Delta enabled by default for these guys.
And finally welcome to Databricks Delta :)
Also, Just to help you out with your code there it should look like this
%sql create table t as select * from test_db.src_data
USING DELTA
PARTITIONED BY (YourPartitionColumnHere)
LOCATION "/mnt/data/path/to/the/location/where/you/want/these/parquetFiles/to/be/present"

Related

How do we test notebooks that use delta live table

I cannot execute the delta live table code in the notebook. I always have to create a DLT pipeline by going into workflows tab. Is there a easy way to test the delta live table code in notebook
Thanks
Debugging Delta Live Table pipelines is challenging. Luckily Souvik Pratiher has created an open-source library for debugging Delta Live Table notebooks on regular Databricks clusters.

Azure Databricks cluster spark configuration is disabled

When creating an Azure Databricks and configuring its cluster, I had chosen the default languages for Spark to be python,sql. But now I want to add Scala, as well. When running the Scala script I was getting the following error. So, my online search took me to this article that describes that you can change Cluster configuration by going to the Advanced options section of the cluster settings page and clicking on the Spark tab there (as shown in image below). But I find the Spark section there greyed out (disabled):
Question: How can I enabled the Spark section of the Advanced section of the cluster settings page (shown in image below) so I can edit the last line of the section? Note: I created the Databricks and its cluster and hence I am the admin (as shown in image 2 below).
Databricks Notebook error: Your administrator has only allowed sql and python commands on this cluster.
You need to click "Edit" button in the cluster controls - after that you should be able to change Spark configuration. But you can't enable Scala for the High concurrency clusters with credentials passthrough as it supports only Python & SQL (doc) - primary reason for that is that with Scala you can bypass users isolation.
If you need credentials passthrough + Scala, then you need to use Standard cluster, but it will work only with a single specific user (doc).

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

Azure Synapse Apache Spark : Pipeline level spark configuration

Trying to configure spark for the entire azure synapse pipeline, Found Spark session config magic command and How to set Spark / Pyspark custom configs in Synapse Workspace spark pool . %%configure magic command works fine for a single notebook. Example:
Insert cell with the below content at the Beginning of the notebook
%%configure -f
{
"driverMemory": "28g",
"driverCores": 4,
"executorMemory": "32g",
"executorCores": 4,
"numExecutors" : 5
}
Then the below emits expected values.
spark_executor_instances = spark.conf.get("spark.executor.instances")
print(f"spark.executor.instances {spark_executor_instances}")
spark_executor_memory = spark.conf.get("spark.executor.memory")
print(f"spark.executor.memory {spark_executor_memory}")
spark_driver_memory = spark.conf.get("spark.driver.memory")
print(f"spark.driver.memory {spark_driver_memory}")
Although if i add that notebook as a first activity in Azure Synapse Pipeline, what happens is that Apache Spark Application which executes that notebook has correct configuration, but the rest of the notebooks in pipeline fall back to default configuration.
How can i configure spark for the entire pipeline ? Should i copy paste above %%configure .. in each and every notebook in pipeline or is there a better way ?
Yes, this is the well known option AFAIK. You need to define %%configure -f at the beginning of each Notebook in order to override default settings for your Job.
Alternatively, you can try by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:
Please refer this third-party article for more details.
Moreover, looks like one cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved.

Read Azure Synapse table with Spark

I'm looking for, with no success, how to read a Azure Synapse table from Scala Spark. I found in https://learn.microsoft.com connectors for others Azure Databases with Spark but nothing with the new Azure Data Warehouse.
Does anyone know if it is possible?
It is now directly possible, and with trivial effort (there is even a right-click option added in the UI for this), to read data from a DEDICATED SQL pool in Azure Synapse (the new Analytics workspace, not just the DWH) for Scala (and unfortunately, ONLY Scala right now).
Within Synapse workspace (there is of course a write API as well):
val df = spark.read.sqlanalytics("<DBName>.<Schema>.<TableName>")
If outside of the integrated notebook experience, need to add imports:
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.SqlAnalyticsConnector._
It sounds like they are working on expanding to SERVERLESS SQL pool, as well as other SDKs (e.g. Python).
Read top portion of this article as reference: https://learn.microsoft.com/en-us/learn/modules/integrate-sql-apache-spark-pools-azure-synapse-analytics/5-transfer-data-between-sql-spark-pool
maybe I misunderstood your question, but normally you would use jdbc connection in Spark to use data from remote database
check this doc
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
keep in mind, Spark would have to ingest data from Synapse tables into memory for processing and perform transformations there, so it is not going to push down operations into Synapse.
Normally, you want to run SQL query against source database and only bring results of SQL into Spark dataframe.

Resources