Truncate tables on databricks - apache-spark

I'm working with two environments in Azure: Databricks and SQL Database. I'm working with a function that generate a dataframe that it's going to be used to overwrite the table that is stored in the SQL Database. I have many problems because the df.write.jdbc(mode = 'overwrite') only drops the table and, I'm guessing, my user didn't have the right permissions to created again (I've already seen for DML and DDL permission that I need to do that). In resume, my functions only drops the table but without recreating again.
We discuss about what could be the problem and we conclude that maybe the best thing that I can do is truncate the table and re-add the new data there. I'm trying to find how to truncate the table, I tried these two approaches but I can't find more information related to that:
df.write.jdbc()
&
spark.read.jdbc()
Can you help me with these? The overwrite doesn't work (maybe I don't have the adequate permissions) and I can't figure out how to truncate that table using a jdbc.

It's in the Spark documentation - you need to add the truncate when writing:
df.write.mode("overwrite").option("truncate", "true")....save()
Also, if you have a lot of data, then maybe it's better to use Microsoft's Spark connector for SQL Server - it has some performance optimizations that should allow to write faster.

You can create stored procedure for truncating or dropping in SQL Server and call that stored procedure in databricks using ODBC connection.

Related

how to synchronize an external database on Spark session

I have a Delta Lake on an s3 Bucket.
Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for each one?
At the moment this is what I do (Let's suppose I have 3 tables into the s3_bucket_path "folder").
s3_bucket_path = 's3a://bucket_name/delta_lake/'
spark.read.format('delta').load(s3_bucket_path + 'table_1').createOrReplaceTempView('table_1')
spark.read.format('delta').load(s3_bucket_path + 'table_2').createOrReplaceTempView('table_2')
spark.read.format('delta').load(s3_bucket_path + 'table_3').createOrReplaceTempView('table_3')
I was wondering if there was a quicker way to have all the tables available (without having to use boto3 and iterate through the folder to get the table names), or if I wasn't following the best practices in order to work with Spark Sql Apis: should I use a different approach? I've been studying Spark for a week and I'm not 100% familiar with its architecture yet.
Thank you very much for your help.
Sounds like you'd like to use managed tables, so you have easy access to query the data with SQL, without manually registering views.
You can create a managed table as follows:
df.write.format("delta").saveAsTable("table_1")
The table path and schema information is stored in the Hive megastore (or another metastore if you've specified another metastore). Managed tables will prevent you from manually having to create the views yourself.

How to solve the maximum view depth error in Spark?

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

Writing Data to External Databases Through PySpark

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.
First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

Resources