Deleting a Managed Table Without Deleting Underlying Data in Databricks - databricks

I am looking for a way to delete a Managed table without deleting the underlying data in Databricks.
I accidentally created a Managed table on top of the data stored in my mounted storage account when I was actually looking to create an External (Unmanaged) table.
Is there a way to do this? Maybe 'delete' is actually not the correct term here and what I am looking for is a way to de-register the table from the hive metastore?

Related

File is not readed completely by Copy Data in Azure Data Factory

I'm developing a pipeline that be able to insert data from a .txt file located in the Blob Storage into a table in a SQL Data Base.
Problem: Somehow the activity configuration is not working properly cause' is not reading all the records in the file and in consequence is not loading all the data into the Data Base (I realized this issue when I opened the file and compared the number of records from .text file against SQL table. Also, when I searched records from the last month in the table on SQL I didn't find them)
Note: I checked out the size limit of characters in the table from SQL and that isn't the problem.
I'd like to share with you the Data Copy activity and Source Data Set configuration as well:
Sink Dataset:
Do you know, guys what I'm doing wrong here? Hope you can help me, best regards.
P.S. Here's the Source Dataset
As discussed in comments, while using copy activity you would have to make sure to set the schema before running the activity. By design the schema mapping is left empty and has to be configured by the user either manually or asking adf to import the schema from the dataset.
Note: While using Auto create table option in sink, it automatically creates sink table (if nonexistent) in source schema,
but won't be supported when a stored procedure is specified (on the
sink side) or when staging is enabled.
Using COPY statement to load data into Azure Synapse Analytics as sink, the connector supports automatically creating destination table with DISTRIBUTION = ROUND_ROBIN if not exists based on the source schema.
Refer official doc: Copy and transform data in Azure Synapse Analytics by using Azure Data Factory or Synapse pipelines
Source...
Sink...
So Azure Synapse will be used as the sink. Additionally, an Azure Synapse table has to be created which matches the column names, column order, and column data types of source.
For dynamic mapping
If you view the pipeline code, you can see in the Translator section the JSON equivalent of the mapping section from UI.
You can reuse this as a base in Dynamic mapping to enable further copying similar files without having to manually configure schema.
Copy the JSON under mappings in translator

How to access DeltaLake Tables without Databrick Cluster running

I have created DeltaLake Tables on DataBricks Cluster. And I am able to access these tables from external system/application. Though I need to keep the cluster up and running all the time to be able to access the table data.
Question:
Is it possible to access the DeltaLake Tables when Cluster is down?
If Yes, Then how can I setup
I tried to lookup on docs. Found that 'Premium access to DetaBrick' has some Table Access Controls. disabled by otherwise. It says:
Enabling Table Access Control will allow users to control who can
select, create, and modify databases, tables, views, and functions
that they create.
I also found this doc
I don't think this is the option for my requirement.. Please suggest
The solution I found is to store all Delta Lake Tables on Storage Gen2. This will have access to external resources irrespective of DataBrick Clusters.
While reading a file or writing into table we will have our Cluster up and running, rest of time it can be shut down.
From Docs: In databricks we can create delta tables of two types: managed and unmanaged. Managed are those for which data is stored in DBFS (Databricks FileSystem). While Unmanaged are those where an external ADLS Gen-2 location can be specified.
dataframe.write.mode("overwrite").option("path","abfss://[ContainerName]#[StorageAccount].dfs.core.windows.net").saveAsTable("table")

Is it possible to rename Azure Synapse SQL Pool?

I am using an Azure SQL Database for our team's reporting and the data size right now is too big to handle by a single data (at least I think so, it has 2 fact tables with around 100m rows in each table).
The Azure SQL Database is named "operation-db" and the Synapse is named "operation-synapse".
I want to make the transition for my team become as smooth as possible. So I'm planning to copy all the tables, views, stored procedure and user-defined function over to Synapse.
Once I'm done with that, is there a way to rename "operation-synapse" to "operation-db" so the team doesn't have to go to their code base to change the name of the db?
Thanks!
It is not possible to rename a SQL Pool via SQL Server Management Studio and you will receive the following error:
ALTER DATABASE NAME statement is not supported in a Synapse workspace.
To update the name of a SQL pool, use the Azure Synapse Portal or the
Synapse REST API. (Microsoft SQL Server, Error: 49978)
The REST API however does list a move method to change names:
POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Synapse/workspaces/{workspaceName}/sqlPools/{sqlPoolName}/move?api-version=2019-06-01-preview
I couldn't get it to work though. YMMV. Not renaming your db shouldn't be a big deal though. Your team should feel comfortable with changing connection strings etc and it will help them understand they are moving to a different product (Synapse) with different characteristics.
Before you move to Synapse however, have you look at Clustered Columnstore indexes in Azure SQL DB? They are default type of index in a SQL Pool database but are also available in SQL DB. They can compress your data 5-10x so it might end up not that big at all. Columnstore is great for aggregate queries but less so for point lookups so have a think about your workload before you migrate.
100 million rows is not big enough for synapse. Cci data in each shard will only have 1 row group (1mil rows).
Consider using partitioning or CCI in your sql db itself.
Also what's your usage pattern? If you are doing point lookups and updates clustered indexes will perform better.
You can rename a Synapse database easily using the SSMS GUI. (I've just tried this on v18.8).
Just click once on the database name in the Object Explorer to select it, then press the F2 key to rename it.
The Synapse service must be running (i.e. not paused) for the rename to work.
You can rename Synapse database using T-SQL. The command is as follows:
ALTER DATABASE [OldSynapseDBName]
MODIFY NAME = [NewSynapseDBName]
Note you need to be connected to/issue the command from the master database otherwise it will not work.
The command takes can 30 seconds on 100GB DB and there are some caveats such as DB must not be used during operation.

How to upsert/insert records in all tables in an Azure SQL Database with Azure Data Factory v2

I have an Azure SQL database with many tables that I want to update frequently with any change made, be it an update or an insert, using Azure Data Factory v2.
There is a section in the documentation that explains how to do this.
However, the example is about two tables, and for each table a TYPE needs to be defined, and for each table a Stored Procedure is built.
I don't know how to generalize this for a large number of tables.
Any suggestion would be welcome.
You can follow my answer https://stackoverflow.com/a/69896947/7392069 but I don't know how to generalise creation of table types and stored procedures, but at least the metadata table of the metadata driven copy task provides a lot of comfort to achieve what you need.

Create Azure table storage folders

I have a worker role running that creates tables in table storage, and I would like to be able to group these tables into categories like you would under a folder.
I cannot see any way to do this with the table classes in .Net, but when I look in my table storage 'Tables', I see a 'Metrics Table' entry which looks like a 'folder' and expands to show multiple metrics tables below it.
How can I create/add one of these myself programmatically?
Any ideas gratefully received?
I'm afraid this is not possible. Metric tables are handled differently by Visual Studio. They are not even returned when using Query Tables storage REST API (you can only use them directly by name). Tools like Azure Storage Explorer do not show them at all.
Back to your question. Best practice is to use common prefix for tables in same 'category'.
ex. WAD* for all azure diagnostics tables, NLog*for nlog tables.
Simple answer is that you can't. Table Storage Service contains tables and then each table contains entities. The functionality about Metrics Table you're talking about is a UI feature where the UI combines all these tables together.

Resources