I have multiple delta lake tables storing images data. Now I want to take specific rows via filter from those tables and put them in another delta table. I do not want to copy the original data just the reference or shallow copy. I am using pyapark and databricks. Can someone please help me find the correct approach for this?
What you actually need is a view over the original table. Use CREATE VIEW to create it with necessary filter expression, like this:
CREATE VIEW <name> AS
SELECT * from <source_table> WHERE <your filter condition>
Then this view could be queried like a normal table, but data will be filtered according to your condition.
I am trying to get a list of tables and columns in a database, so I could find which tables have a particular column, the best I could find is use separate queries like one to show all tables , and then one to show all columns in one table, e.g. SHOW TABLES FROM database_name, SHOW COLUMNS FROM databasename.tablename. It will not be ideal when you have many tables to go through. Any solution out there at all?
Unfortunately, there is no way to fetch all metadata in one call. You can only do show databases, show tables in ..., describe table .... There is also spark.catalog.listTables, etc., but they could be slower than corresponding SQL queries.
I answered to related question yesterday - you can find code there.
How can I drop a Delta Table in Databricks? I can't find any information in the docs... maybe the only solution is to delete the files inside the folder 'delta' with the magic command or dbutils:
%fs rm -r delta/mytable?
EDIT:
For clarification, I put here a very basic example.
Example:
#create dataframe...
from pyspark.sql.types import *
cSchema = StructType([StructField("items", StringType())\
,StructField("number", IntegerType())])
test_list = [['furniture', 1], ['games', 3]]
df = spark.createDataFrame(test_list,schema=cSchema)
and save it in a Delta table
df.write.format("delta").mode("overwrite").save("/delta/test_table")
Then, if I try to delete it.. it's not possible with drop table or similar action
%SQL
DROP TABLE 'delta.test_table'
neither other options like drop table 'delta/test_table', etc, etc...
If you want to completely remove the table then a dbutils command is the way to go:
dbutils.fs.rm('/delta/test_table',recurse=True)
From my understanding the delta table you've saved is sitting within blob storage. Dropping the connected database table will drop it from the database, but not from storage.
you can do that using sql command.
%sql
DROP TABLE IF EXISTS <database>.<table>
Basically in databricks, Table are of 2 types - Managed and Unmanaged
1.Managed - tables for which Spark manages both the data and the metadata,Databricks stores the metadata and data in DBFS in your account.
2.Unmanaged - databricks just manage the meta data only but data is not managed by databricks.
so if you write a drop query for Managed tables it will drop the table and also delete the Data as well, but in case of Unmanaged tables if you write a drop query it will simply delete the sym-link pointer(Meta-information of table) to the table location but your data is not deleted, so you need to delete data externally using rm commands.
for more info:
https://docs.databricks.com/data/tables.html
Databricks has unmanaged tables and managed tables, but your code snippet just creates a Delta Lake. It doesn't create a managed or unmanaged table. The DROP TABLE syntax doesn't work because you haven't created a table.
Remove files
As #Papa_Helix mentioned, here's the syntax to remove files:
dbutils.fs.rm('/delta/test_table',recurse=True)
Drop managed table
Here's how you could have written your data as a managed table.
df.write.saveAsTable("your_managed_table")
Check to make sure the data table exists:
spark.sql("show tables").show()
+---------+------------------+-----------+
|namespace| tableName|isTemporary|
+---------+------------------+-----------+
| default|your_managed_table| false|
+---------+------------------+-----------+
When the data is a managed table, then you can drop the data and it'll delete the table metadata & the underlying data files:
spark.sql("drop table if exists your_managed_table")
Drop unmanaged table
When the data is saved as an unmanaged table, then you can drop the table, but it'll only delete the table metadata and won't delete the underlying data files. Create the unmanaged table and then drop it.
df.write.option("path", "tmp/unmanaged_data").saveAsTable("your_unmanaged_table")
spark.sql("drop table if exists your_unmanaged_table")
The tmp/unmanaged_data folder will still contain the data files, even though the table has been dropped.
Check to make sure the table has been dropped:
spark.sql("show tables").show()
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+
So the table isn't there, but you'd still need to run a rm command to delete the underlying data files.
Delete from the GUI,
Data -> DatabaseTables -> pick your database -> select the drop down next to your table and delete.
I don't know consequences of this type of delete so caveat emptor
I found that to fully delete a delta table and be able to create a new one under the same name with say a different schema, you have to also delete temp files (otherwise you get an error saying that an old file no longer exists).
dbutils.fs.rm('/delta/<my_schema>/<my_table>', recurse=True)
dbutils.fs.rm('/tmp/delta/<my_schema>/<my_table>', recurse=True)
Business Case:
I have a list of key IDs in an excel spreadsheet. I want to use Power Query to join these IDs with a details table in a SQL Server database.
Problem
Currently using Power Query I only know how to import the entire table, which is greater than 1 million records, then do a left join on it against an existing query that targets a local table of IDs.
What I want to do is send that set of IDs in the original query so I'm not pulling back the entire table and then filtering it.
Question
Is there an example of placing an IN clause targeting a local table similar to what is shown below?
= Sql.Database("SQLServer001", "SQLDatabase001",
[Query="SELECT * FROM DTree WHERE ParentID
IN(Excel.CurrentWorkbook(){[Name="tbl_IDs"]}[Content])"])
I would first build a "Connection only" Query on the excel spreadsheet key IDs.
Then I would start a new Query by connecting to the SQL table. In that query I would add a Merge step to apply the key IDs query as an Inner Join (filter).
This will download the 1m rows to apply the filter, but it is surprisingly quick as this is mostly done in memory. It will only write the filtered result to an Excel table.
To improve performance, filter the rows and columns as much as you can before the Merge step.
Sample Data
I have a Access, which has more than 1 million rows of data, as you can see from the screenshot. I want to dedupe the data in term of BRUIDREQID, as it has duplicates. Is there any way that when I connect data from Access to PowerPivot, I can get deduped dataset?
What I am doing now is using Python to dedupe the data and extract it as a csv file. I want to know whether I can use PowerPivot instead and save more time to dedupe large data set.
When accessing the Access database, you should be able to write arbitrary SQL, and you could just do a
SELECT DISTINCT
*
FROM Table
, which would de-dupe the table.
Power Pivot does not offer any functionality to change the existing data in a table once imported - you cannot add or remove rows, nor can you alter the values of any imported fields.