I'm new to the Delta Lake, but I want to create some indexes for fast retrieval for some tables in Delta Lake. Based on the docs, it shows that the closest is by creating the Data Skipping then indexing the skipped portion:
create DATASKIPPING index on [TableName] [DBName.]tableName
Can't seem to find other methods of creating indexes other than Data Skipping
How do I create indexes just like any tables in RDBMS, within Delta Lake?
Thanks!
Indexing happens automatically on Databricks Delta and OSS Delta Lake as of v1.2.0. As you write data, the columns in the files you write are indexed and added to the internal table metadata. As you query the data and filter, data skipping is applied.
In addition you can use z-order on Databricks Delta to optimize the files based on specific columns. Again, indexing will still be used for the other columns as well.
Related
I am trying to do an incremental data load to Azure sql from csv files in ADLS through ADF. The problem I am facing is Azure SQL would generate the primary key column (ID) and the data would be inserted to Azure SQl. But when the pipeline is re triggered the data would be duplicated again. So how do I handle these duplicates ? Because only incremental load should be updated everytime but since primary key column is generated by SQL there would be duplicates every run. Please help !!
You can consider comparing source and sink data first by excluding
Primary key column and then filter that rows which modified and take
it to sink table.
In below video I created a hash on top of few columns from source and sink and compared them to identify changed data. Same way you can consider checking the changed data first and then load it to sink table.
https://www.youtube.com/watch?v=i2PkwNqxj1E
I am working on a use case with pyspark.
My pyspark job should read from Hive tables periodically and apply some aggregations and transformations on top of it.
But I cant read the full table each time as i would need to append the output to another table.Can anyone please suggest any ideas. One approach I am thinking is to keep track of the rowId or rownum of the hive table after each process.
Ps: this is not a streaming useCase
Note: I am new to spark.
Thanks,
Albin
Lets break the problem down.
Create two tables to replace the existing one:
Create a Base table, and a Delta table.
Create a view that is the union of both tables.
(Used to give you a complete view of all data as of 'now'. Excluded
"Processing" tagged data from the view. I'll explain why later.)
When data is added it's added to the delta table.
When it's time to start processing data: Tag the data in the delta table as "Processing"
Copy "Processing" data to the Base table, and complete any required Process/update to aggregations.
Delete the "Processing" data from the delta table once you have completed your calculations.
It's hopefully now clear why you'd exclude data tagged with "Processing" from your view of 'now'.
I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!
Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.
I posted this question on the databricks forum, I'll copy below but basically I need to ingest new data from parquet files into a delta table. I think I have to figure out how to use a merge statement effectively and / or use an ingestion tool.
I'm mounting some parquet files and then I create a table like this:
sqlContext.sql("CREATE TABLE myTableName USING parquet LOCATION 'myMountPointLocation'");
And then I create a delta table with a subset of columns and also a subset of the records. If I do both these things, my queries are super fast.
sqlContext.sql("CREATE TABLE $myDeltaTableName USING DELTA SELECT A, B, C FROM myTableName WHERE Created > '2021-01-01'");
What happens if I now run:
sqlContext.sql("REFRESH TABLE myTableName");
Does my table now update with any additional data that may be present in my parquet source files? Or do I have to re-mount those parquet files to get additional data?
Does my delta table also update with new records? I doubt it but one can hope...
Is this a case for AutoLoader? Or maybe I do some combination of mounting, re-creating / refreshing my source table, and then maybe MERGE new records / updated records into my delta table?
I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat
To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.