I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!
Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.
Related
I need to create a dashboard inside Databricks that summarizes the number of rows in the current workspace right now.
Is there a way to create a SQL query to calculate the number of rows by table, schema, and catalog? The expected result would be:
Catalog
Schema
Table
Rows
example_catalog_1
Finance
table_example_1
1567000
example_catalog_1
Finance
table_example_2
67000
example_catalog_2
Procurement
table_example_1
45324888
example_catalog_2
Procurement
table_example_2
89765987
example_catalog_2
Procurement
table_example_3
145000
Currently, I am working on a pure SQL workflow. So I would like to understand if it's possible to execute such an action using SQL, because as much as I know, the dashboards in Databricks do not accept PySpark Codes.
I was looking for a way to do that. I know that it's possible to access the tables in the workspace by using system.information_schema.tables but how to use it to count to total rows for each table presented there?
I was checking that via SQL Server it's possible via sys schema, dynamic query, or BEGIN...END clause. I couldn't find a way in Databricks to do that.
I strongly doubt if you can run that kind of query in the databricks dashboard . The link shared by #Sharma is more as to how to get the record count using dataframe and not how to link that with the databricks dashboard .
I have 100-150 Azure databases with same table schema. There are 300-400 tables in each database. Separate reports are enabled on all these databases.
Now I want to merge these database into a centralized database and generate some different Power BI reports from this centralized database.
The approach I am thinking is -
There will be Master table on target database which will have
DatabaseID and Name.
All the tables on target database will have the composite primary key
created with the Source Primary key and Database ID.
There will be multiple (30-35) instances of Azure data factory
pipeline and each instance will be responsible to merge data from
10-15 databases.
These ADF pipelines will be scheduled to run weekly.
Can anyone please guide me that the above approach will be feasible in this scenario? Or there could any other option we can go for.
Thanks in Advance.
You trying to create a Data Warehouse.
I hope you will never archive to merge 150 Azure SQL Databases because is soon as you try to query that beefy archive what you will see is this:
This because Power BI, as any other tool, comes with limitations:
Limitation of Distinct values in a column: there is a 1,999,999,997 limit on the number of distinct values that can be stored in a column.
Row limitation: If the query sent to the data source returns more than one million rows, you see an error and the query fails.
Column limitation: The maximum number of columns allowed in a dataset, across all tables in the dataset, is 16,000 columns.
A data warehouse is not just the merge of ALL of your data. You need to clean them and import only the most useful ones.
So the approach you are proposing is overall OK, but just import what you need.
We are using SAP ABAP oracle environment.I'm trying to implement Change Data capture for the SAP BSEG table in Azure datafactory using SAP table connector. In SAP table connector, I don't see an option to pass any join conditions. Based on what fields we can capture the CDC on BSEG table.
BSEG is a cluster table.
It dates back to R2 days on Mainframes.
See Se11 BSEG --> Menu option Database Object --> Database utility.
Run Check.
It will most likely say NOT ON DATABASE.
If you want to access the data via views see one of the numerous index tables.
BSxx description Accounting: Secondary Index for xxxxx
These so called Index tables are separate tables that behave like indexes
on bseg but arent true indexes as cluster tables can not have indexes.
The index tables are real tables you can access with joins/views.
The document number can be used read BSEG later should that still be necessary.
You may find FI_DOCUMENT_READ and BKPF useful too.
In theory the Index tables should be enough.
From the SAP Table connector help:
Currently SAP Table connector only supports one single table with the default function module. To get the joined data of multiple tables, you can leverage the customRfcReadTableFunctionModule property in the SAP Table connector following steps below
...
So no, table joins are not supported by default, you need to write in SAP backend a custom FM with the predefined interface. The interface to do is described in the help.
If you use Azure Data factory to Azure Data Explorer doing big tables like BSEG can be done with a work around.
Although BSEG is a cluster of tables in SAP, from the SAP Connector point of view it is a table with rows and columns which can be partitioned.
Here is an example for MSEG which is similar.
MSEG_Partitioned
Kind Regards
Gauchet
I am trying to MERGE two tables using spark sql and getting error with the statement.
The tables are created as external tables pointing to the Azure ADLS storage. The sql is executing using Databricks.
Table 1:
Name,Age.Sex
abc,24,M
bca,25,F
Table 2:
Name,Age,Sex
abc,25,M
acb,25,F
The Table 1 is the target table and Table 2 is the source table.
In the table 2 I have one Insert and one update record which needs to be merged with source table 1.
Query:
MERGE INTO table1 using table2 ON (table1.name=table2.name)
WHEN MATCHED AND table1.age <> table2.age AND table1.sex<>table2.sex
THEN UPDATE SET table1.age=table2.age AND table1.sex=table2.sex
WHEN NOT MATCHED
THEN INSERT (name,age,sex) VALUES (table2.name,table2.age,table2.sex)
Is the spark SQL support merge or is there another way of achieving it ?
Thanks
Sat
To use MERGE you need the Delta Lake option (and associated jars). Then you can use MERGE.
Otherwise, SQL Merge is not supported by Spark. The Dataframe Writer APIs with own logic are then needed. There are a few different ways to do this. Even with ORC ACID, Spark will not work in this way.
I am trying to ingest data from Sybase source in to Azure datalake. I am ingesting several tables using a Watermark table that has tables names from Sybase source. Now process works fine for a full import, however we are trying to Import tables every 15 minutes to feed a dashboard. We don't need to ingest whole table as we don't need all the data from it.
Table doesn't have dateModified or any kind of incremental id to perform an incremental load. Only way of filtering out unwanted data is to perform a join on to another look up table at source and then using "filter" value in "Where" clause.
Is there a way we can perform this in Azure data factory ? I have attached my current pipeline screenshot just to make it a bit more clear.
Many thanks for looking in to this. I have managed to find a solution. I was using a Watermark table to ingest about 40 tables using one pipeline. My only issue was how to use join and "where" filter in my query without hard coding it in pipeline. I have achieved this by adding "Join" and "Where" fields in my Watermark table and then passing it in "Query" as #{item ().Join} #{item().Where). It Worked like a magic.