Greenplum-Spark Connector generate too many external table - apache-spark

I'm programming with Greenplum-Spark Connector, and I find that every time I use Spark to take the table data, an external table is created in Greenplum and is not deleted after taking data. When i query the same table, another external been generated. Can anybody tell me if the extenral table can be used in the future? Will this table be automatically cleaned?

Related

Delta Load on BSEG table into AZure using SAP table conenctor

We are using SAP ABAP oracle environment.I'm trying to implement Change Data capture for the SAP BSEG table in Azure datafactory using SAP table connector. In SAP table connector, I don't see an option to pass any join conditions. Based on what fields we can capture the CDC on BSEG table.
BSEG is a cluster table.
It dates back to R2 days on Mainframes.
See Se11 BSEG --> Menu option Database Object --> Database utility.
Run Check.
It will most likely say NOT ON DATABASE.
If you want to access the data via views see one of the numerous index tables.
BSxx description Accounting: Secondary Index for xxxxx
These so called Index tables are separate tables that behave like indexes
on bseg but arent true indexes as cluster tables can not have indexes.
The index tables are real tables you can access with joins/views.
The document number can be used read BSEG later should that still be necessary.
You may find FI_DOCUMENT_READ and BKPF useful too.
In theory the Index tables should be enough.
From the SAP Table connector help:
Currently SAP Table connector only supports one single table with the default function module. To get the joined data of multiple tables, you can leverage the customRfcReadTableFunctionModule property in the SAP Table connector following steps below
...
So no, table joins are not supported by default, you need to write in SAP backend a custom FM with the predefined interface. The interface to do is described in the help.
If you use Azure Data factory to Azure Data Explorer doing big tables like BSEG can be done with a work around.
Although BSEG is a cluster of tables in SAP, from the SAP Connector point of view it is a table with rows and columns which can be partitioned.
Here is an example for MSEG which is similar.
MSEG_Partitioned
Kind Regards
Gauchet

Delta tables in Databricks and into Power BI

I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!
Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

Is there way a to use join query in Azure Data factory When copying data from Sybase source

I am trying to ingest data from Sybase source in to Azure datalake. I am ingesting several tables using a Watermark table that has tables names from Sybase source. Now process works fine for a full import, however we are trying to Import tables every 15 minutes to feed a dashboard. We don't need to ingest whole table as we don't need all the data from it.
Table doesn't have dateModified or any kind of incremental id to perform an incremental load. Only way of filtering out unwanted data is to perform a join on to another look up table at source and then using "filter" value in "Where" clause.
Is there a way we can perform this in Azure data factory ? I have attached my current pipeline screenshot just to make it a bit more clear.
Many thanks for looking in to this. I have managed to find a solution. I was using a Watermark table to ingest about 40 tables using one pipeline. My only issue was how to use join and "where" filter in my query without hard coding it in pipeline. I have achieved this by adding "Join" and "Where" fields in my Watermark table and then passing it in "Query" as #{item ().Join} #{item().Where). It Worked like a magic.

Create a Volatile table in teradata

I have a sharepoint list which i have linked to in MS Access.
The information in this table needs to be compared to information in our datawarehouse based on keys both sets of data have.
I want to be able to create a query which will upload the ishare data into our datawarehouse under my login run the comparison and then export the details to Excel somewhere. MS Access seems to be the way to go here.
I have managed to link the ishare list (with difficulties due to the attachment fields)and then create a local table based on this.
I have managed to create the temp table in my Volatile space.
How do i append the newly created table that i created from the list into my temporary space.
I am using Access 2010 and sharepoint 2007
Thank you for your time
If you can avoid using Access I'd recommend it since it is an extra step for what you are trying to do. You can easily manipulate or mesh data within the Teradata session and export results.
You can run the following types of queries using the standard Teradata SQL Assistant:
CREATE VOLATILE TABLE NewTable (
column1 DEC(18,0),
column2 DEC(18,0)
)
PRIMARY INDEX (column1)
ON COMMIT PRESERVE ROWS;
Change your assistant to Import Mode (File-> Import Data)
INSERT INTO NewTable (?,?)
Browse for your file, this example would be a comma delineated file with two numeric columns and column one being the index.
You can now query or join this table to any information in the uploaded database.
When you are finished you can drop with:
DROP TABLE NewTable
You can export results using File->Export Data as well.
If this is something you plan on running frequently there are many ways to easily do these type of imports and exports. The Python module Pandas has simple functionality for reading a query directly into DataFrame objects and dropping those objects into Excel through the pandas.io.sql.read_frame() and .to_excel functions.

Resources