Azure Synapse Polybase/External tables - return only latest file - azure

We have an files partitioned in the datalake and are using Azure Synapse SQL Serverless pool to query them using external tables before visualising in Power BI.
Files are stored in the following partition format {source}/{year}/{month}/{filename}_{date}.parquet
We then have an external table that loads all files for that source.
For all files that increment each day this is working great as we want all files to be included. However we have some integrations that we want to return only the latest file. (i.e. the latest file sent to us is the current state that we want to load into Power BI).
Is it possible in the external table statement to only return the latest file? Or do we have to add extra logic?
We could load all the files in, and then filter for the latest filename and save that in a new location. Alternatively we could try to create an external table that changes every day.
Is there a better way to approach this?

If you are using dedicated pools then I would alter the location of your table with the latest files folder.
Load every day into a new folder and then alter the LOCATION of the external table to look at the current/latest day, but you might need to add additional logic to track in a control table what the latest successful load date is.
Unfortunately I have not found a better way to do this myself.

Related

Build a pipeline in azure data factory to load Excel files, format content, transform in csv and send to azure sql DB

I'm approaching to Azure environment and watching tutorials/reading documents, but I'm trying to figure out how to setup a flow that enables the process that I will describe hereunder. The starting point are reports in .xlsx format produced monthly by Mktg Dept: the requirements are to bring them in Azure SQL DB so that data can be stored and analysed. Sofar I managed to put those files (previously manually converted in .csv format) in a BLOB storage and build an ADF pipeline that copy each file in a table on the SQL DB.
The problem is that as far as I understood with ADF it's not possible to directly manage xlsx files, and I'm wondering how to set up an automated procedure that enables the conversion from .xlsx to .csv and save them on BLOB storage. I was thinking about adding to the pipeline a python script/Databricks notebook to convert format, but I'm not sure this could be the best solution. Any hint/reference to existing tutorial or resources would be very appreciated
I found a tutorial which uses Logic Apps to do the conversion.
Datanovice indirectly suggested using a Custom activity to run either a C# or Python application to do the conversion for you.
The least expensive solution would be to do the conversion before uploading to blob, like Datanovice said.

Copy latest files from S3 to Azure Blob (using Azure Factory V2)

I'm still new to Azure Data Factory and am trying to move files that are dumped in my S3 folder/bucket daily to Azure blob. I already created datasets (for source and sink) and linked services in Data Factory.
But since my S3 bucket receives new file every day, I'm wondering how to move the latest file that was dropped in the S3 (say at 5am EST) on a daily basis. I have looked through most of the answers online like this, this, this and this. But none of them explains how to figure out which is the latest file in S3 (maybe based on last modified date/time or by matching the file name pattern that goes like this 'my_report_YYYYMMDD.csv.gz') and only copy that file to the destination blob.
Thank you in advance for your help/answer!
My idea as below:
1.Firstly,surely,configure your pipeline execution in the schedule trigger.Refer to this link.
2.Use Get metadata activity ,which supports Amazon S3 Connector,to get the files in your S3 dataset.
Get the last modified and file name etc. metadata.
3.Put these metadata array which contains lastModified Time and file name into a Web Activity or Azure Function Activity. In that rest api or function method,you could do a sort logical business to get the latest modified file.
4.Get the fileName from Web Activity or Azure Function Activity ,then copy it into Azure Blob Storage.
Another idea is using Custom-Activity.You could implement your requirements with .net code.
(Side note: thanks to Jay Gong above for suggesting a solution)
I found the answer. It's simpler than I expected. There's dynamic content/expression that we can add to 'Filter by last modified' field of the S3 dataset. Please see the screenshot below where I show how I picked files that are no more than 5 hours old by using dynamic expression. More about these expressions can be read here.
Hope this is helpful.

Table Replication and Synchronization in AZURE

I am pretty new to AZURE cloud and stuck at a place where I want to repplicate 1 table into another database with same schema and table name.
By replication I mean, the new table in another database should automatically synced with the original table. I can do this using the elastic table, but the queries are taking way too long and some time getting timed out, so I am thinking of having a local table in another database instead of elastic table, but I am not sure how I can do this in AZURE ?
Note: Both database resided on same DB server
Any example, links will be helpful
Thanks
To achieve this you can use a DACPAC (Data-Tier Package) a data tier package can be created in Visual Studio or extracted from an existing database. They contain the database creation scripts and manage your deltas for you. More information can be found here. For information about how to build and deploy a DACPAC using both VS and extracted from a database see this answer

Upgrading from CouchDB 1.2.x and separating views

We are in the process of upgrading from 1.2.x to 1.5.1 and would like to take advantage of the fact that you can now store databases and views in separate locations. Everything I've read so far indicates all you have to do is set the view_index_dir property. However since we are upgrading from a time before this feature was available I'm worried this wont work because when I look at our current data directory I only see one file per database. To put it simply, will it be possible for us to relocate our views?
In your current data directory, you should have a data base file named
<databasename>.couch
and a view folder folder named
.<databasename>_design
per data base.
For migration, just stop the data base, move the view folders to the new location, configure the new location in the local.ini (property view_index_dir) and restart the data base.

Recover Deleted Data Table Azure Mob Service

I accidentally deleted a table from Data of Mobile Service. Is there any way I can recover it?
I used the default free database given with making a mobile service. I really do not care about the data in table, instead I really want the scripts than ran on it.
.........................................
In order to retrieve the data I did the following:
Cloned the mobile service, reverted it to a previous commit, copied the deleted table and its script files, pulled again from the server, added the table and the script files where they should be, added the files to git tracking index, pushed the commit to master
Now the files are there in the azure mobile service, but the table is not being displayed in the GUI.
I tried to restart the azure mobile service but still it is not there.
In order to confirm the table and its files were indeed there I even cloned the mob service again and this time in the table folder I had users.json and its script files, but sadly they are not visible in azure portal
To get the table to show again in the UI, you need to use the portal create table command. It will basically noop if it detects the table already exists in SQL. I don't believe it will touch your table scripts, however it may override the .json permissions file.
If it does override the js files, then after creating the table through the UI you can revert the commit that changed the json/js table files as part of that process.
At that point you should be good again.

Resources