External table from existing data with additional column - apache-spark

This is my first question ever so thanks in advance for answering me.
I want to create an external table by Spark in Azure Databricks. I've the data in my ADLS already that are automatically extracted from different sources every day. The folder structure on the storage is like ../filename/date/file.parquet.
I do not want to duplicate the files by saving their copy on another folder/container.
My problem is that I want to add a date column extracted from the folder path to the table neither without copying nor changing the source file.
I am using Spark SQL to create the table.
CREATE TABLE IF EXISTS my_ext_tbl
USING parquet
OPTIONS (path "/mnt/some-dir/source_files/")
Is there any proper way to add such a column in one easy and readable step or I have to read the raw data into Dataframe, add column and then save it as external tabel to different location?
I am aware of that unmanaged tables stores only metadata in dbfs. However, I am wondering is this even possible.
Hope it's clear.
EDIT:
Since it seems like there is no viable solution for that without copying or interfere in source file, I would like to ask how are you handling such challenges?
EDIT2:
I think that link might provide a solution. The difference in my case is that, the date inside the folder path is not the real partition, it's just a date added during the pipeline extracting data from external source.

Related

Add file name to Copy activity in Azure Data Factory

I want to copy data from a CSV file (Source) on Blob storage to Azure SQL Database table (Sink) via regular Copy activity but I want to copy also file name alongside every entry into the table. I am new to ADF so the solution is probably easy but I have not been able to find the answer in the documentation and neither on the internet so far.
My mapping currently looks like this (I have created a table for output with the file name column but this data is not explicitly defined at the column level at the CSV file therefore I need to extract it from the metadata and pair it to the column):
For the first time, I thought that I am going to put dynamic content in there and therefore solve the problem this way. But there is not an option to use dynamic content in each individual box so I do not know how to implement the solution. My next thought was to use Pre-copy script but have not seen how could I use it for this purpose. What is the best way to solve this issue?
In Mapping columns of copy activity you cannot add the dynamic content of Meta data.
First give the source csv dataset to the Get Metadata activity then join it with copy activity like below.
You can add the file name column by the Additional columns in the copy activity source itself by giving the dynamic content of the Get Meta data Actvity after giving same source csv dataset.
#activity('Get Metadata1').output.itemName
If you are sure about the data types of your data then no need to go to the mapping, you can execute your pipeline.
Here I am copying the contents of samplecsv.csv file to SQL table named output.
My output for your reference:

Create list of files in Azure Storage and send it to sql table using ADF

I need to copy file names of excel files that are in my Azure Storage as blobs and then put these names in the SQL Server table using ADF. It can be a file path as a name of a file but the hardest thing is that in the dataset which takes all the files from one specific folder I have to select a sheet name and these sheet names are different for each file, therefore it returns an error. Is there a way to create a collective dataset without indicating the sheet name?
So, if I understand your question correctly you are looking for a way to write all Excel filenames to a SQL Database using ADF.
You can use the generic Get Metadata activity and use a binary dataset as source. Select Child items as an field to retrieve. This will retrieve all files in the folder. Then add a filter to only select the Excel file types.
Hope that this gets you on the right track.

Is there any automated way to convert the CSV files in Azure Blob Storage to Excel tables?

I am trying to export the content of csv files stored in Azure Blob Storage to Excel tables in an automated way. I did some research and found in few blog articles that Azure Logic App could be used for the conversion. I tried to implement something similar but couldn't succeed.
Any suggestions/help would be appreciated.
Thanks.
If you really want to go this route I built this the other day, not that I think this is the best way to handle this but it is a way. You can build further upon this example, change the input to storage blob and the output to excel. I am just pasting the extra step where I set the output to Excel add a row into a table. Keep in mind you will need to purge the header and the last row So you need to at least fix that part.
Find the entire flow in the other question I linked to earlier. The difference is just that I now look in a storage blob, compose the output and in the end, I write to the Excel table

Spotfire - Change datasource from SQL to excel

I'm looking for a way to change my data source using script. In other words, I'm actually connected to the ​SQL database. I have some graphs, some tables, and some functions.
Is there a way to tell to Spotfire to switch from SQL data source to a folder present on my computer? This folder is just the same tables, with the same names but fixed, not changing yet at a date.
So, I'm looking for a button using Python script to change the data source. If you know a package/function to do so, I would be happy to try it.
Thanks for your time,

How to Remove Table column names in Microsoft azure storage explorer tool?

I want to delete some unnecessary columns which are created but not using currently. Without deleting the table data or table in Microsoft azure storage explorer how can I delete columns manually?
Mildly annoying but you can fix the issue by deleting the entire contents of:
AppData\Roaming\StorageExplorer
that will fix the issue. You'll need to reauthorized the accounts but that's a mild inconvenience. There's likely a file or two within that directory that actually caches that data that is a more surgical approach but the few most obvious candidates didn't work for me so i just deleted the whole directory.
It's not possible to delete columns for all entities in a table, since Azure Storage Table is a schema-less database. In other words, the entities within a table can have different properties respectively. You have to query all the entities, remove the useless properties from them one by one and then replace the modified entities back to the table.
Spinning off of the answer from #frigon I performed the following steps:
Close the storage explorer
Delete \AppData\Roaming\StorageExplorer\Local Storage
Open storage explorer
It kept me logged in, kept all my settings, and only appears to have reset my account filters and table caches.
It looks like the caches are stored in leveldb in that folder, so you may be able to crack that open and find the specific value(s) to drop.
I've found that if you copy the table and paste it into a different storage account or simply rename the table then the new table will not reference the unused columns. However, if you paste it back to the original location or rename the table back to the original name then the unused columns will still be shown, even if you delete the table first.
Strangely, if you create a brand new table with the same name it will only have the default columns. But import the contents of the original table from file and the superfluous columns will also reappear even though there is no reference to those columns in the csv file.

Resources