How to configure path for Delta Live Table in cloud_files - apache-spark

I am new to the Databricks Delta Live table. I have some small doubts and need your help to understand the concept behind it. I am unable to proceed without this.
I have a file in the Azure data lake container, and I know that I need to give the path under "cloud_files" so that delta live table can read files from this folder and show them. But my doubt is, if I give only the path, how do I mention the storage account name and container name? Also, do I need to provide an access key in order to read the data securely ?
I think I am missing something, I have gone through various articles and Youtube demo videos, and everywhere they just mention the path but do not tell me how to configure the path.
Please help me to understand this concept.
Thank You.
This is my code for the Delta Live table:
CREATE LIVE TABLE customers_raw
COMMENTS "This is raw table"
AS
SELECT *
FROM cloud_files("/raw_data/customers.csv", "csv")

You need to specify full URL for this folder, like, abfss://<container>#<storage>.dfs.core.windows.net/raw_data/customers.csv. Otherwise if you specify it /raw_data/customers.csv it will consider it as a folder on DBFS, and will fail. Please note that in this case you will need to setup corresponding Spark properties so DLT can access data - you can find it in the following answer.

Related

Copying data using Data Copy into individual files for blob storage

I am entirely new to Azure, so if this is easy please just tell me to RTFM, but I'm not used to the terminology yet so I'm struggling.
I've created a data factory and pipeline to copy data, using a simple query, from my source data. The target data is a .txt file in my blob storage container. This part is all working quite well.
Now, what I'm attempting to do is to store each row that's returned from my query into an individual file in blob storage. This is where I'm getting stuck, and I'm not sure where to look. This seems like something that'll be pretty easy, but as I said I'm new to Azure and so far am not sure where to look.
You can type 1 in the Max rows per file of the Sink setting and don't set the file name in the dataset of sink. If you need, you can specify the file name prefix in the File name prefix setting.
Screenshots:
The dataset of sink
Sink setting in the copy data activity
Result:

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks
This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

How to prepare test data for textsum?

I have been able to successfully run the pre-trained model of TextSum (Tensorflow 1.2.1). The output consists of summaries of CNN & Dailymail articles (which are chuncked into bin format prior to testing).
I have also been able to create the aforementioned bin format test data for CNN/Dailymail articles & vocab file (per instructions here). However, I am not able to create my own test data to check how good the summary is. I have tried modifying the make_datafiles.py code to remove had coded values. I am able to create tokenized files, but the next step seems to be failing. It'll be great if someone can help me understand what url_lists is being used for. Per the github readme -
"For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory."
How is a URL such as http://web.archive.org/web/20150401100102id_/http://www.cnn.com/2015/04/01/europe/france-germanwings-plane-crash-main/ being mapped to the corresponding story in my data folder? If someone has had success with this, please do let me know how to go about this. Thanks in advance!
Update: I was able to figure out how to use own data to create bin files for testing (and avoid using url_lists altogether).
This will be helpful - https://github.com/dondon2475848/make_datafiles_for_pgn
Will update answer once I figure out how to fix ROGUE scoring for this.

Want to setup setting data in Windows Azure Stream Analytics

Need help to setup the Reference data in stream analytics. I want to add setting(default) data of my application into stream analytics. I can add the reference data and by doing upload sample file I can upload JSON or CSV file. However while firing a join query it gives 0 rows as all reference data haven't stored (So null if left outer join).
I investigate the issue and I think it is due to Path Pattern, but I do not have much idea about it.
Based on your description, as you said, you had been sure that the issue was caused by Path Pattern/Path Prefix Pattern, but I could not give some helpful suggestion for you without any details, such as the screenshot of your Path Pattern setting.
So just list some resources as references for you, hope these help for resolving your issue.
Two screenshots about Path Prefix Pattern/Path Pattern which be introduced from Link 1 & 2.
A sample Use Stream Analytics to process exported data from Application Insights introduce how to read stream data from Blob Storage at its section Create an Azure Stream Analytics instance, which step as similar as for Reference data.
Hope it helps.
The issue was due to not properly formatted JSON file.

How can I move data from a azure TABLESTORAGE tproduction table to a development table?

I've set up two different tables with two connection strings. Now I want to move data from one to the other but I'm not sure where to start. Has anyone else coded up some solution for this. If so I would really appreciate some tips/advice on where to start. My tablestorage tables are small. Some are around 500-2000 rows. I would just like to make sure that data does not get lost.
Please note it is tablestorage that I am using and NOT Sql Azure
Thanks if you have some ideas.
If you're dealing with only a few thousand rows then this is easy.
Download TableXplorer here: http://clumsyleaf.com/products/tablexplorer
There is an option to export all data from a table to an XML or CSV file. Once you have your file then use the import option with the file and import it into the other table.

Resources