file transfer from DBFS to Azure Blob Storage - azure

I need to transfer the files in the below dbfs file system path:
%fs ls /FileStore/tables/26AS_report/customer_monthly_running_report/parts/
To the below Azure Blob
dbutils.fs.ls("wasbs://"+blob.storage_account_container+"#"
+ blob.storage_account_name+".blob.core.windows.net/")
WHAT SERIES OF STEPS SHOULD I FOLLOW? Pls suggest

The simplest way would be to load the data into a dataframe and then to write that dataframe into the target.
df = spark.read.format(format).load("dbfs://FileStore/tables/26AS_report/customer_monthly_running_report/parts/*")
df.write.format(format).save("wasbs://"+blob.storage_account_container+"#" + blob.storage_account_name+".blob.core.windows.net/")
You will have to replace "format" with the source file format and the format you want in the target folder.
Keep in mind that if you do not want to do any transformations to the data but to just move it, it will most likely be more efficient not to use pyspark but to just use the az-copy command line tool. You can also run that in Databricks with the %sh magic command if needed.

Related

How to open index.html file in databricks or browser?

I am trying to open index.html file through databricks. Can someone please let me know how to deal with it? I am trying to use GX with databricks and currently, data bricks store this file here: dbfs:/great_expectations/uncommitted/data_docs/local_site/index.html I want to send index.html file to stakeholder
I suspect that you need to copy the whole folder as there should be images, etc. Simplest way to do that is to use Dataricks CLI fs cp command to access DBFS and copy files to the local storage. Like this:
databricks fs cp -r 'dbfs:/.....' local_name
To open file directly in the notebook you can use something like this (note that dbfs:/ should be replaced with /dbfs/):
with open("/dbfs/...", "r") as f:
data = "".join([l for l in f])
displayHTML(data)
but this will break links to images. Alternatively you can follow this approach to display Data docs inside the notebook.

Return the name of a saved file in Databricks

I have a notebook in Databricks that does some transformations and writes a parquet file to Azure Data Lake Storage. At the end of the notebook I would like to be able to have an exit parameter with the name of the parquet file that the notebook have just saved. I would like to use this parameter in Azure Data Factory later.
In general I would like to have a copy activity in Azure DataFactory, which moves the just saved parquet file to a database table. The thing is that the name of the parquet file changes every time the notebook is ran. Let me know if there is a better solution to this problem.
Thank you!
The below code would be sending the file name back to ADF.
dbutils.notebook.exit(filename)

Azure Data Factory- Data Flow - After completion - move

I am using ADF v2 DataFlow ativity to load data from a csv file in a Blob Storage into a table in Azure SQL database. In the Dataflow (Source - Blob storage), in Source options, there is an option 'After Completion(No Action/Delete Source file/ Move)'. I am looking to utilize the move option to save those csv files in a container renaming those files in concatenation with with today's date. How do I frame the logic for this? Can someone please help?
You can define the file name explicitly in both From and To-fields. This is not so well (if at all) documented, and I found it just trying different approaches.
You can also add dynamic content such as timestamps. Here's an example:
concat('incoming/archive/', toString(currentUTC(), 'yyyy-MM-dd_HH.mm.ss_'), 'target_file.csv')
You could parameter the source file to achieve that. Please ref my example.
Data Flow parameter settings:
Set the source file and move expression in Source Options:
Expressions to rename the source with "name + current date":
concat(substring($filename, 1, length($filename)-4),toString(currentUTC(),'yyyy-MM-dd') )
My full file name is "word.csv", the output file name is "word2020-01-26",
HTH.

Error using data factory for copyactivity from blob storage as source

Why do I keep getting this error while using a folder from a blob container as source (which contains only one GZ compressed file) in copy activity in data factory v2 and as sink another blob storage (but I want the file decompressed)?
"message":"ErrorCode=UserErrorFormatIsRequired,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Format setting is required for file based store(s) in this scenario.,Source=Microsoft.DataTransfer.ClientLibrary,'",
I know it means I need to specify explicitly the format for my sink dataset, but I am not sure how to do that.
I suggest using the copy data tool.
step 1
step 2
According you comment, I tried a lot times, unless you choose the compressed file as source dataset and import the schemas, Azure Data factory copy actives will not help you decompress the file.
If the files in the the compressed file don't have the same schema, the copy active also could be failed.
Hope this helps.
The easiest way to do this: go to the dataset, and click on the tab Schema, then Import Schema.
Hope this helped!!

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

Resources