Copy a file from BDL to ADLS - apache-spark

I need to copy a file from BDL(business data lake) to ADLS. Then split it into 5 files in ADLS based on a condition.
I am new to Azure databricks, need suggestion how I can do this.

Copy a file from Datalake to ADLS. Please follow below approach .
Using this code you can copy data from Datalake to ADLS
spark.conf.set(
"fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>")
dbutils.fs.cp("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<subfolder>*','abfss://mycontainer#mydatalake.dfs.core.windows.net/')
Or
dbutils.fs.mv("dbfs:/tmp/test/**", "dbfs:/tmp/<folder_name>")

Related

Need help to write a pyarrow table as parquet file in ADLS Gen2 account

I am struggling to write a pyarrow table as parquet file to ADLS Gen2 storage container. I m working in Azure Synapse Analytics using notebook.
Here is what I am able to do:
Mount ADLS Gen2 account to access files . Spark uses unique syntax to achieve this.
Eg.
df = spark.read.load("synfs:/"+jobId+"/mnt/bronze/workday"+varFilepath
, format='csv',header=True)
print(type(df))
df.show()
This works fine. I then convert it to pandas dataframe to do some manipulation. Now I want to write this as a parquet file.
df_csv=df.toPandas()
pq_tbl=pa.Table.from_pandas(df_csv)
print(type(pq_tbl))
pq.write_table(pq_tbl,"workday/example.parquet",filesystem= "synfs:/"+jobId+"/mnt/bronze" )
I get an error :Unrecognized filesystem type in URI: synfs:/7/mnt/bronze

Extracting Files from Onprem server to Azure Blob storage while filtering files with no data

I am trying to transfer on-premise files to azure blob storage. However, out of the 5 files that I have, 1 has "no data" so I can't map the schema. Is there a way I can filter out this file while importing it to azure? Or would I have to import them into azure blob storage as is then filter them to another blob storage? If so, how would I do this?
DataPath
CompleteFiles Nodata
If your on prem source is your local file system, then first copy the files with folder structure to a temporary blob container using azcopy SAS key. Please refer this thread to know about it.
Then use ADF pipeline to filter out the empty files and store it final blob container.
These are my files in blob container and sample2.csv is an empty file.
First use Get Meta data activity to get the files list in that container.
It will list all the files and give that array to the ForEach as #activity('Get Metadata1').output.childItems
Inside ForEach use lookup to get the row count of every file and if the count !=0 then use copy activity to copy the files.
Use dataset parameter to give the file name.
Inside if give the below condition.
#not(equals(activity('Lookup1').output.count,0))
Inside True activities use copy activity.
copy sink to another blob container:
Execute this pipeline and you can see the empty file is filtered out.
If your on-prem source is SQL, use lookup to get the list of tables and then use ForEach. Inside ForEach do the same procedure for individual tables.
If your on-prem source other than the above mentioned also, first try to copy all files to blob storage then follow the same procedure.

Unable to file multiple BLOB to Synapse using data factory

I want to copy bulk data from BLOB storage to Azure Synapse with the following structure:
BLOB STORAGE:-
devpresented (Storage account)
processedout (Container)
Merge_user (folder)
> part-00000-tid-89051a4e7ca02.csv
Sales_data (folder)
> part-00000-tid-5579282100a02.csv
SYNAPSE SQLDW:
SCHEMA:- PIPELINEDEMO
TABLE: Merge_user, Sales_data
Using data factory I want to copy BLOB data to Synapse database as below:
BLOB >> SQLDW
Merge_user >> PIPELINEDEMO.Merge_user
Sales_data >> PIPELINEDEMO.Sales_data
The following doc in mentioned for SQL DB to SQL DW:
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-bulk-copy-portal
However, I didn't find anything for BLOB source in data facory.
Can anyone please suggest, how can I move multiple BLOB files to different tables.
If you need to copy the contents of different csv files to different tables in Azure Synapse, then you can enable multiple copy activities within a pipeline.
I create a .csv file, this is the content:
111,222,333,
444,555,666,
777,888,999,
I upload this to my storage account, Then I set this .csv file as the source of the copy activity.
After that, I create a table in azure synapse, and set this as the sink of the copy activity:
create table testbowman4(Prop_0 int, Prop_1 int, Prop_2 int)
At last, trigger this pipeline, you will find the data is in the table:
You can create multiple similar copy activities, and each copy activity performs a copy action from blob to azure synapse.

Azure Data Factory - Data flow activity changing file names

I am running a data flow activity using Azure Data Factory.
Source data source - Azure bolb
Destination data source - Azure Data Lake Gen 2
For Eg. I have a file named "test_123.csv" in Azure blob. When I create a data flow activity to filter some data and copy to Data Lake it is changing the file name to "part-00.csv" in Data Lake.
I want to keep my original filename?
Yes you can do that , please look at the screenshot below . Please do let me know how it goes .

Write DataFrame from Databricks to Data Lake

It happens that I am manipulating some data using Azure Databricks. Such data is in an Azure Data Lake Storage Gen1. I mounted the data into DBFS, but now, after transforming the data I would like to write it back into my data lake.
To mount the data I used the following:
configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
"dfs.adls.oauth2.client.id": "<your-service-client-id>",
"dfs.adls.oauth2.credential": "<your-service-credentials>",
"dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}
dbutils.fs.mount(source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>", mount_point = "/mnt/<mount-name>",extra_configs = configs)
I want to write back a .csv file. For this task I am using the following line
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>")
However, I get the following error:
IllegalArgumentException: u'No value for dfs.adls.oauth2.access.token.provider found in conf file.'
Any piece of code that can help me? Or link that walks me through.
Thanks.
If you mount Azure Data Lake Store, you should use the mountpoint to store your data, instead of "adl://...". For details how to mount Azure Data Lake Store
(ADLS ) Gen1 see the Azure Databricks documentation. You can verify if the mountpoint works with:
dbutils.fs.ls("/mnt/<newmountpoint>")
So try after mounting ADLS Gen 1:
dfGPS.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("mnt/<mount-name>/<your-directory-name>")
This should work if you added the mountpoint properly and you have also the access rights with the Service Principal on the ADLS.
Spark writes always multiple files in a directory, because each partition is saved individually. See also the following stackoverflow question.

Resources