Recreate resulting CSV as own file in Azure blob - azure

When writing out my dataframe, it drop the result into a folder called "file.csv" with a "part-000..." file. I need to take this resulting file and write it out/copy it as its own csv file with a name. I am using the logic here, but it appears this won't suffice for Azure blob as it's not recognizing a WASB path.
Code to create the dataframe:
val dfOutput = spark.sql("""SELECT * FROM Query""")
dfOutput.coalesce(1).write.option("header","true").mode("overwrite").format("csv").save(OutputFile)
Creates and outputs the dataframe as a CSV file within a folder as a "part-000..." file. The output path in this case is wasb:/mycontainer#myexamplestorage.blob.core.windows.net/file.csv (example).
The next part should grab the "part-000..." file and create it under its own file by copying it with the FileUtil then remove the "file.csv" path.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import java.io.{File}
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val srcPath = new Path(OutputFile)
val destPath = new Path("wasb://mycontainer#myexamplestorage.blob.core.windows.net/resultfile.csv")
val srcFile = FileUtil.listFiles(new File(OutputFile))
.filterNot(f=>f.getPath.endsWith(".csv"))(0)
FileUtil.copy(srcFile,hdfs,destPath,true,hadoopConfig)
hdfs.delete(srcPath,true)
This next part fails on the listFiles with the error "
IOException: Invalid directory or I/O error occurred for dir: wasb://mycontainer#myexamplestorage.blob.core.windows.net/file.csv" and from what I can tell this is caused because it's not able to list files from Azure blob storage.
I need to be able to get the CSV file from Azure blob, then copy it as its own file to Azure blob without the folder and "part-000..." file. I played around with setting the file configuration, but this entire approach appears to be incompatible with Azure blob storage or there is a configuration missing somewhere that allows these to query blob storage.

Related

Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)
Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

Spark saveAsTextFile with file extension

I want to partition my results and save them as a CSV file into a specified location. However, I didn't find any option to specify the file format using the below code. All the files are created with the format part-000**. How can I specify the required file format here?
records.repartition(partitionNum).saveAsTextFile(path)
you can try this
df.coalesce(1).write.option("header",true).csv(path)
this path it will be a folder, and it must not be exists, and you can't generate specify csv file. But you can change the hdfs file name by hadoop api(contains in spark).
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val file = fs.globStatus(new Path(s"$path/part*"))(0).getPath().getName()
val result:Boolean = fs.rename(new Path(s"$path/$file"), new Path(s"$hdfsFolder/${fileName}"))

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

Spark dataframe(in Azure Databricks) save in single file on data lake(gen2) and rename the file

I am trying to achieve the same functionality as this SO post Spark dataframe save in single file on hdfs location except my file is located in Azure Data Lake Gen2, and I am using pyspark in Databricks notebook.
Below is the code snippet I am using to rename the file
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
destpath = "abfss://" + contianer + "#" + storageacct + ".dfs.core.windows.net/"
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
#Rename the file
I receive an IndexError: list index out of range on this line
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
The part* file does exist in the folder.
1) Is this the right approach to rename file that databricks(pyspark) writes to Azure DataLakeGen2, if not, how else can I accomplish this?
I was able to resolve this by installing the azure.storage.filedatalake client library in my databricks notebook. By using the FileSystemClient class and DataLakeFileClient class, I was able to rename the file in data lake gen2.

Resources