Fastest way to combine multiple CSV files from a blob storage container into one CSV file on another blob storage container in an Azure function - python-3.x

I'd like to hear if it's possible can improve the code below to make it run faster (and maybe cheaper) as part of an Azure function for combining multiple CSV files from a source blob storage container into one CSV file on a target blob storage container on Azure by using Python (please note that it would also be fine for me to use another library than pandas if need be)?
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import ContainerClient
import pandas as pd
from io import StringIO
# Used for getting access to secrets on Azure key vault for authentication purposes
credential = DefaultAzureCredential()
vault_url = 'AzureKeyVaultURL'
secret_client = SecretClient(vault_url=vault_url, credential=credential)
azure_datalake_connection_str = secret_client.get_secret('Datalake_connection_string')
# Connecting to a source Azure blob storage container where multiple CSV files are stored
blob_block_source = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "sourceContainerName"
)
# Connecting to a target Azure blob storage container to where the CSV files from the source should be combined into one CSV file
blob_block_target = ContainerClient.from_connection_string(
conn_str= azure_datalake_connection_str.value,
container_name= "targetContainerName"
)
# Retrieve list of the blob storage names from the source Azure blob storage container, but only those that end with the .csv file extension
blobNames = [name.name for name in blob_block_source.list_blobs()]
only_csv_blob_names = list(filter(lambda x:x.endswith(".csv") , blobNames))
# Creating a list of dataframes - one dataframe from each CSV file found in the source Azure blob storage container
listOfCsvDataframes = []
for csv_blobname in only_csv_blob_names:
df = pd.read_csv(StringIO(blob_block_source.download_blob(csv_blobname, encoding='utf-8').content_as_text(encoding='utf-8')), encoding = 'utf-8',header=0, low_memory=False)
listOfCsvDataframes.append(df)
# Contatenating the different dataframes into one dataframe
df_concat = pd.concat(listOfCsvDataframes, axis=0, ignore_index=True)
# Creating a CSV object from the concatenated dataframe
outputCSV = df_concat.to_csv(index=False, sep = ',', header = True)
# Upload the combined dataframes as a CSV file (i.e. the CSV files have been combined into one CSV file)
blob_block_target.upload_blob('combinedCSV.csv', outputCSV, blob_type="BlockBlob", overwrite = True)

Instead of using Azure Function, you can use Azure Data Factory to concatenate your files.
It will probably have better efficiency with ADF than Azure Functions with pandas.
Take a look at this blog post https://www.sqlservercentral.com/articles/merge-multiple-files-in-azure-data-factory
If you want to use Azure function, try to concatenate files without using pandas. If all your files have the same columns and the same column order, you can concatenate string directly and remove the header line, if any, of all files but the first.

Related

Recreate resulting CSV as own file in Azure blob

When writing out my dataframe, it drop the result into a folder called "file.csv" with a "part-000..." file. I need to take this resulting file and write it out/copy it as its own csv file with a name. I am using the logic here, but it appears this won't suffice for Azure blob as it's not recognizing a WASB path.
Code to create the dataframe:
val dfOutput = spark.sql("""SELECT * FROM Query""")
dfOutput.coalesce(1).write.option("header","true").mode("overwrite").format("csv").save(OutputFile)
Creates and outputs the dataframe as a CSV file within a folder as a "part-000..." file. The output path in this case is wasb:/mycontainer#myexamplestorage.blob.core.windows.net/file.csv (example).
The next part should grab the "part-000..." file and create it under its own file by copying it with the FileUtil then remove the "file.csv" path.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import java.io.{File}
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val srcPath = new Path(OutputFile)
val destPath = new Path("wasb://mycontainer#myexamplestorage.blob.core.windows.net/resultfile.csv")
val srcFile = FileUtil.listFiles(new File(OutputFile))
.filterNot(f=>f.getPath.endsWith(".csv"))(0)
FileUtil.copy(srcFile,hdfs,destPath,true,hadoopConfig)
hdfs.delete(srcPath,true)
This next part fails on the listFiles with the error "
IOException: Invalid directory or I/O error occurred for dir: wasb://mycontainer#myexamplestorage.blob.core.windows.net/file.csv" and from what I can tell this is caused because it's not able to list files from Azure blob storage.
I need to be able to get the CSV file from Azure blob, then copy it as its own file to Azure blob without the folder and "part-000..." file. I played around with setting the file configuration, but this entire approach appears to be incompatible with Azure blob storage or there is a configuration missing somewhere that allows these to query blob storage.

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

AzureML create dataset from datastore with multiple files - path not valid

I am trying to create a dataset in Azure ML where the data source are multiple files (eg images) in a Blob Storage. How do you do that correctly?
Here is the error I get following the documented approach in the UI
When I create the dataset in the UI and select the blob storage and directory with either just dirname or dirname/** then the files can not be found in the explorer tab with the error ScriptExecution.StreamAccess.NotFound: The provided path is not valid or the files could not be accessed. When I try to download the data with the code snippet in the consume tab then I get the error:
from azureml.core import Workspace, Dataset
# set variables
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='teststar')
dataset.download(target_path='.', overwrite=False)
Error Message: ScriptExecutionException was caused by StreamAccessException.
StreamAccessException was caused by NotFoundException.
Found no resources for the input provided: 'https://mystoragename.blob.core.windows.net/data/testdata/**'
When I just select one of the files instead of dirname or dirname/** then everything works. Does AzureML actually support Datasets consisting of multiple files?
Here is my setup:
I have a Data Storage with one container data. In there is a directory testdata containing testfile1.txt and testfile2.txt.
In AzureML I created a datastore testdatastore and there I select the data container in my data storage.
Then in Azure ML I create a Dataset from datastore, select file dataset and the datastore above. Then I can browse the files, select a folder and select that files in subdirectories should be included. This then creates the path testdata/** which does not work as described above.
I got the same issue when creating the dataset and datastore in python:
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
ws = Workspace.from_config()
datastore = Datastore(ws, "mydatastore")
datastore_paths = [(datastore, 'testdata')]
test_ds = Dataset.File.from_files(path=datastore_paths)
test_ds.register(ws, "testpython")
Datasets definitely support multiple files, so your problem is almost certainly in the permissions given when creating "mydatastore" datastore (I suspect you have used SAS token to create this datastore). In order to be able to access anything but individual files you need to give list permissions to the datastore.
This would not be a problem if you register datastore with account key, but could be a limitation of the access token.
The second part of the provided path is not valid or the files could not be accessed refers to potential permission issues.
You can also verify that folder/** syntax works by creating dataset from defaultblobstore that was provisioned for you with your ml workspace.
I uploaded and registered the files with this script and everything works as expected.
from azureml.core import Datastore, Dataset, Workspace
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
datastore_name = "mydatastore"
dataset_path_on_disk = "./data/images_greyscale"
dataset_path_in_datastore = "images_greyscale"
azure_dataset_name = "images_grayscale"
azure_dataset_description = "dataset transformed into the coco format and into grayscale images"
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name=datastore_name)
logger.info("Uploading data...")
datastore.upload(
src_dir=dataset_path_on_disk, target_path=dataset_path_in_datastore, overwrite=False
)
logger.info("Uploading data done.")
logger.info("Registering dataset...")
datastore_path = [(datastore, dataset_path_in_datastore)]
dataset = Dataset.File.from_files(path=datastore_path)
dataset.register(
workspace=workspace,
name=azure_dataset_name,
description=azure_dataset_description,
create_new_version=True,
)
logger.info("Registering dataset done.")

Read excel sheets from gcp bucket

I am currently trying to read in data to my gcp notebook from a shared gcp storage bucket. I am an admin and so restrictions shouldn't apply as far as I know, but I am getting an error before I can even read in with pandas. Is this possible? Or am I going about this in the wrong way?
This is the code I have tried:
from google.cloud import storage
from io import BytesIO
import pandas as pd
client = storage.Client()
bucket = "our_data/deid"
blob = storage.blob.Blob("B_ACTIVITY.xlsx",bucket)
content = blob.download_as_string()
df = pd.read_excel(BytesIO(content))
I was hoping for the data to simply be brought in once the bucket was specified, but I get an error "'str' object has no attribute 'path'".
bucket needs to be a bucket object not just a string.
Try changing that line to
bucket = client.bucket(<BUCKET_URL>)
Here's a link to the constructor:
https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.bucket

Importing multiple files from Google Cloud Bucket to Datalab instance

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.
So, I can easily see them as objects using
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.
Any help is greatly appreciated.
This can be done using Bucket.objects which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv and countries2.csv:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:
object_list = myBucket.objects()

Resources