reading a csv file from azure blob storage with PySpark - azure

I'm trying to do a machine learning project using a PySpark HDInsight cluster on Microsoft Azure. To operate on my cluster a use a Jupyter notebook. Also, I have my data (a csv file), stored on the Azure Blob storage.
According to the documentation the syntax of the path to my file is:
path = 'wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
However, when i try to read the csv file with the following command:
csvFile = spark.read.csv(path, header=True, inferSchema=True)
I get the following error:
'java.net.URISyntaxException: Illegal character in scheme name at index 4: wasb[s]://springboard#6zpbt6muaorgs.blob.core.windows.net/movies_plus_genre_info_2.csv'
Here is a screenshot of the the error looks like in the notebook:
Any ideas on how to fix this?

It is either (unencrypted):
wasb://...
or (encrypted):
wasbs://...
not
wasb[s]://...

Related

How to name a csv file after overwriting in Azure Blob Storage

I am using Databricks notebook to read and write the file into the same location. But when I write into the file I am getting a lot of files with different names.
Like this:
I am not sure why these files are created in the location I specified.
Also, another file with the name "new_location" was created after I performed the write operation
What I want is that after reading the file from Azure Blob Storage I should write the file into the same location with the same name as the original into the same location. But I am unable to do so. please help me out as I am new to Pyspark
I have already mounted and now I am reading the CSV file store in an azure blob storage container.
The overwritten file is created with the name "part-00000-tid-84371752119947096-333f1e37-6fdc-40d0-97f5-78cee0b108cf-31-1-c000.csv"
Code:
df = spark.read.csv("/mnt/ndemo/nsalman/addresses.csv", inferSchema = True)
df = df.toDF("firstName","lastName","street","town","city","code")
df.show()
file_location_new = "/mnt/ndemo/nsalman/new_location"
# write the dataframe as a single file to blob storage
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Spark will save a partial csv file for each partition of your dataset. To generate a single csv file, you can convert it to a pandas dataframe, and then write it out.
Try to change these lines:
df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
to this line
df.toPandas().to_csv(file_location_new, header=True)
You might need to prepend "/dbfs/" to file_location_new for this to work.
Here is a minimal self-contained example that demonstrate how to write a csv file with pandas:
df = spark.createDataFrame([(1,3),(2,2),(3,1)], ["Testing", "123"])
df.show()
df.toPandas().to_csv("/dbfs/" + "/mnt/ndemo/nsalman/" + "testfile.csv", header=True)

Return the name of a saved file in Databricks

I have a notebook in Databricks that does some transformations and writes a parquet file to Azure Data Lake Storage. At the end of the notebook I would like to be able to have an exit parameter with the name of the parquet file that the notebook have just saved. I would like to use this parameter in Azure Data Factory later.
In general I would like to have a copy activity in Azure DataFactory, which moves the just saved parquet file to a database table. The thing is that the name of the parquet file changes every time the notebook is ran. Let me know if there is a better solution to this problem.
Thank you!
The below code would be sending the file name back to ADF.
dbutils.notebook.exit(filename)

Failed to save a file in azure data lake from azure data bricks

I'm trying to save the string content into azure data lake as XML content.
a string variable contains below mentioned xml content.
<project>
<dateformat>dd-MM-yy</dateformat>
<timeformat>HH:mm</timeformat>
<useCDATA>true</useCDATA>
</project>
i have used the below code to process the file into data lake.
xmlfilewrite = "/mnt/adls/ProjectDataDecoded.xml"
with open(xmlfilewrite , "w") as f:
f.write(project_processed_var)
it throws the following error:
No such file or directory: '/mnt/adls/ProjectDataDecoded.xml"
I'm able to access the data lake by using the above mounting point but unable do with the above function "open".
can anyone help me?
Issue is solved.
In databricks when you have a mount point existing on Azure Data Lake,we need to add "/dbfs" to the path and pass it to OPEN function.
The issue is solved by using below code
xmlfilewrite = "/dbfs/mnt/adls/ProjectDataDecoded.xml"
with open(xmlfilewrite , "w") as f:
f.write(project_processed_var)
You could try using the Spark-XML library. Convert your string to a dataframe where each row denotes one project. Then you can write it to ADLS in this way.
df.select("dateformat", "timeformat","useCDATA").write \
.format('xml') \
.options(rowTag='project', rootTag='project') \
.save('/mnt/adls/ProjectDataDecoded.xml')
Here is how you can include an external library -https://docs.databricks.com/libraries.html#create-a-library

Spark dataframe(in Azure Databricks) save in single file on data lake(gen2) and rename the file

I am trying to achieve the same functionality as this SO post Spark dataframe save in single file on hdfs location except my file is located in Azure Data Lake Gen2, and I am using pyspark in Databricks notebook.
Below is the code snippet I am using to rename the file
from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')
destpath = "abfss://" + contianer + "#" + storageacct + ".dfs.core.windows.net/"
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
#Rename the file
I receive an IndexError: list index out of range on this line
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
The part* file does exist in the folder.
1) Is this the right approach to rename file that databricks(pyspark) writes to Azure DataLakeGen2, if not, how else can I accomplish this?
I was able to resolve this by installing the azure.storage.filedatalake client library in my databricks notebook. By using the FileSystemClient class and DataLakeFileClient class, I was able to rename the file in data lake gen2.

How to export data from a dataframe to a file databricks

I'm doing right now Introduction to Spark course at EdX.
Is there a possibility to save dataframes from Databricks on my computer.
I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.
In the notebook data is imported using command:
log_file_path = 'dbfs:/' + os.path.join('databricks-datasets',
'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')
I found this solution but it doesn't work:
df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')
Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.
You can also save it to the file store and donwload via its handle, e.g.
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")
You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...
Download in this case (for Databricks west europe instance)
https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv
I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from #MrChristine does not apply here.
Try this.
df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")
This will save the file into Unix Server.
if you give only /home/yphani/datacsv it looks for the path on HDFS.

Resources