I have troubles writing a file on my Databricks cluster's driver (as a temp file). I have a scala notebook on my company's Azure Databricks which contains those lines of code :
val xml: String = Controller.requestTo(url)
val bytes: Array[Byte] = xml.getBytes
val path: String = "dbfs:/data.xml"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
dbutils.fs.ls("dbfs:/")
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "generic:Obs")
.load(path)
df.show
file.delete()
however it crashes with org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: dbfs:/data.xml. When I run a ls on the root of the dbfs, it doesn't show the file data.xml so for me FileUtils is not doing it's job. What puts me even more in troubles is that the following code works when running it on the same cluster, same Azure resource group, same instance of Databricks, but in another notebook :
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
I tried to restart the cluster, remove "dbfs:/" from the path, put the file in the dbfs:/tmp/ directory, use FileUtils.writeStringToFile(file, xml, StandardCharsets.UTF_8) instead of FileUtils.writeByteArrayToFile but none of those solutions has worked, even when combining them.
If you're using local APIs, like, File, you need to use corresponding local file access - instead of using dbfs:/ you need to prefix path with /dbfs/, so your code will look as following:
val file: File = new File(path.replaceFirst("dbfs:", "/dbfs")
Try to remove the dbfs here : val path: String = "dbfs:/data.xml" for understanding purposes I have given 3 different magical command cells %sh , %fs, %scala . You can ref : here
Related
In Azure Synapse Analytics (Pyspark) note book , Using Spark Context Hadoop File system. I'm able to delete a folder or file
but not able to move/copy or rename the files , keep getting the error
Below is the snipped I used:
from pyspark.sql import SparkSession
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# File path declaration
containerName = "ZZZZZZZ"
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=containerName)
tablePath = "/ws/tables/"
evnetName = 'abcd'
tableFilename= "Game_"+eventName+".kql"
tableFile = fullRootPath + tablePath + tableFilename
tempTableFile = fullRootPath + tablePath + tempPath + tableFilename
#empty the paths
sc._jsc.hadoopConfiguration().set('fs.defaultFS', fullRootPath)
fs = (sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration()))
fs.delete(sc._jvm.org.apache.hadoop.fs.Path(fullRootPath+tablePath), True)
## This one worked fine.
# dfStringFinalF contains a string
spark.sparkContext.parallelize([dfStringFinalF]).coalesce(1).saveAsTextFile(tempTunctionFile)
fs.copy(tempTunctionFile+'/part-00000' , tableFile)
#Copy , rename, cp, mv, nothing working on fs
Please help
It seems in Azure Synapse Analytics is having limitations with Spark Context and shutil libraries.
mssparkutils library helps in copying/moving files in the BLOB containers.
Here is the code sample
from notebookutils import mssparkutils
fullRootPath = "abfss://{cName}#{cName}.dfs.core.windows.net".format(cName=zzzz)
fsmounting = mssparkutils.fs.mount(
fullRootPath,
"/ws",
{"linkedService":"abcd"}
)
fsMove = mssparkutils.fs.mv(tempTableFile+'/part-00000',tableFile )
More reference from the below link :
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api
I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()
I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:
df.coalesce(1).write.csv('Data1.csv',header = 'true')
After an hour of runtime I am getting the below error.
Error: Invalid status code from http://.....session isn't active.
My config is like:
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")
I have checked the container nodes and the error there is:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Not able to figure out the issue.
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
edit the /etc/livy/conf/livy.conf file (in the cluster's master
node)
set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
restart Livy to update the setting: sudo restart livy-server in the cluster's master
test your code again
I am not well versed in pyspark but in scala the solution would involve something like this
First we need to create a method for creating a header file:
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
You will also need a method for calling hadoop to merge your part files which would be written by df.write method:
def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use for loop to iterate over the muliple source paths if not use the code below
// for (sourcePath <- sourcePaths) {
//Get the path under destination where the partitioned files are temporarily stored
val pathText = sourcePaths.split("/")
val destPath = "%s/%s".format(destLocation, pathText.last)
//Merge files into 1
FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
// }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)
}
Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:
def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
spark: SparkSession): String = {
import spark.implicits._
val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
//write as csv to output directory and add file path to array to be sent for merging and create header file
processedDf.write.mode("overwrite").csv(fileName)
createHeaderFile(fileName, processedDf.columns)
//create an array of the partitioned file paths
outputFilePathList = fileName
// you can use array of string or string only depending on if the output needs to get divided in multiple file based on some parameter in that case chagne the signature ot Array[String] as output
// add below code
// outputFilePathList(counter) = fileName
// just use a loop in the above and increment it
//counter += 1
return outputFilePathList
}
With all the methods defined here is how you can implement them:
def processyourlogic( your parameters if any):Dataframe=
{
// your logic to do whatever needs to be done to your data
}
Assuming the above method returns a dataframe, here is how you can put everything together:
val yourbigD f = processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it
val outputPathFinal = " location where you want your file to be saved"
val tempOutputFolderLocation = "temp/"
val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
mergeOutputFiles(partFiles, outputPathFinal)
Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.
I'm running spark 2.1 and I want to write a csv with results into Amazon S3.
After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename.
I'm using the databricks lib for writing into S3.
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and havent found much.
Thanks
You can use below to rename the output file.
dataframe.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("folder/dataframe/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "folder/dataframe/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"file.csv"))
The code as you mentioned here returns a Unit. You would need to confirm when your Spark application has completed its run (assuming this is a batch case) and then rename
dataframe
.repartition(1)
.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("folder/dataframe/")
You can rename the part files with any specific name using the dbutils command, use the below code to rename the part-generated CSV file, this code works fine for pyspark
x = 'dbfs:mnt/source_path' # your source path
y = 'dbfs:mnt/destination_path' # you destination path
Files = dbutils.fs.ls(x)
#moving or renaming the part-000 CSV file into the normal or specific name
i = 0
for file in Files:
print(file.name)
i = i+1
if file.name[-4] ='.csv': #you can use any file extension like parquet, JSON, etc.
dbutils.fs.mv(x+file.name,y+'OutputData-' + str(i) +'.csv') #you can provide any specific name here
dbutils.fs.rm(x, True) # later remove the source path after renaming all the part-generated files if you want
I want to merge multiple files generated by a Spark job into one file. Usually I'd do something like:
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), deleteSrcFiles, hadoopConfig, null)
This runs fine locally, using paths e.g /tmp/some/path/to.csv, but results in an Exception when executed on a cluster my-cluster:
Wrong FS: gs://myBucket/path/to/result.csv, expected: hdfs://my-cluster-m
Is this possible to get a FileSystem for gs:// paths from scala/java code running on a Dataproc cluster?
EDIT
Found the google-storage client library:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-java
You can only use Path's belonging to a particular filesystem with that filesystem e.g. you cannot pass a gs:// path to HDFS as you did above.
The following snippet works for me:
val hadoopConfig = new Configuration()
val srcPath = new Path("hdfs:/tmp/foo")
val hdfs = srcPath.getFileSystem(hadoopConfig)
val dstPath = new Path("gs://bucket/foo")
val gcs = dstPath.getFileSystem(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, srcPath, gcs, dstPath, deleteSrcFiles, hadoopConfig, null)