I want to merge multiple files generated by a Spark job into one file. Usually I'd do something like:
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), deleteSrcFiles, hadoopConfig, null)
This runs fine locally, using paths e.g /tmp/some/path/to.csv, but results in an Exception when executed on a cluster my-cluster:
Wrong FS: gs://myBucket/path/to/result.csv, expected: hdfs://my-cluster-m
Is this possible to get a FileSystem for gs:// paths from scala/java code running on a Dataproc cluster?
EDIT
Found the google-storage client library:
https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-java
You can only use Path's belonging to a particular filesystem with that filesystem e.g. you cannot pass a gs:// path to HDFS as you did above.
The following snippet works for me:
val hadoopConfig = new Configuration()
val srcPath = new Path("hdfs:/tmp/foo")
val hdfs = srcPath.getFileSystem(hadoopConfig)
val dstPath = new Path("gs://bucket/foo")
val gcs = dstPath.getFileSystem(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, srcPath, gcs, dstPath, deleteSrcFiles, hadoopConfig, null)
Related
I have troubles writing a file on my Databricks cluster's driver (as a temp file). I have a scala notebook on my company's Azure Databricks which contains those lines of code :
val xml: String = Controller.requestTo(url)
val bytes: Array[Byte] = xml.getBytes
val path: String = "dbfs:/data.xml"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
dbutils.fs.ls("dbfs:/")
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "generic:Obs")
.load(path)
df.show
file.delete()
however it crashes with org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: dbfs:/data.xml. When I run a ls on the root of the dbfs, it doesn't show the file data.xml so for me FileUtils is not doing it's job. What puts me even more in troubles is that the following code works when running it on the same cluster, same Azure resource group, same instance of Databricks, but in another notebook :
val path: String = "mf-data.grib"
val file: File = new File(path)
FileUtils.writeByteArrayToFile(file, bytes)
I tried to restart the cluster, remove "dbfs:/" from the path, put the file in the dbfs:/tmp/ directory, use FileUtils.writeStringToFile(file, xml, StandardCharsets.UTF_8) instead of FileUtils.writeByteArrayToFile but none of those solutions has worked, even when combining them.
If you're using local APIs, like, File, you need to use corresponding local file access - instead of using dbfs:/ you need to prefix path with /dbfs/, so your code will look as following:
val file: File = new File(path.replaceFirst("dbfs:", "/dbfs")
Try to remove the dbfs here : val path: String = "dbfs:/data.xml" for understanding purposes I have given 3 different magical command cells %sh , %fs, %scala . You can ref : here
Posting similar question, as the existing thread is very old. I am using the below code to check if the file exists at target_path or not. Though the file is present I am getting return value as 'false'. Am I missing on some settings?
val config = sc.hadoopConfiguration
val fileSystem = org.apache.hadoop.fs.FileSystem.get(config)
var existCheck = fileSystem.exists(new org.apache.hadoop.fs.Path(target_path))
I also tried the below codes given in the site, but it is also returning 'false'
new java.io.File(target_path).isFile
scala.reflect.io.File(target_path).exists
target_path is having one delta_log and a parquet part file. Please help me to get the correct status.
(DBR-7.3 LTS, spark-3.0.1)
You were very close :)
Below I use listStatus to give me back an array of status' of all of the files under pathToFolder, which would be the path to the folder containing the parquet file.
I then check the paths of each of the files under the folder too check for matches to target_path.
import org.apache.hadoop.fs.Path
val sc: SparkContext = ???
val pathToFolder: String = ???
val pathToParquetFile: String = target_path
val config = sc.hadoopConfiguration
val src = new Path(pathToFolder)
val fs = src.getFileSystem(config)
val parquetFileExists: Boolean = fs
.listStatus(src)
.map(_.getPath.toString)
.find(_ == pathToParquetFile)
.isDefined
I am working on some code on my local machine on pycharm.
The execution is done on a databricks cluster, while the data is stored on azure datalake.
basaically, I need to list down the files in azure datalake directory and then apply some reading logic on the files, for this I am using the below code
sc = spark.sparkContext
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('adl://<Account>.azuredatalakestore.net/<path>')
for f in fs.get(conf).listStatus(path):
print(f.getPath(), f.getLen())
the above code runs fine on the databricks notebooks, but when i try to run the same code through pycharm using databricks-connect i get the following error.
"Wrong FS expected: file:///....."
on some digging it turns out, that the code is looking in my local drive to find the "path".
I had a similar issue with python libraries (os, pathlib)
I have no issue in running other code on the cluster.
Need help in figuring out how to run this so as to search the datalake and not my local machine.
Also, azure-datalake-store client is not an option due to certain restrictions.
You may use this.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark)
fileCatalog.flatMap(_._2.map(_.path))
}
val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"
val files = listFiles(root, globp)
files.toDF("path").show()
I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below:
df.coalesce(1).write.csv('Data1.csv',header = 'true')
After an hour of runtime I am getting the below error.
Error: Invalid status code from http://.....session isn't active.
My config is like:
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("shuffle.service.enabled","true")
spark.conf.set("spark.dynamicAllocation.minExecutors",6)
spark.conf.set("spark.executor.heartbeatInterval","3600s")
spark.conf.set("spark.cores.max", "4")
spark.conf.set("spark.sql.tungsten.enabled", "true")
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.app.id", "Logs")
spark.conf.set("spark.io.compression.codec", "snappy")
spark.conf.set("spark.rdd.compress", "true")
spark.conf.set("spark.executor.instances", "6")
spark.conf.set("spark.executor.memory", '20g')
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
spark.conf.set("spark.driver.allowMultipleContexts", "true")
spark.conf.set("spark.master", "yarn")
spark.conf.set("spark.driver.memory", "20G")
spark.conf.set("spark.executor.instances", "32")
spark.conf.set("spark.executor.memory", "32G")
spark.conf.set("spark.driver.maxResultSize", "40G")
spark.conf.set("spark.executor.cores", "5")
I have checked the container nodes and the error there is:
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed:container_e836_1556653519610_3661867_01_000005 on host: ylpd1205.kmdc.att.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Not able to figure out the issue.
Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.
If that's the case, here's how to address it:
edit the /etc/livy/conf/livy.conf file (in the cluster's master
node)
set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
restart Livy to update the setting: sudo restart livy-server in the cluster's master
test your code again
I am not well versed in pyspark but in scala the solution would involve something like this
First we need to create a method for creating a header file:
def createHeaderFile(headerFilePath: String, colNames: Array[String]) {
//format header file path
val fileName = "dfheader.csv"
val headerFileFullName = "%s/%s".format(headerFilePath, fileName)
//write file to hdfs one line after another
val hadoopConfig = new Configuration()
val fileSystem = FileSystem.get(hadoopConfig)
val output = fileSystem.create(new Path(headerFileFullName))
val writer = new PrintWriter(output)
for (h <- colNames) {
writer.write(h + ",")
}
writer.write("\n")
writer.close()
}
You will also need a method for calling hadoop to merge your part files which would be written by df.write method:
def mergeOutputFiles(sourcePaths: String, destLocation: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
// in case of array[String] use for loop to iterate over the muliple source paths if not use the code below
// for (sourcePath <- sourcePaths) {
//Get the path under destination where the partitioned files are temporarily stored
val pathText = sourcePaths.split("/")
val destPath = "%s/%s".format(destLocation, pathText.last)
//Merge files into 1
FileUtil.copyMerge(hdfs, new Path(sourcePath), hdfs, new Path(destPath), true, hadoopConfig, null)
// }
//delete the temp partitioned files post merge complete
val tempfilesPath = "%s%s".format(destLocation, tempOutputFolder)
hdfs.delete(new Path(tempfilesPath), true)
}
Here is a method for generating output files or your df.write method where you are passing your huge DF to be written out to hadoop HDFS:
def generateOutputFiles( processedDf: DataFrame, opPath: String, tempOutputFolder: String,
spark: SparkSession): String = {
import spark.implicits._
val fileName = "%s%sNameofyourCsvFile.csv".format(opPath, tempOutputFolder)
//write as csv to output directory and add file path to array to be sent for merging and create header file
processedDf.write.mode("overwrite").csv(fileName)
createHeaderFile(fileName, processedDf.columns)
//create an array of the partitioned file paths
outputFilePathList = fileName
// you can use array of string or string only depending on if the output needs to get divided in multiple file based on some parameter in that case chagne the signature ot Array[String] as output
// add below code
// outputFilePathList(counter) = fileName
// just use a loop in the above and increment it
//counter += 1
return outputFilePathList
}
With all the methods defined here is how you can implement them:
def processyourlogic( your parameters if any):Dataframe=
{
// your logic to do whatever needs to be done to your data
}
Assuming the above method returns a dataframe, here is how you can put everything together:
val yourbigD f = processyourlogic(your parameters) // returns DF
yourbigDf.cache // caching just in case you need it
val outputPathFinal = " location where you want your file to be saved"
val tempOutputFolderLocation = "temp/"
val partFiles = generateOutputFiles(yourbigDf, outputPathFinal, tempOutputFolderLocation, spark)
mergeOutputFiles(partFiles, outputPathFinal)
Let me know if you have any other question relating to that. If the answer you seek is different then the original question should be asked.
I am trying to load a Testfile using spark and java. Code is working fine in client mode(in my local machine) but It's giving FileNotFound Exception in cluster mode(i.e. on the server).
SparkSession spark = SparkSession
.builder()
.config("spark.mesos.coarse","true")
.config("spark.scheduler.mode","FAIR")
.appName("1")
.master("local")
.getOrCreate();
spark.sparkContext().addFile("https://mywebsiteurl/TestFile.csv");
String[] fileServerUrlArray = fileServerUrl.split("/");
fileName = fileServerUrlArray[fileServerUrlArray.length - 1];
String file = SparkFiles.get(fileName);
String modifiedFile="file://"+file;
spark.read()
.option("header", "true")
.load(modifiedFile); //getting FileNotFoundException in this line
getting FileNotFound Exception.
While running your job in cluster mode, spark will never write on local area of the driver. Best option will be to collect() or use toLocalIterator() if you can read the file in buffer. Please try using below code and share if it's working for you?
import org.apache.hadoop.fs._
val conf = new Configuration()
val fs = path.getFileSystem(conf)
val hdfspath = new Path("hdfs:///user/home/testFile.dat")
val localpath = new Path("file:///user/home/test/")
fs.copyToLocalFile(hdfspath,localpath)