How to get the Hadoop path with Java/Scala API in Code Repositories - apache-spark

My need is to read other formats: JSON, binary, XML and infer the schema dynamically within a transform in Code Repositories and using Spark datasource api.
Example:
val df = spark.read.json(<hadoop_path>)
For that, I need an accessor to the Foundry file system path, which is something like:
foundry://...#url:port/datasets/ri.foundry.main.dataset.../views/ri.foundry.main.transaction.../startTransactionRid/ri.foundry.main.transaction...
This is possible with PySpark API (Python):
filesystem = input_transform.filesystem()
hadoop_path = filesystem.hadoop_path
However, for Java/Scala I didn’t find a way to do it properly.

The getter to the Hadoop path has been recently added to Foundry Java API. By upgrading the version of the java transform (transformsJavaVersion >= 1.188.0), and you can get it:
val hadoopPath = myInput.asFiles().getFileSystem().hadoopPath()

Related

Databricks FileInfo: java.lang.ClassCastException: com.databricks.backend.daemon.dbutils.FileInfo cannot be cast to com.databricks.service.FileInfo

I'm getting ClassCastException when trying to traverse the directories in mounted Databricks volume.
java.lang.ClassCastException: com.databricks.backend.daemon.dbutils.FileInfo cannot be cast to com.databricks.service.FileInfo
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at com.mycompany.functions.UnifiedFrameworkFunctions$.getAllFiles(UnifiedFrameworkFunctions.scala:287)
where getAllFiles function looks like:
import com.databricks.service.{DBUtils, FileInfo}
...
def getAllFiles(path: String): Seq[String] = {
val files = DBUtils.fs.ls(path)
if (files.isEmpty)
List()
else
files.map(file => { // line where exception is raised
val path: String = file.path
if (DBUtils.fs.dbfs.getFileStatus(new org.apache.hadoop.fs.Path(path)).isDirectory) getAllFiles(path)
else List(path)
}).reduce(_ ++ _)
}
Locally it runs OK with Databricks Connect, but when src code is packaged as jar and run on Databricks cluster the above exception is raised.
Since Databricks in their documentation suggest using com.databricks.service.DBUtils and when calling DBUtils.fs.ls(path) it returns FileInfo from the same service package - is this a bug or should the api be used in some other way?
I'm using Databricks Connect & Runtime of version 8.1
I have tried a workaround to get file names from a folder.
I have performed following steps to get list of filenames from mounted directory.
I have stored 3 files at “mnt/Sales/” location.
Step1: Use display(dbutils.fs.ls(“/mnt/Sales/”)) command.
Step2: Assign file location to a variable.
Step3: Load variable to dataframe and get names of files.
You could convert the directories (of type Seq[com.databricks.service.FileInfo]) to a string, split the string, and use pattern matching to extract the file names as you traverse the new Array[String]. Using scala:
val files = dbutils.fs.ls(path).mkString(";").split(";")
val pattern = "dbfs.*/(Sales_Record[0-9]+.csv)/.*".r
files.map(file => { val pattern(res) = file; res })
Or you could try
val pattern = "dbfs.*/(.*.csv)/.*".r
to get all file names ending in csv. The pattern can be constructed to suit your needs.
To use dbutils on your local machine (Databricks Connect), you need to import it. However, dbutils is already available on the Databricks cluster, so you should not import it. Thus, when your src code (with DBUtils.fs.ls(path)) is packaged as a jar and run on Databricks cluster, you will get this error. A minimal example on my Databricks cluster shows this:
So scala users should just use dbutils.* for code on the Databricks cluster.
For PySpark users, the Databricks documentations says the following here:
When using Databricks Runtime 7.3 LTS or above, to access the DBUtils module in a way that works both locally and in Databricks clusters, use the following get_dbutils():
And the document goes onto show the code for this purpose, so consult the Python code tab in the link for more details.

Spark "modifiedBefore" option while reading data from files

I am using Spark-2.4 to read files from hadoop.
The requirement is to read the files whose modified time is before some provided value.
I came across the spark documentation that mentions about the option modifiedBefore, please refer to the following spark doc Modification Time Path Filters, but I am not sure if it's available in spark 2.4, if not how can I achieve this?
The options modifiedBefore and modifiedAfter are available since Spark 3+ and can only be used in batch not streaming. For Spark 2.4, you can use Hadoop FileSystem globStatus method and filter files using getModificationTime.
Here is an example of a function that takes a path and a threshold and returns list of file paths filtered using the threshold:
import org.apache.hadoop.fs.Path
def getFilesModifiedBefore(path: Path, modifiedBefore: String) = {
val format = new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
val thresHoldTime = format.parse(modifiedBefore).getTime()
val files = path.getFileSystem(sc.hadoopConfiguration).globStatus(path)
files.filter(_.getModificationTime < thresHoldTime).map(_.getPath.toString)
}
Then using it with spark.read.csv :
val df = spark.read.csv(getFilesModifiedBefore(new Path("/mypath"), "2021-03-17T10:46:12"):_*)

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

The problem
I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array<float>, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.
What I found until now
I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).
How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?
Minimal example with horovod
# Saving df
print(spark.config.get('spark.hadoop.parquet.enable.summary-metadata')) # outputs 'true'
df.repartition(10).write.mode('overwrite').parquet(path)
# ...
# Training
import horovod.spark.keras as hvd
from horovod.spark.common.store import Store
model = build_model()
opti = Adadelta(learning_rate=0.015)
loss='sparse_categorical_crossentropy'
store = Store().create(prefix_path=prefix_path,
train_path=train_path,
val_path=val_path)
keras_estimator = hvd.KerasEstimator(
num_proc=16,
store=store,
model=model,
optimizer=opti,
loss=loss,
feature_cols=['features'],
label_cols=['next'],
batch_size=auto_steps_per_epoch,
epochs=auto_nb_epochs,
sample_weight_col='weight'
)
keras_model = keras_estimator.fit_on_parquet() # Fails here with ArrowIOError
The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to #joris' comment for pointing this out.

How to specify Hadoop Configuration when reading CSV

I am using Spark 2.0.2. How can I specify the Hadoop configuration item textinputformat.record.delimiter for the TextInputFormat class when reading a CSV file into a Dataset?
In Java I can code: spark.read().csv(<path>); However, there doesn't seem to be a way to provide a Hadoop configuration specific to the read.
It is possible to set the item using the spark.sparkContext().hadoopConfiguration() but that is global.
Thanks,
You cannot. Data Source API uses its own configuration which, as of 2.0 is not even compatible with Hadoop configuration.
If you want to use custom input format or other Hadoop configuration use SparkContext.hadoopFile, SparkContext.newAPIHadoopRDD or related classes.
Delimiter can be set using option() in spark2.0
var df = spark.read.option("header", "true").option("delimiter", "\t").csv("/hdfs/file/locaton")

Submit Spark with additional input

I have used Spark to build a machine learning pipeline, which takes a job XML file as an input where users can specify data, features, models and their parameters. The reason for using this job XML input file is that users can simply modify their XML file to config the pipeline and do not need to re-compile from the source code. However, currently the Spark job is typically packaged into an uber-Jar file, and it seems that there is no way to provide additional XML inputs when the job is submitted to YARN.
I wonder if there are any solutions or alternatives?
I'd look into Spark-JobServer You can use it to submit your job to a Spark Cluster together with a configuration. You might have to adapt your XML to the JSON format used by the config or maybe encapsulate it somehow.
Here's an example on how to submit a job + config:
curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'
{
"status": "STARTED",
"result": {
"jobId": "5453779a-f004-45fc-a11d-a39dae0f9bf4",
"context": "b7ea0eb5-spark.jobserver.WordCountExample"
}
}
You should use the resources directory to place the xml file if you want it to be bundled with the jar. This is a basic Java/Scala thing.
Suggest reading: Get a resource using getResource()
To replace the xml in the jar without rebuilding the jar: How do I update one file in a jar without repackaging the whole jar?
The final solution that I used to solve this problem is:
Store the XML file in HDFS,
Pass in the file location of the XML file,
Use the InputStreamHDFS to directly read from HDFS:
val hadoopConf = sc.hadoopConfiguration
val jobfileIn:Option[InputStream] = inputStreamHDFS(hadoopConf, filename)
if (jobfileIn.isDefined) {
logger.info("Job file found in file system: " + filename)
xml = Some(XML.load(jobfileIn.get))
}

Resources