Unable to serialize a apache spark transformer in mleap - apache-spark

I use Spark 2.1.0 and Scala 2.11.8.
I am trying to build a twitter sentiment analysis model in apache spark and service it using MLeap.
When I am running the model without using mleap, things work smoothly.
Problem happens only when I try to save the model in mleap's serialization format so I can serve the model later using mleap.
Here is the line with throws the error -
val modelSavePath = "/tmp/sampleapp/model-mleap/"
val pipelineConfig = json.get("PipelineConfig").get.asInstanceOf[Map[String, Any]]
val loaderConfig = json.get("LoaderConfig").get.asInstanceOf[Map[String, Any]]
val loaderPath = loaderConfig
.get("DataLocation")
.get
.asInstanceOf[String]
var data = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
load(loaderPath)
val pipeline = Pipeline(pipelineConfig)
val model = pipeline.fit(data)
val mleapPipeline: Transformer = model
I get java.util.NoSuchElementException: key not found: org.apache.spark.ml.feature.Tokenizer in the last line.
When I did a quick search I found out that mleap does not support all the transformers. But I was not able to find an exhaustive list.
How do I find out if the transformers that I am using are actually not supported or there is some other error.

I am one of the creators of MLeap, and we do support Tokenizer! I am curious, which version of MLeap are you trying to use? I think you may be looking at an outdated codebase from TrueCar, check out our new codebase here:
https://github.com/combust/mleap
We also have fairly complete documentation here, including a full list of supported transformers:
Documentation: http://mleap-docs.combust.ml/
Transformer List: http://mleap-docs.combust.ml/core-concepts/transformers/support.html
I hope this helps, and if things still aren't working, file an issue in github and we can help you debug it from there.

Related

What changes are required when moving simple synapsesql implementation from Spark 2.4.8 to Spark 3.1.2?

I have a simple implementation of .write.synapsesql() method (code shown below) that works in Spark 2.4.8 but not in Spark 3.1.2 (documentation/example here). The data in use is a simple notebook-created foobar type table. Searching for key phrases online from and about the error did not turn up any new information for me.
What is the cause of the error in 3.1.2?
Spark 2.4.8 version (behaves as desired):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None)
Spark 3.1.2 version (extra method is same as in documentation, can also be left out with a similar result):
val df = spark.sql("SELECT * FROM TEST_TABLE")
df.write.synapsesql("my_local_db_name.schema_name.test_table", Constants.INTERNAL, None,
Some(callBackFunctionToReceivePostWriteMetrics))
The resulting error (only in 3.1.2) is:
WriteFailureCause -> java.lang.IllegalArgumentException: Failed to derive `https` scheme based staging location URL for SQL COPY-INTO}
As the documentation from the question states, ensure that you are setting the options correctly with
val writeOptionsWithAADAuth:Map[String, String] = Map(Constants.SERVER -> "<dedicated-pool-sql-server-name>.sql.azuresynapse.net",
Constants.TEMP_FOLDER -> "abfss://<storage_container_name>#<storage_account_name>.dfs.core.windows.net/<some_temp_folder>")
and including the options in your .write statement like so:
df.write.options(writeOptionsWithAADAuth).synapsesql(...)

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

The problem
I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array<float>, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.
What I found until now
I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).
How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?
Minimal example with horovod
# Saving df
print(spark.config.get('spark.hadoop.parquet.enable.summary-metadata')) # outputs 'true'
df.repartition(10).write.mode('overwrite').parquet(path)
# ...
# Training
import horovod.spark.keras as hvd
from horovod.spark.common.store import Store
model = build_model()
opti = Adadelta(learning_rate=0.015)
loss='sparse_categorical_crossentropy'
store = Store().create(prefix_path=prefix_path,
train_path=train_path,
val_path=val_path)
keras_estimator = hvd.KerasEstimator(
num_proc=16,
store=store,
model=model,
optimizer=opti,
loss=loss,
feature_cols=['features'],
label_cols=['next'],
batch_size=auto_steps_per_epoch,
epochs=auto_nb_epochs,
sample_weight_col='weight'
)
keras_model = keras_estimator.fit_on_parquet() # Fails here with ArrowIOError
The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to #joris' comment for pointing this out.

Cannot create Spark Phoenix DataFrames

I am trying to load data from Apache Phoenix into a Spark DataFrame.
I have been able to successfully create an RDD with the following code:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val foo: RDD[Map[String, AnyRef]] = sc.phoenixTableAsRDD(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
foo.collect().foreach(x => println(x))
However I have not been so lucky trying to create a DataFrame. My current attempt is:
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.phoenixTableAsDataFrame(
table = "FOO",
columns = Seq("ID", "MESSAGE_EPOCH", "MESSAGE_VALUE"),
zkUrl = Some("<zk-ip-address>:2181:/hbase-unsecure"))
df.select(df("ID")).show
Unfortunately the above code results in a ClassCastException:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to org.apache.spark.sql.Row
I am still very new to spark. If anyone can help it would be very much appreciated!
Although you haven't mentioned your spark version and details of the exception...
Please see PHOENIX-2287 which is fixed, which says
Environment: HBase 1.1.1 running in standalone mode on OS X *
Spark 1.5.0 Phoenix 4.5.2
Josh Mahonin added a comment - 23/Sep/15 17:56 Updated patch adds
support for Spark 1.5.0, and is backwards compatible back down to
1.3.0 (manually tested, Spark version profiles may be worth looking at in the future) In 1.5.0, they've gone and explicitly hidden the
GenericMutableRow data structure. Fortunately, we are able to the
external-facing 'Row' data type, which is backwards compatible, and
should remain compatible in future releases as well. As part of the
update, Spark SQL deprecated a constructor on their 'DecimalType'.
In updating this, I exposed a new issue, which is that we don't
carry-forward the precision and scale of the underlying Decimal type
through to Spark. For now I've set it to use the Spark defaults, but
I'll create another issue for that specifically. I've included an
ignored integration test in this patch as well.

How to save a Spark LogisticRegressionModel model?

I am using MLlib 1.1.0 and struggling to find a way to save my model. Docs do not seem to support such as feature in this version. Any ideas?
There is save model option like:
// Save and load model
model.save(sc, "myModelPath")
val sameModel = LogisticRegressionModel.load(sc, "myModelPath")
But I see it starting from v1.3. I m not sure if it will be still valid for 1.1
You can try this and upgrade if it does not work??

Spark workflow with jar

I'm trying to understand the extend to which one must compile a jar to use Spark.
I'd normally write ad-hoc analysis code in an IDE, then run it locally against data with a single click (in the IDE). If my experiments with Spark are giving me the right indication then I have to compile my script into a jar, and send it to all the Spark nodes. I.e. my workflow would be
Writing analysis script, which will upload a jar of itself (created
below)
Go make the jar.
Run the script.
For ad-hoc iterative work this seems a bit much, and I don't understand how the REPL gets away without it.
Update:
Here's an example, which I couldn't get to work unless I compiled it into a jar and did sc.addJar. But the fact that I must do this seems odd, since there is only plain Scala and Spark code.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.SparkFiles
import org.apache.spark.rdd.RDD
object Runner {
def main(args: Array[String]) {
val logFile = "myData.txt"
val conf = new SparkConf()
.setAppName("MyFirstSpark")
.setMaster("spark://Spark-Master:7077")
val sc = new SparkContext(conf)
sc.addJar("Analysis.jar")
sc.addFile(logFile)
val logData = sc.textFile(SparkFiles.get(logFile), 2).cache()
Analysis.run(logData)
}
}
object Analysis{
def run(logData: RDD[String]) {
val numA = logData.filter(line => line.contains("a")).count()
val numB = logData.filter(line => line.contains("b")).count()
println("Lines with 'a': %s, Lines with 'b': %s".format(numA, numB))
}
}
You are creating an anonymous function in the use of 'filter':
scala> (line: String) => line.contains("a")
res0: String => Boolean = <function1>
That function's generated name is not available unless the jar is distributed to the workers. Did the stack trace on the worker highlight a missing symbol?
If you just want to debug locally without having to distribute the jar you could use the 'local' master:
val conf = new SparkConf().setAppName("myApp").setMaster("local")
While creating JARs is the most common way of handling long-running Spark jobs, for interactive development work Spark has shells available directly in Scala, Python & R. The current quick start guide ( https://spark.apache.org/docs/latest/quick-start.html ) only mentions the Scala & Python shells, but the SparkR guide discusses how to work with SparkR interactively as well (see https://spark.apache.org/docs/latest/sparkr.html ). Best of luck with your journeys into Spark as you find yourself working with larger datasets :)
You can use SparkContext.jarOfObject(Analysis.getClass) to automatically include the jar that you want to distribute without packaging it yourself.
Find the JAR from which a given class was loaded, to make it easy for
users to pass their JARs to SparkContext.
def jarOfClass(cls: Class[_]): Option[String]
def jarOfObject(obj: AnyRef): Option[String]
You want to do something like:
sc.addJar(SparkContext.jarOfObject(Analysis.getClass).get)
HTH!

Resources