So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
Obviously I cannot reference SparkContext inside the apply_classifier function.
My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:
def apply_classifier(clf):
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
if clf == 0:
clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
elif clf == 1:
clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)
classifiers = [0, 1]
sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect()
I have tried using flatMap instead of map but I get NoneType object is not iterable.
I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function.
Finally, is it possible to do what I am trying to do? What are the alternatives?
is it possible to do what I am trying to do?
It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.
What are the alternatives?
This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:
Use Spark only as task management tool to train local, non-distributed models. It looks like you explored this path to some extent already. For more advanced implementation of this approach you can check spark-sklearn.
In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.
Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.
While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.
Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.
While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.
Related
I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.
Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.
Currently all our data engineering flows are using Spark (Scala)DataFrame.
We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.
EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets
There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.
For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.
Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.
There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.
In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.
Limitations of Spark Datasets:
Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.
Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution
Spark uses in memory computing and caching to decrease latency on complex analytics, however this is mainly for "iterative algorythms",
If I needed to perform a more basic analytic, say perhaps each element was a group of numbers and I wanted to look for elements with a standard deviation less than 'x' would Spark still decrease latency compared to regular cluster computing (without in memory computing)? Assuming I used that same commodity hardware in each case.
It tied for the top sorting framework using none of those extra mechanisms, so I would argue that is reason enough. But, you can also run streaming, graphing, or machine learning without having to switch gears. Then, you add in that you should use DataFrames wherever possible and you get query optimizations beyond any other framework that I know of. So, yes, Spark is the clear choice in almost every instance.
One good thing about spark is its Datasource API combining it with SparkSQL gives you ability to query and join different data sources together. SparkSQL now includes decent optimizer - catalyst. As mentioned in one of the answer along with core (RDD) in spark you can also include streaming data, apply machine learning models and graph algorithms. So yes.
i've been tasked with figuring out how to extend spark's api to include some custom hooks for another program like iPython Notebook to latch on to. I've already gone through the quick start guide, the cluster mode overview, the submitting applications doc, and this stack overflow question. Everything I'm seeing indicates that, to get something to run in Spark, you need to use
spark-submit
to make it happen. As such I whipped up some code that, visa vis spark, pulled the first ten rows of test data out of an accumulo table I created. My team lead, however, is telling me to modify spark itself. Is this the preferred way to accomplish the task I described? If so, why? What's the value proposition?
No details have been provided about what types of operations your application requires so an answer here will need to remain general in nature.
The question of extending spark itself may come down to:
Can I achieve the needs of the application by leveraging the existing
methods within Spark(/SQL/Hive/Streaming)Context and RDD
(/SchemaRDD/DStream/..)
additional choices:
Is it possible to embed the required functionality inside the
transformation methods of RDD's - either by custom code or by invoking
third party libraries.
The likely distinguishing factors here would be if the existing data access and shuffle/distribution structures support your needs. When it comes to data tranformations - in most cases you should be able to embed the required logic within the methods of RDD.
So:
case class InputRecord(..)
case class OutputRecord(..)
def myTranformationLogic(inputRec: InputRecord) : OutputRecord = {
// put your biz rules/transforms here
(return) outputRec
}
val myData = sc.textFile(<hdfs path>).map{ l => InputRecord.fromInputLine(l)}
val outputData = myData.map(myTransformationLogic)
outputData.saveAsTextFile(<hdfs path>)