how to extend apache spark api? - apache-spark

i've been tasked with figuring out how to extend spark's api to include some custom hooks for another program like iPython Notebook to latch on to. I've already gone through the quick start guide, the cluster mode overview, the submitting applications doc, and this stack overflow question. Everything I'm seeing indicates that, to get something to run in Spark, you need to use
spark-submit
to make it happen. As such I whipped up some code that, visa vis spark, pulled the first ten rows of test data out of an accumulo table I created. My team lead, however, is telling me to modify spark itself. Is this the preferred way to accomplish the task I described? If so, why? What's the value proposition?

No details have been provided about what types of operations your application requires so an answer here will need to remain general in nature.
The question of extending spark itself may come down to:
Can I achieve the needs of the application by leveraging the existing
methods within Spark(/SQL/Hive/Streaming)Context and RDD
(/SchemaRDD/DStream/..)
additional choices:
Is it possible to embed the required functionality inside the
transformation methods of RDD's - either by custom code or by invoking
third party libraries.
The likely distinguishing factors here would be if the existing data access and shuffle/distribution structures support your needs. When it comes to data tranformations - in most cases you should be able to embed the required logic within the methods of RDD.
So:
case class InputRecord(..)
case class OutputRecord(..)
def myTranformationLogic(inputRec: InputRecord) : OutputRecord = {
// put your biz rules/transforms here
(return) outputRec
}
val myData = sc.textFile(<hdfs path>).map{ l => InputRecord.fromInputLine(l)}
val outputData = myData.map(myTransformationLogic)
outputData.saveAsTextFile(<hdfs path>)

Related

How to update rows in Jooq without Codegen using JSON

I am using Jooq version 3.17.0 and attempting to insert data into a table without codegen.
At the minute, I am designing a system that allows data to be imported into multiple tables (one at a time, and starting with just one), yet I do not want to write specific code for each table and as of now, I haven't had a need for codegen.
The code currently works for importing data via JSON, with json being a String formatted in the 'Jooq' format. This imports data correctly into the database. This also allows us to send json data of table updates from one system to our main system that uses Jooq. Yet it gives me an error when I try to update.
I am using MYSQL as my database.
The original code for insertion is :
Result<Record> convertedJson = dslContext.fetchFromJSON(json);
Loader<Record> res1 = dslContext.loadInto(table(tableName)).loadJSON(json).fields(convertedJson.fields()).execute();
However, if we try to update data by sending in the same json, but with one field changed, jooq gives an error org.jooq.exception.DataAccessException stating that there is a duplicate entry for key.
I tried to use :
Loader<Record> res2 = dslContext.loadInto(table(tableName)).onDuplicateKeyUpdate().loadJSON(json).fields(convertedJson.fields()).execute();
But then this throws an error ON DUPLICATE KEY UPDATE only works on tables with explicit primary keys. Table is not updatable : <tableName> since in LoaderImpl.onDuplicateKeyUpdate():220 since table.getPrimaryKey() is null which technically makes sense since table(tableName) returns a Table that does not know it's fields.
My question is probably two-fold.
Is there a way to have a table that is aware of it's fields without codegen?
Is there a way for me to allow jooq to update rows this way.
My preferences is to steer clear of codegen, unless it's really needed. I probably could switch to codegen if needed, but again I would still need to be able to execute SQL without writing specific code for each table. Using JSON is still very much desired, as that allows me to send data from one application to another for import.
Using code generation
You've run into one of those many reasons why code generation is very helpful with jOOQ. If your various tables are known at compile time, and all you're doing is switch table names, then I would go with generated code, making the lookup of the table dynamic. That would solve the problem easily.
From experience with various similar support cases, I've always recommended this first, because as soon as these kinds of troubles start, it's a good idea to re-think the code generation strategy as you will run into other, similar problems, having to work around the lack of ubiquitously available meta data all the time. There are many other benefits to using the code generator.
Emulating code generation
If for some reason you cannot (e.g. the tables aren't known at compile time) or do not want to use the code generator, then you can do the code generator's work yourself at runtime, by building CustomTable types as documented here.
Using other means of providing meta information
Another way to provide jOOQ with meta data is to use one of various forms of implementing org.jooq.Meta, which include:
Looking up meta data from the JDBC driver's DatabaseMetaData (this can be slow, depending on your schema)
Letting jOOQ interpret some DDL scripts
Using jOOQ's XML representation of the standard SQL INFORMATION_SCHEMA
Using generated code

Implications of getting Dataframe from Logical Plan in Apache Spark

I'm writing some custom additions to SparkSQL and in some places, I require a reference to Dataframe.
I've already checked questions like this, and I know that Logical Plans can be converted to data frame by Dataset.ofRows(Sqlctx, Plan).
My question is more about the implications of this like will this conversion trigger a job/ data scan? I'm not performing any explicit action on this data frame, just chaining it through some more transformations.
Please help me here, and let me know if any more details are required.

Using Spark ML Pipelines just for Transformations

I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers.
However, we are now having an internal about debate whether or not this is the most idiomatic way of implementing this pipeline. The other option would be to implement these transformations as series of UDFs and to build our own lineage tracking based on a DataFrame's schema history (or Spark's internal DF lineage tracking). The argument for this side is that Spark's ML pipelines are not intended just ETL jobs, and should always be implemented with goal of producing a column which can be fed to a Spark ML Evaluator. The argument against this side is that it requires a lot of work that mirrors already existing functionality.
Is there any problem with leveraging Spark's ML Pipelines strictly for ETL tasks? Tasks that only make use of Transformers and don't include Evaluators?
For me, seems like a great idea, especially if you can compose the different Pipelines generated into new ones since a Pipeline can itself be made of different pipelines since a Pipeline extends from PipelineStage up the tree (source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.Pipeline).
But keep in mind that you will probably being doing the same thing under the hood as explained here (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-mllib/spark-mllib-transformers.html):
Internally, transform method uses Spark SQL’s udf to define a function (based on createTransformFunc function described above) that will create the new output column (with appropriate outputDataType). The UDF is later applied to the input column of the input DataFrame and the result becomes the output column (using DataFrame.withColumn method).
If you have decided for other approach or found a better way, please, comment. It's nice to share knowledge about Spark.

Run ML algorithm inside map function in Spark

So I have been trying for some days now to run ML algorithms inside a map function in Spark. I posted a more specific question but referencing Spark's ML algorithms gives me the following error:
AttributeError: Cannot load _jvm from SparkContext. Is SparkContext initialized?
Obviously I cannot reference SparkContext inside the apply_classifier function.
My code is similar to what was suggested in the previous question I asked but still haven't found a solution to what I am looking for:
def apply_classifier(clf):
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
if clf == 0:
clf = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxDepth=3)
elif clf == 1:
clf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=5)
classifiers = [0, 1]
sc.parallelize(classifiers).map(lambda x: apply_classifier(x)).collect()
I have tried using flatMap instead of map but I get NoneType object is not iterable.
I would also like to pass a broadcasted dataset (which is a DataFrame) as parameter inside the apply_classifier function.
Finally, is it possible to do what I am trying to do? What are the alternatives?
is it possible to do what I am trying to do?
It is not. Apache Spark doesn't support any form of nesting and distributed operations can be initialized only by the driver. This includes access to distributed data structures, like Spark DataFrame.
What are the alternatives?
This depends on many factors like the size of the data, amount of available resources, and choice of algorithms. In general you have three options:
Use Spark only as task management tool to train local, non-distributed models. It looks like you explored this path to some extent already. For more advanced implementation of this approach you can check spark-sklearn.
In general this approach is particularly useful when data is relatively small. Its advantage is that there is no competition between multiple jobs.
Use standard multithreading tools to submit multiple independent jobs from a single context. You can use for example threading or joblib.
While this approach is possible I wouldn't recommend it in practice. Not all Spark components are thread-safe and you have to pretty careful to avoid unexpected behaviors. It also gives you very little control over resource allocation.
Parametrize your Spark application and use external pipeline manager (Apache Airflow, Luigi, Toil) to submit your jobs.
While this approach has some drawbacks (it will require saving data to a persistent storage) it is also the most universal and robust and gives a lot of control over resource allocation.

Avoid the use of Java data structures in Apache Spark to avoid copying the data

I have a MySQL database with a single table containing about 100 million records (~25GB, ~5 columns). Using Apache Spark, I extract this data via a JDBC connector and store it in a DataFrame.
From here, I do some pre-processing of the data (e.g. replacing the NULL values), so I absolutely need to go through each record.
Then I would like to perform dimensionality reduction and feature selection (e.g. using PCA), perform clustering (e.g. K-Means) and later on do the testing of the model on new data.
I have implemented this in Spark's Java API, but it is too slow (for my purposes) since I do a lot of copying of the data from a DataFrame to a java.util.Vector and java.util.List (to be able to iterate over all records and do the pre-processing), and later back to a DataFrame (since PCA in Spark expects a DataFrame as input).
I have tried extracting information from the database into a org.apache.spark.sql.Column but cannot find a way to iterate over it.
I also tried avoiding the use of Java data structures (such as List and Vector) by using the org.apache.spark.mllib.linalg.{DenseVector, SparseVector}, but cannot get that to work either.
Finally, I also considered using JavaRDD (by creating it from a DataFrame and a custom schema), but couldn't work it out entirely.
After a lengthy description, my question is: is there a way to do all steps mentioned in the first paragraph, without copying all the data into a Java data structure?
Maybe one of the options I tried could actually work, but I just can't seem to find out how, as the docs and literature on Spark are a bit scarce.
From the wording of your question, it seems there is some confusion about the stages of Spark processing.
First, we tell Spark what to do by specifying inputs and transformations. At this point, the only things that are known are (a) the number of partitions at various stages of processing and (b) the schema of the data. org.apache.spark.sql.Column is used at this stage to identify the metadata associated with a column. However, it doesn't contain any of the data. In fact, there is no data at all at this stage.
Second, we tell Spark to execute an action on a dataframe/dataset. This is what kicks off processing. The input is read and flows through the various transformations and into the final action operation, be it collect or save or something else.
So, that explains why you cannot "extract information from the database into" a Column.
As for the core of your question, it's hard to comment without seeing your code and knowing exactly what it is you are trying to accomplish but it is safe to say that much migrating between types is a bad idea.
Here are a couple of questions that might help guide you to a better outcome:
Why can't you perform the data transformations you need by operating directly on the Row instances?
Would it be convenient to wrap some of your transformation code into a UDF or UDAF?
Hope this helps.

Resources