Is there any out-of-the box functionality in Spark available to bind an RDD to a REST service? That is, calling a web service and get an RDD.
Or is the simplest approach to call the rest service by myself and convert the result collection to a RDD ?
Thanks.
I used jersey client, read a string (one comple json document per line), and with this string did this:
val stringResponse = request.request().get(classOf[String])
val jsonDataset = session.createDataset[String](Seq(stringResponse))
// try with case class
val parsedResponse = session.read.json(jsonDataset)
...which results in a DataFrame that you can select stuff on.
You can refer to the link Spark-Jobserver
Some of the features of Spark-Jobserver which I think you are looking for are :
"Spark as a Service": Simple REST interface for all aspects of job, context management
Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
Named RDDs to cache and retrieve RDDs by name, improving RDD sharing and reuse among jobs.
Hope this helps.
Related
Can we have one spark driver which acts as a Rest API?
Using this Rest API (1 driver) i can spin up multiple executors on fly(real time).
I mean when ever a new request comes to spark Driver(RestAPI) it need to spin up new executors not another new driver.
Example scenario:
Assume you have a model with 3 steps
1) Read from one set of tables and applies join and many transformations.
2) Read from second set of tables and applies join and many transformations.
3) Finally compare above dataframes and update back some records.
Here we have 3 input values to the model.
Likewise we have 1000 combinations of input values for the model to run.
Offcourse, you can use the driver as rest api.
When ever you get a request just prepare your RDD/DF then perform an action and it`ll work.
You can do it in the driver (which mean SparkContext always up and it take resources), or you can wrap it with REST Api that submit jobs to your cluster by request.(And then for each job a new SparkContext will be created).
I have a Kafka broker with JSON data from my IoT applications. I connect to this server from a Spark Streaming application in order to do some processing.
I'd like to save in memory (RAM) some specific fields of my json data which I believe I could achieve using cache() and persist() operators.
Next time when I receive a new JSON data in the Spark Streaming application, I check in memory (RAM) if there are fields in common that I can retrieve. And if yes, I do some simple computations and I finally update the values of fields I saved in memory (RAM).
Thus, I would like to know if what I previously descibed is possible. If yes, do I have to use cache() or persist() ? And How can I retrieve from memory my fields?
It's possible with cache / persist which uses memory or disk for the data in Spark applications (not necessarily for Spark Streaming applications only -- it's a more general use of caching in Spark).
But...in Spark Streaming you've got special support for such use cases which are called stateful computations. See Spark Streaming Programming Guide to explore what's possible.
I think for your use case mapWithState operator is exactly what you're after.
Spark does not work that way. Please think it through in a distributed way.
For the first part of keeping in RAM. You can use cache() or persist() anyone as by default they keep data in memory, of the worker.
You can verify this from Apache Spark Code.
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()
As far as I understand your use case, you need the UpdateStateByKey Operation to implement your second use case !
For more on Windowing see here.
I am creating a Spark RDD by loading data from Elasticsearch using the elasticsearch-hadoop connector in python (importing pyspark) as:
es_cluster_read_conf = {
"es.nodes" : "XXX",
"es.port" : "XXX",
"es.resource" : "XXX"
}
es_cluster_rdd = sc.newAPIHadoopRDD(
inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_cluster_read_conf)
Now, if I only have these 2 commands in my file and run it, on the Spark Web UI for Application Details, I see on job as: take at SerDeUtil.scala:201
I have 2 questions now:
1) I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. In the above scenario, I am not applying any action, yet I see a job as being run on the web UI.
2) If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node ? I understand some jobs as being listed as collect, count, etc because these are valid actions in Spark. However, even after doing extensive research, I still couldn't figure out the semantics of this take operation.
I was under the impression that in Spark RDDs are computed lazily i.e, if no action is applied, there would not be any job launched. I
This is more or less true although there a few exceptions out there when action can be triggered by a secondary task like creating partitioner, data conversions between JVM and guest languages. It is even more complicated when you work with high level Dataset API and Dataframes.
If this is a job, what does this "take" operation actually mean? Does this mean that the data is actually loaded from my ElasticSearch node and passed to Spark node?
It is a job and some amount of data is actually fetched from the source. It is required to determine serializer for the key-value pairs.
We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the results as alerts to the users. Executing the query based on each incoming event batch seems to be very slow. Would appreciate if anyone can suggest a better approach to implementing this use case. Also, would like know if Spark is executing the sql on the driver or workers? Thanks in advance. Given below are the steps we perform in order to achieve this -
1) Load the initial dataset from an external database as a JDBCRDD
JDBCRDD<SomeState> initialRDD = JDBCRDD.create(...);
2) Create an incoming DStream (that captures updates to the initialized data)
JavaReceiverInputDStream<SparkFlumeEvent> flumeStream =
FlumeUtils.createStream(ssc, flumeAgentHost, flumeAgentPort);
JavaDStream<SomeState> incomingDStream = flumeStream.map(...);
3) Create a Pair DStream using the incoming DStream
JavaPairDStream<Object,SomeState> pairDStream =
incomingDStream.map(...);
4) Create a Stateful DStream from the pair DStream using the initialized RDD as the base state
JavaPairDStream<Object,SomeState> statefulDStream = pairDStream.updateStateByKey(...);
JavaRDD<SomeState> updatedStateRDD = statefulDStream.map(...);
5) Run a user-driven query against the updated state based on the values in the incoming stream
incomingStream.foreachRDD(new Function<JavaRDD<SomeState>,Void>() {
#Override
public Void call(JavaRDD<SomeState> events) throws Exception {
updatedStateRDD.count();
SQLContext sqx = new SQLContext(events.context());
schemaDf = sqx.createDataFrame(updatedStateRDD, SomeState.class);
schemaDf.registerTempTable("TEMP_TABLE");
sqx.sql(SELECT col1 from TEMP_TABLE where <condition1> and <condition2> ...);
//collect the results and process and send alerts
...
}
);
The first step should be to identify which step is taking most of the time.
Please see the Spark Master UI and identify which Step/ Phase is taking most of the time.
There are few best practices + my observations which you can consider: -
Use Singleton SQLContext - See example - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala
updateStateByKey can be a memory intensive operation in case of large number of keys. You need to check size of data processed by
updateStateByKey function and also if it fits well in the given
memory.
How is your GC behaving?
Are you really using "initialRDD"? if not then do not load it. In case it is static dataset then cache it.
Check the time taken by your SQL Query too.
Here are few more questions/ areas which can help you
What is the StorageLevel for DStreams?
Size of cluster and configuration of Cluster
version of Spark?
Lastly - ForEachRDD is an Output Operation which executes the given function on the Driver but RDD might actions and those actions are executed on worker nodes.
You may need to read this for better explaination about Output Operations - http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
I too facing the same issue could you please let me know if you have got the solution for the same? Though I have mentioned the detailed use case in below post.
Spark SQL + Window + Streming Issue - Spark SQL query is taking long to execute when running with spark streaming
We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.