Spark driver as a REST API - apache-spark

Can we have one spark driver which acts as a Rest API?
Using this Rest API (1 driver) i can spin up multiple executors on fly(real time).
I mean when ever a new request comes to spark Driver(RestAPI) it need to spin up new executors not another new driver.
Example scenario:
Assume you have a model with 3 steps
1) Read from one set of tables and applies join and many transformations.
2) Read from second set of tables and applies join and many transformations.
3) Finally compare above dataframes and update back some records.
Here we have 3 input values to the model.
Likewise we have 1000 combinations of input values for the model to run.

Offcourse, you can use the driver as rest api.
When ever you get a request just prepare your RDD/DF then perform an action and it`ll work.
You can do it in the driver (which mean SparkContext always up and it take resources), or you can wrap it with REST Api that submit jobs to your cluster by request.(And then for each job a new SparkContext will be created).

Related

Does every operation you write in a spark job performed in a spark cluster?

lets say for the operation
val a = 12 + 4, or something simple.
Will it still be distributed by the driver into cluster ?
lets say I have a map , say Map[String,String] (very large say 1000000 key value pairs)(hypothetical assumption)
Now when I do get*("something"),
Will this be distributed across the cluster to get that value?
If not , then what is the use of spark if it doesn't computes simple task together?
How is the number of tasks determined by spark also number of job determined ?
If there is a stream and some action is perform for each batch. Is it so that for each batch new job is created?
Answers:
No, This is still a driver side compute.
If you create the map in a Driver program then it remains on driver. If you try access a key then it would simply lookup on the map you created on driver memory and return you back the value.
If you create a RDD out of the collection (Reference) , and if you run any transformation then it will be run on Spark cluster.
Number of partitions usually corresponds to the number of tasks. You can explicitly tell how many partitions you want when you parallelize the collection ( like the map in your case)
Yes, there will be a job created for action performed on each batch.

How lazy Data structure works

Having some doubts on action and transformation in Spark.
I'm using spark API from last couple of months. (Learned) Spark api has a power which says that it wont load any data into memory until any action is taken to store final transformed data somewhere. Is it correct understanding ?
More refined defination:
Spark will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The moment action is triggered (For example writing to file), data will start loading into memory from source and then transforming and finally writing into file. Is it correct meaning of action ? OR Action is something when driver program submits transformation and action graph to master and then master sends respected data and code to different worker nodes to execute ? Which one is correct understanding ?
Have Read online posts, but not cleared.
You right, spark won`t do anything untill certain action will be taken (e.g write).
Every transformation will return a new RDD contains its DAG and when you submit an action, spark will execute the DAG (If you use dataset itll make optimizations too).
Action is a method that submit the DAG as said (writing into file/foreach and more actions).
The driver is response to parallize the work, keep a live on executors and send them tasks.

Batch processing job (Spark) with lookup table that's too big to fit into memory

I'm trying to write a batch job to process a couple of hundreds of terabytes that currently sit in an HBase database (in an EMR cluster in AWS), all in a single large table. For every row I'm processing, I need to get additional data from a lookup table (a simple integer to string mapping) that is in a second HBase table. We'd be doing 5-10 lookups per row.
My current implementation uses a Spark job that's distributing partitions of the input table to its workers, in the following shape:
Configuration hBaseConfig = newHBaseConfig();
hBaseConfig.set(TableInputFormat.SCAN, convertScanToString(scan));
hBaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> table = sparkContext.newAPIHadoopRDD(hBaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
table.map(val -> {
// some preprocessing
}).foreachPartition(p -> {
p.forEachRemaining(row -> {
// code that does the lookup
});
});
The problem is that the lookup table is too big to fit in the workers' memory. They all need access to all parts of the lookup table, but their access pattern would significantly benefit from a cache.
Am I right in thinking that I cannot use a simple map as a broadcast variable because it'd need to fit into memory?
Spark uses a shared nothing architecture, so I imagine there won't be an easy way to share a cache across all workers, but can we build a simple LRU cache for every individual worker?
How would I implement such a local worker cache that gets the data from the lookup table in HBase on a cache miss? Can I somehow distribute a reference to the second table to all workers?
I'm not set on my choice of technology, apart from HBase as the data source. Is there a framework other than Spark which could be a better fit for my use case?
You have a few of options for dealing with this requirement:
1- Use RDD or Dataset joins
You can load both of your HBase tables as Spark RDD or Datasets and then do a join on your lookup key.
Spark will split both RDD into partitions and shuffle content around so that rows with the same keys end up on the same executors.
By managing the number of number of partitions within spark you should be able to join 2 tables on any arbitrary sizes.
2- Broadcast a resolver instance
Instead of broadcasting a map, you can broadcast a resolver instance that does a HBase lookup and temporary LRU cache. Each executor will get a copy of this instance and can manage its own cache and you can invoke them within for foreachPartition() code.
Beware, the resolver instance needs to implement Serializable so you will have to declare the cache, HBase connections and HBase Configuration properties as transient to be initialized on each executor.
I run such a setup in Scala on one of the projects I maintain: it works and can be more efficient than the straight Spark join if you know your access patterns and manage you cache efficiently
3- Use the HBase Spark connector to implement your lookup logic
Apache HBase has recently incorporated improved HBase Spark connectors
The documentation is pretty sparse right now, you need to look at the JIRA tickets and the documentation of the previous incarnation of these tools
Cloudera's SparkOnHBase but the last unit test in the test suite looks pretty much like what you want
I have no experience with this API though.

How does mllib code run on spark?

I am new to distributed computing, and I'm trying to run Kmeans on EC2 using Spark's mllib kmeans. As I was reading through the tutorial I found the following code snippet on
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
I am having trouble understanding how this code runs inside the cluster. Specifically, I'm having trouble understanding the following:
After submitting the code to master node, how does spark know how to parallelize the job? Because there seem to be no part of the code that deals with this.
Is the code copied to all nodes and executed on each node? Does the master node do computation?
How do node communitate the partial result of each iteration? Is this dealt inside the kmeans.train code, or is the spark core takes care of it automatically?
Spark divides data to many partitions. For example, if you read a file from HDFS, then partitions should be equal to partitioning of data in HDFS. You can manually specify number of partitions by doing repartition(numberOfPartitions). Each partition can be processed on separate node, thread, etc. Sometimes data are partitioned by i.e. HashPartitioner, which looks on hash of the data.
Number of partitions and size of partitions generally tells you if data is distributed/parallelized correctly. Creating partitions of data is hidden in RDD.getPartitions methods.
Resource scheduling depends on cluster manager. We can post very long post about them ;) I think that in this question, the partitioning is the most important. If not, please inform me, I will edit answer.
Spark serializes clusures, that are given as arguments to transformations and actions. Spark creates DAG, which is sent to all executors and executors execute this DAG on the data - it launches closures on each partition.
Currently after each iteration, data is returned to the driver and then next job is scheduled. In Drizzle project, AMPLab/RISELab is creating possibility to create multiple jobs on one time, so data won't be sent to the driver. It will create DAG one time and schedules i.e. job with 10 iterations. Shuffle between them will be limited / will not exists at all. Currently DAG is created in each iteration and job in scheduled to executors
There is very helpful presentation about resource scheduling in Spark and Spark Drizzle.

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources