How to define a global read\write variables in Spark - apache-spark

Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?
One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.
Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?
The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:
Performance would presumably be bad because of the network overhead
between the nodes and the driver application.
To my understanding, each rdd will be passed a copy of the variable
(cache in this case) when the variable is first accessed by the
function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .
What is the best way to implement and store such a cache?
Thanks

Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding
where,
when,
in what order (excluding of course the order of transformations defined by the lineage / DAG)
and how many times
given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.
These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.
If all you want is a simple cache then you have multiple options:
use one of the methods described by Tzach Zohar in Caching in Spark
use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)
If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.
* This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).

Related

Hazelcast load data from RDBMS in client-server topology

I am using client-server topology for hazelcast cache. I have multiple maps which I load eagerly using MapLoaders. When there is a cache miss- Maploader's load(key) method is called. The MapLoader.load(key) method seems to be executed by the partition thread, which means that all other operations on the partition are blocked until loading is done. A very common use case for the MapLoader is to load data from a DB, which arguably can take some time. So what is the best possible approach to take so that other operations on partition are not blocked when laod is taking place? Is there any other way to load missing data at runtime?(Hazelcast version : Hazelcast 4.0.3)
There's a good answer to this question that gives a few options.
MapLoader.load(key) only loads a single entry, but if the remote source is really slow or there are lots of cache misses it's going to mount up.
Another alternative to #mike-yawn 's answer would be to have a Runnable that fetches needed items from the database and writes them directly into the map. You can still have the MapLoader.load(key) as well, but the chances of cache miss are reduced if your fetcher code is good at predicting what entries will be needed.
If you don't cache 100% of the records, then a cache miss is inevitable. If it's punitively slow you could always return a Entry.Value that contains some sort of flag that it's a placeholder and launch a thread to do the actual load. Then your code has to deal with that placeholder and try again later -- noting that when it tries later the eventual result of the database query could be no record found.

Is it possible for Spark worker nodes broadcast variables?

I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!
Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....
Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.

Spark: Importing Data

I currently have a spark app that reads a couple of files and forms a data frame out of them and implements some logic on the data frames.
I can see the number and size of these files growing by a lot in the future and wanted to understand what goes on behind the scenes to be able to keep up with this growth.
Firstly, I just wanted to double check that since all machines on the cluster can access the files (which is a requirement by spark), the task of reading in data from these files is distributed and no one machine is burdened by it?
I was looking at the Spark UI for this app but since it only shows what actions were performed by which machines and since "sc.textFile(filePath)" is not an action I couldn't be sure what machines are performing this read.
Secondly, what advantages/disadvantages would I face if I were to read this data from a database like Cassandra instead of just reading in files?
Thirdly, in my app I have some code where I perform a collect (val treeArr = treeDF.collect()) on the dataframe to get an array and then I have some logic implemented on those arrays. But since these are not RDDs, how does Spark distribute this work? Or does it distribute them at all?
In other words, should I be doing maximum amount of my work transforming and performing actions on RDDs than converting them into arrays or some other data structure and then implementing the logic like I would in any programming language?
I am only about two weeks into Spark so I apologize if these are stupid questions!
Yes, sc.textFile is distributed. It even has an optional minPartitions argument.
This question is too broad. But the short answer is that you should benchmark it for yourself.
collect fetches all the data to the master. After that it's just a plain array. Indeed the idea is that you should not use collect if you want to perform distributed computations.

Spring batch difference between Multithreading vs partitioning

I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the partitions then process it. I want to know what is the difference and which one is more efficient way to process when the bottleneck is the item-processor.
TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The AsyncItemProcessor/AsyncItemWriter may be a better option.
Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:
Multithreaded step
Parallel steps
Partitioning
Remote chunking
AsyncItemProcessor/AsyncItemWriter
Each has it's own benefits and disadvantages. Let's walk through each:
Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.
Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.
Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).
An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.
Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.
The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.
AsyncItemProcessor/AsyncItemWriter
The final option for scaling Spring Batch processes is the AsyncItemProcessor/AsycnItemWriter combination. In this case, the AsyncItemProcessor wraps your ItemProcessor implementation and executes the call to your implementation in a separate thread. The AsyncItemProcessor then returns a Future that is passed to the AsyncItemWriter where it is unwrapped and passed to the delegate ItemWriter implementation.
Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the ItemProcessor call until inside the ItemWriter) but overall, it can provide a useful tool for parallelizing just the ItemProcessor logic in a single JVM without sacrificing restartability.

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

Resources