I am trying to implement distributed cache in spark using java. Here I want an Integer variable which is expected to be updated and read at different executor nodes and updated value should be reflected to other executors.
I know that accumulators can be updated on executors but they cannot be read on executors. Only driver program can read their value.
Can anyone please help on this?
Related
I am new to spark and i have read several articles about Closure but I am unable to understand the logic can someone please clarify.
For counters we should use accumulators coz executer cant see the counter variable on driver side in cluster mode and works on local copy...the isnt this the same with any operation, transformation...since executors reside on different nodes then they cant see whats on driver node. What is the difference if I do a collect() then also executors are not able to view driver node variables what happens then?
I thought the executors work on its own partitions and send their data to driver & driver combines all output. Is my understanding wrong?
Another item that I read little about.
Leaving S3 aside, and not in the position just now to try out on a bare metal classic data locality approach to Spark, Hadoop, and not in Dynamic Resource Allocation mode, then:
What if a large dataset in HDFS is distributed over (all) N data nodes in the Cluster, but the total-executor-cores parameter is set lower than N, and we need to read all the data on obviously (all) N relevant Data Nodes?
I assume Spark has to ignore this parameter for reading from HDFS. Or not?
If it is ignored, an Executor Core needs to be allocated on that Data Node and is thus acquired by the overall Job and thus this parameter needs to be interpreted to mean for processing and not for reading blocks?
Is the data from such a Data Node immediately shuffled to where the Executors were allocated?
Thanks in advance.
There seems to be little bit of confusion here.
Optimal Data locality (node local) is something we want to achieve, not guarantee. All Spark can do is request resources (for example with YARN - How YARN knows data locality in Apache spark in cluster mode) and hope that it will get resources, which satisfy data locality constraints.
If it doesn't it will simply fetch data from remote nodes. However it is not shuffle. It just a simple transfer over network.
So to answer your question - Spark will use resource which has been allocated, trying to do its best do satisfy the constraints. It cannot use nodes, which hasn't been acquired, so it won't automatically get additional nodes for reads.
I would like to use Spark structured streaming to watch a drop location that exists on the driver only. I do this with
val trackerData = spark.readStream.text(sourcePath)
After that I would like to parse, filter, and map incoming data and write it out to elastic.
This works well except that it does only work when spark.master is set to e.g. local[*]. When set to yarn, no files get found even when deployment mode is set to client.
I thought that reading data in from local driver node is achieved by setting deployment to client and doing the actual processing and writing within the Spark cluster.
How could I improve my code to use driver for reading in and cluster for processing and writing?
What you want is possible, but not recommended in Spark Structured Streaming in particular and in Apache Spark in general.
The main motivation of Apache Spark is to bring computation to the data not the opposite as Spark is to process petabytes of data that a single JVM (of a driver) would not be able to handle.
The driver's "job" (no pun intended) is to convert a RDD lineage (= a DAG of transformations) to tasks that know how to load a data. Tasks are executed on Spark executors (in most cases) and that's where data processing happens.
There are some ways to make the reading part on driver and processing on executors and among them the most "lucrative" would be to use broadcast variables.
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
One idea that came to my mind is that you could "hack" Spark "Streams" and write your own streaming sink that would do collect or whatever. That could make the processing local.
I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.
When I submit my spark job into yarn cluster with --num-executers=4 , I can see in the spark UI, 4 executors are allocated in 4 nodes in the cluster. In my spark application I am taking inputs from various HDFS locations in various steps. But the allocated executors remain the same through out the execution.
My doubt is whether spark do anything for data-locality, since the nodes it selects at the very beginning irrespective of where input data situated(at least just in case of HDFS)?
I know map reduce does it in some extent.
Yes, it does. Spark still uses Hadoop InputFormat and RecordReader interfaces and appropriate implementations like i.e. TextInputFormat. So Spark's behaviour in this case is very similar to common MapReduce. Spark driver retrieves block locations of the file and assigns task to executors with regard to data locality.