I have two sources, they can be different type of sources(database or files) or can be of same type.
Dataset1 = source1.load;
Dataset2 = source2.load;
Will spark loads the data parallelly into different datasets or will it load in sequence?
Actions occur sequentially. Your statement ... will load parallel into different datasets ... has as answer sequentially as these are Actions.
Data pipelines required for Actions including the Transformations, occur in parallel where possible. E.g. creating a Data Frame with 4 loads that are subject to Union, say, will cause those loads to occur in parallel, provided enough Executors (Slots) can be allocated.
So, as the comment also states, you need an Action and the DAG path will determine flow and any parallelism that can be applied. You can see that in the Spark UI.
To demonstrate:
rdd1 = get some data
rdd2 = get some other data
rdd3 = get some other other data
rddA = rdd1 union rdd2 union rdd3
rddA.toDF.write ...
// followed by
rdd1' = get some data
rdd2' = get some other data
rdd3' = get some other other data
rddA' = rdd1 union rdd2 union rdd3
rddA'.toDF.write ...
rddA'.toDF.write ... will occur after rddA.toDF.write... None of rdd1' and rdd2' and rdd3' Transformations occur in parallel with rddA.toDF.write 's Transformations / Action. That cannot be the case. This means that if you want write parallelism you need two separate SPARK apps - running concurrently - provided resources allow that of course.
Related
I'm trying to read a list of directories each into its own dataframe. E.g.
dir_list = ['dir1', 'dir2', ...]
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
...
Each directory has data of varying schemas.
I want to do this in parallel, so just a simple for loop won't work. Is there any way of doing this?
Yes there is :))
"How" will depend on what kind of processing you do after read, because by itself the spark.read.csv(...) won't execute until you call an action (due to Spark's lazy evaluation) and putting multiple reads in a for loop will work just fine.
So, if the results of evaluating multiple dataframes have the same schema, parallelism can be simply achieved by UNION'ing them. For example,
df1 = spark.read.csv(dir_list[0])
df2 = spark.read.csv(dir_list[1])
df1.withColumn("dfid",lit("df1")).groupBy("dfid").count()
.union(df2.withColumn("dfid",lit("df1")).groupBy("dfid").count())
.show(truncate=False)
... will cause both dir_list[0] and dir_list[1] to be read in parallel.
If this is not feasible, then there is always a Spark Scheduling route:
Inside a given Spark application (SparkContext instance), multiple
parallel jobs can run simultaneously if they were submitted from
separate threads. By “job”, in this section, we mean a Spark action
(e.g. save, collect) and any tasks that need to run to evaluate that
action.
What am I trying to do:
read large Terabyte size RDD
filter it using broadcast variable, it'll reduce it to few gigabytes
join filtered RDD with another RDD which few gigabytes too
persist join result and reuse multiple times
Expectation:
join executed once
join result persisted
join result reused several times w/o recomputation
IRL:
join recomputed several times.
half of entire job runtime spent on re-computing same thing several times.
My presudo-code
val nonPartitioned = sparkContext.readData("path")
val terabyteSizeRDD = nonPartitioned
.keyBy(_.joinKey)
.partitionBy(new HashPartitioner(nonPartitioned.getNumPartitions))
//filters down to few Gigabytes
val filteredTerabyteSizeRDD = terabyteSizeDataset.mapPartitions(filterAndMapPartitionFunc, preservesPartitioning = true)
val (joined, count) = {
val result = filteredTerabyteSizeRDD
.leftOuterJoin(anotherFewGbRDD, filteredTerabyteSizeRDD.partitioner.get)
.map(mapJoinRecordFunc)
result.persist()
result -> result.count()
}
DAG says that join is executed several times
first time
another time for .count() I don't know how to trigger persist is another way
three more times since code uses joined three times to create another RDDs.
How can I align expectation and reality?
You can cache or persist data in spark with df.cache() or df.persist(). If you are using persist you have further options then using cache. If you are using persist without an argument its just like a simple cache()see here . Why dont you cache your filteredTerabyteSizeRDD? It should fit in memory if its just a few GB? If it doesnt fit in memory, you could tryfilteredTerabyteSizeRDD.persist(StorageLevel.MEMORY_AND_DISK).
Hope I could answer your question.
Suppose we start from some data and gets some intermediate result df_intermediate. Along the pipeline from source data to df_intermediate, all transformations are lazy and nothing is actually computed.
Then I would like to perform two different transformations to df_intermediate. For example, I would like to calculate df_intermediate.agg({"col":"max"}) and df_intermediate.approxquantile("col", [0.1,0.2,0.3], 0.01) using two separate commands.
I wonder in the following scenario, does spark need to recompute df_intermediate when it is performing the second transformation? In other words, does Spark perform the calculation for the above two transformations both start from the raw data without storing the intermediate result? Obviously I can cache the intermediate result but I'm just wondering if Spark does this kind of optimization internallly.
It is somewhat disappointing. But firstly you need to see it in terms of Actions. I will not consider the caching.
If you do the following there will be optimization for sure.
val df1 = df0.withColumn(...
val df2 = df1.withColumn(...
Your example needs an Action like count to work. But the two statements are too diverse, so that there is no skipped processing evident. There is thus no sharing.
In general the Action = Job is correct way to look at it. For DFs Catalyst Optimizer can kick a Job off even though you may not realize this. For RRDs (legacy) this was a little different.
This does not get optimized either:
import org.apache.spark.sql.functions._
val df = spark.range(1,10000).toDF("c1")
val df_intermediate = df.withColumn("c2", col("c1") + 100)
val x = df_intermediate.agg(max("c2"))
val y = df_intermediate.agg(min("c2"))
val z = x.union(y).count
x and y both go back to source. One would have thought that would be easier to do and it is also 1 Action here. Need to do the .explain, but the idea is to leave it to Spark due to lazy evaluation, etc.
As an aside: Is it efficient to cache a dataframe for a single Action Spark application in which that dataframe is referenced more than once? & In which situations are the stages of DAG skipped?
I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});
The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.
Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.