kafka streaming behaviour for more than one partition - apache-spark

I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});

The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.

Related

Can a long-running map operation on one partition delay processing on other partitions in Spark?

I have a mapGroupsWithState function that I would like to add some additional functionality to based on the groupBy key. Roughly it would look like this:
dataFrame
.as[Log]
.groupByKey(_.id)
.mapGroupsWithState(GroupStateTimeout.NoTimeout())(processData)
.writeStream
.trigger(Trigger.ProcessingTime(s"$x seconds"))
.outputMode(OutputMode.Update())
.foreachBatch(postProcess _)
.start()
def processData(id: String, logs: Iterator[Log], oldState: GroupState[Checkpoint]): Array[Log] = {
if (f(id)) {
// long running operation
}
else {
.
.
.
}
}
My dataframe is partitioned by the id field. I realize that since this if() operation is long running, it may delay the processing of other batches of data with an id that maps to the same partition. However, there was some concern brought up as to whether this long running operation could also delay the processing of data batches on other partitions. I'm not sure how Spark handles batches of data when taking the output of mapGroupsWithState and then passing that to forEachBatch; am I in danger of delaying data output on all partitions with this setup? It seems counterintuitive to me that delays on one partition could affect another, but I'd like to be sure.
mapGroupsWithState runs at Executor level.
This means parallel operations. An Executor Core is assigned to a Task servicing a given Partition for the duration of the processing of that Partition.
Assuming you have enough Executors and thus Cores, there should be no issue, but of course if you have 1 Executor with 1 Core only available to your App, then you would get an issue at Task level.

What algorithm spark uses to bring same keys together

What algorithm Spark uses to identify similar keys and pushes the data to the next stage?
Scenarios include,
When I apply distinct(), I know a pre-distinct applied in the current stage and then the data is shuffled to the next stage. In this case, all the similar keys need to be in the same partition in the next stage.
When Dataset1 joins with Dataset2 (SortMergeJoin). In this case, all the similar keys in Dataset1 and Dataset2 needs to be in the same partition in next stage.
There are other scenarios as well, but overall picture is this.
How does Spark efficiently does this? and will there be any time lag between Stage1 and Stage2 when identifying the similar keys?
Algorithm Spark uses to partition the data is Hash by default. Also stages don't push but pull the data from previous stage.
Spark creates a stage boundaries whenever a shuffle is needed. Second stage will wait untill all the tasks in stage first complete and write their output to temp files. Second stage then starts pulling the data needed for its partitions from across the partitions written in stage 1.
Distinct as you see isn't as simple as it looks. Spark does distinct by applying aggregates. Also shuffling is needed because duplicates can be in multiple partitions. One of the conditions for shuffling is Spark needs a pair RDD and if your parent isn't one, it will create intermediary pair RDDs.
If you see the logical plan of Distinct, it would be more or less like
Parent RDD ---> Mapped RDD (record as key and null values) ---> MapPartitionsRDD (running distinct at partition level) ----> Shuffled RDD (pulling needed partitions data) ----> MapPartitionsRDD (distinct from segregated partitions for each key) ----> Mapped RDD (collecting only keys and discarding null values for result)
Spark uses RDD Dependency to achieve that data is shuffled to the next stage. And know which is a complexed processes;
The getDependencies function in RDD.SCALA is responsible to get the data from parent.
/**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps
And Some RDD dont have to get parent rdd, so the rdd dont implement the function compute,like DataSource RDD;
Shuffle rowRDD usually appear in chain compute, so it usually have the parent data to fetch.

Spark Streaming appends to S3 as Parquet format, too many small partitions

I am building an app that uses Spark Streaming to receive data from Kinesis streams on AWS EMR. One of the goals is to persist the data into S3 (EMRFS), and for this I am using a 2 minutes non-overlapping window.
My approaches:
Kinesis Stream -> Spark Streaming with batch duration about 60 seconds, using a non-overlapping window of 120s, save the streamed data into S3 as:
val rdd1 = kinesisStream.map( rdd => /* decode the data */)
rdd1.window(Seconds(120), Seconds(120).foreachRDD { rdd =>
val spark = SparkSession...
import spark.implicits._
// convert rdd to df
val df = rdd.toDF(columnNames: _*)
df.write.parquet("s3://bucket/20161211.parquet")
}
Here is what s3://bucket/20161211.parquet looks like after a while:
As you can see, lots of fragmented small partitions (which is horrendous for read performance)...the question is, is there any way to control the number of small partitions as I stream data into this S3 parquet file?
Thanks
What I am thinking to do, is to each day do something like this:
val df = spark.read.parquet("s3://bucket/20161211.parquet")
df.coalesce(4).write.parquet("s3://bucket/20161211_4parition.parquet")
where I kind of repartition the dataframe to 4 partitions and save them back....
It works, I feel that doing this every day is not elegant solution...
That's actually pretty close to what you want to do, each partition will get written out as an individual file in Spark. However coalesce is a bit confusing since it can (effectively) apply upstream of where the coalesce is called. The warning from the Scala doc is:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
this may result in your computation taking place on fewer nodes than
you like (e.g. one node in the case of numPartitions = 1). To avoid this,
you can pass shuffle = true. This will add a shuffle step, but means the
current upstream partitions will be executed in parallel (per whatever
the current partitioning is).
In Dataset's its a bit easier to persist and count to do wide evaluation since the default coalesce function doesn't take repartition as a flag for input (although you could construct an instance of Repartition manually).
Another option is to have a second periodic batch job (or even a second streaming job) that cleans up/merges the results, but this can be a bit complicated as it introduces a second moving part to keep track of.

how spark master splits input data to different nodes

I'm using spark with mongodb , I want to know how input rdd splitted across different worker nodes in the cluster, because my job is to club two records(one is request another is response) into one , based on the msg_id ,flag(flag indicates request or response) fields, msg_id is same in both records.while spark splitting input rdd ,each split for each node then how to handle the case if request record in one node and response record in another node.
Firstly, Spark master does not split data. It just controls workers.
Secondly, rdd splits (while reading from external sources) are decided by InputSplits, implemented through input format. This part is fairly similar to map reduce. So in your case, rdd splits (or partitions, in spark terms) are decided by mongodb input format.
In your case, I believe what you are looking for is to co-locate all records for a msg id to one node. That can be achieved using a partitionByKey function.
RDDs will be build based on your transformation(Subjected to the scenario) and master has less room to play the role here.Refer this link How does Spark paralellize slices to tasks/executors/workers? .
In your case, you may need to implement groupby() or groupbykey() (this one is not recommended) transformations to group your values based on the keys(msg_id).
For example
val baseRDD = sc.parallelize(Array("1111,REQUEST,abcd","1111,RESPONSE,wxyz","2222,REQUEST,abcd","2222,RESPONSE,wxyz"))
//convert your base rdd to keypair RDD
val keyValRDD =baseRDD.map { line => (line.split(",")(0),line)}
//Group it by message_id
val groupedRDD = keyValRDD.groupBy(keyvalue => keyvalue._1)
groupedRDD.saveAsTextFile("c:\\result")
Result :
(1111,CompactBuffer((1111,1111,REQUEST,abcd), (1111,1111,RESPONSE,wxyz)))
(2222,CompactBuffer((2222,2222,REQUEST,abcd), (2222,2222,RESPONSE,wxyz)))
In the above case, possibility of having all the values for a key in same partition is high(subjected to data volume and available computing resource at run time)

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/user#spark.apache.org/msg43512.html

Resources