Spark Streaming not distributing task to nodes on cluster - apache-spark

I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated

Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.

If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.

By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.

Related

How persist(StorageLevel.MEMORY_AND_DISK) works in Spark 3.1 with Java implementation

I am using Apache Spark 3.1 with java in GCP Dataproc cluster. And my code structure is like this.
Dataset<Row> dataset1 = readSpannerData(SparkSession session, Configuration session.sessionState().newHadoopConf());
Dataset<Row> dataset2 = reading some data from table1 bigtable
Dataset<Row> result1 = dataset1.join(dataset2);
dataset1.persist(StorageLevel.MEMORY_AND_DISK());
dataset2.persist(StorageLevel.MEMORY_AND_DISK()); //once the usage is done I am persisting both datasets
System.out.println(result1.count()); // It throws error in this line
The exact error from YARN UI is, select query on spanner table which I am using in the starting of the job, not from any bigtable. I persisted dataset1 only after the usage is done.
And my cluster size is autoscale enabled with max of 250 worker nodes each have 8 core and 1024GB memory.
It is configured to use 2 Executors on each node.(4 cores on each exe).
It was working fine with low volume of data. But it throws error while running with huge data.
Why it throws error in this situation? Will it look into the parent in memory dataset while using the result, calculated from the parent dataset which is already persisted? If we want to maintain that dataset then what is the usage of IN-Memory storage?
How it is working in low data environments? Howmany nodes and how long IN-MEMORY dataset will be maintained in spark job? Will the volume of the data affect IN-MEMORY dataset?
Can any one clarify this doubt?
Thanks In Advance :)

Spark Streaming Dynamic Allocation ExecutorAllocationManager

We have a spark 2.1 streaming application with a mapWithState, enabling spark.streaming.dynamicAllocation.enabled=true. The pipeline is as follows:
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
The streaming app starts with 2 executors, after some time it creates new executors, as the load increases as expected. The problem is that the load is not shared by those executors.
The number of Partitions is big enough to spill over to the new executors, and the key is equally distributed, I set it up with 40+ partitions, but I can see only 8 partitions (2 executors x 4 cores each) on the mapWithState storage. I am expecting when new executors are allocated, those 8 partitions get split and assigned to the new ones, but this never happens.
Please advise.
Thanks,
Apparently the answer was staring at my face al along :). RDDs as per documentation below, should inherit the upstream partitions.
* Otherwise, we use a default HashPartitioner. For the number of partitions, if
* spark.default.parallelism is set, then we'll use the value from SparkContext
* defaultParallelism, otherwise we'll use the max number of upstream partitions.
The state inside a mapWithState however does not have an upstream RDD. Therefore is set to the default parallelism unless you specify the partitions directly in the state, as the example bellow.
val stateSpec = StateSpec.function(crediting.formSession _)
.timeout(timeout)
.numPartitions(partitions) // <----------
var rdd_out = ssc.textFileStream()
.map(convertToEvent(_))
.combineByKey(...., new HashPartitioner(partitions))
.mapWithState(stateSpec)
.map(s => sessionAnalysis(s))
.foreachRDD( rdd => rdd.toDF().....save(output));
Still need to figure out how to make the number of partitions dynamic, as with dynamic allocation, this should change at runtime.

Spark foreachpartition connection improvements

I have written a spark job which does below operations
Reads data from HDFS text files.
Do a distinct() call to filter duplicates.
Do a mapToPair phase and generate pairRDD
Do a reducebykey call
do the aggregation logic for grouped tuple.
now call a foreach on #5
here it does
make a call to cassandra db
create an aws SNS and SQS client connection
do some json record formatting.
publish the record to SNS/SQS
when I run this job it creates three spark stages
first stage - it takes nearly 45 sec . performs a distinct
second stage - mapToPair and reducebykey = takes 1.5 mins
third stage = takes 19 mins
what I did
I turned off cassandra call so see DB hit cause - this is taking less time
Offending part I found is to create SNS/SQS connection foreach partition
its taking more than 60% of entire job time
I am creating SNS/SQS Connection within foreachPartition to improve less connections. do we have even better way
I Cannot create connection object on the driver as these are not serializable
I am not using number of executor 9 , executore core 15 , driver memory 2g, executor memory 5g
I am using 16 core 64 gig memory
cluster size 1 master 9 slave all same configuration
EMR deployment spark 1.6
It sounds like you would want to set up exactly one SNS/SQS connection per node and then use it to process all of your data on each node.
I think foreachPartition is the right idea here, but you might want to coalesce your RDD beforehand. This will collapse partitions on the same node without shuffling, and will allow you to avoid starting extra SNS/SQS connections.
See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD#coalesce(numPartitions:Int,shuffle:Boolean,partitionCoalescer:Option[org.apache.spark.rdd.PartitionCoalescer])(implicitord:Ordering[T]):org.apache.spark.rdd.RDD[T]

How to Dynamically Increase Active Tasks in Spark running on Yarn

I am running a spark streaming process where I got a batch of 6000 events. But when I look at executors only one active task is running. I tried dynamic allocation and as well as setting number of executors etc. Even if I have 15 executors only one active task is running at a time. Can any one please guide me what am I doing wrong here.
It looks like you're having only one partition in your DStream. You should try to explicitly repartition your input stream:
val input: DStream[...] = ...
val partitionedInput = input.repartition(numPartitions = 16)
This way you would have 16 partitions in your input DStream, and each of those partitions could be processed in a separate task (and each of those tasks could be executed on a separate executor)

Repartitioning of dataframe in spark does not work

I have a cassandra database with large numbers of records ~4 million. I have 3 slave machines and one driver. I want to load this data in spark memory and do processing of it. When I do the following it reads all the data in one slave machine(300 mb out of 6 Gb) and all other slave machines memory is unused. I did a reparition on the dataframe into 3 but still the data is there on one machine. Because of this it takes a lot of time to process data since every job is executed on one machine. This is what I am doing
val tabledf = _sqlContext.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "events", "keyspace" -> "sams")).load
tabledf.registerTempTable("tempdf");
_sqlContext.cacheTable("tempdf");
val rdd = _sqlContext.sql(query);
val partitionedRdd = rdd.repartition(3)
val count = partitionedRdd.count.toInt
When I do some operations on partitionedRdd it is executed only on one machine since all data is present on one machine only
UPDATE
I am using this in the configuration --conf spark.cassandra.input.split.size_in_mb=32, still all my data is loaded into one executor
Update
I am using spark version 1.4 and spark cassandra connector version 1.4 released
If "Query" only accesses a single C* partition key you will only get a single task because we don't have a way (yet) of automatically getting a single cassandra partition in parallel. If you are accessing multiple C* partitions then try futher shrinking the input split_size in mb.

Resources