I'm running a simple groupby on 350GB of data. Since I'm running this on a single node (I'm on an HPC cluster), I requested computing resource of 400GB and then running the spark job by setting spark.driver.memory to 350 GB.
Since it's running on a single node, the Driver node acts as both master and slave. The job is currently taking more than 6 hours to complete. All it does is a simple groupby operation followed by merging it into a single parquet:
val data = spark.read.parquet("path_to_folder/*")
val grouped = data.groupBy("i","j").agg(sum("count").alias("count"))
grouped.write.parquet("output_folder_path")
Is there a way to make this process more optimal. Specifically, is there a way to force the driver node to make multiple slaves even as the driver node is acting as both master and slave (0 workers) so that the grouping is more efficient?
Related
Posting this question to learn how Apache Spark collects and coordinate the results from executors.
Suppose I'm running a job with 3 executors. My DataFrame is partitioned and running across these 3 executors.
So now, When I execute a count() or collect() action on the DataFrame how spark will coordinate the results from these 3 executors?
val prods = spark.read.format("csv").option("header", "true").load("testFile.csv")
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
prods.count(); // How spark collect data from three executors? Who will coordinate the result from different executors and give it to driver?
When you do spark-submit you specify master a client program (driver) starts running on yarn ,if yarn is specified master or local if local specified. https://spark.apache.org/docs/latest/submitting-applications.html
Since you have added tag yarn in the question i am assuming you mean yarn-url,so yarn launches client program(driver) on any of the nodes of cluster and registers and assigns workers (executors) to driver so that task to be executed on each node.Each transformation/action is run parallel on each worker nodes (executor).Once each node complete the job they return back there results to the driver program.
Oki,what part are you not clear ?
Let me make it generic,the client/driver program launches and requests the master local/standalone master/yarn aka Cluster Manager that driver program wants resources to perform tasks ,so allocate driver with the workers for that.The cluster manager in return allocates workers,launches executors on worker nodes and gives the information to client program that you can you use these workers to do your job.So data is divided in each worker node and parallel tasks/transformations are done.Once collect() or count() is called (i assume this is the final part of job).Then each executor return its result back to driver.
I'm using a 64 GB RAM & 24 core machine and allocated 32 GB to JVM. I wanted to run following processes :-
7 Kafka Brokers
3 instance of zookeeper
Elastic Search
Cassandra
Spark
MongoDB
Mysql
Kafka Manager
Node.js
& running 4-5 Spark Application on 5-6 executor with 1GB each simulatenously. The working of Spark jobs are as follows :-
1) 1 Spark job takes data from kafka and inserts into Cassandra
2) 1 Spark job takes data from another kafka topic and inserts into different Cassandra Table.
3) 2 Spark Job takes data from Cassandra, did some processing/analysis and writes data into their respective different cassandra table.
So, Sometimes my insertion application gets hang. It is taking around 500 records/second from Kafka. After running for sometime, It starts creating batches in queue and there is no error still processing time in Spark dashboard is increasing gradually.
I have used TOP to check CPU usage and found there is one process "0QrmJB" which is taking 1500+ CPU% usage and java is taking 200% usage.
What might be the issues ? I'm not able to analyse.Is it ok to run these many processes on single JVM machine? Thanks,
I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.
I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks
I have two node standalone cluster for spark stream processing. below is my sample code which demonstrate process I am executing.
sparkConf.setMaster("spark://rsplws224:7077")
val ssc=new StreamingContext()
println(ssc.sparkContext.master)
val inDStream = ssc.receiverStream //batch of 500 ms as i would like to have 1 sec latency
val filteredDStream = inDStream.filter // filtering unwanted tuples
val keyDStream = filteredDStream.map // converting to pair dstream
val stateStream = keyDStream .updateStateByKey //updating state for history
stateStream.checkpoint(Milliseconds(2500)) // to remove long lineage and meterilizing state stream
stateStream.count()
val withHistory = keyDStream.join(stateStream) //joining state wit input stream for further processing
val alertStream = withHistory.filter // decision to be taken by comparing history state and current tuple data
alertStream.foreach // notification to other system
My Problem is spark is not distributing this state RDD to multiple nodes or not distributing task to other node and causing high latency in response, my input load is around 100,000 tuples per seconds.
I have tried below things but nothing is working
1) spark.locality.wait to 1 sec
2) reduce memory allocated to executer process to check weather spark distribute RDD or task but even if it goes beyond memory limit of first node (m1) where drive is also running.
3) increased spark.streaming.concurrentJobs from 1 (default) to 3
4) I have checked in streaming ui storage that there are around 20 partitions for state dstream RDD all located on local node m1.
If I run SparkPi 100000 then spark is able to utilize another node after few seconds (30-40) so I am sure that my cluster configuration is fine.
Edit
One thing I have noticed that even for my RDD if I set storage level MEMORY_AND_DISK_SER_2 then also in app ui storage it shows Memory Serialized 1x Replicated
Spark will not distribute stream data across the cluster automatically for it tends to make full use of data locality(to launch a task on where its data lies will be better, this is default configuration). But you can use repartition to distribute stream data and improve the parallelism. You can turn to http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#performance-tuning for more information.
If your not hitting the cluster and your jobs only run locally it most likely means your Spark Master in your SparkConf is set to the local URI not the master URI.
By default the value of spark.default.parallelism property is "Local mode" so all the tasks will be executed in the node is receiving the data.
Change this property in spark-defaults.conf file in order to increase the parallelism level.