Spark Streaming Job Keeps growing memory - apache-spark

I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks

Related

Spark - only one partition is processed at each node

I see in my Spark job that usually (but not always) only one partition is being processed on each node. What could be the possible reasons? How can I debug it?
You should check the executor's resources configuration:
spark.executor.memory
spark.executor.cores
These configs control how many executors can run concurrently on each node and therefore -- how many partitions are processed concurrently (by default, every executor processes a single partition).
For example, if your nodes have 8 cores and 32gb memory each and your spark application is defined with:
spark.executor.memory=25g
spark.executor.cores=3
only one executor will be able to run concurrently on each node and in order to run 2 executors concurrntly the node should have at least 50gb memory.

How Spark and S3 interact

I'm wondering how data is loaded into spark in below scenario:
There is 10 GB transaction data stored in S3 in parquet format, I'm going to run a Spark program to categorize every record in that 10 GB Parquet file (e.g. Income, Shopping, Dinning).
I have following questions:
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
How would this 10 GB distributed into different workers in the Spark Cluster? Does the 10 GB file loaded into Spark Master then Master split the data and send to executors?
Answer:
Spark follows Master-Slave architecture. We have one master (Driver/Co-Ordinator) and multiple distributed worker nodes. Driver process runs on the master node and main method of the program runs in driver process. Driver process creates SparkSession or SparkContext. Driver process converts user code to tasks based on the transformation and actions operations in the code from the lineage graph. Driver creates the logical and physical plan and once physical plan is ready it co-ordinates with the cluster manager to get the executors to complete the task. Driver just keeps track of the state of the data(metadata) for each of the executors.
So, 10 GB file does not get loaded to the master node. S3 is a distributed Storage and spark reads from it in a splitted manner. Driver process just decides how the data would get splitted and what each executor needs to work on. Even if you cache the data it gets cached on the executors node only based on the partitions/data that the executor is working on. Also nothing gets triggered unless you call a action operation like count, collect etc. It creates a lineage graph plus DAG to keep track of this information.
If all these happen in memory? What if one of the executors crashed during a job run, will the master load the 10 GB file from S3 again and extract the subset of data that supposed to be processed by the crashed executor and send to another executor?
Answer:
As answered in first question, anything gets loaded into memory only when any action is performed. Loaded into memory does not mean it would be loaded into the driver memory. Depending upon the action data gets loaded into memory of driver or executors. If you have used collect operation everything gets loaded into the driver memory but for some other operation like count if you have cached dataframe then the data would get loaded into memory on each of the executor nodes.
Now if one of the executor crashes during the job ran, driver has the lineage graph information and the data (metadata) that the crash executor had, so it runs the same lineage graph on other executor and perform the task. This is what makes Spark resilient and fault tolerance.
Each worker will issue 1+ GET request on the ranges of the parquet file it has been given; more as it seeks around the files. The whole 10GB file is never loaded anywhere.
each worker will be doing its own read of its own split; this counts against the overall IO capacity of the store/shard.

Running 16 processes on single jvm machine

I'm using a 64 GB RAM & 24 core machine and allocated 32 GB to JVM. I wanted to run following processes :-
7 Kafka Brokers
3 instance of zookeeper
Elastic Search
Cassandra
Spark
MongoDB
Mysql
Kafka Manager
Node.js
& running 4-5 Spark Application on 5-6 executor with 1GB each simulatenously. The working of Spark jobs are as follows :-
1) 1 Spark job takes data from kafka and inserts into Cassandra
2) 1 Spark job takes data from another kafka topic and inserts into different Cassandra Table.
3) 2 Spark Job takes data from Cassandra, did some processing/analysis and writes data into their respective different cassandra table.
So, Sometimes my insertion application gets hang. It is taking around 500 records/second from Kafka. After running for sometime, It starts creating batches in queue and there is no error still processing time in Spark dashboard is increasing gradually.
I have used TOP to check CPU usage and found there is one process "0QrmJB" which is taking 1500+ CPU% usage and java is taking 200% usage.
What might be the issues ? I'm not able to analyse.Is it ok to run these many processes on single JVM machine? Thanks,

unable to launch more tasks in spark cluster

I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure blob storage and writing to hive table.
when the job runs I see only a single task getting launched and single executor working and all the other executor and instances sitting idle/dead.
How to increase the number of tasks so the job can run faster.
any help appreciated
I'm guessing that your csv file is in one block. Therefore your data is only on one partition and since Spark "only" creates one task per partition, you only have one.
You can call repartition(X) on your dataframe/rdd just after reading it to increase the number of partitions. Reading won't be faster but all your transformations and the writting will be parallelized.

Bad read performance on Spark over HBase Hadoop

When reading 161 000 elements from HBase (462 MB based on HDFS file size) Spark spends at least 6 seconds to read them.
HBase is configured to use a block cache. During the test (there is no other process running at that moment), the block cache has a size of 470.1 MB (752.0 MB free).
All the elements are in the block cache.
The executor is running in an Yarn container (yarn mode) of 1408 MB memory.
Everything is running on a single node (including the master) over an Amazon m4 large node.
There is no other row in the table and a range scanning is performed.
RDD initialized like this
Executor Logs (it took 8 seconds in debug logging level)
The job is executed via Spark JobServer
Even a simple count on the RDD (no other operation) takes 5 seconds
I don't know what I can do based on the figures below. Where does the executor spend its time? How can I identify the bottleneck?
Thank you very much,
Sébastien.

Resources