Bad read performance on Spark over HBase Hadoop - apache-spark

When reading 161 000 elements from HBase (462 MB based on HDFS file size) Spark spends at least 6 seconds to read them.
HBase is configured to use a block cache. During the test (there is no other process running at that moment), the block cache has a size of 470.1 MB (752.0 MB free).
All the elements are in the block cache.
The executor is running in an Yarn container (yarn mode) of 1408 MB memory.
Everything is running on a single node (including the master) over an Amazon m4 large node.
There is no other row in the table and a range scanning is performed.
RDD initialized like this
Executor Logs (it took 8 seconds in debug logging level)
The job is executed via Spark JobServer
Even a simple count on the RDD (no other operation) takes 5 seconds
I don't know what I can do based on the figures below. Where does the executor spend its time? How can I identify the bottleneck?
Thank you very much,
Sébastien.

Related

Can spark manage partitions larger than the executor size?

Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:

Running 16 processes on single jvm machine

I'm using a 64 GB RAM & 24 core machine and allocated 32 GB to JVM. I wanted to run following processes :-
7 Kafka Brokers
3 instance of zookeeper
Elastic Search
Cassandra
Spark
MongoDB
Mysql
Kafka Manager
Node.js
& running 4-5 Spark Application on 5-6 executor with 1GB each simulatenously. The working of Spark jobs are as follows :-
1) 1 Spark job takes data from kafka and inserts into Cassandra
2) 1 Spark job takes data from another kafka topic and inserts into different Cassandra Table.
3) 2 Spark Job takes data from Cassandra, did some processing/analysis and writes data into their respective different cassandra table.
So, Sometimes my insertion application gets hang. It is taking around 500 records/second from Kafka. After running for sometime, It starts creating batches in queue and there is no error still processing time in Spark dashboard is increasing gradually.
I have used TOP to check CPU usage and found there is one process "0QrmJB" which is taking 1500+ CPU% usage and java is taking 200% usage.
What might be the issues ? I'm not able to analyse.Is it ok to run these many processes on single JVM machine? Thanks,

How Apache Spark partitions data of a big file [duplicate]

This question already has answers here:
How does Spark partition(ing) work on files in HDFS?
(4 answers)
Closed 4 years ago.
Let's say I have a cluster of 4 nodes each having 1 core. I have a 600 Petabytes size big file which I want to process through Spark. File could be stored in HDFS.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions(600/4) so each partition will be of 125 PB size.
But I think 125 PB is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark. So, apologies if this is a naive question.
As you are storing your data on HDFS, it will be partitioned already in 64 MB or 128 MB blocks as per your HDFS configuration. (Lets assume 128 MB Blocks.)
So 600 petabytes will result in 4687500000 blocks of 128 MB each. (600 petabytes/128 MB)
Now when you run your Spark job, each executor will read few blocks of data (number of blocks will be equal to the number of cores in executor) and process them in parallel.
Basically, each core will process 1 partition. So the more cores you give to an executor the more data it can process, but at the same time you will need to allocate more memory to executor to handle the size of data loaded in memory.
It is advised to have moderate size executors. Having too many small executors will cause a lot of data shuffle.
Now coming to your scenario, if you have a 4 node cluster with 1 core each. You will have 3 executors running on them at max as 1 core will be taken for spark driver.
So to process the data, you will be able to process 3 partitions in parallel.
so it will take your job 4687500000/3 = 1562500000 iteration to process the whole data.
Hope that helps!
Cheers!
To answer your question, if you have stored file in HDFS it is already partitioned based on your HDFS configuration i.e. if block size is 64MB, your total file will be divided in such blocks and spread across Hadoop cluster. Spark will generate tasks according to your num.executors configuration to decide how many parallel tasks can be executed. Expect no_of_hdfs_blocks=no_of_total_tasks.
Next what matters is how you are processing logic on this data, are you doing any shuffling of data, something similar to repartition(*) which will move the data around the cluster and change partition number to be processed by your spark job.
HTH!

Spark only running one executor for big gz files

I have input source files(compressed .gz) which I need to process using Spark. Each file is 5 GBs (compressed gz) and there are around 11-12 files.
But when I give the source as input, spark just launches one executor. I understand that this may be due to the non-splittable nature of the file but even when I use a high RAM box e.g c3.8xlarge, it still doesnot use more executors. the executor memory being assigned is 45 GB and the executor cores as 31.

Spark Streaming Job Keeps growing memory

I am running spark v 1.6.1 on a single machine in standalone mode, having 64GB RAM and 16cores.
I have created five worker instances to create five executor as in standalone mode, there cannot be more than one executor in one worker node.
Configuration:
SPARK_WORKER_INSTANCES 5
SPARK_WORKER_CORE 1
SPARK_MASTER_OPTS "-Dspark.deploy.default.Cores=5"
all other configurations are default in spark_env.sh
I am running a spark streaming direct kafka job at an interval of 1 min, which takes data from kafka and after some aggregation write the data to mongo.
Problems:
when I start master and slave, it starts one master process and five worker processes. each only consume about 212 MB of ram.when i submit the job , it again creates 5 executor processes and 1 job process and also the memory uses grows to 8GB in total and keeps growing over time (slowly) also when there is no data to process.
we are unpersisting cached rdd at the end also set spark.cleaner.ttl to 600. but still memory is growing.
one more thing, I have seen the merged SPARK-1706, then also why i am unable to create multiple executor within a worker.and also in spark_env.sh file , setting any configuration related to executor comes under YARN only mode.
Any help would be greatly appreciated,
Thanks

Resources