How to support spark broadcast join for 5 GB table - apache-spark

I am doing broadcast join-in spark via a broadcast hint. My broadcast table size is 5GB and 224m records with 2 columns. As per spark it can support broadcast join till 8GB or 500m records.
But in my case when it goes beyond 30M records then Broadcast issues occur and sometime roc timeout issues
My driver memory is 8GB and executes memory is 8g. Also, the driver's max result size is 4GB. What additional settings I can apply to support this 5 GB table broadcast
My node is having enough bandwidth and tried with 12 GB driver memory as well but still doesn't work
Please let me know if any other spark setting I can apply to make it work
Thanks

Related

Can spark manage partitions larger than the executor size?

Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:

Spark is not use all configured storage memory capacity

My task in spark uses images data for prediction I am working on a spark cluster standalone but I have an issue utilizing all the available memory capacity as here all available memory is 2.7 GB (coming from a memory executor that is configured 5 GB *0.6 *0.9= 2.7 it's okay ) but the usage memory is only 342 MB after that value my spark session being crashed and I did not know why this specific value!
I test my application on local and on a standalone cluster mode in addition whatever the memory executor configured value the limit of memory value for execution will be 342 MB. and here as shown my data size of 290691 KB led to the crash of my spark session and it works fine if I decrease the number of images
as follows screenshot issue:
This output error crashed with a data size of 290691 KB
Here my spark UI Storage Memory did not exceed 342 MB
so is there any advice or what is the correct spark configuration?
It's a warning, initially.
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher throughput. You can find many such issues out there on the Internet.

Spark- What could be the best idle broadcast table size

We use broadcast as one of the joining optimize solution in spark. Could you please help me to understand below things
1) Always broadcast table size should be less than driver memory .
In this case suppose my broadcast table size is 4 GB but driver memory is 3GB , Can i increase the driver memory to 6 GB and broadcast 4 GB table
2) What could be the maximum driver memory can i provide is there any limit ?
I think , It totally depends on what we are bringing to driver ( broadcast, collect etc)
3) I heard we can broadcast upto only 2GB data , because java serialisation has support till 2GB data only , is it true ?

Memory regarding broadcast variables in spark

Assuming I have a cluster with two worker nodes and from these two workers, I have 10 executors. How much memory will be used up in my cluster if I choose to broadcast a 1gb Map?
Will it be 1gb per worker so 2gb in total? Or will it be 1gb per executor so 10gb in total?
Apologies for the simple question, but for me, a number of articles written about broadcast variables aren’t 100% clear on this issue.
Executers are the entities that perform the actual work. Each executor is its own JVM process with allotted memory.
(As described here: http://spark.apache.org/docs/latest/cluster-overview.html)
The broadcast is materialized at the executer level so in the above example- 10GB.

Cassandra out of memory heap

I have 4 cassandra nodes in cluster and one column family which has 10 columns where row cannot grow very wide (maybe max 1000 columns).
I have "peak" writes where I insert up to 500 000 records in 5-10 minutes range.
I use node.js driver: node-cassandra-cql
3 nodes are working fine but one node crashes every time on heavy writes.
All nodes currently have around 1.5 GB data size and problematic node has 1.9 GB data size.
All nodes have max heap space at 1GB (I have 4 GB RAM on machines so default cassandra config file calculated this amount of heap)
I use default cassandra configuration except I increased write/read timeouts.
Question: Does anyone knows what could be reason for this?
Is heap size really that small?
What and how to configure cassandra cluster for this use case (heavy writes at small time range and other time actually doing nothing or just small writes)
I haven't tried to increase heap size manually, first I would like to know if maybe there is something other to configure instead just increasing it.

Resources