I am using nodejs version 4.2.3, recently I observed that memory pile up quickly while uploading data to S3 bucket (Approx. file size is 1.5 GB each). I took heap snapshot which shows TLSWrap object retained size around 1GB.
Any one faced same issue ? thanks in advance.
Related
Question:
Spark seems to be able to manage partitions that are bigger than the executor size. How does it do that?
What I have tried so far:
I picked up a CSV with: Size on disk - 12.3 GB, Size in memory deserialized - 3.6 GB, Size in memory serialized - 1964.9 MB. I got these sizes from caching the data in memory deserialized and serialized both and 12.3 GB is the size of the file on the disk.
To check if spark can handle partitions larger than the executor size, I created a cluster with just one executor with spark.executor.memory equal to 500mb. Also, I set executor cores (spark.executor.cores) to 2 and, increased spark.sql.files.maxPartitionBytes to 13 GB. I also switched off Dynamic allocation and adaptive for good measure. The entire session configuration is:
spark = SparkSession.builder.\
config("spark.dynamicAllocation.enabled",False).\
config("spark.executor.cores","2").\
config("spark.executor.instances","1").\
config("spark.executor.memory","500m").\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","13g").\
getOrCreate()
I read the CSV and checked the number of partitions that it is being read in by df.rdd.getNumPartitions(). Output = 2. This would be confirmed later on as well in the number of tasks
Then I run df.persist(storagelevel.StorageLevel.DISK_ONLY); df.count()
Following are the observations I made:
No caching happens till the data for one batch of tasks (equal to number of cpu cores in case you have set 1 cpu core per task) is read in completely. I conclude this since there is no entry that shows up in the storage tab of the web UI.
Each partition here ends up being around 6 GB on disk. Which should, at a minimum, be around 1964.9 MB/2 (=Size in memory serializez/2) in memory. Which is around 880 MB. There is no spill. Below is the relevant snapshot of the web UI from when around 11 GB of the data has been read in. You can see that Input has been almost 11GB and at this time there was nothing in the storage tab.
Questions:
Since the memory per executor is 300 MB (Execution + Storage) + 200 MB (User memory). How is spark able to manage ~880 MB partitions that too 2 of them in parallel (one by each core)?
The data read in does not show up in the Storage, is not (and, can not be) in the executor and, there is no spill as well. where exactly is that read in data?
Attaching a SS of the web UI post that job completion in case that might be useful
Attaching a SS of the Executors tab in case that might be useful:
My task in spark uses images data for prediction I am working on a spark cluster standalone but I have an issue utilizing all the available memory capacity as here all available memory is 2.7 GB (coming from a memory executor that is configured 5 GB *0.6 *0.9= 2.7 it's okay ) but the usage memory is only 342 MB after that value my spark session being crashed and I did not know why this specific value!
I test my application on local and on a standalone cluster mode in addition whatever the memory executor configured value the limit of memory value for execution will be 342 MB. and here as shown my data size of 290691 KB led to the crash of my spark session and it works fine if I decrease the number of images
as follows screenshot issue:
This output error crashed with a data size of 290691 KB
Here my spark UI Storage Memory did not exceed 342 MB
so is there any advice or what is the correct spark configuration?
It's a warning, initially.
The general gist here is that you need to repartition to get more, but smaller size partitions, so as to get more parallelism and higher throughput. You can find many such issues out there on the Internet.
spark version: 2.4.0
My application is a simple pyspark-kafka structured streaming application using park-sql-kafka-0-10_2.11:2.4.0
To avoid this possible memory leak problem
, I am also using foreachBatch for each microbatch.
Also, each microbatch is supposed to be composed of <= 1000 rows meaning, it is very unlikely to cause the out of memory issue as long as caches are cleared properly. To be extra cautious, I called spark.catalog.clearCache() at the end of each microbatch to ensure all caches are cleared.
However, after having it run for a while (~ 30 mins) it raises the following issue.
22/01/11 10:39:36 ERROR Client: Application diagnostics message: Application application_1641893608223_0002 failed 2 times due to AM Container for appattempt_1641893608223_0002_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=17995,containerID=container_1641893608223_0002_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 4.4 GB of 6.9 GB virtual memory used. Killing container.
Even though 1.4 GB is a small amount of memory, each microbatch itself is pretty small as well so it shouldn't be a problem.
Also, there are a lot of tasks stacked in the Kafka-Q, In order to prevent the overload in the spark streaming, I have set spark.streaming.blockInterval to 40000ms and maxOffsetsPerTrigger to 10.
What could be possibly causing this out-of-memory issue?
I have a Spark job that reads a CSV file and does a bunch of joins and renaming columns.
The file size is in MB
x = info_collect.collect()
x size in python is around 100MB
however I get a memory crash, checking Gangla the memory goes up 80GB.
I have no idea why collection 100MB can cause memory to spike like that.
Could someone please advice?
I’m using Cassandra 1.2.1, and I am using COPY command to insert millions of rows. Each row is 100 bytes long. The issue is that the insertion happens rather slowly, at rate of 1500 rows per second. We have 3 node cluster with 50 GB disk space each, and 4 GB RAM each. Cassandra process is running with max heap size of 1 GB. We are storing commit logs and data files on the same disk. What could be the cause of this behaviour? Any help would be appreciated.
Apparently as of now, they are not planning to improve the speed of COPY.
See https://issues.apache.org/jira/browse/CASSANDRA-4588