I am new in Spark application. I am using r5a.4xlarge aws cluster with min worker is 1 and max worker is 16. This instance has 128GB memory and 16 cores.
I have used spark.executor.cores 5.
As per the memory management calculation memory/ executor is near around 42GB. After subtracting the overhead memory 10% net memory available is around 37GB.
I have kept offheap memory enable. Whenever I am trying to use below spark configuration it is giving me the below error.
Error updating cluster for job S2C_ER_ENROLMENT_PREM_DTL_HISTORY_TEST×Specified heap memory (37888 MB) and off heap memory (73404 MB) is above the maximum executor memory (97871 MB) allowed for node type r5a.4xlarge.
I have three questions from the above error message.
How can maximum executor memory is showing 97871 MB? This is nowhere near to my calculated result.
How become offheap memory is becoming 73404MB as I did not set it explicitly?
How we calculate the offheap memory?
Below is the configuration I was trying to use.
spark.broadcast.compress true
spark.dynamicAllocation.enabled true
yarn.fail-fast true
spark.shuffle.reduceLocality.enabled true
spark.task.cpus 5
spark.dynamicAllocation.shuffleTracking.enabled true
mapreduce.shuffle.listen.queue.size 2048
spark.rpc.message.maxSize 1024
spark.storage.memoryFraction 0.8
spark.files.useFetchCache true
spark.scheduler.mode FAIR
spark.shuffle.compress true
spark.sql.adaptive.coalescePartitions.initialPartitionNum 3
spark.sql.adaptive.coalescePartitions.enabled true
mapreduce.shuffle.max.connections 100
spark.executor.cores 5
spark.executor.memory 37G
spark.driver.memory 37G
spark.driver.cores 5
spark.executor.instances 29
spark.sql.adaptive.coalescePartitions.parallelismFirst false
spark.storage.replication.proactive true
spark.sql.adaptive.skewJoin.enabled true
spark.network.timeout 300000
spark.broadcast.blockSize 128m
spark.sql.adaptive.coalescePartitions.minPartitionSize 256M
spark.akka.frameSize 1024
spark.speculation true
spark.cleaner.periodicGC.interval 12000
mapreduce.reduce.shuffle.memory.limit.percent 85
spark.logConf false
spark.executor.heartbeatInterval 200000
spark.worker.cleanup.enabled true
spark.sql.adaptive.enabled true
spark.sql.adaptive.advisoryPartitionSizeInBytes 1024M
spark.shuffle.io.preferDirectBufs true
mapreduce.map.log.level ERROR
spark.sql.adaptive.skewJoin.skewedPartitionFactor 6
spark.default.parallelism 40
spark.hadoop.databricks.fs.perfMetrics.enable false
mapreduce.shuffle.max.threads 99
Related
I am running a complex logic in Spark, in order to get some metrics, I add count in someplace. there are 2 stags
after reading data from S3, I got 22 tasks and data distributed on 22 executorsthese tasks distributed on 22 executors
I run count, the data shuffle to 4 executors with 200 tasks. data shuffle to 4 executors
code:
val newRdd = rdd.toDF()
.repartition(Constants.cappingSparkPartitionNum, $"internal_id")
.rdd // line 389
.map(convertRowToConfiguration)
Log.info(s"source, rdd.getNumPartitions: , ${rdd.getNumPartitions}, newRdd.getNumPartitions: ${newRdd.getNumPartitions}, count: ${newRdd.count()}") // line 392
Why did this happen?
How to control the data distributed on executors(40 executors) evenly?
We are processing roughly 500 MB file of data in EMR.
I am performing the following operations on the file.
read csv :
val df = spark.read.format("csv").load(s3)
aggregating by key and creating the list :
val data = filteredDf.groupBy($"<key>")
.agg(collect_list(struct(cols.head, cols.tail: _*)) as "finalData")
.toJSON
Iterating through each partition and storing per key aggregation to S3 and sending the key to SQS.
data.foreachPartition(partition => {
partition.foreach(json => ......)
}
Data is skewed with one account having almost 10M records (~400 MB). I am experiencing out of memory issue during foreachPartition for the given account.
Configuration:
1 driver : m4.4xlarge CPU Cores : 16 and Memory : 64GB
1 executor : m4.2x large CPU Cores : 8 and Memory : 32GB
driver-memory: 20G
executor-memory: 10G
Partitions : default 200 [ most of them don't do anything ]
Any help is much appreciated! thanks a lot in advance :)
I am using m4.2x master + 12 r5.12x core instances to run my spark job (spark 2.4, EMR 5.21)
I gave the following cluster config:
[
{
"classification": "spark-defaults",
"properties": {
"spark.executor.memory": "39219M",
"spark.driver.memory": "39219M",
"spark.driver.cores": "5",
"spark.executor.cores": "5",
"spark.memory.storageFraction": "0.27",
"spark.memory.fraction": "0.80",
"spark.executor.instances": "107",
"spark.yarn.executor.memoryOverhead": "4357M",
"spark.dynamicAllocation.enabled": "false",
"spark.yarn.driver.memoryOverhead": "4357M"
},
"configurations": []
}
]
As ec2 instance types says r5.12x has memory of 384 GB. And I calculated the above as follows
# of cores per executor = 5
# of executors per r5.12x instance = floor(48/5) = 9
spark.executor.instances = 9 * 12 - 1 (minus 1 for driver)
spark.executor.memory = floor(((383 * 1024)/9) * 0.9) = 39219MB
spark.executor.memoryOverhead = floor(((383 * 1024)/9) * 0.1) = 4357MB
(and same for driver)
Yet when the cluster launches, 95 executors + 1 driver is created (instead of 107 executors + 1 driver). Each with storage memory (as per spark UI) of 29 GB (which should have been spark.memory.storageFraction * 39219 ~ 30.63 GB). Why are 12 less executors created (including driver)? And why is storage memory in UI less ? Am I correct about the formula used to derive storage memory in UI ?
I tested throughput performance of Cassandra cluster with 2,3 and 4 nodes. There was significant improvement when I used 3 nodes(as compared to 2), however, the improvement wasn't so significant when I used 4 nodes, instead of 3.
Given below are specs of the 4 nodes:
N->No. of physical CPU cores, Ra->Total RAM, Rf->Free RAM
Node 1: N=16, Ra=189 GB, Rf=165 GB
Node 2: N=16, Ra=62 GB, Rf=44 GB
Node 3: N=12, Ra=24 GB, Rf=38 GB
Node 4: N=16, Ra=189 GB, Rf=24 GB
All nodes are on RHEL 6.5
Case 1(2 nodes in the cluster, Node 1 and Node 2)
Throughput: 12K ops/second
Case 2(3 nodes in the cluster, Node 1, Node 2 and Node 3)
Throughput: 20K ops/second
Case 3(All 4 nodes in the cluster)
Throughput: 23K ops/second
1 operation involved 1 read + 1 write(Read/write takes place on the same row)(Row cache can't be used). In all cases, Read consistency =2 and Write Consistency =1. Both read and write were asynchronous. The client application used Datastax's C++ driver and was being run with 10 threads.
Given below are the keyspace and table details:
CREATE KEYSPACE cass WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '2'} AND durable_writes = true;
CREATE TABLE cass.test_table (
pk text PRIMARY KEY,
data1_upd int,
id1 int,
portid blob,
im text,
isflag int,
ms text,
data2 int,
rtdata blob,
rtdynamic blob,
rtloc blob,
rttdd blob,
rtaddress blob,
status int,
time bigint
) WITH bloom_filter_fp_chance = 0.001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Some parameters from YAML are given below(All 4 nodes used similar YAML files):
commitlog_segment_size_in_mb: 32
concurrent_reads: 64
concurrent_writes: 256
concurrent_counter_writes: 32
memtable_offheap_space_in_mb: 20480
memtable_allocation_type: offheap_objects
memtable_flush_writers: 1
concurrent_compactors: 2
Some parameters from jvm.options are given below(all nodes used same values):
### CMS Settings
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=4
-XX:MaxTenuringThreshold=6
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:+CMSClassUnloadingEnabled
Given below are some client's connection specific parameters:
cass_cluster_set_max_connections_per_host ( ms_cluster, 20 );
cass_cluster_set_queue_size_io ( ms_cluster, 102400*1024 );
cass_cluster_set_pending_requests_low_water_mark(ms_cluster, 50000);
cass_cluster_set_pending_requests_high_water_mark(ms_cluster, 100000);
cass_cluster_set_write_bytes_low_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_write_bytes_high_water_mark(ms_cluster, 100000 * 2024);
cass_cluster_set_max_requests_per_flush(ms_cluster, 10000);
cass_cluster_set_request_timeout ( ms_cluster, 12000 );
cass_cluster_set_connect_timeout (ms_cluster, 60000);
cass_cluster_set_core_connections_per_host(ms_cluster,1);
cass_cluster_set_num_threads_io(ms_cluster,10);
cass_cluster_set_connection_heartbeat_interval(ms_cluster, 60);
cass_cluster_set_connection_idle_timeout(ms_cluster, 120);
Is there anything wrong with the configurations due to which Cassandra didn't scale much when number of nodes were increased from 3 to 4?
During a test, you may check ThreadPools using nodetool tpstats.
You will be able to see if some stages have too many pending (or blocked) tasks.
If there are no issues with ThreadPools, may be you cloud launch a benchmark using cassandra-stress in order to see if the limitation comes from your client?
I don't know if it is only for test purpose but as far as I know, Read before Write data is an antipattern with Cassandra.
I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory and 4G of spark.executor.memory, eventually increasing the spark.driver.memory up to 10G.
Problem(s):
The loop runs like a charm during the first iterations, but towards the end,
around the 6th or 7th iteration I see my CPU utilization dropping (using 1
instead of 6 cores). Along with that, execution time for one iteration
increases significantly.
At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G: at collect() (FAIL 1)
4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Stack Trace for FAIL 1 case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict(), it can be in arbitrary order).
My questions thus:
What causes the OutOfMemory Error and why are not all available CPU cores
used towards the end?
And why do I need 10G spark.driver.memory when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
If I get an OOM error, the right place to look at is almost always the driver
(b/c the executor spills to disk)?
Why would count() cause an OOM error - I thought this action would only
consume resources on the exector(s) (delivering a few bytes to the driver)?
Are the memory metrics (metrics system, UI) mentioned above the correct
places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G of spark.driver.memory and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried
to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.