Spark : 100 GB file OutOfMemoryError: Requested array size exceeds VM limit [closed] - apache-spark

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed yesterday.
Improve this question
I have a 100 GB xml file that I like to process using Spark/AWS Glue. However, I'm getting a out of memory Requested array size exceeds VM limit. I have tested my code on 60 MB file and its working without any issues. But it chokes on 100 Gig file. Is partitioning the xml file even an option here? If it is, how can I partition the file to a certain size (say 5 GB) files but make each partitioned file finish at the closing tag of the record. Meaning:
<provider> ....many many inner and nested elements ...
</provider>
<provider> ....many many inner and nested elements ...
</provider>
<provider> ....many many inner and nested elements ...
</provider> // this 5 GB file for example, will be complete at the closing tag
Any help, examples, approaches would be appreciated to parse and process this 100 GB file.
One of the log files has the following output:
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 7"...
Another log file - It seems like I'm not able to convert my dynamic frame to a spark dataframe due to the amount of data.
Error: An error occurred while calling o102.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.223.68.133 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2798)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2798)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1239)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3051)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2993)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2982)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1009)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2229)
Code below:
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
try:
# Script generated for node S3 bucket
df = glueContext.create_dynamic_frame.from_options(
format_options={"rowTag": "PRVDR"},
connection_type="s3",
format="xml",
connection_options={
"paths": ["s3://aws-glue-bucket/xml_files/myfile.XML"]
})
df1 = df.toDF()
df1.printSchema()
df_provider_info = df1.withColumn("pec_ind_name_exploded", F.explode(df1["PRVDR_INFO.INDVDL_INFO.NAME_LIST.PEC_INDVDL_NAME"]))\
.select(df1["PRVDR_INFO.INDVDL_INFO.ID"].alias("ID"),
df1["PRVDR_INFO.INDVDL_INFO.BIRTH_DT"],
df1["PRVDR_INFO.INDVDL_INFO.BIRTH_STATE_CD"],
df1["PRVDR_INFO.INDVDL_INFO.BIRTH_STATE_NAME"],
df1["PRVDR_INFO.INDVDL_INFO.BIRTH_CNTRY_CD"]
# many more elements selected to flattened ....
)
df_provider_info.write.format("snowflake")\
.options(**snowflake_options)\
.option("dbtable", "PROVIDER_INFO")\
.mode("overwrite")\
.save()

Related

AWS Glue ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated

I have a simple glue job where I am using pyspark to read 14million rows from RDS using JDBC and then trying to save it into S3. I can see Output logs in Glue that reading and creating dataframe is quick but while calling write opeation, it fails with the error:
error occurred while calling o89.save. Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, 10.150.85.95, executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have tried the following solutions:
Adding --conf with spark.executor.memory=10g and also with 30g after seeing some solutions on SO.
Tried to convert spark df to DynamicFrame and then call the save opeartion.
Tried increasing the workers to 500!
And still no luck getting it to pass.
1 weird thing I observed it is, after I create the dataframe by reading from JDBC, it keeps the entire df in 1 partition until I do repartition. But the reading step completes without any error.
I used the same code to run for 6M rows and the job completes in 5 mins.
But it fails for 14M rows with the ExecutorLostFailure Error.
I also see this error sometimes if I dig deep in the Logs:
2023-01-22 10:36:52,972 WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(66)): executor task creation failed for executor 203, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit..
Code:
def read_from_db():
logger.info(f'Starts Reading Data from {DB_TABLE} table')
start = time.perf_counter()
filter_query = f'SELECT * FROM {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (glueContext.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.load()
)
end = time.perf_counter()
logger.info(f'Count of records in DB is {spark_df.count()}')
logger.info(f'Elapsed time for reading records from {DB_TABLE} table = {end - start:0.4f} seconds')
logger.info(f'Finished Reading Data from {DB_TABLE} table')
logger.info(f"Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# def write_to_s3(spark_df_rep):
# S3_PATH = (
# f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
# )
# spark_df_rep.write.format("csv").option("header", "true").save(S3_PATH)
spark_df = spark_df.repartition(20)
logger.info(f"Completed Repartitioning. Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# spark_df.foreachPartition(write_to_s3)
# spark_dynamic_frame = DynamicFrame.fromDF(spark_df, glueContext, "spark_dynamic_frame")
# logger.info("Conversion to DynmaicFrame compelete")
# glueContext.write_dynamic_frame.from_options(
# frame=spark_dynamic_frame,
# connection_type="s3",
# connection_options={"path": S3_PATH},
# format="csv"
# )
S3_PATH = (
f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
)
spark_df.write.format("csv").option("header", "true").save(S3_PATH)
return
In many cases this quite a criptic error message signals about OOM. Setting spark.task.cpus to value greater than default 1 (up to 8 which is the number of cores on G2.X worker for Glue verson 3 or higher) helped me. This effectively increases the amount of memory a single Spark task will get (at a cost of a few cores being idle).
I Understood this was because, no memory was left in 1 executor - Increasing workers doesn't help. Because 1 Worker → 1 Executor → 2 DPUs. Even max configuration with G2.X doesn’t help.
This issue stir up because the data was skewed. All rows in my Database were similar, except 2 columns out of 13 columns. And Pyspark wasn't able to load it into different partitions and it was trying to load all my rows into a single partition.
So increasing Workers/ Executors was of no help.
I solved this by loading data into different partitions manually. Spark actually tried to keep everything in 1 partition, I verified that it was in 1 partition.
Even adding repartitioning doesn’t help,
I was getting error while writing and not when reading. This was the cause of confusion. But the actual issue was with reading and the read was actually trigered when write(transformation) is called. So we were getting error at write step:
From other SO answers
Spark reads the data as soon as an action is applied, since you are just reading and writing to s3 so data is read when the write is triggered.
Spark is not optimized to read bulk data from rdbms as it establishes only single connection to the database
Write data to parquet format in parallel
Also see:
Databricks Spark Pyspark RDD Repartition - "Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues."
Manually partition for skewed data
I added a temporary column called RNO (Row number) which is used as partitionColumn to partition data into partitions and it has to be either int/ datetime. After we are done with the job I drop this RNO column in the job itself or manually.
I had to read 14 million records from RDBMS and then write it to S3 where in each file should have around 200k records.
This is where we can use upperBound, lowerBound and numPartitions along with your partitionKey.
Ran with upperBound-14,000,000 and lowerBound-1 and numPartitions-70 to check if all files get 200k records (upperBound/numPartitions - lowerBound/numPartitions) . And it created 65 files and job ran successfully within 10mins.
filter_query = f'select ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RNO, * from {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (spark.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.option('partitionColumn','RNO')
.option('numPartitions',70)
.option('lowerBound',1)
.option('upperBound',14000000)
.load()
)
Additional references:
https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/

Spark breaks when you need to make a very large shuffle

I'm working with 1 terabytes of data, and at a moment I need to join two smaller dataframes, I don't know the size, but it has more than 200 GB and I get the error below.
The break occurs in the middle of the operation after 2 hours.
It seems to me to be a memory stick, but that doesn't make sense, because looking at the UI Spark Ganglia, the RAM memory doesn't reach the limit as shown in the print below.
Does anyone have any idea how I can solve this without decreasing the amount of data analyzed.
My cluster has:
1 x master node n1-highmem-32
4 x slave node n1-highmem-32
[org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 482.1 failed 4 times, most recent failure: Lost task 3.3 in stage 482.1 (TID 119785, 10.0.101.141, executor 1): java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
Caused by: java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)][1]
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
´´´
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)
´´´

Kryo serialization failed: Buffer overflow

We read data present in hour format present in S3 through spark in scala.For example,
sparkSession
.createDataset(sc
.wholeTextFiles(("s3://<Bucket>/<key>/<yyyy>/<MM>/<dd>/<hh>/*"))
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
Doing the above slice and dice (like replace and split)to convert the pretty json data in form of json lines(one json record per json).
Now I am getting this error while running on EMR:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 43, ip-10-0-2-22.eu-west-1.compute.internal, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1148334. To avoid this, increase spark.kryoserializer.buffer.max value.
I have tried increasing the value for kyro serializer buffer --conf spark.kryoserializer.buffer.max=2047m but still I am getting this error for reading data for some hour locations like hours 09,10 and for other hours it is reading fine.
I wanted to ask how to remove this error and whether I need to add something else in spark configurations like change number of partitions?Thanks

skip a very large cell parquet

I have a parquet file of 250 mb
One of the cell has bad data. I am assuming there is no schema issue but there is a length issue. When I skip reading this column I am able to read file via spark.
When I try to read the column then spark runs out of memory. I have tried giving 100gb ram to executor and it still fails
there are 58k rows in this file. Is there a way to recover rest of the data and ignore that 1 row / 1 cell ?
the column is named meta and is of type struct<name:String,schema_version:string>
I did try converting to json and then skip the row, but conversion to json fails
Stack trace on spark:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 10.0 failed 4 times, most recent failure: Lost task 7.3 in stage 10.0 (TID 157, ip-10-1-131-191.us-west-2.compute.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.6 GB of 11.1 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
Since we had isolated to a speciific file we tried following:
parquet-tools cat /Users/gaurav/Downloads/part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet > ~/Downloads/parquue_2.json
java.lang.OutOfMemoryError: Java heap space
Parquet column dump
parquet-tools dump -c meta part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet
row group 0
--------------------------------------------------------------------------------
row group 1
--------------------------------------------------------------------------------

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Resources