skip a very large cell parquet - apache-spark

I have a parquet file of 250 mb
One of the cell has bad data. I am assuming there is no schema issue but there is a length issue. When I skip reading this column I am able to read file via spark.
When I try to read the column then spark runs out of memory. I have tried giving 100gb ram to executor and it still fails
there are 58k rows in this file. Is there a way to recover rest of the data and ignore that 1 row / 1 cell ?
the column is named meta and is of type struct<name:String,schema_version:string>
I did try converting to json and then skip the row, but conversion to json fails
Stack trace on spark:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 10.0 failed 4 times, most recent failure: Lost task 7.3 in stage 10.0 (TID 157, ip-10-1-131-191.us-west-2.compute.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.6 GB of 11.1 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
Since we had isolated to a speciific file we tried following:
parquet-tools cat /Users/gaurav/Downloads/part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet > ~/Downloads/parquue_2.json
java.lang.OutOfMemoryError: Java heap space
Parquet column dump
parquet-tools dump -c meta part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet
row group 0
--------------------------------------------------------------------------------
row group 1
--------------------------------------------------------------------------------

Related

How to extend the memory limit of PySpark running locally on Windows 10 / JVM 64bit

I try to make PySpark operations in Jupyter Notebook, and it seems that there is (a rather low) threshold of working memory when it halts with an error message. The laptop has 16GB RAM (out of which 50% is free when running the script), so the physical memory shouldn't be a problem. Spark runs on JVM (64bit) 1.8.0_301. The Jupyter Notebook runs on Python 3.9.5.
The dataframe consists only of 360K rows and two 'long' type columns (i.e. ca. 3.8MB only). The script works properly if I reduce the size of the dataframe 1.5MB memory usage (49,200 rows). But above that, the script collapses using the df.toPandas() command with the following error message (extract):
Py4JJavaError: An error occurred while calling o234.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 50.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 50.0 (TID 577) (BEXXXXXX.subdomain.domain.com executor driver):
TaskResultLost (result lost from block manager)
This is a well known error message when PySpark runs into memory limits, so I tried to adjust the settings as follows:
In the %SPARK_HOME%/conf/spark-defaults.conf file:
spark.driver.memory 4g
In Jupyter notebook itself:
spark = SparkSession.builder\
.config("spark.driver.memory", "4G")\
.config("spark.driver.maxResultSize", "4G")\
.appName("MyApp")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setSystemProperty('spark.executor.memory', '4G')
I tried to play with the values of spark.driver.memory, spark.executor.memory etc., but the threshold seems to remain the same.
The Spark panel (on http://localhost:4040) says in the Executors menu, that Storage memory is 603 KiB / 2 GiB, Input is 4.1 GiB, Shuffle read: 60.6 MiB, Shuffle write: 111.3 MiB. But this is essentially the same, if I reduce the dataframe size below 1.5MB and the script runs properly.
Have you got any ideas, how to raise this 1.5MB memory limit somehow, and where is it coming from?

Kryo serialization failed: Buffer overflow

We read data present in hour format present in S3 through spark in scala.For example,
sparkSession
.createDataset(sc
.wholeTextFiles(("s3://<Bucket>/<key>/<yyyy>/<MM>/<dd>/<hh>/*"))
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
Doing the above slice and dice (like replace and split)to convert the pretty json data in form of json lines(one json record per json).
Now I am getting this error while running on EMR:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 43, ip-10-0-2-22.eu-west-1.compute.internal, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1148334. To avoid this, increase spark.kryoserializer.buffer.max value.
I have tried increasing the value for kyro serializer buffer --conf spark.kryoserializer.buffer.max=2047m but still I am getting this error for reading data for some hour locations like hours 09,10 and for other hours it is reading fine.
I wanted to ask how to remove this error and whether I need to add something else in spark configurations like change number of partitions?Thanks

AWS Glue: Data Skewed or not Skewed?

I have a job in AWS Glue that fails with:
An error occurred while calling o567.pyWriteDynamicFrame. Job aborted due to stage failure: Task 168 in stage 31.0 failed 4 times, most recent failure: Lost task 168.3 in stage 31.0 (TID 39474, ip-10-0-32-245.ec2.internal, executor 102): ExecutorLostFailure (executor 102 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The main message is Container killed by YARN for exceeding memory limits. 22.2 GB of 22 GB physical memory used.
I have used broadcasts for the small dfs and salt technique for bigger tables.
The input consists of 75GB of JSON files to process.
I have used a a grouping of 32MB for the input files:
additional_options={
'groupFiles': 'inPartition',
'groupSize': 1024*1024*32,
},
The output file is written with 256 partitions:
output_df = output_df.coalesce(256)
In AWS Glue I launch the job with 60 G.2X executors = 60 x (8 vCPU, 32 GB of memory, 128 GB disk).
Below is the plot representing the metrics for this job. From that, the data don't look skewed... Am I wrong?
Any advice to successfully run this is welcome!
Try to use repartition instead of coalesce. The latter one will do the complete execution based on the number of the partitions you have provided. In your case it tries to process all the input data with the 256 partitions, when it can't handle the input data volume you will get the error.

Prediction.io - pio train fails with OutOfMemoryError

We are getting the following error after running "pio train". It works about 20 minutes and fails on Stage 26.
[ERROR] [Executor] Exception in task 0.0 in stage 1.0 (TID 3)
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-4,5,main]
[WARN] [TaskSetManager] Lost task 2.0 in stage 1.0 (TID 5, localhost): java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:80)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:293)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Our server has about 30gb memory, but about 10gb is taken by hbase+elasticsearch.
We are trying to process about 20 millions of records created by Universal Recommender.
I've tried the following command to increase executor/driver memory, but it didn't help:
pio train -- --driver-memory 6g --executor-memory 8g
What options could we try to fix the issue? Is it possible to process that amount of events on server with that amount of memory?
Vertical scaling can take you only so far but you could try increasing the memory available if it's AWS by stopping and restarting with a larger instance.
CF looks at a lot of data, Since Spark gets it's speed by doing in-memory calculations (by default) you will need enough memory to hold all of your data spread over all Spark workers and in your case you have only 1.
Another thing that comes to mind is that this is a Kryo error so you might try increasing the Kryo buffer size a little, which is configured in engine.json
Also there is a Google Group for community support here: https://groups.google.com/forum/#!forum/actionml-user

Error ExecutorLostFailure when running a task in Spark

when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime
Hi I am a beginner in Spark. I am trying to run a job on Spark 1.4.1 with 8 slave nodes with 11.7 GB memory each 3.2 GB Disk . I am running the Spark task from one of the slave node (from 8 nodes) (so with 0.7 storage fraction approx 4.8 gb only is available on each node )and using Mesos as the Cluster Manager. I am using this configuration :
spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true
I am trying to Run Spark MLlib Naive Bayes Algorithm on a Folder around 14 GB of data. (There is no issue when I am running the task on a 6 GB folder) I am reading this folder from google storage as RDD and giving 32 as partition parameter.(I have tried increasing the partition as well). Then using TF to create feature vector and predict on basis of that.
But when I am trying to run it on this folder it is throwing me ExecutorLostFailure everytime. I tried different configurations but nothing is helping. May be I am missing something very basic but not able to figure out. Any help or suggestion will be highly valuable.
Log is:
15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s
Traceback (most recent call last):
File "/opt/work/V2ProcessRecords.py", line 213, in <module>
secondPassRDD = firstPassRDD.map(lambda ( name, title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title, idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616389697,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerExecutorAdded","Timestamp":1437616389707,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Executor Info":{"Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Total Cores":1,"Log Urls":{}}}
{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}
{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"Lost executor"}
{"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"Launch Time":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1437616397743,"Failed":true,"Accumulables":[]}}
{"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215","Number of Tasks":72,"RDD Info":[{"RDD ID":6,"Name":"PythonRDD","Parent IDs":[0],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/*.nxml","Scope":"{\"id\":\"0\",\"name\":\"wholeTextFiles\"}","Parent IDs":[],"Storage Level":{"Use Disk":false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1437616365566,"Completion Time":1437616397753,"Failure Reason":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Accumulables":[]}}
{"Event":"SparkListenerJobEnd","Job ID":2,"Completion Time":1437616397755,"Job Result":{"Result":"JobFailed","Exception":{"Message":"Job aborted due to stage failure: Task 6 in stage 2.0 failed 4 times, most recent failure: Lost task 6.3 in stage 2.0 (TID 12, uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)\nDriver stacktrace:","Stack Trace":[{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages","File Name":"DAGScheduler.scala","Line Number":1266},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1257},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"scala.collection.mutable.ResizableArray$class","Method Name":"foreach","File Name":"ResizableArray.scala","Line Number":59},{"Declaring Class":"scala.collection.mutable.ArrayBuffer","Method Name":"foreach","File Name":"ArrayBuffer.scala","Line Number":47},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"abortStage","File Name":"DAGScheduler.scala","Line Number":1256},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","Method Name":"apply","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"scala.Option","Method Name":"foreach","File Name":"Option.scala","Line Number":236},{"Declaring Class":"org.apache.spark.scheduler.DAGScheduler","Method Name":"handleTaskSetFailed","File Name":"DAGScheduler.scala","Line Number":730},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1450},{"Declaring Class":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","Method Name":"onReceive","File Name":"DAGScheduler.scala","Line Number":1411},{"Declaring Class":"org.apache.spark.util.EventLoop$$anon$1","Method Name":"run","File Name":"EventLoop.scala","Line Number":48}]}}}
The Most common cause of ExecutorLostFailure as per my understanding is OOM in executor.
In order to resolve the OOM issue, one needs to figure out what exactly is causing it. Simply increasing the default parallelism or increasing the executor memory is not a strategic solution.
If you look at what increasing parallelism do is it tries to create more executors so that each executor can work on less and less data. But if your data is skewed such that the key on which data partitioning happens (for parallelism) has more data, simply increasing parallelism will be of no effect.
Similarly just by increasing Executor memory will be a very inefficient way of handing such a scenario as if only one executor is failing with ExecutorLostFailure , requesting increased memory for all the executors will make your application require much more memory then actually expected.
This error is occurring because a task failed more than four times.
Try increase the parallelism in your cluster using the following parameter.
--conf "spark.default.parallelism=100"
Set the parallelism value to 2 to 3 time the number of cores available on your cluster. If that doesn't work. try increase the parallelism in an exponential fashion. i.e if your current parallelism doesn't work multiply it by two and so on. Also I have observed that it helps if your level of parallelism is a prime number especially if you are using groupByKkey.
It is hard to say what the problem is without the log of the failed executor and not the driver's but most likely it is a memory problem. Try increasing the partition number significantly (if your current is 32 try 200)
I was having this issue, and the problem for me was very high incidence of one key in a reduceByKey task. This was (I think) causing a massive list to collect on one of the executors, which would then throw OOM errors.
The solution for me was to just filter out keys with high population before doing the reduceByKey, but I appreciate that this may or may not be possible depending on your application. I didn't need all my data anyway.

Resources