Parquet decoding exception while reading parquet files over spark - apache-spark

I've generated parquet files using append data mode over spark. But while reading those files, throwing parquet decode exceptions.
I'm already using merge-schema option, but the problem I'm facing is with files part of some partitions. Other partitions are not throwing any kind of exception.
df = spark.read.parquet("s3://bucket/folder/date=<>/")
org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details:
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://bucket/key/date=<>/part-00028-625e653b-c000.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
... 21 more
Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetStringConverter"
at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:34)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:267)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 25 more
Reading the mentioned file, doesn't throw any exception. But reading all the files under folder, throw exception wrt a single file. There is other data set in similar folders, but that doesn't threw any exceptions.
I'm not able to understand the real cause for this error, and is there any option or way to fix this ?

Related

Spark 3.0.1 tasks are failing when using zstd compression codec

I'm using Spark 3.0.1 with user provided Hadoop 3.2.0 and Scala 2.12.10 running on Kubernetes.
Everything works fine when reading a parquet file compressed as snappy, however when I try to read a parquet file compressed as zstd several tasks fails under the following error:
java.io.IOException: Decompression error: Version not supported
at com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:164)
at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:120)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2781)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2797)
at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3274)
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:934)
at java.io.ObjectInputStream.(ObjectInputStream.java:396)
at org.apache.spark.MapOutputTracker$.deserializeObject$1(MapOutputTracker.scala:954)
at org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:964)
at org.apache.spark.MapOutputTrackerWorker.$anonfun$getStatuses$2(MapOutputTracker.scala:856)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at org.apache.spark.MapOutputTrackerWorker.getStatuses(MapOutputTracker.scala:851)
at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:808)
at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:128)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:185)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
What I don't get is that those tasks succeed after a retry but not always and thus my jobs are failing frequently. As mentioned before if I use the same dataset compressed as snappy everything works.
I've also tried building Spark and Hadoop, changing the zstd-jni version, but the same behavior still happens.
Does anyone knows what might be happening?
Thanks!
As commented, I updated Spark (3.0.1) configuration with following property to permanently fix the issue in my case. The file path and configuration added are as follows:
$SPARK_HOME/conf/spark-defaults.conf
spark.shuffle.mapStatus.compression.codec lz4

Last task stuck forever... only sometimes

I have pySpark Job (spark 2.4.1) that seems to be working fine in 10% of the cases, and the other times seemingly stuck forever on a single task, where I can't really understand what is happening.
Here's what I'm doing in my pyspark code:
df = ss.read.parquet(...)
df2 = df.withColumn("A", my_python_udf(sf.col("position.latitude"))
print(df2.groupBy(sf.spark_partition_id()).count().agg(sf.min("count"), sf.max("count"), sf.avg("count")).toPandas())
I seem to be forever stuck in the evaluation of the "toPandas" call.
When I check the executors tab, only one executor is runnable with the following call stack:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
java.net.SocketInputStream.read(SocketInputStream.java:171)
java.net.SocketInputStream.read(SocketInputStream.java:141)
java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
java.io.BufferedInputStream.read(BufferedInputStream.java:345) => holding Monitor(java.io.BufferedInputStream#1118259716})
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:74)
org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
org.apache.spark.scheduler.Task.run(Task.scala:121)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
I have several questions:
Why does it seems like the call stack is doing stuff related to the UDF evaluatin, which is not needed for my computation?
What is actually going on? I'm not sure if that thread is deadlocked or live from the call stack
How to fix this?
edit :
I also have 2 executors that are failing with the following error:
java.io.IOException: expected more bytes in input stream
at net.razorvine.pickle.PickleUtils.readbytes_into(PickleUtils.java:75)
at net.razorvine.pickle.PickleUtils.readbytes(PickleUtils.java:55)
at net.razorvine.pickle.Unpickler.load_binunicode(Unpickler.java:473)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:190)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:90)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$evaluate$1.apply(BatchEvalPythonExec.scala:89)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which makes me believe that something outside of my code is going wrong
In my case, my PySpark jobs always (not randomly) got stuck in PythonUDFRunner waiting for data.
I found out it is waiting for data coming from a PySpark daemon which is used for launching python code (Python UDFs) without wasting loads of memory while forking a Python process (UDF) from a Java process (Spark) More information in this link: pyspark daemon
So, roughly speaking, the Python UDF is serialized by Spark and sent to this daemon which is in charge of running the Python code.
This daemon is run by means of Python code located in a file called pyspark.zip
In my case, for reasons that I prefer not to tell ;) ;) the pyspark.zip was coming from Spark 2.3.3 but Spark itself was running with Spark 2.4.5. I replaced pyspark.zip with the one coming from Spark 2.4.5 and everything started to run with success.
I do not think you will have the same problem as I had but perhaps it could give you some ideas about what is going on in your set up.

Spark: Can not build a HashedRelation that is larger than 8G

I am currently encountering this exception in Spark 2.3, while running in Azure HDInsight 3.6 on an 80 node cluster:
java.lang.UnsupportedOperationException: Can not build a HashedRelation that is larger than 8G
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:623)
at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:570)
at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:867)
at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:111)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec.org$apache$spark$sql$execution$joins$ShuffledHashJoinExec$$buildHashedRelation(ShuffledHashJoinExec.scala:56)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec$$anonfun$doExecute$1.apply(ShuffledHashJoinExec.scala:68)
at org.apache.spark.sql.execution.joins.ShuffledHashJoinExec$$anonfun$doExecute$1.apply(ShuffledHashJoinExec.scala:67)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This has occurred multiple times while performing a union between 6 tables, one of which is several GB. However, it does not always occur and I cannot reproduce it. This union has run on a much larger dataframe on the same size and number of executors without failing. It happened on a particular run for all 5 retries, and after setting "spark.sql.join.preferSortMergeJoin" to true, it then ran through. Now when trying to reproduce it on a new cluster, with everything else being the same, I cannot, and it runs as expected.
Are there any ideas on what could cause this?
Since I was able to resolve this, here is what I found:
It was discovered that one of the dataframes in the union was not cached, and was causing this issue. This dataframe had a large number of partitions, but had 0 rows.
This was difficult to discover as it was unexpected. I don't have a link, but there is a known issue of extreme performance degradation of having a multi-partition dataframe with 0 rows.

How to reset spark structured streaming data to last available offset

I am running a structured stream application with Kafka. I found that if for some reason the system is down for days... The checkpoint become stale and offsets corresponding to the checkpoint is not found in Kafka. How do I let Spark Structured Streaming app pick the last available offset and start from there. I tried setting offset reset to earlier/latest but the system was crashing with following error:
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {MyTopic-574=6559828}
at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:970)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:490)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1259)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1187)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:64)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:500)
at org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.next(KafkaMicroBatchReader.scala:357)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
If the system was down for days then it means that some logs might have been compacted. More precisely, your application attempts to read offset 6559828 from the 574th partition in topic MyTopic.
In order to find the earliest available offset per partition, you can simply run the following command:
bin/kafka-run-class.sh kafka.tools.GetOffsetShell \
--broker-list localhost:9092 \
--topic MyTopic \
--time -2

Read all partitioned parquet files in PySpark

I want to load all parquet files that are stored in a folder structure in S3 AWS.
The folder structure is as follows: S3/bucket_name/folder_1/folder_2/folder_3/year=2019/month/day
What I want is to read all parquet files at once, so I want PySpark to read all data from 2019 for all months and days that are available and then store it in one dataframe (so you get a concatenated/unioned dataframe with all days in 2019).
I am told that these are partitioned files (though I am not sure of this).
Is this possible in PySpark and if so how?
When I try spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019')
it works. However, when I want to take a look at the Spark dataframe using spark.read.parquet('S3/bucket_name/folder_1/folder_2/folder_3/year=2019').show()
it says:
An error occurred while calling o934.showString.
: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times,
most recent failure:
Lost task 0.3 in stage 36.0 (TID 718, executor 7):
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToBinary(ParquetDictionary.java:51)
at org.apache.spark.sql.execution.vectorized.WritableColumnVector.getUTF8String(WritableColumnVector.java:372)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I want to be able to show the dataframe.
Please refer the "Partition discovery" part of the documentation:
https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery
In PySpark, you can do this simply as follows:
from pyspark.sql.functions import col
(
spark.read
.parquet('S3/bucket_name/folder_1/folder_2/folder_3')
.filter(col('year') == 2019)
)
So you will point the path to the folder where it is partitioned into some subfolders and you apply the partition filter which should take the data only from the given year subfolder.

Resources