Parquet file too wide to work with in PySpark

Parquet file too wide to work with in PySpark - apache-spark

I have a large Parquet file with 25k columns that is about 10GB. I'm trying to view it, and convert some rows to CSV.
All the tools I've tried have blown up (parquet-tools, fastparquet, pandas) so I'm using PySpark now but am running into Java out of memory errors (java.lang.OutOfMemoryError: Java heap space).
My machine has 96GB of ram. Prior to running Python, I use
export JAVA_OPTS="-Xms36g -Xmx90g"
I've also experimented by setting the driver memory to 80GB.
Here's the code I'm using. Unfortunately I can't share the data set.
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.types import *
sc = SparkContext(appName="foo")
sqlContext = SQLContext(sc)
sc._conf.set('spark.driver.memory', '80g')
readdf = sqlContext.read.parquet('dataset.parquet')
readdf.head(2)
Here's the error:
In [5]: df.head(2)
23/02/01 20:48:43 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
23/02/01 20:48:48 ERROR Executor: Exception in task 7.0 in stage 3.0 (TID 13)20]
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
at java.base/java.lang.StringCoding.decodeUTF8(StringCoding.java:732)
at java.base/java.lang.StringCoding.decode(StringCoding.java:257)
at java.base/java.lang.String.<init>(String.java:507)
at java.base/java.lang.String.<init>(String.java:561)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readString(TCompactProtocol.java:687)
at org.apache.parquet.format.InterningProtocol.readString(InterningProtocol.java:216)
at org.apache.parquet.format.KeyValue$KeyValueStandardScheme.read(KeyValue.java:406)
at org.apache.parquet.format.KeyValue$KeyValueStandardScheme.read(KeyValue.java:384)
at org.apache.parquet.format.KeyValue.read(KeyValue.java:321)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1317)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1242)
at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:1116)
at org.apache.parquet.format.Util.read(Util.java:362)
at org.apache.parquet.format.Util.readFileMetaData(Util.java:151)
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1428)
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1410)
at org.apache.parquet.format.converter.ParquetMetadataConverter$RangeMetadataFilter.accept(ParquetMetadataConverter.java:1205)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1410)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:582)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:776)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:99)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:173)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:342)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$Lambda$2673/0x0000000101256040.apply(Unknown Source)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at org.apache.spark.sql.execution.SparkPlan$$Lambda$2676/0x0000000101281040.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
23/02/01 20:48:49 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 8)
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOfRange(Arrays.java:4030)
at java.base/java.lang.StringCoding.decodeUTF8(StringCoding.java:732)
at java.base/java.lang.StringCoding.decode(StringCoding.java:257)
at java.base/java.lang.String.<init>(String.java:507)
at java.base/java.lang.String.<init>(String.java:561)
at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readString(TCompactProtocol.java:687)
at org.apache.parquet.format.InterningProtocol.readString(InterningProtocol.java:216)
at org.apache.parquet.format.KeyValue$KeyValueStandardScheme.read(KeyValue.java:406)
at org.apache.parquet.format.KeyValue$KeyValueStandardScheme.read(KeyValue.java:384)
at org.apache.parquet.format.KeyValue.read(KeyValue.java:321)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1317)
at org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.read(FileMetaData.java:1242)
at org.apache.parquet.format.FileMetaData.read(FileMetaData.java:1116)
at org.apache.parquet.format.Util.read(Util.java:362)
at org.apache.parquet.format.Util.readFileMetaData(Util.java:151)
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1428)
at org.apache.parquet.format.converter.ParquetMetadataConverter$3.visit(ParquetMetadataConverter.java:1410)
at org.apache.parquet.format.converter.ParquetMetadataConverter$RangeMetadataFilter.accept(ParquetMetadataConverter.java:1205)
at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1410)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:582)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:776)
at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:99)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:173)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$1(ParquetFileFormat.scala:342)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$Lambda$2673/0x0000000101256040.apply(Unknown Source)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:209)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at org.apache.spark.sql.execution.SparkPlan$$Lambda$2676/0x0000000101281040.apply(Unknown Source)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
Any suggestions for dealing with this file? Thanks

Related

Pyspark: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times Possible cause: Parquet column cannot be converted

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.
df = spark.read.load(FilePath1,
format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.
any help is appreciated. Thanks.

The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.
For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.
The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.
To disable the vectorized Parquet reader at the cluster level, set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration
At the notebook level, you can also disable the vectorized Parquet reader by running:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure

Spark loading ORC files does not use exact schema from Hive Metastore causing type casting errors

I am trying to load some data from a Hive table where one of the columns is like this:
id - bigint
When I load the table into a dataframe and do a printSchema, I see that Spark agrees with the Hive Metastore that id is of type long. However, when I try to do anything with the table, I get this error:
SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 218, 10.139.64.41, executor 1): java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.LongWritable
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$6.apply(OrcDeserializer.scala:94)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$6.apply(OrcDeserializer.scala:93)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:234)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$8.apply(OrcFileFormat.scala:233)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:283)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:401)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:249)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:640)
at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:62)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:159)
at org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:158)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:528)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:534)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It seems that Spark reads the data, decides that an Integer is enough to hold the data, and then tries to reconcile with the Hive Metastore who expects a Long. How do I go about solving this?
EDIT:
I have tried reading the data in two ways, although I suspect they are the same:
spark.sql("select * from databaseName.tableName")
spark.table("databaseName.tableName")

The bigint is not a supported datatype as documented here - https://spark.apache.org/docs/latest/sql-reference.html
You can use the cast function to convert bigint to long
val newDF = df.select($"cola", $"colb".cast("Long"))

skip a very large cell parquet

I have a parquet file of 250 mb
One of the cell has bad data. I am assuming there is no schema issue but there is a length issue. When I skip reading this column I am able to read file via spark.
When I try to read the column then spark runs out of memory. I have tried giving 100gb ram to executor and it still fails
there are 58k rows in this file. Is there a way to recover rest of the data and ignore that 1 row / 1 cell ?
the column is named meta and is of type struct<name:String,schema_version:string>
I did try converting to json and then skip the row, but conversion to json fails
Stack trace on spark:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 10.0 failed 4 times, most recent failure: Lost task 7.3 in stage 10.0 (TID 157, ip-10-1-131-191.us-west-2.compute.internal, executor 22): ExecutorLostFailure (executor 22 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 11.6 GB of 11.1 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
Since we had isolated to a speciific file we tried following:
parquet-tools cat /Users/gaurav/Downloads/part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet > ~/Downloads/parquue_2.json
java.lang.OutOfMemoryError: Java heap space
Parquet column dump
parquet-tools dump -c meta part-00142-84bd71d5-268e-4db2-a962-193c171ed889.c000.snappy.parquet
row group 0
--------------------------------------------------------------------------------
row group 1
--------------------------------------------------------------------------------

java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame

I'm running a Spark cluster in standalone mode. Both Master and Worker nodes are reachable, with logs in the Spark Web UI.
I'm trying to load data into a PySpark session so I can work on Spark DataFrames.
Following several examples (among them, one from the official doc), I tried using different methods, all failing with the same error. Eg
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('NAME').setMaster('spark://HOST:7077')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()
# a try
df = spark.read.load('/path/to/file.csv', format='csv', sep=',', header=True)
# another try
sql_ctx = SQLContext(sc)
df = sql_ctx.read.csv('/path/to/file.csv', header=True)
# and a few other tries...
Every time, I get the same error:
Py4JJavaError: An error occurred while calling o81.csv. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, 192.168.X.X, executor 0):
java.io.StreamCorruptedException: invalid stream header: 0000000B
I'm loading data from JSON and CSV (tweaking the methods calls appropriately of course), the error is the same for both, every time.
Does someone understand what is the problem?

To whom it may concern, I finally figured out the problem thank to this response.
pyspark version for the SparkSession did not match Spark application version (2.4 VS 2.3).
Re-installing pyspark under version 2.3 solved instantly the issues. #facepalm

Read, sort and count 20GB CSV file stored in HDFS by using pyspark RDD

I am new in Spark Hadoop. I got one use case in which I am trying to read, count number of records and sort data in 20GB CSV file. The problem is when I am using these functions its not working. Here is my code please have a look and suggest me the approach to handle the large file with spark RDD.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
APP_NAME = 'My Spark Application'
file = 0
conf = SparkConf().setAppName("APP_NAME").setMaster("local")
sc = SparkContext(conf=conf)
val_file = sc.textFile("hdfs://localhost:50000/yottaa/transactions.csv")
val_file.count() ### Its taking 10 mins to execute and produce result.
val_file.count() ---> It's taking 10 mins time to count rows, How can I increase speed ?. I'm using 16GB RAM laptop and when I am giving val_file.collect() statement it shows me following error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) at
java.nio.CharBuffer.allocate(CharBuffer.java:331) at
java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777) at
org.apache.hadoop.io.Text.decode(Text.java:412) at
org.apache.hadoop.io.Text.decode(Text.java:389) at
org.apache.hadoop.io.Text.toString(Text.java:280) at

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Parquet file too wide to work with in PySpark - apache-spark

Related

Pyspark: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times Possible cause: Parquet column cannot be converted

Spark loading ORC files does not use exact schema from Hive Metastore causing type casting errors

skip a very large cell parquet

java.io.StreamCorruptedException when importing a CSV to a Spark DataFrame

Read, sort and count 20GB CSV file stored in HDFS by using pyspark RDD

Categories

Resources