Cannot read parquet using PySpark - apache-spark

I'm currently writing a program in PySpark, which involves writing dataframe to parquet using a loop. New data is appended to the parquet in each cycle. The parquet is stored in S3 bucket.
I was able to write the parquet, but when I load the parquet to a dataframe, and try to read it using
df.take(5)
I encounter the following error message
An error occurred while calling o461.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 57, ip-10-0-2-219.ec2.internal, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
...
However, I can run the following commands on the dataframe:
df.count()
df.printSchema()
Any idea why this error occurs?

Related

Pyspark: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times Possible cause: Parquet column cannot be converted

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.
df = spark.read.load(FilePath1,
format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.
any help is appreciated. Thanks.
The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.
For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.
The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.
To disable the vectorized Parquet reader at the cluster level, set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration
At the notebook level, you can also disable the vectorized Parquet reader by running:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure

Record larger than the Split size in AWS GLUE?

I'm Newbie in AWS Glue and Spark.
I build my ETL in this.
When connect my s3 with files of 200mb approximately not read this.
The error is that
An error was encountered:
An error occurred while calling o99.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 16) (91ec547edca7 executor driver): com.amazonaws.services.glue.util.NonFatalException: Record larger than the Split size: 67108864
Update 1:
When split my json file(200mb) with jq, in two parts AWS GLUE, read with normally both parts
My solution is a lambda splitting file, but i want to know how aws glue split works
Thanks and Regards

How do I read parquet with Spark that has unsupported types?

I would like to use PySpark to pull data from a parquet file that contains UINT64 columns which currently maps to typeNotSupported() in Spark. I do not need these columns, so I was hoping I could pull the other columns using predicate pushdown with the following command:
spark.read.parquet('path/to/dir/').select('legalcol1', 'legalcol2')
However, I was still met with the following error.
An error was encountered:
An error occurred while calling o86.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ..., executor 1):
org.apache.spark.sql.AnalysisException: Parquet type not supported: INT64 (UINT_64);
Is there a way to ingest this data without throwing the above error?
You can try to convert any column type into another column type:
val df = spark.read.parquet('path/to/dir/')
df.select(col('legalcol1').cast('string').alias('col1'), col('legalcol2').cast('string').alias('col2'))
Convert to bigint column type:
df.select(col('uint64col').cast('bigint').alias('bigint_col'))

convert pickle (.pck) file into spark data frame using python

Hello!
Dear members i want to train model using Bigdl, i have data set of Medical images in the form of pickle object files (,pck).that pickle file is a 3D image(3D array)
i have tried to convert this into spark datafram by using BigDl python API
pickleRdd = sc.pickleFilehome/student/BigDL-
trainings/elephantscale/data/volumetric_data/329637-8.pck
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(pickleRdd)
it throws error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver)
: java.io.IOException: file:/home/student/BigDL-trainings/elephantscale/data/volumetric_data/329637-8.pck not a SequenceFile
i have executed this code on python 3.5 as well as 2.7 in both cases i am getting error

Read, sort and count 20GB CSV file stored in HDFS by using pyspark RDD

I am new in Spark Hadoop. I got one use case in which I am trying to read, count number of records and sort data in 20GB CSV file. The problem is when I am using these functions its not working. Here is my code please have a look and suggest me the approach to handle the large file with spark RDD.
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
APP_NAME = 'My Spark Application'
file = 0
conf = SparkConf().setAppName("APP_NAME").setMaster("local")
sc = SparkContext(conf=conf)
val_file = sc.textFile("hdfs://localhost:50000/yottaa/transactions.csv")
val_file.count() ### Its taking 10 mins to execute and produce result.
val_file.count() ---> It's taking 10 mins time to count rows, How can I increase speed ?. I'm using 16GB RAM laptop and when I am giving val_file.collect() statement it shows me following error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0, localhost): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
java.nio.HeapCharBuffer.(HeapCharBuffer.java:57) at
java.nio.CharBuffer.allocate(CharBuffer.java:331) at
java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:777) at
org.apache.hadoop.io.Text.decode(Text.java:412) at
org.apache.hadoop.io.Text.decode(Text.java:389) at
org.apache.hadoop.io.Text.toString(Text.java:280) at

Resources