to_date conversion failing in PySpark on Spark 3.0 - apache-spark

Having known about calendar change in Spark 3.0, I am trying to understand why the cast is failing in this particular instance. Spark 3.0 has issues with dates before year 1582. However, in this example, year is greater than 1582.
rdd = sc.parallelize(["3192016"])
df = rdd.map(row).toDF()
df.createOrReplaceTempView("date_test")
sqlDF = spark.sql("SELECT to_date(date, 'yyyymmdd') FROM date_test")
Fails with
Py4JJavaError: An error occurred while calling o1519.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 167.0 failed 4 times, most recent failure: Lost task 10.3 in stage 167.0 (TID 910) (172.36.189.123 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

You just need to turn spark.sql.legacy.timeParserPolicy to LEGACY to get the behaviour from previous versions
There is an error that shows:
SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Here how you can do it with python
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
Check quick example in the image below

Related

Pyspark: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times Possible cause: Parquet column cannot be converted

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.
df = spark.read.load(FilePath1,
format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.
any help is appreciated. Thanks.
The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.
For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.
The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.
To disable the vectorized Parquet reader at the cluster level, set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration
At the notebook level, you can also disable the vectorized Parquet reader by running:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure

Creating dynamic frame issue without the pushdown predicate

New to AWS glue, so pardon my question:
Why do I get an error when I don't include a pushdown predicate when creating the dynamic frame. I try to use it without the predicate as I will be using bookmark so only new files will be processed regardless of the date partition.
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1 ,transformation_ctx = "datasourceDyF")
datasourceDyF.ToDF().show(20)
vs
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1,transformation_ctx = "datasourceDyF", push_down_predicate = "salesdate = '2020-01-01'")
datasourceDyF.ToDF().show(20)
code 1 is giving this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 1.0 (TID 4, xxx.xx.xxx.xx, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
The
pushdown predicate
is actually good to use while connecting a RDBMS / table , this helps spark to identify which data to be loaded into it's RAM/memory (i.e. there is no point in loading the data which is not required in the downstream system ). The benefits of using this - due to less data execution happens in a much faster way than a full table load.
Now, in your case , your underlaying table could be a partitioned one hence the pushdown predicate was required.

Kryo serialization failed: Buffer overflow

We read data present in hour format present in S3 through spark in scala.For example,
sparkSession
.createDataset(sc
.wholeTextFiles(("s3://<Bucket>/<key>/<yyyy>/<MM>/<dd>/<hh>/*"))
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
Doing the above slice and dice (like replace and split)to convert the pretty json data in form of json lines(one json record per json).
Now I am getting this error while running on EMR:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 43, ip-10-0-2-22.eu-west-1.compute.internal, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1148334. To avoid this, increase spark.kryoserializer.buffer.max value.
I have tried increasing the value for kyro serializer buffer --conf spark.kryoserializer.buffer.max=2047m but still I am getting this error for reading data for some hour locations like hours 09,10 and for other hours it is reading fine.
I wanted to ask how to remove this error and whether I need to add something else in spark configurations like change number of partitions?Thanks

How do I read parquet with Spark that has unsupported types?

I would like to use PySpark to pull data from a parquet file that contains UINT64 columns which currently maps to typeNotSupported() in Spark. I do not need these columns, so I was hoping I could pull the other columns using predicate pushdown with the following command:
spark.read.parquet('path/to/dir/').select('legalcol1', 'legalcol2')
However, I was still met with the following error.
An error was encountered:
An error occurred while calling o86.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ..., executor 1):
org.apache.spark.sql.AnalysisException: Parquet type not supported: INT64 (UINT_64);
Is there a way to ingest this data without throwing the above error?
You can try to convert any column type into another column type:
val df = spark.read.parquet('path/to/dir/')
df.select(col('legalcol1').cast('string').alias('col1'), col('legalcol2').cast('string').alias('col2'))
Convert to bigint column type:
df.select(col('uint64col').cast('bigint').alias('bigint_col'))

Can't get the johnsnow OCR notebook run on databricks

So I am trying to follow this notebook and get it to work on a databricks notebook: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/ocr-spell/OcrSpellChecking.ipynb ; However, after installing all the packages, I still get stuck by the time I get to
{ // for displaying
val regions = data.select("region").collect().map(_.get(0))
regions.foreach{chunk =>
println("---------------")
println(chunk)}
}
Error message is:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 51, 10.195.249.145, executor 4): java.lang.NoClassDefFoundError: Could not initialize class net.sourceforge.tess4j.TessAPI
Anyone knows why? Much appreciated!
To use Spark NLP OCR you need to install Tesseract 4.x+ as the documentation has stated. In the cluster, you must have this on all the nodes. However, if you are just dealing with PDF and not scanned images you can probably skip Tesseract 4.x+ installation:
import com.johnsnowlabs.nlp.util.io.OcrHelper
val ocrHelper = new OcrHelper()
val df = ocrHelper.createDataset(spark, "/tmp/Test.pdf")
Update: There is a new doc for Spark OCR and special instructions for Databricks:
https://nlp.johnsnowlabs.com/docs/en/ocr

Resources