Creating dynamic frame issue without the pushdown predicate - apache-spark

New to AWS glue, so pardon my question:
Why do I get an error when I don't include a pushdown predicate when creating the dynamic frame. I try to use it without the predicate as I will be using bookmark so only new files will be processed regardless of the date partition.
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1 ,transformation_ctx = "datasourceDyF")
datasourceDyF.ToDF().show(20)
vs
datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1,transformation_ctx = "datasourceDyF", push_down_predicate = "salesdate = '2020-01-01'")
datasourceDyF.ToDF().show(20)
code 1 is giving this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 1.0 (TID 4, xxx.xx.xxx.xx, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

The
pushdown predicate
is actually good to use while connecting a RDBMS / table , this helps spark to identify which data to be loaded into it's RAM/memory (i.e. there is no point in loading the data which is not required in the downstream system ). The benefits of using this - due to less data execution happens in a much faster way than a full table load.
Now, in your case , your underlaying table could be a partitioned one hence the pushdown predicate was required.

Related

to_date conversion failing in PySpark on Spark 3.0

Having known about calendar change in Spark 3.0, I am trying to understand why the cast is failing in this particular instance. Spark 3.0 has issues with dates before year 1582. However, in this example, year is greater than 1582.
rdd = sc.parallelize(["3192016"])
df = rdd.map(row).toDF()
df.createOrReplaceTempView("date_test")
sqlDF = spark.sql("SELECT to_date(date, 'yyyymmdd') FROM date_test")
Fails with
Py4JJavaError: An error occurred while calling o1519.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 167.0 failed 4 times, most recent failure: Lost task 10.3 in stage 167.0 (TID 910) (172.36.189.123 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
You just need to turn spark.sql.legacy.timeParserPolicy to LEGACY to get the behaviour from previous versions
There is an error that shows:
SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '3192016' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
Here how you can do it with python
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
Check quick example in the image below

Kryo serialization failed: Buffer overflow

We read data present in hour format present in S3 through spark in scala.For example,
sparkSession
.createDataset(sc
.wholeTextFiles(("s3://<Bucket>/<key>/<yyyy>/<MM>/<dd>/<hh>/*"))
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
Doing the above slice and dice (like replace and split)to convert the pretty json data in form of json lines(one json record per json).
Now I am getting this error while running on EMR:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 43, ip-10-0-2-22.eu-west-1.compute.internal, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1148334. To avoid this, increase spark.kryoserializer.buffer.max value.
I have tried increasing the value for kyro serializer buffer --conf spark.kryoserializer.buffer.max=2047m but still I am getting this error for reading data for some hour locations like hours 09,10 and for other hours it is reading fine.
I wanted to ask how to remove this error and whether I need to add something else in spark configurations like change number of partitions?Thanks

How do I read parquet with Spark that has unsupported types?

I would like to use PySpark to pull data from a parquet file that contains UINT64 columns which currently maps to typeNotSupported() in Spark. I do not need these columns, so I was hoping I could pull the other columns using predicate pushdown with the following command:
spark.read.parquet('path/to/dir/').select('legalcol1', 'legalcol2')
However, I was still met with the following error.
An error was encountered:
An error occurred while calling o86.parquet.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, ..., executor 1):
org.apache.spark.sql.AnalysisException: Parquet type not supported: INT64 (UINT_64);
Is there a way to ingest this data without throwing the above error?
You can try to convert any column type into another column type:
val df = spark.read.parquet('path/to/dir/')
df.select(col('legalcol1').cast('string').alias('col1'), col('legalcol2').cast('string').alias('col2'))
Convert to bigint column type:
df.select(col('uint64col').cast('bigint').alias('bigint_col'))

When the underlying files have changed, should PySpark refresh the view or the source tables?

Let's say we have a Hive table foo that's backed by a set of parquet files on e.g. s3://some/path/to/parquet. These files are known to be updated at least once per day, but not always at the same hour of the day.
I have a view on that table, for example defined as
spark.sql("SELECT bar, count(baz) FROM foo GROUP BY bar").createOrReplaceTempView('foo_view')
When I use the foo_view the application will occasionally fail with
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 975.0 failed 4 times, most recent failure: Lost task 0.3 in stage 975.0 (TID 116576, 10.56.247.98, executor 193): com.databricks.sql.io.FileReadException: Error while reading file s3a://some/path/to/parquet. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I've tried prefixing all my queries on foo_view with a call to spark.catalog.refreshTable('foo'), but the problem keeps on showing up.
Am I doing this right? Or should I call refreshTable() on the view instead of the source table?

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Resources