Spark, Why does dropping columns cause spark job to fail? - apache-spark

In Spark 2.0, I am running a pyspark job where I read from a table, add some columns whose logic is based off of windowing on 30 days worth of data and then I use df.createOrReplaceTempView followed up with spark.sql(create table as select * from ...) to create a table in HDFS.
This job runs successfully and creates a table in HDFS. However, I don't need all of the columns I just created in my dataframe. I only need half of the new columns and so I add some logic to drop the columns I don't need (all of these columns that will be dropped were recently created). When I run the drop `df = df.select([c for c in df.columns if c not in ('a','b','d','e')]) the spark job now fails!
error: Job aborted due to stage failure: Task 139 in stage 1.0 failed 4 times, most recent failure: Lost task 139.3 in stage 1.0 (TID 405, myhost, executor 197): ExecutorLostFailure (executor 197 exited caused by one of the running tasks) Reason: Container marked as failed: container_111 on host: myhost. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143

You can use .drop("colname") to drop the columns from dataframe.
df1=df.drop("a","b","c","d")
Hope it helps you.

Related

AWS Glue ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated

I have a simple glue job where I am using pyspark to read 14million rows from RDS using JDBC and then trying to save it into S3. I can see Output logs in Glue that reading and creating dataframe is quick but while calling write opeation, it fails with the error:
error occurred while calling o89.save. Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, 10.150.85.95, executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have tried the following solutions:
Adding --conf with spark.executor.memory=10g and also with 30g after seeing some solutions on SO.
Tried to convert spark df to DynamicFrame and then call the save opeartion.
Tried increasing the workers to 500!
And still no luck getting it to pass.
1 weird thing I observed it is, after I create the dataframe by reading from JDBC, it keeps the entire df in 1 partition until I do repartition. But the reading step completes without any error.
I used the same code to run for 6M rows and the job completes in 5 mins.
But it fails for 14M rows with the ExecutorLostFailure Error.
I also see this error sometimes if I dig deep in the Logs:
2023-01-22 10:36:52,972 WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(66)): executor task creation failed for executor 203, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit..
Code:
def read_from_db():
logger.info(f'Starts Reading Data from {DB_TABLE} table')
start = time.perf_counter()
filter_query = f'SELECT * FROM {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (glueContext.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.load()
)
end = time.perf_counter()
logger.info(f'Count of records in DB is {spark_df.count()}')
logger.info(f'Elapsed time for reading records from {DB_TABLE} table = {end - start:0.4f} seconds')
logger.info(f'Finished Reading Data from {DB_TABLE} table')
logger.info(f"Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# def write_to_s3(spark_df_rep):
# S3_PATH = (
# f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
# )
# spark_df_rep.write.format("csv").option("header", "true").save(S3_PATH)
spark_df = spark_df.repartition(20)
logger.info(f"Completed Repartitioning. Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# spark_df.foreachPartition(write_to_s3)
# spark_dynamic_frame = DynamicFrame.fromDF(spark_df, glueContext, "spark_dynamic_frame")
# logger.info("Conversion to DynmaicFrame compelete")
# glueContext.write_dynamic_frame.from_options(
# frame=spark_dynamic_frame,
# connection_type="s3",
# connection_options={"path": S3_PATH},
# format="csv"
# )
S3_PATH = (
f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
)
spark_df.write.format("csv").option("header", "true").save(S3_PATH)
return
In many cases this quite a criptic error message signals about OOM. Setting spark.task.cpus to value greater than default 1 (up to 8 which is the number of cores on G2.X worker for Glue verson 3 or higher) helped me. This effectively increases the amount of memory a single Spark task will get (at a cost of a few cores being idle).
I Understood this was because, no memory was left in 1 executor - Increasing workers doesn't help. Because 1 Worker → 1 Executor → 2 DPUs. Even max configuration with G2.X doesn’t help.
This issue stir up because the data was skewed. All rows in my Database were similar, except 2 columns out of 13 columns. And Pyspark wasn't able to load it into different partitions and it was trying to load all my rows into a single partition.
So increasing Workers/ Executors was of no help.
I solved this by loading data into different partitions manually. Spark actually tried to keep everything in 1 partition, I verified that it was in 1 partition.
Even adding repartitioning doesn’t help,
I was getting error while writing and not when reading. This was the cause of confusion. But the actual issue was with reading and the read was actually trigered when write(transformation) is called. So we were getting error at write step:
From other SO answers
Spark reads the data as soon as an action is applied, since you are just reading and writing to s3 so data is read when the write is triggered.
Spark is not optimized to read bulk data from rdbms as it establishes only single connection to the database
Write data to parquet format in parallel
Also see:
Databricks Spark Pyspark RDD Repartition - "Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues."
Manually partition for skewed data
I added a temporary column called RNO (Row number) which is used as partitionColumn to partition data into partitions and it has to be either int/ datetime. After we are done with the job I drop this RNO column in the job itself or manually.
I had to read 14 million records from RDBMS and then write it to S3 where in each file should have around 200k records.
This is where we can use upperBound, lowerBound and numPartitions along with your partitionKey.
Ran with upperBound-14,000,000 and lowerBound-1 and numPartitions-70 to check if all files get 200k records (upperBound/numPartitions - lowerBound/numPartitions) . And it created 65 files and job ran successfully within 10mins.
filter_query = f'select ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RNO, * from {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (spark.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.option('partitionColumn','RNO')
.option('numPartitions',70)
.option('lowerBound',1)
.option('upperBound',14000000)
.load()
)
Additional references:
https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/

Pyspark: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times Possible cause: Parquet column cannot be converted

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.
df = spark.read.load(FilePath1,
format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.
any help is appreciated. Thanks.
The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.
For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.
The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.
To disable the vectorized Parquet reader at the cluster level, set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration
At the notebook level, you can also disable the vectorized Parquet reader by running:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure

When the underlying files have changed, should PySpark refresh the view or the source tables?

Let's say we have a Hive table foo that's backed by a set of parquet files on e.g. s3://some/path/to/parquet. These files are known to be updated at least once per day, but not always at the same hour of the day.
I have a view on that table, for example defined as
spark.sql("SELECT bar, count(baz) FROM foo GROUP BY bar").createOrReplaceTempView('foo_view')
When I use the foo_view the application will occasionally fail with
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 975.0 failed 4 times, most recent failure: Lost task 0.3 in stage 975.0 (TID 116576, 10.56.247.98, executor 193): com.databricks.sql.io.FileReadException: Error while reading file s3a://some/path/to/parquet. It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
I've tried prefixing all my queries on foo_view with a call to spark.catalog.refreshTable('foo'), but the problem keeps on showing up.
Am I doing this right? Or should I call refreshTable() on the view instead of the source table?

Databricks Checksum error while writing to a file

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?
So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Calling simple count() on Spark dataframe fails

Cluster manager: YARN
Deploy-mode : None
I was told if deploy mode is set to none, stdout of drives process comes at the root path, instead of inside container id of the driver process.
SparkUI logs: Give error Container executed on lost node...
I have unpersisted all other dataframes/datasets before making this call to ensure, they are not cached in memory.
Calling a simple action like count(), keeps failing.
I am essentially doing the following:
columnNames.keys.foreach(
col => {
val nonNullColCount =
dataset.select(dataset(col)).filter(row =>
row.getAs(col) != null).count()
println(nonNullParamsCount)
})
So, i am calling count() on dataset in a loop.
In each iteration, i select a column from a list of column names.
Errors are generic and misleading, in the form of:
Job aborted due to stage failure: Task 284 in stage 14.0 failed 4 times,
most recent failure: Lost task 284.3 in stage 14.0 (TID 100923, ip-172-31-50-226.ec2.internal, executor 266):
ExecutorLostFailure (executor 266 exited caused by one of the running tasks)
Reason: Container marked as failed: container_1506075842477_0672_01_017877 on host: ip-172-31-50-226.ec2.internal.
Exit status: -100.
Diagnostics: Container released on a *lost* node
If you are using AWS spot instance and spot instance taken back of price change you can get following error.
Exit status: -100. Diagnostics: Container released on a lost node
Workaround split the Spark job into many independent steps, so you can save the
result of each step as a file on S3 in short interval or go with non spot instance.

Resources