Can spark ignore the a task failure due to an account data issue and continue the job process for other accounts? - apache-spark

I want spark to ignore some failed tasks due to data issues. Also, I want spark not to stop the whole job due to some insert failures.

if you are using databricks, you can handle bad records and files as explained in this article.
https://docs.databricks.com/spark/latest/spark-sql/handling-bad-records.html
From the documentation:
Databricks provides a unified interface for handling bad records and
files without interrupting Spark jobs. You can obtain the exception
records/files and reasons from the exception logs by setting the data
source option badRecordsPath. badRecordsPath specifies a path to store
exception files for recording the information about bad records for
CSV and JSON sources and bad files for all the file-based built-in
sources (for example, Parquet).
You can also use some data cleansing library like Pandas,Optimus, sparkling.data,spark vanilla, dora etc. That will give you an insight into the bad data and let you fix your data before running analysis on it.

Related

How to Spark batch job global commit on ADLS Gen2?

I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark will perform global commit once the job is committed, but what it really does it commits on each task, meaning once task completes writing it moves from temp to target storage.
So if the batch fails we have partial data, and on retry we are getting data duplications. Our scale is really huge so rolling back (deleting data) is not an option for us, the search will takes a lot of time.
Is there any "built-in" solution, something we can use out of the box?
Right now we are considering writing to some temp destination and move files only after the whole job completed, but we would like to find some more elegant solution (if exists).
This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions.
Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.

Custom processing of multiple json event streams with spark/databricks

I have multiple (hundreds) of event streams, each persisted as multiple blobs in azure blob storage, each encoded as multi-line json, and I need to perform an analysis on these streams.
For the analysis I need to "replay" them, which basically is a giant reduce operation per stream using a big custom function, that is not commutative. Since other departments are using databricks, I thought I could parallelize the tasks with it.
My main question: Is spark/databricks a suitable tool for the job and if so, how would you approach it?
I am completely new to spark, but I am currently reading up on spark using the "Complete Guide" and the "Learning Spark 2.ed" and I have trouble to answer that question myself.
As far as I see, most of the dataset / Spark SQL is not suitable for this task? Can I just inject custom code in a spark application that is not using these APIs and how do I control how the tasks get distributed afterwards?
Can I read in all blob names, partition them by stream and then generate tasks that read in all blobs in a partition and just feed them into my function without spark trying to be clever in the background?

java.io.FileNotFoundException: Item not found Concurrent read/write on ORC table

When I try concurrent read/write on a table using spark application, I get the following error:
19/10/28 15:26:49 WARN TaskSetManager: Lost task 213.0 in stage 6.0 (TID 407, prod.internal, executor 3): java.io.FileNotFoundException: Item not found: 'gs://bucket/db_name/table_name/p1=xxx/part-1009-54ad3fbb-5eed-43ba-a7da-fb875382897c.c000'. If you enabled STRICT generation consistency, it is possible that the live version is still available but the intended generation is deleted.
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.getFileNotFoundException(GoogleCloudStorageExceptions.java:38)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:631)
I am using Google Cloud Dataproc Version 1.4 and stock hadoop component versions.
I was previously writing and reading from same partition of a PARQUET table but it used to throw a refresh table error. Now I'm using an ORC format table, but the error stays somewhat same. Any solutions for concurrent read/write on hive tables using spark applications?
You can try running;
spark.sql("refresh table your_table")
statement before your read/write operation it can work "occasionally".
First error line indicates that you were file was not found in your bucket, you may want to look into this. Make sure to check the existence of your folders and make sure the files and the requested versions are accessible.
For the “STRICT generation consistency”, this is most probably related to Cloud Storage and produced by the connector, more precisely related to “Strongly consistent operation’.
https://cloud.google.com/storage/docs/consistency
Have you looked into your error logs to see why this error occurs? What type of environment are you running your application on?
This may be more of a Hive issue relating to the concurrency mechanism in which you want to implement.
https://cwiki.apache.org/confluence/display/Hive/Locking
Also, I would advise you to look more into the recommendations and functionalities of using Apache Hive on Cloud Dataproc. You can also consider using a multi-regional bucket if the Hive data needs to be accessed from Hive servers that are located in multiple locations.
https://cloud.google.com/solutions/using-apache-hive-on-cloud-dataproc

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Thanks!
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template log4j.properties.template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

Resources