I am writing huge datasets to a database (DB2) using the dataset.write.jdbc method. I see that if one of the records has an issue in getting inserted to the db, the entire dataset fails. This is turning out to be expensive since the dataset is prepared by running a huge pipeline. Re-running the entire pipeline for the sake of failed persistence does not making sense.
The problem seems to be much bigger than exception handling. Ideally a data pipeline has to designed in such a way that it should validate the data before it is processed/transformed. In general context this is called data validation and cleansing. In this phase you may want to identify the NULL/empty values and treat them accordingly. Especially the attributes that participate in joins or those that participate in lookups. You need to transform them so that they don't create issues in subsequent steps in your pipeline. In fact this is pretty much applicable for every transformation. Hope this helps.
Related
We have a spark app that does some customized aggregation on our data. And we've created a Spark Aggregator to do it, which is basically extending the org.apache.spark.sql.expressions.Aggregator and calling it through a UDAF. These aggregators basically defines a pre-aggregate process so that executors could process the data at local first and then go to exchange phase, which is pretty sweet. More on them could be found here https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html
Now my question is, I've noticed that for these aggregators, the second ObjectHashAggregate process is always much much slower than the first partial ObjectHashAggregate. For example in the screenshot I've attached, the second ObjectHashAggregate is significantly slower than the first one. SparkJobScreenShot
However in our code both of them should be running kinda the same aggregation methods on the data or on the intermediate results. So why is it the second one is much slower than the first one? Is it because the hashing and exchanging after the first ObjectHashAggregate? I'm not really sure what could be causing Spark with such behavior.
I'm using delta live tables from Databricks and I was trying to implement a complex data quality check (so-called expectations) by following this guide. After I tested my implementation, I realized that even though the expectation is failing, the tables dependent downstream on the source table are still loaded.
To illustrate what I mean, here is an image describing the situation.
Image of the pipeline lineage and the incorrect behaviour
I would assume that if the report_table fails due to the expectation not being met (in my case, it was validating for correct primary keys), then the Customer_s table would not be loaded. However, as can be seen in the photo, this is not quite what happened.
Do you have any idea on how to achieve the desired result? How can I define a complex validation with SQL that would cause the future nodes to not be loaded (or it would make the pipeline fail)?
The default behavior when expectation violation occurs in Delta Live Tables is to load the data but track the data quality metrics (retain invalid records). The other options are : ON VIOLATION DROP ROW and ON VIOLATION FAIL UPDATE. Choose "ON VIOLATION DROP ROW" if that is the behavior you want in your pipeline.
https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html#drop-invalid-records
If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.
I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks
The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions
I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.