I've a DLT pipeline where in it creates Delta table by reading from sql server and then we call few apis to update metadata in our cosmos. Whenever we start it, it gets struck in initialising state.
But when we run same code using interactive cluster in a stand alone notebook, it works fine.
Can someone help me to understand this issue ?
DLT Pipeline shouldn't struck in initialising state
The problem is that you're structured your DLT program incorrectly. Programs that are written for DLT should be declarative by design, but in your case you're performing your actions on the top-level, not inside the functions marked as #dlt.table. When DLT pipeline is starting, it's building the execution graph by evaluating all code, and identifying vertices of the execution graph that are marked with #dlt annotations (you can see that your function is called several times, as explained here). And because your code is having side effect of reading all data with spark.read.jdbc, interacting with Cosmos, etc., then the initialization step is really slow.
To illustrate the problem let look onto your code structure. Right now you have following:
def read(...):
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
3. Return annotated function that will just return captured `df`
As result of this, items 1 & 2 are performed during the initialization stage, not when the actual pipeline is executed.
To mitigate this problem you need to change structure to following:
def read(...):
1. Return annotated function that will:
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
Related
Short question: what are the best practices for using spark in large ETL processes in terms of reliability and fault tolerance?
My team and I are working on the pyspark pipeline processing many (~50) tables resulting in wide tables (~5000 columns). The pipeline is so complex that usual way of using spark (series of joins and transformation) cannot be applied here: spark takes a lot of time just to construct the execution plan and fails often during the execution.
Instead, we use intermediate steps which are temporary tables. Every few joins we save the data to some table and use it afterwards. It does really help with reliability but reduces the speed of process: subsequent steps are not executed until the previous steps have been completed. Additionally, intermediate tables help to debug the pipeline and compare different versions between each other.
Our solution to the speed problem is to parallelise the execution of steps manually: we separate ones which can be run independently and put them into different files. These files then are launched in airflow as different operators.
The approach we use which is described above sounds like a big crutch because it feels like we are doing the spark’s job manually. Are there any other possibilities to tackle these problems?
We considered using spark’s .checkpoint() method but it has drawbacks:
The storage the method uses is not a usual table and it is not possible (or not convenient) to use for debug or compare purposes
If the pipeline fails than you have to restart the whole process from the start. Using our approach one can restart only failed operator in airflow and use results of previous operators to continue the job
I am currently developing an Azure ML pipeline that as one of its outputs is maintaining a SQL table holding all of the unique items that are fed into it. There is no way to know in advance if the data fed into the pipeline is new unique items or repeats of previous items, so before updating the table that it maintains it pulls the data already in that table and drops any of the new items that already appear.
However, due to this there are cases where this self-reference results in zero new items being found, and as such there is nothing to export to the SQL table. When this happens Azure ML throws an error, as it is considered an error for there to be zero lines of data to export. In my case, however, this is expected behaviour, and as such absolutely fine.
Is there any way for me to suppress this error, so that when it has zero lines of data to export it just skips the export module and moves on?
It sounds as if you are struggling to orchestrate a data pipeline because there orchestration is happening in two places. My advice would be to either move more orchestration into Azure ML, or make the separation between the two greater. One way to do this would be to have a regular export to blob of the table you want to use as training. Then you can use a Logic App to trigger a pipeline whenever a non-empty blob lands in the location
This issue has been resolved by an update to Azure Machine Learning; You can now run pipelines with a flag set to "Continue on Failure Step", which means that steps following the failed data export will continue to run.
This does mean you will need to design your pipeline to be able to handles upstream failures in its downstream modules; this must be done very carefully.
I am writing huge datasets to a database (DB2) using the dataset.write.jdbc method. I see that if one of the records has an issue in getting inserted to the db, the entire dataset fails. This is turning out to be expensive since the dataset is prepared by running a huge pipeline. Re-running the entire pipeline for the sake of failed persistence does not making sense.
The problem seems to be much bigger than exception handling. Ideally a data pipeline has to designed in such a way that it should validate the data before it is processed/transformed. In general context this is called data validation and cleansing. In this phase you may want to identify the NULL/empty values and treat them accordingly. Especially the attributes that participate in joins or those that participate in lookups. You need to transform them so that they don't create issues in subsequent steps in your pipeline. In fact this is pretty much applicable for every transformation. Hope this helps.
Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following:
Building a base predictive model for Linear Regression(One off activity)
Pass the data points to get the value for the response variable.
Do something with the result.
At regular intervals update the models.
This has to be done in a sync (req/resp) mode so that the caller code invokes the prediction code, gets the result and carries on with the downstream. The caller code is outside spark (it's a webapp).
I'm struggling to understand if Spark/Spark Streaming is a good fit for doing the Linear Regression purely because of it's async nature.
From what I understand, it simply works of a Job Paradigm where you tell it a source (DB, Queue etc) and it does the computation and pushes the result to a destination (DB, Queue, File etc). I can't see a HTTP/Rest interface which could be used to get the results.
Is Spark the right choice for me? Or are there any better ideas to approach this problem?
Thanks.
If i got it right then in general you have to solve three basic problems:
Build model
Use that model to perform predictions per sync http
request (for scoring or smth like this) from the outside (in your
case this will be webapp)
Update model to make it more precise
within some interval
In general Spark is a platform for performing distributed batch computations on some datasets in a pipeline manner. So with job paradigm you were right. Job is actually a pipeline which will be executed by Spark and which have start and end operations. You will get such benefits as distributed workload among your cluster and effective resource utilization and good (in compare with other such platforms and frameworks) performance due to data partiotioning which allows to perform parallel executions in narrow operation.
So for me the right solution will be to use Spark to build and update your model and then export it in some other solution which will serve your requests and use this model for predictions.
Do something with the result
In step 3 you can use kafka and spark streaming to pass the corrected result and update your model's precision.
Some useful links which can probable help:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
https://datascience.stackexchange.com/questions/13028/loading-and-querying-a-spark-machine-learning-model-outside-of-spark
http://openscoring.io/blog/2016/07/04/sparkml_realtime_prediction_rest_approach/
I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.
Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.
After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.