I am writing unit test for a function to test read, write and move operations in pyspark, this is the first time I am doing so need some inputs or any best resources to understand the concept of unit test.
Related
I've a DLT pipeline where in it creates Delta table by reading from sql server and then we call few apis to update metadata in our cosmos. Whenever we start it, it gets struck in initialising state.
But when we run same code using interactive cluster in a stand alone notebook, it works fine.
Can someone help me to understand this issue ?
DLT Pipeline shouldn't struck in initialising state
The problem is that you're structured your DLT program incorrectly. Programs that are written for DLT should be declarative by design, but in your case you're performing your actions on the top-level, not inside the functions marked as #dlt.table. When DLT pipeline is starting, it's building the execution graph by evaluating all code, and identifying vertices of the execution graph that are marked with #dlt annotations (you can see that your function is called several times, as explained here). And because your code is having side effect of reading all data with spark.read.jdbc, interacting with Cosmos, etc., then the initialization step is really slow.
To illustrate the problem let look onto your code structure. Right now you have following:
def read(...):
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
3. Return annotated function that will just return captured `df`
As result of this, items 1 & 2 are performed during the initialization stage, not when the actual pipeline is executed.
To mitigate this problem you need to change structure to following:
def read(...):
1. Return annotated function that will:
1. Perform read via `spark.read.jdbc` into `df`
2. Perform operations with Cosmos DB
Short question: what are the best practices for using spark in large ETL processes in terms of reliability and fault tolerance?
My team and I are working on the pyspark pipeline processing many (~50) tables resulting in wide tables (~5000 columns). The pipeline is so complex that usual way of using spark (series of joins and transformation) cannot be applied here: spark takes a lot of time just to construct the execution plan and fails often during the execution.
Instead, we use intermediate steps which are temporary tables. Every few joins we save the data to some table and use it afterwards. It does really help with reliability but reduces the speed of process: subsequent steps are not executed until the previous steps have been completed. Additionally, intermediate tables help to debug the pipeline and compare different versions between each other.
Our solution to the speed problem is to parallelise the execution of steps manually: we separate ones which can be run independently and put them into different files. These files then are launched in airflow as different operators.
The approach we use which is described above sounds like a big crutch because it feels like we are doing the spark’s job manually. Are there any other possibilities to tackle these problems?
We considered using spark’s .checkpoint() method but it has drawbacks:
The storage the method uses is not a usual table and it is not possible (or not convenient) to use for debug or compare purposes
If the pipeline fails than you have to restart the whole process from the start. Using our approach one can restart only failed operator in airflow and use results of previous operators to continue the job
I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...
Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following:
Building a base predictive model for Linear Regression(One off activity)
Pass the data points to get the value for the response variable.
Do something with the result.
At regular intervals update the models.
This has to be done in a sync (req/resp) mode so that the caller code invokes the prediction code, gets the result and carries on with the downstream. The caller code is outside spark (it's a webapp).
I'm struggling to understand if Spark/Spark Streaming is a good fit for doing the Linear Regression purely because of it's async nature.
From what I understand, it simply works of a Job Paradigm where you tell it a source (DB, Queue etc) and it does the computation and pushes the result to a destination (DB, Queue, File etc). I can't see a HTTP/Rest interface which could be used to get the results.
Is Spark the right choice for me? Or are there any better ideas to approach this problem?
Thanks.
If i got it right then in general you have to solve three basic problems:
Build model
Use that model to perform predictions per sync http
request (for scoring or smth like this) from the outside (in your
case this will be webapp)
Update model to make it more precise
within some interval
In general Spark is a platform for performing distributed batch computations on some datasets in a pipeline manner. So with job paradigm you were right. Job is actually a pipeline which will be executed by Spark and which have start and end operations. You will get such benefits as distributed workload among your cluster and effective resource utilization and good (in compare with other such platforms and frameworks) performance due to data partiotioning which allows to perform parallel executions in narrow operation.
So for me the right solution will be to use Spark to build and update your model and then export it in some other solution which will serve your requests and use this model for predictions.
Do something with the result
In step 3 you can use kafka and spark streaming to pass the corrected result and update your model's precision.
Some useful links which can probable help:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
https://datascience.stackexchange.com/questions/13028/loading-and-querying-a-spark-machine-learning-model-outside-of-spark
http://openscoring.io/blog/2016/07/04/sparkml_realtime_prediction_rest_approach/
I have just written a nontrivial reduce function in which the rereduce execution path differs from the standard reduce.
It is my understanding that rereduce is only performed in certain situations; specifically, to merge reduce operations performed across multiple nodes or large ranges of keys.
As my local development environment runs on a single node with a relatively small dataset, how can I effectively test the behaviour of my reduce function in a rereduce scenario?
You could write a script to generate larger amounts of realistic dummy data. The only way I've been able to test my map-reduce's are with real or fake data, but lots of it.
It's a side benefit and perversely fun, but you also get a good idea of how long it will take for indexing, and view requests, and get a peek at how your app will do at scale. Load testing never hurts.
I don't test my scripts in couchdb. I use instead:
a JS-IDE (Webstorm)
behaviour driven testing (jasmine)
a full suite of JSON test documents
a self written runner script that mocks the call for map and reduce functions.
an external version control system to manage changes to the queries.
This way I can regression and unit test my map and reduce functions whilst at the same time support a level of complexity that can just not be supported in couchdb.