I have a practical use case. three notebooks (pyspark) all have one common parameter.
need to schedule all three notebooks in a sequence.
is there any way to run them by setting one parameter value, as they are same in all?
please suggest the best way to do it.
Related
Short question: what are the best practices for using spark in large ETL processes in terms of reliability and fault tolerance?
My team and I are working on the pyspark pipeline processing many (~50) tables resulting in wide tables (~5000 columns). The pipeline is so complex that usual way of using spark (series of joins and transformation) cannot be applied here: spark takes a lot of time just to construct the execution plan and fails often during the execution.
Instead, we use intermediate steps which are temporary tables. Every few joins we save the data to some table and use it afterwards. It does really help with reliability but reduces the speed of process: subsequent steps are not executed until the previous steps have been completed. Additionally, intermediate tables help to debug the pipeline and compare different versions between each other.
Our solution to the speed problem is to parallelise the execution of steps manually: we separate ones which can be run independently and put them into different files. These files then are launched in airflow as different operators.
The approach we use which is described above sounds like a big crutch because it feels like we are doing the spark’s job manually. Are there any other possibilities to tackle these problems?
We considered using spark’s .checkpoint() method but it has drawbacks:
The storage the method uses is not a usual table and it is not possible (or not convenient) to use for debug or compare purposes
If the pipeline fails than you have to restart the whole process from the start. Using our approach one can restart only failed operator in airflow and use results of previous operators to continue the job
I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists.
I simply created the folder.
However, how can I make it continue where it left without executing all the previous commands.
I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.
I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.
Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.
A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow.
In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.
If job is finished, then all processed data is gone, until you write some intermediate states (additional tables, etc.) from which you can continue processing. In most cases, Spark actually execute the code only when it's writing results of execution of provided transformations.
So right now you just need to rerun the job.
I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...
I need your suggestion for a system that we are building on spark. Our use case is as follows.
We have four independent business rules that we need to calculate.
Every business rule consists of operation like reading from several table and joining and aggregating them and producing one output table.
Intermediate output results of each business rule we need to combine with each other and produce final output. For example we need to combine say rule1 and rule2 result and then rule3 and rule4 followed by combining their result as final output. Let's call these steps as combining steps.
So this seems to be a DAG (Direct acyclic graph) that we need to process where rules become initial nodes to process followed by combining steps that has to be executed in some order.
Now the question is, should I consider a work-flow management system like Oozie as technology choice to process this DAG? For e.g. in Oozie all individual rules will be become independent "actions" and combining steps will be dependent "actions" that should run post business rule actions. If I follow this, my system will be decoupled in several independent sub-component; all will be configured in Oozie xml file.
This sounds fascinating but I am not very convinced for above approach. "My" reasons are -
1. Each layer has to create independent SparkContext, single spark context can not be shared across all steps.
2. Spark in itself is nothing but process a DAG. Using Oozie on top of this would be an over kill.
3. Oozie should be used for use cases where we need to build a workflow for different components. For e.g. I want to call spark application followed by some shell script followed by Pig script and so on.
If I don't use a work-flow management system like Oozie, I need to write small framework in my application that combines these rules in that order. Of-course in future course of action if I need to run rules in parallel, I need to modify my this tiny framework.
So my questions are -
1. Do my reasons of not using Oozie correct?
2. If not Oozie, there is another framework that fits in my use case?
3. Does Writing my small framework to handle this workflow is actually reinventing the wheel?
I need to run identical jobs in schedule, and they differ only in few strings.
As you may know, there is no a convenient way to create identical jobs with different parameters. For now i prefer so "codeless" way to do so, or with "as less code as possilbe".
So lets imagine they are stored in a rows of JobsConfigurations table of the website-related database.
How I can get the Job name of job being running to pick the right configuration from the table?
Thanks for help!
See https://github.com/projectkudu/kudu/wiki/Web-Jobs#environment-settings
The WEBJOBS_NAME environment variable will give you the name of the current WebJob.