Short question: what are the best practices for using spark in large ETL processes in terms of reliability and fault tolerance?
My team and I are working on the pyspark pipeline processing many (~50) tables resulting in wide tables (~5000 columns). The pipeline is so complex that usual way of using spark (series of joins and transformation) cannot be applied here: spark takes a lot of time just to construct the execution plan and fails often during the execution.
Instead, we use intermediate steps which are temporary tables. Every few joins we save the data to some table and use it afterwards. It does really help with reliability but reduces the speed of process: subsequent steps are not executed until the previous steps have been completed. Additionally, intermediate tables help to debug the pipeline and compare different versions between each other.
Our solution to the speed problem is to parallelise the execution of steps manually: we separate ones which can be run independently and put them into different files. These files then are launched in airflow as different operators.
The approach we use which is described above sounds like a big crutch because it feels like we are doing the spark’s job manually. Are there any other possibilities to tackle these problems?
We considered using spark’s .checkpoint() method but it has drawbacks:
The storage the method uses is not a usual table and it is not possible (or not convenient) to use for debug or compare purposes
If the pipeline fails than you have to restart the whole process from the start. Using our approach one can restart only failed operator in airflow and use results of previous operators to continue the job
Related
I have a databricks job that run many commands and at the end it tries to save the results to a folder. However, it is failed because it tried to write a file to folder but folder was not exists.
I simply created the folder.
However, how can I make it continue where it left without executing all the previous commands.
I assume that by Databricks job you refer to the way to run non-interactive code in a Databricks cluster.
I do not think that what you ask is possible, namely getting the output of a certain Spark task from a previous job run on Databricks. As pointed out in the other answer, "if job is finished, then all processed data is gone". This has to do with the way Spark works under the hood. If you are curious about this topic, I suggest you start reading this post about Transformations and Actions in Spark.
Although you can think of a few workarounds, for instance if you are curious about certain intermediate outputs of your job, you could decide to temporary write your DataFrame/Dataset to some external location. In this way you can easily resume a the job from your preferred point by reading one of your checkpoints as input. This approach is a bit clanky and I do not recommend it, but it's a quick and dirty solution you might want to choose if you are in the testing/designing phase.
A more robust solution would involve splitting your job in multiple sub-jobs and setting upstream & downstream dependencies among them. You can do that using Databricks natively (Task dependencies section) or an external scheduler that integrates with Databricks, like Airflow.
In this way, you can split your tasks and you will be able to have an higher control granularity on your Application. So, in case of again failures on the writing step, you will be able to re run only the writing easily.
If job is finished, then all processed data is gone, until you write some intermediate states (additional tables, etc.) from which you can continue processing. In most cases, Spark actually execute the code only when it's writing results of execution of provided transformations.
So right now you just need to rerun the job.
I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...
Lets say my job performs several spark actions, where the first few are not using multiple cores for a single task so I would like each instance to perform (executor.cores) tasks in parallel (spark.task.cpus=1).
Then suppose I have another action which can be parallelized - I'm desiring a feature where I could increase spark.task.cpus (say to use more cores on the executor), and perform fewer tasks simultaneously on each instance.
My workaround right now is to save data, start a new sparkContext with new settings, and reload the data.
The use case: my later actions may be unavoidable skewed and I may want to apply more than one core per task to avoid bottlenecking on such large tasks, but I don't want this to impact the earlier actions which can benefit from using 1 core per task.
From looking around my guess is that I can't do this currently, so I'm mainly wondering if there is a a significant limitation for not allowing this. Alternatively, suggestions for how I could trick spark into achieving something similar.
Note: Currently using 1.6.2 but willing to hear other options for Spark2+
We have two clusters X and Y with same keyspaces but distinct data sets. We are planning to merge these into single cluster. What would be ideal steps to achieve this without downtime for applications? We have time series data stream continuously writing to Cassandra.
We have ruled out export/import as that will make us lose data during the time of copy.
We also ruled out sstableloader as that is not reliable. It fails often and there is not way to start from where it failed. Also it has same issue mentioned above.
Do double writes (to both clusters) then any of the methods above will work.
1) A spark job is probably best if you have a ton of data.
2) import export is not as good as Brian's cassandra-loader so maybe give that a try https://github.com/brianmhess/cassandra-loader
We also ruled out sstableloader as that is not reliable. It fails
often and there is not way to start from where it failed. Also it has
same issue mentioned above.
I'm assuming your writes are idempotent so it's not a huge deal if you need to run the job over.
I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.