Spark | For a synchronous request/response use case - apache-spark

Spark Newbie alert. I've been exploring the ideas to design a requirement which involves the following:
Building a base predictive model for Linear Regression(One off activity)
Pass the data points to get the value for the response variable.
Do something with the result.
At regular intervals update the models.
This has to be done in a sync (req/resp) mode so that the caller code invokes the prediction code, gets the result and carries on with the downstream. The caller code is outside spark (it's a webapp).
I'm struggling to understand if Spark/Spark Streaming is a good fit for doing the Linear Regression purely because of it's async nature.
From what I understand, it simply works of a Job Paradigm where you tell it a source (DB, Queue etc) and it does the computation and pushes the result to a destination (DB, Queue, File etc). I can't see a HTTP/Rest interface which could be used to get the results.
Is Spark the right choice for me? Or are there any better ideas to approach this problem?
Thanks.

If i got it right then in general you have to solve three basic problems:
Build model
Use that model to perform predictions per sync http
request (for scoring or smth like this) from the outside (in your
case this will be webapp)
Update model to make it more precise
within some interval
In general Spark is a platform for performing distributed batch computations on some datasets in a pipeline manner. So with job paradigm you were right. Job is actually a pipeline which will be executed by Spark and which have start and end operations. You will get such benefits as distributed workload among your cluster and effective resource utilization and good (in compare with other such platforms and frameworks) performance due to data partiotioning which allows to perform parallel executions in narrow operation.
So for me the right solution will be to use Spark to build and update your model and then export it in some other solution which will serve your requests and use this model for predictions.
Do something with the result
In step 3 you can use kafka and spark streaming to pass the corrected result and update your model's precision.
Some useful links which can probable help:
https://www.phdata.io/exploring-spark-mllib-part-4-exporting-the-model-for-use-outside-of-spark/
https://datascience.stackexchange.com/questions/13028/loading-and-querying-a-spark-machine-learning-model-outside-of-spark
http://openscoring.io/blog/2016/07/04/sparkml_realtime_prediction_rest_approach/

Related

Spark ETL pipeline reliability

Short question: what are the best practices for using spark in large ETL processes in terms of reliability and fault tolerance?
My team and I are working on the pyspark pipeline processing many (~50) tables resulting in wide tables (~5000 columns). The pipeline is so complex that usual way of using spark (series of joins and transformation) cannot be applied here: spark takes a lot of time just to construct the execution plan and fails often during the execution.
Instead, we use intermediate steps which are temporary tables. Every few joins we save the data to some table and use it afterwards. It does really help with reliability but reduces the speed of process: subsequent steps are not executed until the previous steps have been completed. Additionally, intermediate tables help to debug the pipeline and compare different versions between each other.
Our solution to the speed problem is to parallelise the execution of steps manually: we separate ones which can be run independently and put them into different files. These files then are launched in airflow as different operators.
The approach we use which is described above sounds like a big crutch because it feels like we are doing the spark’s job manually. Are there any other possibilities to tackle these problems?
We considered using spark’s .checkpoint() method but it has drawbacks:
The storage the method uses is not a usual table and it is not possible (or not convenient) to use for debug or compare purposes
If the pipeline fails than you have to restart the whole process from the start. Using our approach one can restart only failed operator in airflow and use results of previous operators to continue the job

How to run apache-beam in batches on a bounded data?

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:
Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)
Please point out to me any lapses in my understanding
Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...

Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Records processed metric for intermediate datasets

I have created a spark job using DATASET API. There is chain of operations performed until the final result which is collected on HDFS.
But I also need to know how many records were read for each intermediate dataset. Lets say I apply 5 operations on dataset (could be map, groupby etc), I need to know how many records were there for each of 5 intermediate dataset. Can anybody suggest how this can be obtained at dataset level. I guess I can find this out at task level (using listeners) but not sure how to get it at dataset level.
Thanks
The nearest from Spark documentation related to metrics is Accumulators. However this is good only for actions and they mentioned that acucmulators will not be updated for transformations.
You can still use count to get the latest counts after each operation. But should keep in mind that its an extra step like any other and you need to see if the ingestion should be done faster with less metrics or slower with all metrics.
Now coming back to listerners, I see that a SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
Your requirement is more of a custom implementation. Not sure if you can achieve this. Some info regarding exporting metrics is here.
All metrics which you can collect are at job start, job end, task start and task end. You can check the docs here
Hope the above info might guide you in finding a better solutions

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?
Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

Resources