Performance of Pyspark pipeline in python classes - apache-spark

I have a pyspark pipeline running on Databricks. A pipeline is basically a number of functions executed in a sequence which are reading/creating tables, joining, transforming, etc.(i.e. common spark stuff). So for example it could be something like below:
def read_table():
def perform_tansforms():
def perform_further_transforms():
def run_pipeline():
read_table()
perform_tansforms()
perform_further_transforms()
Now to structure the code better I encapsulated the constants and functions of the pipeline into a class with static methods and a run method like below:
class CustomPipeline():
class_variable_1 = "some_variable"
class_variable_2 = "another_variable"
#staticmethod
def read_table():
#staticmethod
def perform_tansforms():
#staticmethod
def perform_further_transforms():
#staticmethod
def run():
CustomPipeline.read_table()
CustomPipeline.perform_tansforms()
CustomPipeline.perform_further_transforms()
Now, this may be a stupid question but conceptually, can this affect the performance of the pipeline in any way? For example, encapsulating the parts of the pipeline into class may result in some extra overhead in communication from the python interpreter to the JVM running spark.
Any help is appreciated, thanks. Also, comment if any other detail is needed.

Not directly, no, doesn't matter.
I suppose it could matter if, for example, your class had a bunch of initialization that executed all initialization for every step, no matter which step was executed. But I don't see that here.
This isn't different on Spark or Databricks.

Related

Using PySpark instead of the multiprocessing module - How can I translate my code?

I am currently using the multiprocessing module to parallelize iterations as shown in this example. The thing is that this way I would be using only 1 worker and its cores, but not using all the workers available. Also I'm not able to parallelize experiments (I'm running several experiments, and several iterations for each experiment).
This code is taking too long to run, and I have the understanding that the runtime could be highly reduced using PySpark. My Spark knowledge is very little and I donĀ“t know how to translate this code in order to use it with Spark.
All the functions and classes used here are written using purely python, (numpy and pandas)
import concurrent.futures
import multiprocessing as mp
def process_simulation(experiment):
number_of_workers = mp.cpu_count()
with concurrent.futures.ProcessPoolExecutor(max_workers=number_of_workers) as executor:
results = list(executor.map(Simulation.simulation_steps(), iterations_generator()))
experiment.simulations = []
for i, v in enumerate(results):
experiment.simulations.append(results[v])
For context, Experiment and Simulation are classes (there's no inheritance). One experiment needs multiple simulations to be completed.
Thank you!
You can use Fugue to bring this type of logic to PySpark with a minimal wrapper. The only thing is you need to start with a DataFrame of inputs, and then you can do something like:
from fugue import transform
transform(input_df, Simulation.simulation_steps, schema=<your output schema here>, partition={"how": "per_row"}, engine="spark"
I can always help more get it in this shape if I have more details to what the logic is. It may just need one wrapper function. (contact info in bio).

Celery Process a Task and Items within a Task

I am new to Celery, and I would like advice on how best to use Celery to accomplish the following.
Suppose I have ten large datasets. I realize that I can use Celery to do work on each dataset by submitting ten tasks. But suppose that each dataset consists of 1,000,000+ text documents stored in a NoSQL database (Elasticsearch in my case). The work is performed at the document level. The work could be anything - maybe counting words.
For a given dataset, I need to start the dataset-level task. The task should read documents from the data store. Then workers should process the documents - a document-level task.
How can I do this, given that the task is defined at the dataset level, not the document level? I am trying to move away from using a JoinableQueue to store documents and submit them for work with multiprocessing.
It have read that it is possible to use multiple queues in Celery, but it is not clear to me that that is the best approach.
Lets see if this helps. You can define a workflow and add tasks to it and then run the whole thing after building up your tasks. You can have normal python methods return tasks to can be added into celery primatives (chain, group chord etc) See here for more info. For example lets say you have two tasks that process documents for a given dataset:
def some_task():
return dummy_task.si()
def some_other_task():
return dummy_task.si()
#celery.task()
def dummy_task(self, *args, **kwargs):
return True
You can then provide a task that generates the subtasks like so:
#celery.task()
def dataset_workflow():
datastets = get_datasets(*args, **kwargs)
workflows = []
for dataset in datasets:
documents = get_documents(dataset)
worflow = chain(some_task(documents), some_other_task(documents))
worlfows.append(workflow)
run_workflows = chain(*workflows).apply_aysnc()
Keep in mind that generating alot of tasks can consume alot of memory for the celery workers, so throttling or breaking the task generation up might be needed as you start to scale your workloads.
Additionally you can have the document level tasks on a diffrent queue then your worflow task if needed based on resource contstraints etc.

Why is pyspark implemented such that exiting a session is stopping the underlying spark context?

I just massively shot my foot by writing "pythonic" spark code like this:
# spark = ... getOrCreate() # essentially provided by the environment (Databricks)
with spark.newSession() as session:
session.catalog.setCurrentDatabase("foo_test")
do_something_within_database_scope(session)
assert spark.currentDatabase() == "default"
And oh was I surprised that when executing this notebook cell, somehow the cluster terminated.
I read through this answer which tells me, that there can only be one spark context. That is fine. But why is exiting a session terminating the underlying context? Is there some requirement for this or is this just a design flaw in pyspark?
I also understand that the session's __exit__ call invokes context.stop() - I want to know why it is implemented like that!
I always think of a session as some user initiated thing, like with databases or http clients which I can create and discard on my own will. If the session provides __enter__ and __exit__ then I try to use it from within a with context to make sure I clean up after I am done.
Is my understanding wrong, or alternatively why does pyspark deviate from that concept?
Edit: I tested this together with databricks-connect which comes with its own pyspark python module, but as pri pointed out below it seems to be implemented the same way in standard pyspark.
I looked at the code, and it calls below method:
#since(2.0)
def __exit__(
self,
exc_type: Optional[Type[BaseException]],
exc_val: Optional[BaseException],
exc_tb: Optional[TracebackType],
) -> None:
"""
Enable 'with SparkSession.builder.(...).getOrCreate() as session: app' syntax.
Specifically stop the SparkSession on exit of the with block.
"""
self.stop()
And the stop method is:
#since(2.0)
def stop(self) -> None:
"""Stop the underlying :class:`SparkContext`."""
from pyspark.sql.context import SQLContext
self._sc.stop()
# We should clean the default session up. See SPARK-23228.
self._jvm.SparkSession.clearDefaultSession()
self._jvm.SparkSession.clearActiveSession()
SparkSession._instantiatedSession = None
SparkSession._activeSession = None
SQLContext._instantiatedContext = None
So I don't think you can stop just the SparkSession. Whenever a Spark Session gets stopped (irrespective of the way, in this case, when it comes out of 'with' block, __exit__ is being called), it would kill the underlying SparkContext along with it.
Link to the relevant Apache Spark code below:
https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L1029

Is it possible to get the attempt number of a running Spark Task?

I have a Spark job in which I am applying a custom transformation function to all records of my RDD. This function is very long and somewhat "fragile" as it might fail unexpectedly. I also have a fallback function that should be used in case of a failure - this one is much faster and stable. For reasons beyond the scope of this question, I can't split the primary function into smaller pieces, nor can I catch the primary function's failure inside the function itself and handle the fallback (with try/except for example) as I also want to handle failures caused by the execution environment, such as OOM.
Here's a simple example of what I'm trying to achieve:
def primary_logic(*args):
...
def fallback_logic(*args):
...
def spark_map_function(*args):
current_task_attempt_num = ... # how do I get this?
if current_task_attempt_num == 0:
return primary_logic(*args)
else:
return fallback_logic(*args)
result = rdd.map(spark_map_function)

How do I get data on spark jobs and stages from python [duplicate]

This question already has answers here:
How to add a SparkListener from pySpark in Python?
(2 answers)
Closed 3 years ago.
Following the breadcrumbs, I cobbled some code that seems to do what I want: run in the background, look at ongoing jobs, then collect... whatever information may be available:
def do_background_monitoring(sc: pyspark.context.SparkContext):
thread = threading.Thread(target=monitor, args=[sc])
thread.start()
return thread
def monitor(sc: pyspark.context.SparkContext):
job_tracker: pyspark.status.StatusTracker = sc.statusTracker() # should this go inside the loop?
while True:
time.sleep(1)
for job_id in job_tracker.getActiveJobsIds():
job: pyspark.status.SparkJobInfo = job_tracker.getJobInfo(job_id)
stages = job.stageIds
# ???
However, that's where I hit a dead end. According to the docs, stageIds is an int[], and apparently py4j or whatever doesn't know what to do with it? (py4j claims otherwise...)
ipdb> stages
JavaObject id=o34
ipdb> stages.
equals notify wait
getClass notifyAll
hashCode toString
ipdb> stages.toString()
'[I#4b1009f3'
Is this a dead end? Are there other ways to achieve this? If I were willing to write scala to do this, would I be able to have just this bit be in Scala and keep the rest in Python?
...while the repl made it look like Python knew nothing about the object other than it was some sort of Object, py4j does make the contents of the array available to you:
ipdb> type(si)
<class 'py4j.java_collections.JavaArray'>
ipdb> tuple(si)
(0,)
and now I feel really silly :)

Resources