(1) Can I have the cluster object global across machines so that once a job is submitted, that job in turn can submit other dissimilar smaller jobs?
cluster = dispy.JobCluster(compute)
(2) Can the "compute" function be different each time I invoke a submit?
(1) I believe you need to look into the SharedCluster object, of which I'm not familiar.
(2) You can create different "functions" inside of the compute function using if statements and passing a selection argument to compute:
def compute(param):
if param == 'a':
'Do something'
if param == 'b':
'Do something else'
cluster = dispy.JobCluster(compute)
for i in params:
cluster.submit(i)
Related
I'm exploring Prefect's map-reduce capability as a powerful idiom for writing massively-parallel, robust importers of external data.
As an example - very similar to the X-Files tutorial - consider this snippet:
#task
def retrieve_episode_ids():
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode_ids()
#task(max_retries=2, retry_delay=datetime.timedelta(seconds=3))
def download_episode(episode_id):
api_connection = APIConnection(prefect.context.my_config)
return api_connection.get_episode(episode_id)
#task(trigger=all_finished)
def persist_episodes(episodes):
db_connection = DBConnection(prefect.context.my_config)
...store all episodes by their ID with a success/failure flag...
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episodes)
The peculiarity of my flow, compared with the simple X-Files tutorial, is that I would like to persist results for all the episodes that I have requested, even for the failed ones. Imagine that I'll be writing episodes to a database table as the episode ID decorated with an is_success flag. Moreover, I'd like to write all episodes with a single task instance, in order to be able to perform a bulk insert - as opposed to inserting each episode one by one - hence my persist_episodes task being a reduce task.
The trouble I'm having is in being able to gather the episode ID for the failed downloads from that reduce task, so that I can store the failed information in the table under the appropriate episode ID. I could of course rewrite the download_episode task with a try/catch and always return an episode ID even in the case of failure, but then I'd lose the automatic retry/failure functionality which is a good deal of the appeal of Prefect.
Is there a way for a reduce task to infer the argument(s) of a failed mapped task? Or, could I write this differently to achieve what I need, while still keeping the same level of clarity as in my example?
Mapping over a list preserves the order. This is a property you can use to link inputs with the errors. Check the code I have below, will add more explanation after.
from prefect import Flow, task
import prefect
#task
def retrieve_episode_ids():
return [1,2,3,4,5]
#task
def download_episode(episode_id):
if episode_id == 5:
return ValueError()
return episode_id
#task()
def persist_episodes(episode_ids, episodes):
# Note the last element here will be the ValueError
prefect.context.logger.info(episodes)
# We change that ValueError into a "fail" message
episodes = ["fail" if isinstance(x, BaseException) else x for x in episodes]
# Note the last element here will be the "fail"
prefect.context.logger.info(episodes)
result = {}
for i, episode_id in enumerate(episode_ids):
result[episode_id] = episodes[i]
# Check final results
prefect.context.logger.info(result)
return
with Flow("import_episodes") as flow:
episode_ids = retrieve_episode_ids()
episodes = download_episode.map(episode_ids)
persist_episodes(episode_ids, episodes)
flow.run()
The handling will largely happen in the persist_episodes. Just pass the list of inputs again and then we can match the inputs with the failed tasks. I added some handling around identifying errors and replacing them with what you want. Does that answer the question?
Always happy to chat more. You can reach out in the Prefect Slack or Discourse as well.
Below is the function called for scheduling a job on server start.
But somehow the scheduled job is getting called again and again, and this is causing too many calls to that respective function.
Either this is happening because of multiple function calls or something else? Suggestions please.
def redis_schedule():
with current_app.app_context():
redis_url = current_app.config["REDIS_URL"]
with Connection(redis.from_url(redis_url)):
q = Queue("notification")
from ..tasks.notification import send_notifs
task = q.enqueue_in(timedelta(minutes=5), send_notifs)
Refer - https://python-rq.org/docs/job_registries/
Needed to read scheduled_job_registry and retrieve jobids.
Currently below logic works for me as I only have a single scheduled_job.
But in case of multiple jobs, I will need to loop these jobids to find the right job exists or not.
def redis_schedule():
with current_app.app_context():
redis_url = current_app.config["REDIS_URL"]
with Connection(redis.from_url(redis_url)):
q = Queue("notification")
if len(q.scheduled_job_registry.get_job_ids()) == 0:
from ..tasks.notification import send_notifs
task = q.enqueue_in(timedelta(seconds=30), send_notifs)
_pickle.PicklingError: Could not serialize object:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Super simple EXAMPLE app to try and run some calculations in parallel. Works (sometimes) but most times crashes with the above exception.
I don't think I have nested RDD, but the part about not being able to use the sparkContext in workers is worrisome since I think I need that to achieve some level of parallelism. If I can't use the sparkContext in the worker threads, how do I get the computational results back?
At this point I still expect it to be serialized, and was going to enable the parallel run after this. But can't even get the serialized multi-threaded version to run....
from pyspark import SparkContext
import threading
THREADED = True. # Set this to false and it always works but is sequential
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
content = sc.textFile(content_file).cache() # For the non-threaded version
class Worker(threading.Thread):
def __init__(self, letter, *args, **kwargs):
super().__init__(*args, **kwargs)
self.letter = letter
def run(self):
print(f"Starting: {self.letter}")
nums[self.letter] = content.filter(lambda s: self.letter in s).count() # SPOILER self.letter turns out to be the problem
print(f"{self.letter}: {nums[self.letter]}")
nums = {}
if THREADED:
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
threads.append(Worker(letter, name=letter))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
else:
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
nums[letter] = content.filter(lambda s: letter in s).count()
print(f"{letter}: {nums[letter]}")
print(nums)
Even when I change the code to use one thread at a time
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
thread = Worker(letter, name=letter)
threads.append(thread)
thread.start()
thread.join()
It raises the same exception, I guess because it is trying to get the results back in a worker thread and not the main thread (where the SparkContext is declared).
I need to be able to wait on several values simultaneously if spark is going to provide any benefit here.
The real problem I'm trying to solve looks like this:
__________RESULT_________
^ ^ ^
A B C
a1 ^ a2 b1 ^ b2 c1 ^ c2...
To get my result I want to calculate A B and C in parallel, and each of those pieces will have to calculate a1, a2, a3, .... in parallel. I'm breaking it into threads so I can request multiple values simultaneously so that spark can run the computation in parallel.
I created the sample above simply because I want to get the threading correct, I'm not trying to figure out how to count the # of lines with a character in it. But this seemed super simple to vet the threading aspect.
This little change fixes things right up. self.letter was blowing up in the lambda, dereferencing it before the filter call removed the crash
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Exception says
It appears that you are attempting to reference SparkContext from a
broadcast variable, action, or transformation
In your case the reference to the SparkContext is held by the following line:
nums[self.letter] = self.content.filter(lambda s: self.letter in s).count()
in this line, you define a filter (which counts as a transformation) using the following lambda expression:
lambda s: self.letter in s
The Problem with this expression is: You reference the member variable letter of the object-reference self. To make this reference available during the execution of your batch, Spark needs to serialize the object self. But this object holds not only the member letter, but also content, which is a Spark-RDD (and every Spark-RDD holds a reference to the SparkContext it was created from).
To make the lambda serializable, you have to ensure not to reference anything that is not serializable inside it. The easiest way to achieve that, given your example, is to define a local variable based on the member letter:
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = self.content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Why
To understand why we can't do this, we have to understand what Spark does with every transformation in the background.
Whenever you have some piece of code like this:
sc = SparkContext(<connection information>)
You're creating a "Connection" to the Spark-Master. It may be a simple in-process local Spark-Master or a Spark-Master running on a whole different server.
Given the SparkContext-Object, we can define where our pipeline should get it's data from. For this example, let's say we want to read our data from a text-file (just like in your question:
rdd = sc.textFile("file:///usr/local/Cellar/apache-spark/3.0.0/README.md")
As I mentioned before, the SparkContext is more or less a "Connection" to the Spark-Master. The URL we specify as the location of our text-file must be accessable from the Spark-Master, not from the system you're executing the python-script on!
Based on the Spark-RDD we created, we can now define how the data should be processed. Let's say we want to count only lines that contain a given string "Hello World":
linesThatContainHelloWorld = rdd.filter(lambda line: "Hello World" in line).count()
What Spark does once we call a terminal function (a computation that yields a result, like count() in this case) is that it serializes the function we passed to filter, transfers the serialized data to the Spark-Workers (which may run on a totally different server) and these Spark-Workers deserialize that function to be able to execute the given function.
That means that this piece of code: lambda line: "Hello World" in line will actually not be executed inside the Python-Process you're currently in, but on the Spark-Workers.
Things start to get trickier (for Spark) whenever we reference a variable from the upper scope inside one of our transformations:
stringThatALineShouldContain = "Hello World"
linesThatContainHelloWorld = rdd.filter(lambda line: stringThatALineShouldContain in line).count()
Now, Spark not only has to serialize the given function, but also the referenced variable stringThatALineShouldContain from the upper scope. In this simple example, this is no problem, since the variable stringThatALineShouldContain is serializable.
But whenever we try to access something that is not serializable or simply holds a reference to something that is not serialize, Spark will complain.
For example:
stringThatALineShouldContain = "Hello World"
badExample = (sc, stringThatALineShouldContain) # tuple holding a reference to the SparkContext
linesThatContainHelloWorld = rdd.filter(lambda line: badExample[1] in line).count()
Since the function now references badExample, Spark tries to serialize this variable and complains that it holds a reference to the SparkContext.
This not only applies to the SparkContext, but to everything that is not serializable, such as Connection-Objects to Databases, File-Handles and many more.
If, for any reason, you have to do something like this, you should only reference an object that contains information of how to create that unserializable object.
An example
Invalid example
dbConnection = MySQLConnection("mysql.example.com") # Not sure if this class exists, only for the example
rdd.filter(lambda line: dbConnection.insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
Valid example
# note that this is still "bad code", since the connection is never cleared. But I hope you get the idea
class LazyMySQLConnection:
connectionString = None
actualConnection = None
def __init__(self, connectionString):
self.connectionString = connectionString
def __getstate__(self):
# tell pickle (the serialization library Spark uses for transformations) that the actualConnection member is not part of the state
state = dict(self.__dict__)
del state["actualConnection"]
return state
def getOrCreateConnection(self):
if not self.actualConnection:
self.actualConnection = MySQLConnection(self.connectionString)
return self.actualConnection
lazyDbConnection = LazyMySQLConnection("mysql.example.com")
rdd.filter(lambda line: lazyDbConnection.getOrCreateConnection().insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
# remember, the lambda we supplied for the filter will be executed on the Spark-Workers, so the connection will be etablished from each Spark-Worker!
You're trying to use (Py)Spark in a way it is not intended to be used. You're mixing up plain-python data processing with spark-processing where you could completely realy on spark.
The Idea with Spark (and other Data Processing Frameworks) is, that you define how your data should be processed and all the multithreading + distribution stuff is just a independent "configuration".
Also, I don't really see what you would like to gain by using multiple threads.
Every Thread would:
Have to read every single character from your input file
Check if the current line contains the letter that was assigned to this thread
Count
This would (if it worked) yield a correct result, sure, but is inefficient, since there would be many threads fighting for those read operations on that file (remember, every thread would have to read the COMPLETE file in the first place, the be able to filter based on its assigned letter).
Work with spark, not against it, to get the most out of it.
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
rdd = sc.textFile(content_file) # read from this file
rdd = rdd.flatMap(lambda line: [letter for letter in line]) # forward every letter of each line to the next operator
# initialize the letterRange "outside" of spark so we reduce the runtime-overhead
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
rdd = rdd.filter(lambda letter: letter in relevantLetterRange)
rdd = rdd.keyBy(lambda letter: letter) # key by the letter itself
countsByKey = rdd.countByKey() # count by key
You can of course simply write this in one chain:
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
countsByKey = sc.textFile(content_file)\
.flatMap(lambda line: [letter for letter in line])\
.filter(lambda letter: letter in relevantLetterRange)\
.keyBy(lambda letter: letter)
.countByKey()
Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?
But lets explain it on an example:
Creating a dask_server.py:
from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
Now I can connect from my my_stream.py and start to submit and gather data:
DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)
def my_dask_function(lines):
return lines['a'].mean() + lines['b'].mean
def async_stream_redis_to_d(max_chunk_size = 1000):
while 1:
# This is a redis queue, but can be any queueing/file-stream/syslog or whatever
lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)
futures = []
df = pd.DataFrame(data=lines, columns=['a','b','c'])
futures.append(dask_client.submit(my_dask_function, df))
result = self.dask_client.gather(futures)
print(result)
time sleep(0.1)
if __name__ == '__main__':
max_chunk_size = 1000
thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
#thread_stream_data_from_redis.setDaemon(True)
thread_stream_data_from_redis.start()
# Lets go
This works as expected and it is really quick!!!
But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.
Questions / Approaches:
Is this cummulative calculation possible?
Bad Alternative 1: I
cache all lines locally and submit all the data to the cluster
every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
Golden Option: Python
Program 1 pushes the data. Than it would be possible to connect with
another client (from another python program) to that cummulated data
and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?
Maybe related: Distributed Variables, Actors Worker
Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds
client.datasets["x"] = list_of_futures
def worker_function(...):
futures = get_client().datasets["x"]
data = get_client.gather(futures)
... work with data
As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.
I am trying to write a Python function for fast calculation of the sum of a list, using parallel computing. Initially I tried to use the Python multithreading library, but then I noticed that all threads run on the same CPU, so there is no speed gain, so I switched to using multiprocessing. In the first version I made the list a global variable:
from multiprocessing import Pool
array = 100000000*[1]
def sumPart(fromTo:tuple):
return sum(array[fromTo[0]:fromTo[1]])
with Pool(2) as pool:
print(sum(pool.map(sumPart, [(0,len(array)//2), (len(array)//2,len(array))])))
This worked well and returned the correct sum after about half the time of a serial computation.
But then I wanted to make it a function that accepts the array as an argument:
def parallelSum(theArray):
def sumPartLocal(fromTo: tuple):
return sum(theArray[fromTo[0]:fromTo[1]])
with Pool(2) as pool:
return (sum(pool.map(sumPartLocal, [(0, len(theArray) // 2), (len(theArray) // 2, len(theArray))])))
Here I got an error:
AttributeError: Can't pickle local object 'parallelSum.<locals>.sumPartLocal'
What is the correct way to write this function?
When scheduling jobs to a Python Pool you need to ensure both the function and it's arguments can be serialized as they will be transferred over a pipe.
Python uses the pickle protocol to serialize its objects. You can see what can be pickled in the module documentation. In your case, you are facing this limitation.
functions defined at the top level of a module (using def, not lambda)
Under the hood, the Pool is sending a string with the function name and its parameters. The Python interpreter in the child process looks for that function name in the module and fails to find it as it's nested in the scope of another function parallelSum.
Move sumPartLocal outside parallelSum and everything will be fine.
I believe you are hitting this, or see the documentation
What you could do is leave def sumPartLocal at module level, and pass theArray as third component of your tuple so that would be fromTo[2] inside the sumPartLocal function.
Example:
from multiprocessing import Pool
def sumPartLocal(fromTo: tuple):
return sum(fromTo[2][fromTo[0]:fromTo[1]])
def parallelSum(theArray):
with Pool(2) as pool:
return (sum
(pool.map
(sumPartLocal, [
(0, len(theArray) // 2, theArray),
(len(theArray) // 2, len(theArray), theArray)
]
)
)
)
if __name__ == '__main__':
theArray = 100000000*[1]
s = parallelSum(theArray)
print(s)
[EDIT 15-Dec-2017 based on comments]
Anyone who is thinking of multi-threading in python, I strongly recommend reading up about the Global Interpreter Lock
Also, some good answers on this question here on SO