Multithread not finding first target function python - python-3.x

So I have a peculiar situation where I'm multi-threading a set of functions with some of these calling further threads. In one such case, the first thread in a block of three fails to execute its target function and instead produces this error:
NameError: name '<FUNCTION_NAME>' is not defined
This is the code:
threadA = self.pool.apply_async(functionA)
threadB = self.pool.apply_async(functionB)
threadC = self.pool.apply_async(functionC)
valueA = threadA.get()
valueB = threadB.get()
valueC = threadC.get()
And the relevant function is defined above.
If I switch the order of these thread assignments, the first will produce the NameError. Eg. If threadB was assigned first, the error would be:
NameError: name 'functionB' is not defined
There's plenty of other threading going on while this is happening so I'm not sure if it's a resource issue.
Edit:
I'm using multiprocessing.pool.ThreadPool not processes.
Any help would be great,
Cheers :)

Related

Python3 and Multithreading - How does "join" method work?

I started to use the threading library recently because I need to make my software faster but unfortunately I can't. below an example about what I wanna do:
from threading import Thread
# in my PC, it takes around 30 seconds to complete the task:
def do_something(string):
a=string
for n in range(1000000000):
a+=a+a
return {"a":"aa", "b":"bb", "c":"cc"}
ls_a, ls_b, ls_c = [], [], []
ls_strings=["ciao", "hello", "hola"]
for key in ls_strings:
t=Thread(target=do_something, args=(key,))
t.start()
dic=t.join()
ls_a.append(dic["a"]) # <--- TypeError: 'NoneType' object is not subscriptable
ls_b.append(dic["b"])
ls_c.append(dic["c"])
print(ls_a)
print(ls_b)
print(ls_c)
this code doesn't work, it returns an exception when Python starts to read the line "ls_a.append(dic["a"])":
TypeError: 'NoneType' object is not subscriptable
there is this error because the instruction "dic=t.join()" returns "None" and I really don't understand why (I expected to receive "a" and not "None"). why does't the method "join" work? how can I fix my code? can you guys help me to understand?
what I want to do is run "do_something" function for more strings (in my example "ciao", "hello" and "hola") at the same time.
The trick in that case is, to not join any of them until all of them have been started. Use two loops instead of just one:
threads = []
for key in ls_strings:
t=Thread(target=do_something, args=(key,))
t.start()
threads.append(t)
# optionally, do something else here while the threads run.
for t in threads:
t.join()
Note: this does not solve your problem of how to "return" a value from a thread. There's lots of questions already answered on this site that tell you how to do that (e.g., How to get the return value from a thread in Python?)

Multi-threading PySpark, Could not serialize object exception

_pickle.PicklingError: Could not serialize object:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Super simple EXAMPLE app to try and run some calculations in parallel. Works (sometimes) but most times crashes with the above exception.
I don't think I have nested RDD, but the part about not being able to use the sparkContext in workers is worrisome since I think I need that to achieve some level of parallelism. If I can't use the sparkContext in the worker threads, how do I get the computational results back?
At this point I still expect it to be serialized, and was going to enable the parallel run after this. But can't even get the serialized multi-threaded version to run....
from pyspark import SparkContext
import threading
THREADED = True. # Set this to false and it always works but is sequential
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
content = sc.textFile(content_file).cache() # For the non-threaded version
class Worker(threading.Thread):
def __init__(self, letter, *args, **kwargs):
super().__init__(*args, **kwargs)
self.letter = letter
def run(self):
print(f"Starting: {self.letter}")
nums[self.letter] = content.filter(lambda s: self.letter in s).count() # SPOILER self.letter turns out to be the problem
print(f"{self.letter}: {nums[self.letter]}")
nums = {}
if THREADED:
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
threads.append(Worker(letter, name=letter))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
else:
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
nums[letter] = content.filter(lambda s: letter in s).count()
print(f"{letter}: {nums[letter]}")
print(nums)
Even when I change the code to use one thread at a time
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
thread = Worker(letter, name=letter)
threads.append(thread)
thread.start()
thread.join()
It raises the same exception, I guess because it is trying to get the results back in a worker thread and not the main thread (where the SparkContext is declared).
I need to be able to wait on several values simultaneously if spark is going to provide any benefit here.
The real problem I'm trying to solve looks like this:
__________RESULT_________
^ ^ ^
A B C
a1 ^ a2 b1 ^ b2 c1 ^ c2...
To get my result I want to calculate A B and C in parallel, and each of those pieces will have to calculate a1, a2, a3, .... in parallel. I'm breaking it into threads so I can request multiple values simultaneously so that spark can run the computation in parallel.
I created the sample above simply because I want to get the threading correct, I'm not trying to figure out how to count the # of lines with a character in it. But this seemed super simple to vet the threading aspect.
This little change fixes things right up. self.letter was blowing up in the lambda, dereferencing it before the filter call removed the crash
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Exception says
It appears that you are attempting to reference SparkContext from a
broadcast variable, action, or transformation
In your case the reference to the SparkContext is held by the following line:
nums[self.letter] = self.content.filter(lambda s: self.letter in s).count()
in this line, you define a filter (which counts as a transformation) using the following lambda expression:
lambda s: self.letter in s
The Problem with this expression is: You reference the member variable letter of the object-reference self. To make this reference available during the execution of your batch, Spark needs to serialize the object self. But this object holds not only the member letter, but also content, which is a Spark-RDD (and every Spark-RDD holds a reference to the SparkContext it was created from).
To make the lambda serializable, you have to ensure not to reference anything that is not serializable inside it. The easiest way to achieve that, given your example, is to define a local variable based on the member letter:
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = self.content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Why
To understand why we can't do this, we have to understand what Spark does with every transformation in the background.
Whenever you have some piece of code like this:
sc = SparkContext(<connection information>)
You're creating a "Connection" to the Spark-Master. It may be a simple in-process local Spark-Master or a Spark-Master running on a whole different server.
Given the SparkContext-Object, we can define where our pipeline should get it's data from. For this example, let's say we want to read our data from a text-file (just like in your question:
rdd = sc.textFile("file:///usr/local/Cellar/apache-spark/3.0.0/README.md")
As I mentioned before, the SparkContext is more or less a "Connection" to the Spark-Master. The URL we specify as the location of our text-file must be accessable from the Spark-Master, not from the system you're executing the python-script on!
Based on the Spark-RDD we created, we can now define how the data should be processed. Let's say we want to count only lines that contain a given string "Hello World":
linesThatContainHelloWorld = rdd.filter(lambda line: "Hello World" in line).count()
What Spark does once we call a terminal function (a computation that yields a result, like count() in this case) is that it serializes the function we passed to filter, transfers the serialized data to the Spark-Workers (which may run on a totally different server) and these Spark-Workers deserialize that function to be able to execute the given function.
That means that this piece of code: lambda line: "Hello World" in line will actually not be executed inside the Python-Process you're currently in, but on the Spark-Workers.
Things start to get trickier (for Spark) whenever we reference a variable from the upper scope inside one of our transformations:
stringThatALineShouldContain = "Hello World"
linesThatContainHelloWorld = rdd.filter(lambda line: stringThatALineShouldContain in line).count()
Now, Spark not only has to serialize the given function, but also the referenced variable stringThatALineShouldContain from the upper scope. In this simple example, this is no problem, since the variable stringThatALineShouldContain is serializable.
But whenever we try to access something that is not serializable or simply holds a reference to something that is not serialize, Spark will complain.
For example:
stringThatALineShouldContain = "Hello World"
badExample = (sc, stringThatALineShouldContain) # tuple holding a reference to the SparkContext
linesThatContainHelloWorld = rdd.filter(lambda line: badExample[1] in line).count()
Since the function now references badExample, Spark tries to serialize this variable and complains that it holds a reference to the SparkContext.
This not only applies to the SparkContext, but to everything that is not serializable, such as Connection-Objects to Databases, File-Handles and many more.
If, for any reason, you have to do something like this, you should only reference an object that contains information of how to create that unserializable object.
An example
Invalid example
dbConnection = MySQLConnection("mysql.example.com") # Not sure if this class exists, only for the example
rdd.filter(lambda line: dbConnection.insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
Valid example
# note that this is still "bad code", since the connection is never cleared. But I hope you get the idea
class LazyMySQLConnection:
connectionString = None
actualConnection = None
def __init__(self, connectionString):
self.connectionString = connectionString
def __getstate__(self):
# tell pickle (the serialization library Spark uses for transformations) that the actualConnection member is not part of the state
state = dict(self.__dict__)
del state["actualConnection"]
return state
def getOrCreateConnection(self):
if not self.actualConnection:
self.actualConnection = MySQLConnection(self.connectionString)
return self.actualConnection
lazyDbConnection = LazyMySQLConnection("mysql.example.com")
rdd.filter(lambda line: lazyDbConnection.getOrCreateConnection().insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
# remember, the lambda we supplied for the filter will be executed on the Spark-Workers, so the connection will be etablished from each Spark-Worker!
You're trying to use (Py)Spark in a way it is not intended to be used. You're mixing up plain-python data processing with spark-processing where you could completely realy on spark.
The Idea with Spark (and other Data Processing Frameworks) is, that you define how your data should be processed and all the multithreading + distribution stuff is just a independent "configuration".
Also, I don't really see what you would like to gain by using multiple threads.
Every Thread would:
Have to read every single character from your input file
Check if the current line contains the letter that was assigned to this thread
Count
This would (if it worked) yield a correct result, sure, but is inefficient, since there would be many threads fighting for those read operations on that file (remember, every thread would have to read the COMPLETE file in the first place, the be able to filter based on its assigned letter).
Work with spark, not against it, to get the most out of it.
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
rdd = sc.textFile(content_file) # read from this file
rdd = rdd.flatMap(lambda line: [letter for letter in line]) # forward every letter of each line to the next operator
# initialize the letterRange "outside" of spark so we reduce the runtime-overhead
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
rdd = rdd.filter(lambda letter: letter in relevantLetterRange)
rdd = rdd.keyBy(lambda letter: letter) # key by the letter itself
countsByKey = rdd.countByKey() # count by key
You can of course simply write this in one chain:
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
countsByKey = sc.textFile(content_file)\
.flatMap(lambda line: [letter for letter in line])\
.filter(lambda letter: letter in relevantLetterRange)\
.keyBy(lambda letter: letter)
.countByKey()

Why is the deconstructor not automatically being called?

I'm working on an assignment for school and having some difficulty understanding the __del__ method. I understand that it is called after all the references to the object are deleted, but I'm not exactly sure how to get to that point. It states that the __del__ method should be called automatically, but I'm having a rough time even getting the del() to automatically call __del__ as I understand it should.
I've tried to manually call the del method and have tried looking at various sample coding. Something is just not clicking with me for this. The only way I can some-what get it to be called is by using this piece of code at the end:
for faq in faqs:
Faq.__del__(faq)
But I know that is not correct.
class Faq:
def __init__(self, question, answer):
self.question = question
self.answer = answer
return
def print_faq(self):
print('\nQuestion: {}'.format(self.question))
print('Answer: {}'.format(self.answer))
def __del__(self):
print('\nQuestion: {}'.format(self.question))
print('FAQ deleted')
faqs = []
faq1 = Faq('Does this work?', 'Yes.')
faqs.append(faq1)
faq2 = Faq('What about now?', 'Still yes.')
faqs.append(faq2)
faq3 = Faq('Should I give up?', 'Nope!')
faqs.append(faq3)
print("FAQ's:")
print('='*30)
for faq in faqs:
obj = Faq.print_faq(faq)
print()
print('='*30)
I expect the code to output the __del__ print statements to verify the code ran.
The method __del__ is called
when the instance is about to be destroyed
This happens when there are no more references to it.
del x doesn’t directly call x.__del__() — the former decrements the reference count for x by one, and the latter is only called when x’s reference count reaches zero.
So the reason you don't see the expected prints is because each Faq object has 2 references to it:
The variable it is assigned to (faq1, faq2 ...)
A reference from the list faqs
So doing del faq1 is not enough as this will only leave one last reference from the list. To delete those references too, you can do del faqs[:].
As to the code posted here, I am guessing you expect to see the prints because when the program finishes all resources are released. Well that is true, but:
It is not guaranteed that __del__() methods are called for objects that still exist when the interpreter exits.
You bind faqN and faqs and you keep these references. You need to destroy bindings.
For example:
faq1 = None # del faq1
faq2 = None
faq3 = None
faqs = None # [] or del faqs

Python 3.5 asyncio execute coroutine on event loop from synchronous code in different thread

I am hoping someone can help me here.
I have an object that has the ability to have attributes that return coroutine objects. This works beautifully, however I have a situation where I need to get the results of the coroutine object from synchronous code in a separate thread, while the event loop is currently running. The code I came up with is:
def get_sync(self, key: str, default: typing.Any=None) -> typing.Any:
"""
Get an attribute synchronously and safely.
Note:
This does nothing special if an attribute is synchronous. It only
really has a use for asynchronous attributes. It processes
asynchronous attributes synchronously, blocking everything until
the attribute is processed. This helps when running SQL code that
cannot run asynchronously in coroutines.
Args:
key (str): The Config object's attribute name, as a string.
default (Any): The value to use if the Config object does not have
the given attribute. Defaults to None.
Returns:
Any: The vale of the Config object's attribute, or the default
value if the Config object does not have the given attribute.
"""
ret = self.get(key, default)
if asyncio.iscoroutine(ret):
if loop.is_running():
loop2 = asyncio.new_event_loop()
try:
ret = loop2.run_until_complete(ret)
finally:
loop2.close()
else:
ret = loop.run_until_complete(ret)
return ret
What I am looking for is a safe way to synchronously get the results of a coroutine object in a multithreaded environment. self.get() can return a coroutine object, for attributes I have set to provide them. The issues I have found are: If the event loop is running or not. After searching for a few hours on stack overflow and a few other sites, my (broken) solution is above. If the loop is running, I make a new event loop and run my coroutine in the new event loop. This works, except that the code hangs forever on the ret = loop2.run_until_complete(ret) line.
Right now, I have the following scenarios with results:
results of self.get() is not a coroutine
Returns results. [Good]
results of self.get() is a coroutine & event loop is not running (basically in same thread as the event loop)
Returns results. [Good]
results of self.get() is a coroutine & event loop is running (basically in a different thread than the event loop)
Hangs forever waiting for results. [Bad]
Does anyone know how I can go about fixing the bad result so I can get the value I need? Thanks.
I hope I made some sense here.
I do have a good, and valid reason to be using threads; specifically I am using SQLAlchemy which is not async and I punt the SQLAlchemy code to a ThreadPoolExecutor to handle it safely. However, I need to be able to query these asynchronous attributes from within these threads for the SQLAlchemy code to get certain configuration values safely. And no, I won't switch away from SQLAlchemy to another system just in order to accomplish what I need, so please do not offer alternatives to it. The project is too far along to switch something so fundamental to it.
I tried using asyncio.run_coroutine_threadsafe() and loop.call_soon_threadsafe() and both failed. So far, this has gotten the farthest on making it work, I feel like I am just missing something obvious.
When I get a chance, I will write some code that provides an example of the problem.
Ok, I implemented an example case, and it worked the way I would expect. So it is likely my problem is elsewhere in the code. Leaving this open and will change the question to fit my real problem if I need.
Does anyone have any possible ideas as to why a concurrent.futures.Future from asyncio.run_coroutine_threadsafe() would hang forever rather than return a result?
My example code that does not duplicate my error, unfortunately, is below:
import asyncio
import typing
loop = asyncio.get_event_loop()
class ConfigSimpleAttr:
__slots__ = ('value', '_is_async')
def __init__(
self,
value: typing.Any,
is_async: bool=False
):
self.value = value
self._is_async = is_async
async def _get_async(self):
return self.value
def __get__(self, inst, cls):
if self._is_async and loop.is_running():
return self._get_async()
else:
return self.value
class BaseConfig:
__slots__ = ()
attr1 = ConfigSimpleAttr(10, True)
attr2 = ConfigSimpleAttr(20, True)
def get(self, key: str, default: typing.Any=None) -> typing.Any:
return getattr(self, key, default)
def get_sync(self, key: str, default: typing.Any=None) -> typing.Any:
ret = self.get(key, default)
if asyncio.iscoroutine(ret):
if loop.is_running():
fut = asyncio.run_coroutine_threadsafe(ret, loop)
print(fut, fut.running())
ret = fut.result()
else:
ret = loop.run_until_complete(ret)
return ret
config = BaseConfig()
def example_func():
return config.get_sync('attr1')
async def main():
a1 = await loop.run_in_executor(None, example_func)
a2 = await config.attr2
val = a1 + a2
print('{a1} + {a2} = {val}'.format(a1=a1, a2=a2, val=val))
return val
loop.run_until_complete(main())
This is the stripped down version of exactly what my code is doing, and the example works, even if my actual application doesn't. I am stuck as far as where to look for answers. Suggestions are welcome as to where to try to track down my "stuck forever" problem, even if my code above doesn't actually duplicate the problem.
It is very unlikely that you need to run several event loops at the same time, so this part looks quite wrong:
if loop.is_running():
loop2 = asyncio.new_event_loop()
try:
ret = loop2.run_until_complete(ret)
finally:
loop2.close()
else:
ret = loop.run_until_complete(ret)
Even testing whether the loop is running or not doesn't seem to be the right approach. It's probably better to give explicitly the (only) running loop to get_sync and schedule the coroutine using run_coroutine_threadsafe:
def get_sync(self, key, loop):
ret = self.get(key, default)
if not asyncio.iscoroutine(ret):
return ret
future = asyncio.run_coroutine_threadsafe(ret, loop)
return future.result()
EDIT: Hanging problems can be related to tasks being scheduled in the wrong loop (e.g. forgetting about the optional loop argument when calling a coroutine). This kind of problem should be easier to debug with the PR 303 (now merged): a RuntimeError is raised instead when the loop and the future don't match. So you might want to run your tests with the latest version of asyncio.
Ok, I got my code working, by taking a different approach to it. The problem was tied with using something that had file IO, which I was converting into a coroutine using loop.run_in_executor() on the file IO components. Then, I was trying to use this in a sync function being called from another thread, processed using another loop.run_in_executor() on that function. This is a very important routine in my code (called probably a million times or more during the execution of my short-running code), and I made a decision that my logic was just getting too complicated. So... I uncomplicated it. Now, if I want to use the file IO components asynchronously, I explicitly use my "get_async()" method, otherwise, I use my attribute through normal attribute access.
By removing the complexity of my logic, it made the code cleaner, easier to understand, and even more importantly, it actually works. While I am not 100% certain that I know the root cause of the issue (I believe it has something to do with a thread processing an attribute, which then in turn starts another thread that tries to read the attribute before it is processed, which caused something like a race condition and halting my code, but I could never duplicate the error outside of my application unfortunately to completely prove it out), I was able to get past it and continue with my development efforts.

NameError: name 'value' is not defined

(background first: I am NEW to programming and currently in my very first "intro to programming class" in my college. This is our second assignment dealing with Functions. So far functions have been a pain in the ass for me because they don't really make any sense. ex: you can use miles_gas(gas) but then don't use "miles_gas" anywhere else, but the program still runs??, anyways)
Okay, I've looked EVERYWHERE online for this and can't find an answer. Everything is using "Exceptions" and "try" and all that advanced stuff. I'm NEW so I have no idea what exceptions are, or try, nor do I care to use them considering my teacher hasn't assigned anything like that yet.
My project is to make a program that gives you the assessment value, and the property tax upon entering your property price. Here is the code I came up with (following the video from my class, as well as in the book)
ASSESSMENT_VALUE = .60
TAX = 0.64
def main():
price = float(input('Enter the property value: '))
show_value(value)
show_tax(tax)
def show_value():
value = price * ASSESSMENT_VALUE
print('Your properties assessment value is $', \
format(value, ',.2f'), \
sep='')
def show_tax(value,TAX):
tax = value * TAX
print('Your property tax will be $', \
format(tax, ',.2f'), \
sep='')
main()
Upon running it, I get it to ask "blah blah enter price:" so I enter price then I get a huge red error saying
Traceback (most recent call last):
File "C:/Users/Gret/Desktop/chapter3/exercise6.py", line 41, in <module>
main()
File "C:/Users/Gret/Desktop/chapter3/exercise6.py", line 24, in main
show_value(value)
NameError: name 'value' is not defined
But I DID define 'value'... so why is it giving me an error??
Python is lexically scoped. A variable defined in a function isn't visible outside the function. You need to return values from functions and assign them to variables in the scopes where you want to use the values. In your case, value is local to show_value.
When you define a function, it needs parameters to take in. You pass those parameters in the brackets of the function, and when you define your function, you name those parameters for the function. I'll show you an example momentarily.
Basically what's happened is you've passed the function a parameter when you call it, but in your definition you don't have one there, so it doesn't know what to do with it.
Change this line:
def show_value():
To this line:
def show_value(price):
And show_value to show_value(price)
For example:
In this type of error:
def addition(a,b):
c = a + b
return c
addition() # you're calling the function,
# but not telling it the values of a and b
With your error:
def addition():
c = a + b
return c
addition(1,2) # you're giving it values, but it
# has no idea to give those to a and b
The thing about functions, is that those variable only exist in the function, and also the name of the parameters doesn't matter, only the order. I understand that's frustrating, but if you carry on programming with a more open mind about it, I guarantee you'll appreciate it. If you want to keep those values, you just need to return them at the end. You can return multiple variables by writing return c, a, b and writing the call like this sum, number1, number2 = addition(1,2)
Another problem is that I could call my addition function like this:
b = 1
a = 2
addition(b,a)
and now inside the function, a = 1 and b = 2, because it's not about the variable names, it's about the order I passed them to the function in.
You also don't need to pass TAX into show_tax because TAX is already a global variable. It was defined outside a function so it can be used anywhere. Additionally, you don't want to pass tax to show_tax, you want to pass value to it. But because show_value hasn't returned value, you've lost it. So return value in show value to a variable like so value = show_value(price).

Resources