Val at object level and thread safety in Scala - multithreading

Stumbled upon following code in an existing codebase I am looking at. there are other similar calls which "set" values to "myService' etc. Confirming that following piece isn't threadsafe given myService is not "local" and two threads entering createUser at the same time and calling "myService.newUser" at the same time will corrupt the subsequent persona.firstName and persona.lastName etc. Is this understanding correct?
object WFService {
lazy private val myService = engine.getMyService
def createUser(persona: Persona): String = {
val user = myService.newUser(persona.id.toString)
persona.firstName.map(n => user.setFirstName(n))
persona.lastName.map(n => user.setLastName(n))

Lazy vals in Scala are thread safe [1]. You don't need to worry about multiple calls from different threads resulting in the RHS being executed twice.
Since you have an object, you only have one instance of WFService too.
[1] http://code-o-matic.blogspot.co.uk/2009/05/double-checked-locking-idiom-sweet-in.html

It is val member so I would assume that there could be only one alignement to it. The second one should return error.
As mentioned before lazy val's in Scala are thread-safe. Please refert to:
Lazy Vals initialization

Related

Simplifying Init Method Python

Is there a better way of doing this?
def __init__(self,**kwargs):
self.ServiceNo = kwargs["ServiceNo"]
self.Operator = kwargs["Operator"]
self.NextBus = kwargs["NextBus"]
self.NextBus2 = kwargs["NextBus2"]
self.NextBus3 = kwargs["NextBus3"]
The attributes (ServiceNo,Operator,...) always exist
That depends on what you mean by "simpler".
For example, is what you wrote simpler than what I would write, namely
def __init__(self,ServiceNo, Operator, NextBus, NextBus2, NextBus3):
self.ServiceNo = ServiceNo
self.Operator = Operator
self.NextBus = NextBus
self.NextBus2 = NextBus2
self.NextBus3 = NextBus3
True, I've repeated each attribute name an additional time, but I've made it much clearer which arguments are legal for __init__. The caller is not free to add any additional keyword argument they like, only to see it silently ignored.
Of course, there's a lot of boilerplate here; that's something a dataclass can address:
from dataclasses import dataclass
#dataclass
class Foo:
ServiceNo: int
Operator: str
NextBus: Bus
NextBus2: Bus
NextBus3: Bus
(Adjust the types as necessary.)
Now each attribute is mentioned once, and you get the __init__ method shown above for free.
Better how? You don’t really describe what problem you’re trying to solve.
If it’s error handling, you can use the dictionary .get() method in the event that key doesn’t exist.
If you just want a more succinct way of initializing variables, you could remove the ** and have the dictionary as a variable itself, then use it elsewhere in your code, but that depends on what your other methods are doing.
A hacky solution available since the attributes and the argument names match exactly is to directly copy from the kwargs dict to the instance's dict, then check that you got all the keys you expected, e.g.:
def __init__(self,**kwargs):
vars(self).update(kwargs)
if vars(self).keys() != {"ServiceNo", "Operator", "NextBus", "NextBus2", "NextBus3"}:
raise TypeError(f"{type(self).__name__} missing required arguments")
I don't recommend this; chepner's options are all superior to this sort of hackery, and they're more reliable (for example, this solution fails if you use __slots__ to prevent autovivication of attributes, as the instance won't having a backing dict you can pull with vars).

QCheckbox issue [duplicate]

I am struggling to get this working.
I tried to transpose from a c++ post into python with no joy:
QMessageBox with a "Do not show this again" checkbox
my rough code goes like:
from PyQt5 import QtWidgets as qtw
...
mb = qtw.QMessageBox
cb = qtw.QCheckBox
# following 3 lines to get over runtime errors
# trying to pass the types it was asking for
# and surely messing up
mb.setCheckBox(mb(), cb())
cb.setText(cb(), "Don't show this message again")
cb.show(cb())
ret = mb.question(self,
'Close application',
'Do you really want to quit?',
mb.Yes | mb.No )
if ret == mb.No:
return
self.close()
the above executes with no errors but the checkbox ain't showing (the message box does).
consider that I am genetically stupid... and slow, very slow.
so please go easy on my learning curve
When trying to "port" code, it's important to know the basis of the source language and have a deeper knowledge of the target.
For instance, taking the first lines of your code and the referenced question:
QCheckBox *cb = new QCheckBox("Okay I understand");
The line above in C++ means that a new object (cb) of type QCheckBox is being created, and it's assigned the result of QCheckBox(...), which returns an instance of that class. To clarify how objects are declared, here's how a simple integer variable is created:
int mynumber = 10
This is because C++, like many languages, requires the object type for its declaration.
In Python, which is a dynamic typing language, this is not required (but it is possible since Python 3.6), but you still need to create the instance, and this is achieved by using the parentheses on the class (which results in calling it and causes both calling __new__ and then __init__). The first two lines of your code then should be:
mb = qtw.QMessageBox()
cb = qtw.QCheckBox()
Then, the problem is that you're calling the other methods with new instances of the above classes everytime.
An instance method (such as setCheckBox) is implicitly called with the instance as first argument, commonly known as self.
checkboxInstance = QCheckBox()
checkboxInstance.setText('My checkbox')
# is actually the result of:
QCheckBox.setText(checkboxInstance, 'My checkbox')
The last line means, more or less: call the setText function of the class QCheckBox, using the instance and the text as its arguments.
In fact, if QCheckBox was an actual python class, setText() would look like this:
class QCheckBox:
def setText(self, text):
self.text = text
When you did cb = qtw.QCheckBox you only created another reference to the class, and everytime you do cb() you create a new instance; the same happens for mb, since you created another reference to the message box class.
The following line:
mb.setCheckBox(mb(), cb())
is the same as:
QMessageBox.setCheckBox(QMessageBox(), QCheckBox())
Since you're creating new instances every time, the result is absolutely nothing: there's no reference to the new instances, and they will get immediately discarded ("garbage collected", aka, deleted) after that line is processed.
This is how the above should actually be done:
mb = qtw.QMessageBox()
cb = qtw.QCheckBox()
mb.setCheckBox(cb)
cb.setText("Don't show this message again")
Now, there's a fundamental flaw in your code: question() is a static method (actually, for Python, it's more of a class method). Static and class methods are functions that don't act on an instance, but only on/for a class. Static methods of QMessageBox like question or warning create a new instance of QMessageBox using the provided arguments, so everything you've done before on the instance you created is completely ignored.
These methods are convenience functions that allow simple creation of message boxes without the need to write too much code. Since those methods only allow customization based on their arguments (which don't include adding a check box), you obviously cannot use them, and you must code what they do "under the hood" explicitly.
Here is how the final code should look:
# create the dialog with a parent, which will make it *modal*
mb = qtw.QMessageBox(self)
mb.setWindowTitle('Close application')
mb.setText('Do you really want to quit?')
# you can set the text on a checkbox directly from its constructor
cb = qtw.QCheckBox("Don't show this message again")
mb.setCheckBox(cb)
mb.setStandardButtons(mb.Yes | mb.No)
ret = mb.exec_()
# call some function that stores the checkbox state
self.storeCloseWarning(cb.isChecked())
if ret == mb.No:
return
self.close()

Multi-threading PySpark, Could not serialize object exception

_pickle.PicklingError: Could not serialize object:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Super simple EXAMPLE app to try and run some calculations in parallel. Works (sometimes) but most times crashes with the above exception.
I don't think I have nested RDD, but the part about not being able to use the sparkContext in workers is worrisome since I think I need that to achieve some level of parallelism. If I can't use the sparkContext in the worker threads, how do I get the computational results back?
At this point I still expect it to be serialized, and was going to enable the parallel run after this. But can't even get the serialized multi-threaded version to run....
from pyspark import SparkContext
import threading
THREADED = True. # Set this to false and it always works but is sequential
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
content = sc.textFile(content_file).cache() # For the non-threaded version
class Worker(threading.Thread):
def __init__(self, letter, *args, **kwargs):
super().__init__(*args, **kwargs)
self.letter = letter
def run(self):
print(f"Starting: {self.letter}")
nums[self.letter] = content.filter(lambda s: self.letter in s).count() # SPOILER self.letter turns out to be the problem
print(f"{self.letter}: {nums[self.letter]}")
nums = {}
if THREADED:
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
threads.append(Worker(letter, name=letter))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
else:
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
nums[letter] = content.filter(lambda s: letter in s).count()
print(f"{letter}: {nums[letter]}")
print(nums)
Even when I change the code to use one thread at a time
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
thread = Worker(letter, name=letter)
threads.append(thread)
thread.start()
thread.join()
It raises the same exception, I guess because it is trying to get the results back in a worker thread and not the main thread (where the SparkContext is declared).
I need to be able to wait on several values simultaneously if spark is going to provide any benefit here.
The real problem I'm trying to solve looks like this:
__________RESULT_________
^ ^ ^
A B C
a1 ^ a2 b1 ^ b2 c1 ^ c2...
To get my result I want to calculate A B and C in parallel, and each of those pieces will have to calculate a1, a2, a3, .... in parallel. I'm breaking it into threads so I can request multiple values simultaneously so that spark can run the computation in parallel.
I created the sample above simply because I want to get the threading correct, I'm not trying to figure out how to count the # of lines with a character in it. But this seemed super simple to vet the threading aspect.
This little change fixes things right up. self.letter was blowing up in the lambda, dereferencing it before the filter call removed the crash
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Exception says
It appears that you are attempting to reference SparkContext from a
broadcast variable, action, or transformation
In your case the reference to the SparkContext is held by the following line:
nums[self.letter] = self.content.filter(lambda s: self.letter in s).count()
in this line, you define a filter (which counts as a transformation) using the following lambda expression:
lambda s: self.letter in s
The Problem with this expression is: You reference the member variable letter of the object-reference self. To make this reference available during the execution of your batch, Spark needs to serialize the object self. But this object holds not only the member letter, but also content, which is a Spark-RDD (and every Spark-RDD holds a reference to the SparkContext it was created from).
To make the lambda serializable, you have to ensure not to reference anything that is not serializable inside it. The easiest way to achieve that, given your example, is to define a local variable based on the member letter:
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = self.content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Why
To understand why we can't do this, we have to understand what Spark does with every transformation in the background.
Whenever you have some piece of code like this:
sc = SparkContext(<connection information>)
You're creating a "Connection" to the Spark-Master. It may be a simple in-process local Spark-Master or a Spark-Master running on a whole different server.
Given the SparkContext-Object, we can define where our pipeline should get it's data from. For this example, let's say we want to read our data from a text-file (just like in your question:
rdd = sc.textFile("file:///usr/local/Cellar/apache-spark/3.0.0/README.md")
As I mentioned before, the SparkContext is more or less a "Connection" to the Spark-Master. The URL we specify as the location of our text-file must be accessable from the Spark-Master, not from the system you're executing the python-script on!
Based on the Spark-RDD we created, we can now define how the data should be processed. Let's say we want to count only lines that contain a given string "Hello World":
linesThatContainHelloWorld = rdd.filter(lambda line: "Hello World" in line).count()
What Spark does once we call a terminal function (a computation that yields a result, like count() in this case) is that it serializes the function we passed to filter, transfers the serialized data to the Spark-Workers (which may run on a totally different server) and these Spark-Workers deserialize that function to be able to execute the given function.
That means that this piece of code: lambda line: "Hello World" in line will actually not be executed inside the Python-Process you're currently in, but on the Spark-Workers.
Things start to get trickier (for Spark) whenever we reference a variable from the upper scope inside one of our transformations:
stringThatALineShouldContain = "Hello World"
linesThatContainHelloWorld = rdd.filter(lambda line: stringThatALineShouldContain in line).count()
Now, Spark not only has to serialize the given function, but also the referenced variable stringThatALineShouldContain from the upper scope. In this simple example, this is no problem, since the variable stringThatALineShouldContain is serializable.
But whenever we try to access something that is not serializable or simply holds a reference to something that is not serialize, Spark will complain.
For example:
stringThatALineShouldContain = "Hello World"
badExample = (sc, stringThatALineShouldContain) # tuple holding a reference to the SparkContext
linesThatContainHelloWorld = rdd.filter(lambda line: badExample[1] in line).count()
Since the function now references badExample, Spark tries to serialize this variable and complains that it holds a reference to the SparkContext.
This not only applies to the SparkContext, but to everything that is not serializable, such as Connection-Objects to Databases, File-Handles and many more.
If, for any reason, you have to do something like this, you should only reference an object that contains information of how to create that unserializable object.
An example
Invalid example
dbConnection = MySQLConnection("mysql.example.com") # Not sure if this class exists, only for the example
rdd.filter(lambda line: dbConnection.insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
Valid example
# note that this is still "bad code", since the connection is never cleared. But I hope you get the idea
class LazyMySQLConnection:
connectionString = None
actualConnection = None
def __init__(self, connectionString):
self.connectionString = connectionString
def __getstate__(self):
# tell pickle (the serialization library Spark uses for transformations) that the actualConnection member is not part of the state
state = dict(self.__dict__)
del state["actualConnection"]
return state
def getOrCreateConnection(self):
if not self.actualConnection:
self.actualConnection = MySQLConnection(self.connectionString)
return self.actualConnection
lazyDbConnection = LazyMySQLConnection("mysql.example.com")
rdd.filter(lambda line: lazyDbConnection.getOrCreateConnection().insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
# remember, the lambda we supplied for the filter will be executed on the Spark-Workers, so the connection will be etablished from each Spark-Worker!
You're trying to use (Py)Spark in a way it is not intended to be used. You're mixing up plain-python data processing with spark-processing where you could completely realy on spark.
The Idea with Spark (and other Data Processing Frameworks) is, that you define how your data should be processed and all the multithreading + distribution stuff is just a independent "configuration".
Also, I don't really see what you would like to gain by using multiple threads.
Every Thread would:
Have to read every single character from your input file
Check if the current line contains the letter that was assigned to this thread
Count
This would (if it worked) yield a correct result, sure, but is inefficient, since there would be many threads fighting for those read operations on that file (remember, every thread would have to read the COMPLETE file in the first place, the be able to filter based on its assigned letter).
Work with spark, not against it, to get the most out of it.
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
rdd = sc.textFile(content_file) # read from this file
rdd = rdd.flatMap(lambda line: [letter for letter in line]) # forward every letter of each line to the next operator
# initialize the letterRange "outside" of spark so we reduce the runtime-overhead
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
rdd = rdd.filter(lambda letter: letter in relevantLetterRange)
rdd = rdd.keyBy(lambda letter: letter) # key by the letter itself
countsByKey = rdd.countByKey() # count by key
You can of course simply write this in one chain:
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
countsByKey = sc.textFile(content_file)\
.flatMap(lambda line: [letter for letter in line])\
.filter(lambda letter: letter in relevantLetterRange)\
.keyBy(lambda letter: letter)
.countByKey()

Multithread not finding first target function python

So I have a peculiar situation where I'm multi-threading a set of functions with some of these calling further threads. In one such case, the first thread in a block of three fails to execute its target function and instead produces this error:
NameError: name '<FUNCTION_NAME>' is not defined
This is the code:
threadA = self.pool.apply_async(functionA)
threadB = self.pool.apply_async(functionB)
threadC = self.pool.apply_async(functionC)
valueA = threadA.get()
valueB = threadB.get()
valueC = threadC.get()
And the relevant function is defined above.
If I switch the order of these thread assignments, the first will produce the NameError. Eg. If threadB was assigned first, the error would be:
NameError: name 'functionB' is not defined
There's plenty of other threading going on while this is happening so I'm not sure if it's a resource issue.
Edit:
I'm using multiprocessing.pool.ThreadPool not processes.
Any help would be great,
Cheers :)

Why does scala hang evaluating a by-name parameter in a Future?

The below (contrived) code attempts to print a by-name String parameter within a future, and return when the printing is complete.
import scala.concurrent._
import concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
class PrintValueAndWait {
def printIt(param: => String): Unit = {
val printingComplete = future {
println(param); // why does this hang?
}
Await.result(printingComplete, Duration.Inf)
}
}
object Go {
val str = "Rabbits"
new PrintValueAndWait().printIt(str)
}
object RunMe extends App {
Go
}
However, when running RunMe, it simply hangs while trying to evaluate param. Changing printIt to take in its parameter by-value makes the application return as expected. Alternatively, changing printIt to simply print the value and return synchronously (in the same thread) seems to work fine also.
What's happening exactly here? Is this somehow related to the Go object not having been fully constructed yet, and so the str field not being visible yet to the thread attempting to print it? Is hanging the expected behaviour here?
I've tested with Scala 2.10.3 on both Mac OS Mavericks and Windows 7, on Java 1.7.
Your code is deadlocking on the initialization of the Go object. This is a known issue, see e.g. SI-7646 and this SO question
Objects in scala are lazily initialized and a lock is taken during this time to prevent two threads from racing to initialize the object. However, if two threads simultaneously try and initialize an object and one depends on the other to complete, there will be a circular dependency and a deadlock.
In this particular case, the initialization of the Go object can only complete once new PrintValueAndWait().printIt(str) has completed. However, when param is a by name argument, essentially a code block gets passed in which is evaluated when it is used. In this case the str argument in new PrintValueAndWait().printIt(str) is shorthand for Go.str, so when the thread the future runs on tries to evaluate param it is essentially calling Go.str. But since Go hasn't completed initialization yet, it will try to initialize the Go object too. The other thread initializing Go has a lock on its initialization, so the future thread blocks. So the first thread is waiting on the future to complete before it finishes initializing, and the future thread is waiting for the first thread to finish initializing: deadlock.
In the by value case, the string value of str is passed in directly, so the future thread doesn't try to initialize Go and there is no deadlock.
Similarly, if you leave param as by name, but change Go as follows:
object Go {
val str = "Rabbits"
{
val s = str
new PrintValueAndWait().printIt(s)
}
}
it won't deadlock, since the already evaluated local string value s is passed in, instead of Go.str, so the future thread won't try and initialize Go.

Resources