Multi-threading PySpark, Could not serialize object exception - multithreading

_pickle.PicklingError: Could not serialize object:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Super simple EXAMPLE app to try and run some calculations in parallel. Works (sometimes) but most times crashes with the above exception.
I don't think I have nested RDD, but the part about not being able to use the sparkContext in workers is worrisome since I think I need that to achieve some level of parallelism. If I can't use the sparkContext in the worker threads, how do I get the computational results back?
At this point I still expect it to be serialized, and was going to enable the parallel run after this. But can't even get the serialized multi-threaded version to run....
from pyspark import SparkContext
import threading
THREADED = True. # Set this to false and it always works but is sequential
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
content = sc.textFile(content_file).cache() # For the non-threaded version
class Worker(threading.Thread):
def __init__(self, letter, *args, **kwargs):
super().__init__(*args, **kwargs)
self.letter = letter
def run(self):
print(f"Starting: {self.letter}")
nums[self.letter] = content.filter(lambda s: self.letter in s).count() # SPOILER self.letter turns out to be the problem
print(f"{self.letter}: {nums[self.letter]}")
nums = {}
if THREADED:
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
threads.append(Worker(letter, name=letter))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
else:
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
nums[letter] = content.filter(lambda s: letter in s).count()
print(f"{letter}: {nums[letter]}")
print(nums)
Even when I change the code to use one thread at a time
threads = []
for char in range(ord('a'), ord('z')+1):
letter = chr(char)
thread = Worker(letter, name=letter)
threads.append(thread)
thread.start()
thread.join()
It raises the same exception, I guess because it is trying to get the results back in a worker thread and not the main thread (where the SparkContext is declared).
I need to be able to wait on several values simultaneously if spark is going to provide any benefit here.
The real problem I'm trying to solve looks like this:
__________RESULT_________
^ ^ ^
A B C
a1 ^ a2 b1 ^ b2 c1 ^ c2...
To get my result I want to calculate A B and C in parallel, and each of those pieces will have to calculate a1, a2, a3, .... in parallel. I'm breaking it into threads so I can request multiple values simultaneously so that spark can run the computation in parallel.
I created the sample above simply because I want to get the threading correct, I'm not trying to figure out how to count the # of lines with a character in it. But this seemed super simple to vet the threading aspect.
This little change fixes things right up. self.letter was blowing up in the lambda, dereferencing it before the filter call removed the crash
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")

The Exception says
It appears that you are attempting to reference SparkContext from a
broadcast variable, action, or transformation
In your case the reference to the SparkContext is held by the following line:
nums[self.letter] = self.content.filter(lambda s: self.letter in s).count()
in this line, you define a filter (which counts as a transformation) using the following lambda expression:
lambda s: self.letter in s
The Problem with this expression is: You reference the member variable letter of the object-reference self. To make this reference available during the execution of your batch, Spark needs to serialize the object self. But this object holds not only the member letter, but also content, which is a Spark-RDD (and every Spark-RDD holds a reference to the SparkContext it was created from).
To make the lambda serializable, you have to ensure not to reference anything that is not serializable inside it. The easiest way to achieve that, given your example, is to define a local variable based on the member letter:
def run(self):
print(f"Starting: {self.letter}")
letter = self.letter
nums[self.letter] = self.content.filter(lambda s: letter in s).count()
print(f"{self.letter}: {nums[self.letter]}")
The Why
To understand why we can't do this, we have to understand what Spark does with every transformation in the background.
Whenever you have some piece of code like this:
sc = SparkContext(<connection information>)
You're creating a "Connection" to the Spark-Master. It may be a simple in-process local Spark-Master or a Spark-Master running on a whole different server.
Given the SparkContext-Object, we can define where our pipeline should get it's data from. For this example, let's say we want to read our data from a text-file (just like in your question:
rdd = sc.textFile("file:///usr/local/Cellar/apache-spark/3.0.0/README.md")
As I mentioned before, the SparkContext is more or less a "Connection" to the Spark-Master. The URL we specify as the location of our text-file must be accessable from the Spark-Master, not from the system you're executing the python-script on!
Based on the Spark-RDD we created, we can now define how the data should be processed. Let's say we want to count only lines that contain a given string "Hello World":
linesThatContainHelloWorld = rdd.filter(lambda line: "Hello World" in line).count()
What Spark does once we call a terminal function (a computation that yields a result, like count() in this case) is that it serializes the function we passed to filter, transfers the serialized data to the Spark-Workers (which may run on a totally different server) and these Spark-Workers deserialize that function to be able to execute the given function.
That means that this piece of code: lambda line: "Hello World" in line will actually not be executed inside the Python-Process you're currently in, but on the Spark-Workers.
Things start to get trickier (for Spark) whenever we reference a variable from the upper scope inside one of our transformations:
stringThatALineShouldContain = "Hello World"
linesThatContainHelloWorld = rdd.filter(lambda line: stringThatALineShouldContain in line).count()
Now, Spark not only has to serialize the given function, but also the referenced variable stringThatALineShouldContain from the upper scope. In this simple example, this is no problem, since the variable stringThatALineShouldContain is serializable.
But whenever we try to access something that is not serializable or simply holds a reference to something that is not serialize, Spark will complain.
For example:
stringThatALineShouldContain = "Hello World"
badExample = (sc, stringThatALineShouldContain) # tuple holding a reference to the SparkContext
linesThatContainHelloWorld = rdd.filter(lambda line: badExample[1] in line).count()
Since the function now references badExample, Spark tries to serialize this variable and complains that it holds a reference to the SparkContext.
This not only applies to the SparkContext, but to everything that is not serializable, such as Connection-Objects to Databases, File-Handles and many more.
If, for any reason, you have to do something like this, you should only reference an object that contains information of how to create that unserializable object.
An example
Invalid example
dbConnection = MySQLConnection("mysql.example.com") # Not sure if this class exists, only for the example
rdd.filter(lambda line: dbConnection.insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
Valid example
# note that this is still "bad code", since the connection is never cleared. But I hope you get the idea
class LazyMySQLConnection:
connectionString = None
actualConnection = None
def __init__(self, connectionString):
self.connectionString = connectionString
def __getstate__(self):
# tell pickle (the serialization library Spark uses for transformations) that the actualConnection member is not part of the state
state = dict(self.__dict__)
del state["actualConnection"]
return state
def getOrCreateConnection(self):
if not self.actualConnection:
self.actualConnection = MySQLConnection(self.connectionString)
return self.actualConnection
lazyDbConnection = LazyMySQLConnection("mysql.example.com")
rdd.filter(lambda line: lazyDbConnection.getOrCreateConnection().insertIfNotExists("INSERT INTO table (col) VALUES (?)", line)
# remember, the lambda we supplied for the filter will be executed on the Spark-Workers, so the connection will be etablished from each Spark-Worker!

You're trying to use (Py)Spark in a way it is not intended to be used. You're mixing up plain-python data processing with spark-processing where you could completely realy on spark.
The Idea with Spark (and other Data Processing Frameworks) is, that you define how your data should be processed and all the multithreading + distribution stuff is just a independent "configuration".
Also, I don't really see what you would like to gain by using multiple threads.
Every Thread would:
Have to read every single character from your input file
Check if the current line contains the letter that was assigned to this thread
Count
This would (if it worked) yield a correct result, sure, but is inefficient, since there would be many threads fighting for those read operations on that file (remember, every thread would have to read the COMPLETE file in the first place, the be able to filter based on its assigned letter).
Work with spark, not against it, to get the most out of it.
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
rdd = sc.textFile(content_file) # read from this file
rdd = rdd.flatMap(lambda line: [letter for letter in line]) # forward every letter of each line to the next operator
# initialize the letterRange "outside" of spark so we reduce the runtime-overhead
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
rdd = rdd.filter(lambda letter: letter in relevantLetterRange)
rdd = rdd.keyBy(lambda letter: letter) # key by the letter itself
countsByKey = rdd.countByKey() # count by key
You can of course simply write this in one chain:
# imports and so on
content_file = "file:///usr/local/Cellar/apache-spark/3.0.0/README.md"
sc = SparkContext("local", "first app")
relevantLetterRange = [chr(char) for char in range(ord('a'), ord('z') + 1)]
countsByKey = sc.textFile(content_file)\
.flatMap(lambda line: [letter for letter in line])\
.filter(lambda letter: letter in relevantLetterRange)\
.keyBy(lambda letter: letter)
.countByKey()

Related

Shared memory and how to access a global variable from within a class in Python, with multiprocessing?

I am currently developing some code that deals with big multidimensional arrays. Of course, Python gets very slow if you try to perform these computations in a serialized manner. Therefore, I got into code parallelization, and one of the possible solutions I found has to do with the multiprocessing library.
What I have come up with so far is first dividing the big array in smaller chunks and then do some operation on each of those chunks in a parallel fashion, using a Pool of workers from multiprocessing. For that to be efficient and based on this answer I believe that I should use a shared memory array object defined as a global variable, to avoid copying it every time a process from the pool is called.
Here I add some minimal example of what I'm trying to do, to illustrate the issue:
import numpy as np
from functools import partial
import multiprocessing as mp
import ctypes
class Trials:
# Perform computation along first dimension of shared array, representing the chunks
def Compute(i, shared_array):
shared_array[i] = shared_array[i] + 2
# The function you actually call
def DoSomething(self):
# Initializer function for Pool, should define the global variable shared_array
# I have also tried putting this function outside DoSomething, as a part of the class,
# with the same results
def initialize(base, State):
global shared_array
shared_array = np.ctypeslib.as_array(base.get_obj()).reshape(125, 100, 100) + State
base = mp.Array(ctypes.c_float, 125*100*100) # Create base array
state = np.random.rand(125,100,100) # Create seed
# Initialize pool of workers and perform calculations
with mp.Pool(processes = 10,
initializer = initialize,
initargs = (base, state,)) as pool:
run = partial(self.Compute,
shared_array = shared_array) # Here the error says that shared_array is not defined
pool.map(run, np.arange(125))
pool.close()
pool.join()
print(shared_array)
if __name__ == '__main__':
Trials = Trials()
Trials.DoSomething()
The trouble I am encountering is that when I define the partial function, I get the following error:
NameError: name 'shared_array' is not defined
For what I understand, I think that means that I cannot access the global variable shared_array. I'm sure that the initialize function is executing, as putting a print statement inside of it gives back a result in the terminal.
What am I doing incorrectly, is there any way to solve this issue?

.get_dummies() works alone but doesnt save within function

I have a dataset and I want to make a function that does the .get_dummies() so I can use it in a pipeline for specific columns.
When I run dataset = pd.get_dummies(dataset, columns=['Embarked','Sex'], drop_first=True)
alone it works, as in, when I run df.head() I can still see the dummified columns but when I have a function like this,
def dummies(df):
df = pd.get_dummies(df, columns=['Embarked','Sex'], drop_first=True)
return df
Once I run dummies(dataset) it shows me the dummified columsn in that same cell but when I try to dataset.head() it isn't dummified anymore.
What am I doing wrong?
thanks.
You should assign the result of the function to df, call the function like:
dataset=dummies(dataset)
function inside them have their own independent namespace for variable defined there either in the signature or inside
for example
a = 0
def fun(a):
a=23
return a
fun(a)
print("a is",a) #a is 0
here you might think that a will have the value 23 at the end, but that is not the case because the a inside of fun is not the same a outside, when you call fun(a) what happens is that you pass into the function a reference to the real object that is somewhere in memory so the a inside will have the same reference and thus the same value.
With a=23 you're changing what this a points to, which in this example is 23.
And with fun(a) the function itself return a value, but without this being saved somewhere that result get lost.
To update the variable outside you need to reassigned to the result of the function
a = 0
def fun(a):
a=23
return a
a = fun(a)
print("a is",a) #a is 23
which in your case it would be dataset=dummies(dataset)
If you want that your function make changes in-place to the object it receive, you can't use =, you need to use something that the object itself provide to allow modifications in place, for example
this would not work
a = []
def fun2(a):
a=[23]
return a
fun2(a)
print("a is",a) #a is []
but this would
a = []
def fun2(a):
a.append(23)
return a
fun2(a)
print("a is",a) #a is [23]
because we are using a in-place modification method that the object provided, in this example that would be the append method form list
But such modification in place can result in unforeseen result, specially if the object being modify is shared between processes, so I rather recomend the previous approach

Access Violation when using Ctypes to Interface with Fortran DLL

I have a set of dlls created from Fortran that I am running from python. I've successfully created a wrapper class and have been running the dlls fine for weeks.
Today I noticed an error in my input and changed it, but to my surprise this caused the following:
OSError: exception: access violation reading 0x705206C8
If seems that certain input values somehow cause me to try to access illegal data. I created the following MCVE and it does repeat the issue. Specifically an error is thrown when 338 < R_o < 361. Unfortunately I cannot publish the raw Fortran code, nor create an MCVE which replicates the problem and is sufficiently abstracted such that I could share it. All of the variables are either declared as integer or real(8) types in the Fortran code.
import ctypes
import os
DLL_PATH = "C:\Repos\CASS_PFM\dlls"
class wrapper:
def __init__(self,data):
self.data = data
self.DLL = ctypes.CDLL(os.path.join(DLL_PATH,"MyDLL.dll"))
self.fortran_subroutine = getattr(self.DLL,"MyFunction_".lower())
self.output = {}
def run(self):
out = (ctypes.c_longdouble * len(self.data))()
in_data = []
for item in self.data:
item.convert_to_ctypes()
in_data.append(ctypes.byref(item.c_val))
self.fortran_subroutine(*in_data, out)
for item in self.data:
self.output[item.name] = item.convert_to_python()
class FortranData:
def __init__(self,name,py_val,ctype,some_param=True):
self.name = name
self.py_val = py_val
self.ctype = ctype
self.some_param = some_param
def convert_to_ctypes(self):
ctype_converter = getattr(ctypes,self.ctype)
self.c_val = ctype_converter(self.py_val)
return self.c_val
def convert_to_python(self):
self.py_val = self.c_val.value
return self.py_val
def main():
R_o = 350
data = [
FortranData("R_o",R_o,'c_double',False),
FortranData("thick",57.15,'c_double',False),
FortranData("axial_c",100,'c_double',False),
FortranData("sigy",235.81,'c_double',False),
FortranData("sigu",619.17,'c_double',False),
FortranData("RO_alpha",1.49707,'c_double',False),
FortranData("RO_sigo",235.81,'c_double',False),
FortranData("RO_epso",0.001336,'c_double',False),
FortranData("RO_n",6.6,'c_double',False),
FortranData("Resist_Jic",116,'c_double',False),
FortranData("Resist_C",104.02,'c_double',False),
FortranData("Resist_m",0.28,'c_double',False),
FortranData("pressure",15.51375,'c_double',False),
FortranData("i_write",0,'c_int',False),
FortranData("if_flag_twc",0,'c_int',),
FortranData("i_twc_ll",0,'c_int',),
FortranData("i_twc_epfm",0,'c_int',),
FortranData("i_err_code",0,'c_int',),
FortranData("Axial_TWC_ratio",0,'c_double',),
FortranData("Axial_TWC_fail",0,'c_int',),
FortranData("c_max_ll",0,'c_double',),
FortranData("c_max_epfm",0,'c_double',)
]
obj = wrapper(data)
obj.run()
print(obj.output)
if __name__ == "__main__": main()
It's not just the R_o value either; there are some combinations of values that cause the same error (seemingly without rhyme or reason). Is there anything within the above Python that might lead to an access violation depending on the values passed to the DLL?
Python version is 3.7.2, 32-bit
I see 2 problems with the code (and a potential 3rd one):
argtypes (and restype) not being specified. Check [SO]: C function called from Python via ctypes returns incorrect value (#CristiFati's answer) for more details
This may be a consequence (or at least it's closely related to) the previous. I can only guess without the Fortran (or better: C) function prototype, but anyway there is certainly something wrong. I assume that for the input data things should be same as for the output data, so the function would take 2 arrays (same size), and the input one's elements would be void *s (since their type is not consistent). Then, you'd need something like (although I can't imagine how would Fortran know which element contains an int and which a double):
in_data (ctypes.c_void_p * len(self.data))()
for idx, item in enumerate(self.data):
item.convert_to_ctypes()
in_data[index] = ctypes.addressof(item.c_val)
Since you're on 032bit, you should also take calling convention into account (ctypes.CDLL vs ctypes.WinDLL)
But again, without the function prototype, everything is just a speculation.
Also, why "MyFunction_".lower() instead of "myfunction_"?

How do I return a dataframe object from a thread

I had previously asked this question by may not have been clear enough on my explanation of my particular situation. My previous question was voted as a duplicate of how to get the return value from a thread in python?
Perhaps I should have explained more. I had already read and tried the referenced thread, but nothing that I did from there seemed to work. (I could be just implementing it incorrectly).
My main class that does all the work and data transformation is:
class SolrPull(object):
def __init__(self, **kwargs):
self.var1 = kwargs['var1'] if 'var1' in kwargs else 'this'
self.var2 = kwargs['var2'] if 'var2' in kwargs else 'that'
def solr_main(self):
#This is where the main data transformation takes place.
return(self.flattened_df)
I need to create multiple objects and have them pull from a Solr database and transform data synchronously in different threads.
My arguments must be passed to the SolrPull class, not to the solr_main function.
I need to wait for those returns before continuing with processing.
I tried a couple of different answers from the referenced thread, but nothing worked.
Using the accepted answer for that thread, I did:
class TierPerf(object):
def pull_current(self):
pool = ThreadPool(processes=5)
CustomerRecv_df_result = pool.apply_async(SolrPull(var1='this', var2='that').solr_main())
APS_df_result = pool.apply_async(SolrPull(var1='this', var2='that').solr_main())
self.CustomerRecv_df = CustomerRecv_df_result.get()
self.APS_df = APS_df_result.get()
But the pulls and transformation do not happen synchronously.
Then when I do the .get(), I get the error 'DataFrame object is not callable'.
As an end result, I need to be able to synchronously call SolrPull(*args).solr_main() and return pandas dataframe that will then be used for further processing.
Well, after all the struggle and pain over that, I finally figured out my specifics after posting this question.
I went back to my original solution and then just set my desired dataframe (self.CustomerRecv_df) to the return dataframes attribute (CustomerRecv_df.flattened_df).
class TierPerf(object):
def pull_current(self):
thread_list = []
CustomerRecv_df = SolrPull(var1='this', var2='that')
tr_CustomerRecv_df = threading.Thread(name='Customerrecev_tier', target=CustomerRecv_df.solr_main)
thread_list.append(tr_CustomerRecv_df)
APS_df = SolrPull(var1='this', var2='other')
tr_APS_df = threading.Thread(name='APS_tier', target=APS_df.solr_main)
thread_list.append(tr_APS_df)
for thread in thread_list:
print('Starting', thread)
thread.start()
for thread in thread_list:
print('Joining', thread)
thread.join()
self.CustomerRecv_df = CustomerRecv_df.flattened_df
self.APS_df = APS_df.flattened_df

Creating a list of Class objects from a file with no duplicates in attributes of the objects

I am currently taking some computer science courses in school and have come to a dead end and need a little help. Like the title says, I need of create a list of Class objects from a file with objects that have a duplicate not added to the list, I was able to successfully do this with a python set() but apparently that isn't allowed for this particular assignment, I have tried various other ways but can't seem to get it working without using a set. I believe the point of this assignment is comparing data structures in python and using the slowest method possible as it also has to be timed. my code using the set() will be provided.
import time
class Students:
def __init__(self, LName, FName, ssn, email, age):
self.LName = LName
self.FName = FName
self.ssn = ssn
self.email = email
self.age = age
def getssn(self):
return self.ssn
def main():
t1 = time.time()
f = open('InsertNames.txt', 'r')
studentlist = []
seen = set()
for line in f:
parsed = line.split(' ')
parsed = [i.strip() for i in parsed]
if parsed[2] not in seen:
studentlist.append(Students(parsed[0], parsed[1], parsed[2], parsed[3], parsed[4]))
seen.add(parsed[2])
else:
print(parsed[2], 'already in list, not added')
f.close()
print('final list length: ', len(studentlist))
t2 = time.time()
print('time = ', t2-t1)
main()
A note, that the only duplicates to be checked for are those of the .ssn attribute and the duplicate should not be added to the list. Is there a way to check what is already in the list by that specific attribute before adding it?
edit: Forgot to mention only 1 list allowed in memory.
You can write
if not any(s.ssn==parsed[2] for s in studentlist):
without committing to this comparison as the meaning of ==. At this level of work, you probably are expected to write out the loop and set a flag yourself rather than use a generator expression.
Since you already took the time to write a class representing a student and since ssn is a unique identifier for the instances, consider writing an __eq__ method for that class.
def __eq__(self, other):
return self.ssn == other.ssn
This will make your life easier when you want to compare two students, and in your case make a list (specifically not a set) of students.
Then your code would look something like:
with open('InsertNames.txt') as f:
for line in f:
student = Student(*line.strip().split())
if student not in student_list:
student_list.append(student)
Explanation
Opening a file with with statement makes your code more clean and
gives it the ability to handle errors and do cleanups correctly. And
since 'r' is a default for open it doesn't need to be there.
You should strip the line before splitting it just to handle some
edge cases but this is not obligatory.
split's default argument is ' ' so again it isn't necessary.
Just to clarify the meaning of this item is that the absence of a parameter make the split use whitespaces. It does not mean that a single space character is the default.
Creating the student before adding it to the list sounds like too
much overhead for this simple use but since there is only one
__init__ method called it is not that bad. The plus side of this
is that it makes the code more readable with the not in statement.
The in statement (and also not in of course) checks if the
object is in that list with the __eq__ method of that object.
Since you implemented that method it can check the in statement
for your Student class instances.
Only if the student doesn't exist in the list, it will be added.
One final thing, there is no creation of a list here other than the return value of split and the student_list you created.

Resources