Broadcasting a single value Spark(python) - apache-spark

Lets say I am having the following list and a single value:
alist = [1,2,3,4,5]
alistRDD = sc.parallelize(alist)
single_value = 3
and I got the following function:
def a_fun(x,y):
return x+y
And I am doing the following:
alistRDD.map(lambda x:a_fun(x,single_value))
So I am using for this function as second argument the single_value. Does it make sense to broadcast this single_value in order to be in all the nodes?

When the driver submit this transformation to the workers it would simply pass the value it self and not the parameter. So from performance perspective it may even be better. There is no value in broadcasting data that is assigned without any logic. You be better simply pass the variable and let the serialization process turn it into the value itself. Hope this answers your question.

Related

Is there a way changing actual value of an int without creating a new instance? [duplicate]

How can I pass an integer by reference in Python?
I want to modify the value of a variable that I am passing to the function. I have read that everything in Python is pass by value, but there has to be an easy trick. For example, in Java you could pass the reference types of Integer, Long, etc.
How can I pass an integer into a function by reference?
What are the best practices?
It doesn't quite work that way in Python. Python passes references to objects. Inside your function you have an object -- You're free to mutate that object (if possible). However, integers are immutable. One workaround is to pass the integer in a container which can be mutated:
def change(x):
x[0] = 3
x = [1]
change(x)
print x
This is ugly/clumsy at best, but you're not going to do any better in Python. The reason is because in Python, assignment (=) takes whatever object is the result of the right hand side and binds it to whatever is on the left hand side *(or passes it to the appropriate function).
Understanding this, we can see why there is no way to change the value of an immutable object inside a function -- you can't change any of its attributes because it's immutable, and you can't just assign the "variable" a new value because then you're actually creating a new object (which is distinct from the old one) and giving it the name that the old object had in the local namespace.
Usually the workaround is to simply return the object that you want:
def multiply_by_2(x):
return 2*x
x = 1
x = multiply_by_2(x)
*In the first example case above, 3 actually gets passed to x.__setitem__.
Most cases where you would need to pass by reference are where you need to return more than one value back to the caller. A "best practice" is to use multiple return values, which is much easier to do in Python than in languages like Java.
Here's a simple example:
def RectToPolar(x, y):
r = (x ** 2 + y ** 2) ** 0.5
theta = math.atan2(y, x)
return r, theta # return 2 things at once
r, theta = RectToPolar(3, 4) # assign 2 things at once
Not exactly passing a value directly, but using it as if it was passed.
x = 7
def my_method():
nonlocal x
x += 1
my_method()
print(x) # 8
Caveats:
nonlocal was introduced in python 3
If the enclosing scope is the global one, use global instead of nonlocal.
Maybe it's not pythonic way, but you can do this
import ctypes
def incr(a):
a += 1
x = ctypes.c_int(1) # create c-var
incr(ctypes.ctypes.byref(x)) # passing by ref
Really, the best practice is to step back and ask whether you really need to do this. Why do you want to modify the value of a variable that you're passing in to the function?
If you need to do it for a quick hack, the quickest way is to pass a list holding the integer, and stick a [0] around every use of it, as mgilson's answer demonstrates.
If you need to do it for something more significant, write a class that has an int as an attribute, so you can just set it. Of course this forces you to come up with a good name for the class, and for the attribute—if you can't think of anything, go back and read the sentence again a few times, and then use the list.
More generally, if you're trying to port some Java idiom directly to Python, you're doing it wrong. Even when there is something directly corresponding (as with static/#staticmethod), you still don't want to use it in most Python programs just because you'd use it in Java.
Maybe slightly more self-documenting than the list-of-length-1 trick is the old empty type trick:
def inc_i(v):
v.i += 1
x = type('', (), {})()
x.i = 7
inc_i(x)
print(x.i)
A numpy single-element array is mutable and yet for most purposes, it can be evaluated as if it was a numerical python variable. Therefore, it's a more convenient by-reference number container than a single-element list.
import numpy as np
def triple_var_by_ref(x):
x[0]=x[0]*3
a=np.array([2])
triple_var_by_ref(a)
print(a+1)
output:
7
The correct answer, is to use a class and put the value inside the class, this lets you pass by reference exactly as you desire.
class Thing:
def __init__(self,a):
self.a = a
def dosomething(ref)
ref.a += 1
t = Thing(3)
dosomething(t)
print("T is now",t.a)
In Python, every value is a reference (a pointer to an object), just like non-primitives in Java. Also, like Java, Python only has pass by value. So, semantically, they are pretty much the same.
Since you mention Java in your question, I would like to see how you achieve what you want in Java. If you can show it in Java, I can show you how to do it exactly equivalently in Python.
class PassByReference:
def Change(self, var):
self.a = var
print(self.a)
s=PassByReference()
s.Change(5)
class Obj:
def __init__(self,a):
self.value = a
def sum(self, a):
self.value += a
a = Obj(1)
b = a
a.sum(1)
print(a.value, b.value)// 2 2
In Python, everything is passed by value, but if you want to modify some state, you can change the value of an integer inside a list or object that's passed to a method.
integers are immutable in python and once they are created we cannot change their value by using assignment operator to a variable we are making it to point to some other address not the previous address.
In python a function can return multiple values we can make use of it:
def swap(a,b):
return b,a
a,b=22,55
a,b=swap(a,b)
print(a,b)
To change the reference a variable is pointing to we can wrap immutable data types(int, long, float, complex, str, bytes, truple, frozenset) inside of mutable data types (bytearray, list, set, dict).
#var is an instance of dictionary type
def change(var,key,new_value):
var[key]=new_value
var =dict()
var['a']=33
change(var,'a',2625)
print(var['a'])

Receiving function as another function´s parameter

Let´s say I have 2 functions like these ones:
def list(n):
l=[x for x in range(n)]
return l
def square(l):
l=list(map(lambda x:x**2,l))
print(l)
The first one makes a list from all numbers in a given range which is "n" and the second one receives a list as a parameter and returns the squared values of this list.
However when I write:
square(list(20))
it raises the error "map object cannot be interpreted as an integer" and whenever I erase one of the functions above and run the other one it runs perfectly and I have no idea what mistake I made.
You redefined the standard function list()! Rename it to my_list() and clean the code accordingly.
As a side note, your function list() is doing exactly what list(range(n)) would do. Why do you need it at all? In fact, for most purposes (including your example), range(n) alone is sufficient.
Finally, you do not pass a function as a parameter. You pass the value generated by another function. It is not the same.

Pythonic way of creating if statement for nested if statements

I'm kind of newbie as programmer, but I wish to master Python and I'm developing open source application. This application has function to gather some information. This function takes 1 parameter. This parameter can be 0, 1 or 2. 0 = False, 1 = True, 2 = Multi. Also I have an if statement that does 2 actions. 1st - (when False) gathers single type value, 2nd - (when True) gathers multiple type values and when parameter is 2 (multi) then it will gather single type (1st) and multiple types (2nd). My if statement looks like this:
if False:
get_single_type = code.of.action
generators.generate_data(False, get_single_type)
elif True:
get_multiple_type = code.of.action
generators.generate_data(True, get_multiple_type)
else:
get_single_type = code.of.action
generators.generate_data(False, get_single_type)
get_multiple_type = code.of.action
generators.generate_data(True, get_multiple_type)
Is there maybe better way of avoiding this kind of coding, like in last else statement when I call both single and multiple.
Thank you in advance.
One thing I learned from Python is that although it lacks the Switch operator, you can use dictionary in a similar fashion to get things done since everything is an object:
def get_single():
# define your single function
get_single_type = code.of.action
generators.generate_data(False, get_single_type)
def get_multi():
# define your multi function
get_multiple_type = code.of.action
generators.generate_data(True, get_multiple_type)
actions = {
0: [get_single],
1: [get_multi],
2: [get_single, get_multi]
}
parameter = 0 # replace this line with however you are capturing the parameter
for action in actions[parameter]:
action()
This way you avoid c+p your code everywhere and have it referenced from the function, and your "actions" dictionary define the function to be used based on the parameter given.
In this case since you have multiple functions you want to call, I kept all dictionary items as a list so the structure is consistent and it can be iterated through to perform any number of actions.
Ensure you use leave out the () in the dictionary so that the functions aren't instantiated when the dictionary is defined. And remember to add () when you are actually calling the function from the dictionary to instantiate it.
This is something you will often encounter and it is pretty much always bad practice to be repeating code. Anyway, the way to do this is use two if-statements. This way, even if the first case passes, the second case can still pass. Oh, and assuming your variable that can be 0, 1 or 2 is called x, then we could either use or and two checks:
if x == 0 or x == 2:
but, personally, I prefer using in on a tuple:
if x in (0, 2):
get_single_type = code.of.action
generators.generate_data(False, get_single_type)
if x in (1, 2):
get_multiple_type = code.of.action
generators.generate_data(True, get_multiple_type)

Pass by Reference in Haskell?

Coming from a C# background, I would say that the ref keyword is very useful in certain situations where changes to a method parameter are desired to directly influence the passed value for value types of for setting a parameter to null.
Also, the out keyword can come in handy when returning a multitude of various logically unconnected values.
My question is: is it possible to pass a parameter to a function by reference in Haskell? If not, what is the direct alternative (if any)?
There is no difference between "pass-by-value" and "pass-by-reference" in languages like Haskell and ML, because it's not possible to assign to a variable in these languages. It's not possible to have "changes to a method parameter" in the first place in influence any passed variable.
It depends on context. Without any context, no, you can't (at least not in the way you mean). With context, you may very well be able to do this if you want. In particular, if you're working in IO or ST, you can use IORef or STRef respectively, as well as mutable arrays, vectors, hash tables, weak hash tables (IO only, I believe), etc. A function can take one or more of these and produce an action that (when executed) will modify the contents of those references.
Another sort of context, StateT, gives the illusion of a mutable "state" value implemented purely. You can use a compound state and pass around lenses into it, simulating references for certain purposes.
My question is: is it possible to pass a parameter to a function by reference in Haskell? If not, what is the direct alternative (if any)?
No, values in Haskell are immutable (well, the do notation can create some illusion of mutability, but it all happens inside a function and is an entirely different topic). If you want to change the value, you will have to return the changed value and let the caller deal with it. For instance, see the random number generating function next that returns the value and the updated RNG.
Also, the out keyword can come in handy when returning a multitude of various logically unconnected values.
Consequently, you can't have out either. If you want to return several entirely disconnected values (at which point you should probably think why are disconnected values being returned from a single function), return a tuple.
No, it's not possible, because Haskell variables are immutable, therefore, the creators of Haskell must have reasoned there's no point of passing a reference that cannot be changed.
consider a Haskell variable:
let x = 37
In order to change this, we need to make a temporary variable, and then set the first variable to the temporary variable (with modifications).
let tripleX = x * 3
let x = tripleX
If Haskell had pass by reference, could we do this?
The answer is no.
Suppose we tried:
tripleVar :: Int -> IO()
tripleVar var = do
let times_3 = var * 3
let var = times_3
The problem with this code is the last line; Although we can imagine the variable being passed by reference, the new variable isn't.
In other words, we're introducing a new local variable with the same name;
Take a look again at the last line:
let var = times_3
Haskell doesn't know that we want to "change" a global variable; since we can't reassign it, we are creating a new variable with the same name on the local scope, thus not changing the reference. :-(
tripleVar :: Int -> IO()
tripleVar var = do
let tripleVar = var
let var = tripleVar * 3
return()
main = do
let x = 4
tripleVar x
print x -- 4 :(

How to create an array of functions which partly depend on outside parameters? (Python)

I am interested in creating a list / array of functions "G" consisting of many small functions "g". This essentially should correspond to a series of functions 'evolving' in time.
Each "g" takes-in two variables and returns the product of these variables with an outside global variable indexed at the same time-step.
Assume obs_mat (T x 1) is a pre-defined global array, and t corresponds to the time-steps
G = []
for t in range(T):
# tried declaring obs here too.
def g(current_state, observation_noise):
obs = obs_mat[t]
return current_state * observation_noise * obs
G.append(g)
Unfortunately when I test the resultant functions, they do not seem to pick up on the difference in the obs time-varying constant i.e. (Got G[0](100,100) same as G[5](100,100)). I tried playing around with the scope of obs but without much luck. Would anyone be able to help guide me in the right direction?
This is a common "gotcha" to referencing variables from an outer scope when in an inner function. The outer variable is looked up when the inner function is run, not when the inner function is defined (so all versions of the function see the variable's last value). For each function to see a different value, you either need to make sure they're looking in separate namespaces, or you need to bind the value to a default parameter of the inner function.
Here's an approach that uses an extra namespace:
def make_func(x):
def func(a, b):
return a*b*x
return func
list_of_funcs = [make_func(i) for i in range(10)]
Each inner function func has access to the x parameter in the enclosing make_func function. Since they're all created by separate calls to make_func, they each see separate namespaces with different x values.
Here's the other approach that uses a default argument (with functions created by a lambda expression):
list_of_funcs = [lambda a, b, x=i: a*b*x for i in range(10)]
In this version, the i variable from the list comprehension is bound to the default value of the x parameter in the lambda expression. This binding means that the functions wont care about the value of i changing later on. The downside to this solution is that any code that accidentally calls one of the functions with three arguments instead of two may work without an exception (perhaps with odd results).
The problem you are running into is one of scoping. Function bodies aren't evaluated until the fuction is actually called, so the functions you have there will use whatever is the current value of the variable within their scope at time of evaluation (which means they'll have the same t if you call them all after the for-loop has ended)
In order to see the value that you would like, you'd need to immediately call the function and save the result.
I'm not really sure why you're using an array of functions. Perhaps what you're trying to do is map a partial function across the time series, something like the following?
from functools import partial
def g(current_state, observation_noise, t):
obs = obs_mat[t]
return current_state * observation_noise * obs
g_maker = partial(g, current, observation)
results = list(map(g_maker, range(T)))
What's happening here is that partial creates a partially-applied function, which is merely waiting for its final value to be evaluated. That final value is dynamic (but the first two are fixed in this example), so mapping that partially-applied function over a range of values gets you answers for each value.
Honestly, this is a guess because it's hard to see what else you are trying to do with this data and it's hard to see what you're trying to achieve with the array of functions (and there are certainly other ways to do this).
The issue (assuming that your G.append call is mis-indented) is simply that the name t is mutated when you loop over the iterator returned by range(T). Since every function g you create stores returns the same name t, they wind up all returning the same value, T - 1. The fix is to de-reference the name (the simplest way to do this is by sending t into your function as a default value for an argument in g's argument list):
G = []
for t in range(T):
def g(current_state, observation_noise, t_kw=t):
obs = obs_mat[t_kw]
return current_state * observation_noise * obs
G.append(g)
This works because it creates another name that points at the value that t references during that iteration of the loop (you could still use t rather than t_kw and it would still just work because tg is bound to the value that tf is bound to - the value never changes, but tf is bound to another value on the next iteration, while tg still points at the "original" value.

Resources