Spark: Difference Between Reduce() vs Fold() [duplicate] - apache-spark

This question already has an answer here:
Why is the fold action necessary in Spark?
(1 answer)
Closed 4 years ago.
I'm studying Spark using Learning Spark, Lightning-Fast Data Analysis book.
I have been to many sites and read many articles but I still did not understand the difference between reduce() and fold().
According to the book that I'm using:
"Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an empty list for concatenation)."
To help me better understand, I run the following code:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)
rdd.getNumPartitions()
Out[1]: 2
rdd.glom().collect()
Out[2]: [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
rdd.reduce(lambda x,y: x+y)
Out[3]: 55
rdd.fold(0, lambda x,y: x+y)
Out[4]: 55
Question:
1) Referencing: "but in addition takes a “zero value” to be used for the initial call on each partition." What does it mean initial call on each partition?
2) Referencing: "The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value" If that's the case, what is the point of providing "the value" for the operation?
3) According to the example I provided above, both produced the sum of 55. What's the difference?

the difference is that fold lets you change the type of the result, whereas reduce doesn't and thus can use values from the data.
e.g.
rdd.fold("",lambda x,y: x+str(y))
'12345678910'
Your example doesn't change the type of the result and indeed in that example, you can use reduce instead of fold.
a "normal" fold used in a non-distributed environment uses the initial value once. However, as spark runs distributed it would run a fold that would start with the initial value in each partition and again when combining the results
Because in your example you've created the 10 numbers above in 2 partitions if we'd call the following :
rdd.fold("HERE",lambda x,y: x+str(y))
we'd get
'HEREHERE12345HERE678910'

Related

Reduce operation in Spark with constant values gives a constant result irrespective of input

ser = sc.parallelize([1,2,3,4,5])
freq = ser.reduce(lambda x,y : 1+2)
print(freq). #answer is 3
If I run reduce operation by giving constant values, it just gives the sum of those 2 numbers. So in this case, the answer is just 3. While I was expecting it would be (3+3+3+3=12) as there are 5 elements and the summation would happen 4 times. Not able to understand the internals of reduce here. Any help please?
You're misunderstanding what reduce does. It does not apply an aggregation operation (which you assume to be sum for some reason) to a mapping of all elements (which you suppose is what you do with lambda x,y : 1+2)
Reducing that RDD will, roughly speaking, do something like this:
call your lambda with 1, 2 -> lambda returns 3
carry 3 and call lambda with 3, 3 -> lambda returns 3
carry 3 and call lambda with 3, 4 -> lambda returns 3
carry 3 and call lambda with 3, 5 -> lambda returns 3
The reduce method returns the last value, which is 3.
If your intention is to compute 1 + 2 for each element in the RDD, then you need to map and then reduce, something like:
freq = ser.map(lambda x: 1 + 2).reduce(lambda a,b: a+b) #see how reduce works
#which you can rewrite as
freq = ser.map(lambda x: 1 + 2).sum()
But the result of this is 15, not 12 (as there are 5 elements). I don't know any operation that computes a mapping value for each "reduction" step and allows further reduction.
It's likely that is the wrong question to ask, but you can possibly do that by using the map & reduce option above, skipping just one element, although I strongly doubt this is intentional (because the commutative and associative operation of reduce can be called an arbitrary number of times depending on how the RDD is partitioned).

how to iterate two different lists parallelly, converges to one

a = [1, 2, 7, 5, 11]
b = [3, 4, 5, 11]
above example is related to
(1)-->(2)-->(7)-->(5)--(11)
/
(3) -->(4) ----
here node 5 is merging point of two list.
For case of guaranteeed common tail:
Iterate both these lists in reverse direction, from the ends, until difference is discovered. Remember indexes of the common tail start.
Now, if needed, traverse beginning of both lists before junction point.
Alternative way - if strict ordering exists:
At every step go ahead in the list with smaller current element

Understanding factorials and alternate code

Python newbie here. I am trying to understand the following code to calculate Euler's number:
import math
num(i=10):
sum([1 / math.factorial(z) for z in range(0, i)])
I would really like to get a better grasp on how equations are done in code. I have read many tutorials, but I don't understand them well enough to apply a concept to unique situations like in the above code. Can someone just explain to me what is happening in this code step by step? Additionally, I have not been able to figure out how to do factorials, and it would be very helpful to me if someone would explain how input a factorial in a function (the hard way) without imports.
for you to understand the above code, you first must understand the language itself. e=1/0!+1/1!+1/2!+1/3!+1/4!+..., so you need to do:
total=0
for i in range(100):
total +=1/math.factorial(i)
print(total)
2.7182818284590455
in case you understand what a for loop is and how it runs. This is much faster compared to what you wrote above.
Now in python there is something called list comprehension. That is, creating a list from a for-loop without necessarily pre-defining the list. so you can do `[i for i in range(10)] which will create a list of 10 elements. You can therefore manipulate each element as you create the list ie
[i**2 for i in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
so in your case you are doing [1/math.factorial(i) for i in range(100)]. This creates the list [1.0, 1.0, 0.5, 0.16666666666666666, 0.041666666666666664,...] which you then add the list by calling sum on this list. ie sum([1/math.factorial(i) for i in range(100)])
Defining your own factorial
factorial is a function that multiplies all numbers from 1 to the specified value. with factorial(0) set at 1
factorial(3)= 1*2*3.
thus you can define it as:
def factorial(x):
if x == 0: return 1
val = 1
for i in range(1,x+1):
val *=i
return val
factorial(3)
Out[40]: 6
factorial(4)
Out[41]: 24
You can also use recursiveness to define factorial:
def factorial(x):
if x==0: return 1
else: return x * factorial(x-1)

how to change the type of constraint's arguments in ortools

I don't know my question is possible or not. I am using ortools to solve an optimization problem and I know in the part of conditions the argument should be defined in double type, like this:
constraints[i] = solver.Constraint(0.0 , 10,0)
But my problem is that, I don't want to use this type of argument in creating conditions. For example I want to have a list.
So I wrote this in my code:
constraints[i] = solver.Constraint([1,2,3,...])
And I got this error:
return _pywraplp.Solver_Constraint(self, *args)
NotImplementedError: Wrong number or type of arguments for overloaded
function 'Solver_Constraint'.
Possible C/C++ prototypes are:
operations_research::MPSolver::MakeRowConstraint(double,double)
operations_research::MPSolver::MakeRowConstraint()
operations_research::MPSolver::MakeRowConstraint(double,double,std::string
const &)
operations_research::MPSolver::MakeRowConstraint(std::string const &)
Is there any way to change the type of condition's argument?
My Assumptions
your constraint expression is "a sum of some lists", meaning something along the lines of what the NumPy library does: e.g., if you have two lists of values, [1, 2, 3] and [4, 5, 6], their sum would be element-wise, s.t. [1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9].
your "list constraint" is also element-wise; e.g., [x1, x2, x3] <= [1, 2, 3] means x1 <= 1, x2 <= 2 and x3 <= 3.
you're using the GLOP Linear Solver. (Everything I say below applies to the ILP/CP/CP-SAT solvers, but some of the particular method names/other details are different.)
My Answer
The thing is, ortools only lets you set scalar values (like numbers) as variables; you can't make a "list variable", so to speak.
Therefore, you'll have to make a list of scalar variables that effectively represents the same thing.
For example, let's say you wanted your "list variable" to be a list of values, each one subjected to a particular constraint which you have stored in a list. Let's say you have a list of upper bounds:
upper_bounds = [1, 2, 3, ..., n]
And you have several lists of solver variables like so:
vars1 = [
# variable bounds here are chosen arbitrarily; set them to your purposes
solver.NumVar(0, solver.infinity, 'x{0}'.format(i))
for i in range(n)
]
vars2 = [...] # you define any other variable lists in the same way
Then, you would make a list of constraint objects, one constraint for each upper bound in your list:
constraints = [
solver.Constraint(0, ubound)
for ubound in upper_bounds
]
And you insert the variables into your constraints however is dictated for your problem:
# Example expression: X1 - X2 + 0.5*X3 < UBOUND
for i in range(n):
constraints[i].SetCoefficient(vars1[i], 1)
constraints[i].SetCoefficient(vars2[i], -1)
constraints[i].SetCoefficient(vars3[i], 0.5)
Hope this helps! I recommend taking (another, if you already have) look at the examples for your particular solver. The one for GLOP can be found here.

Scikit Learn - Random Forest: How continuous feature is handled?

Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.
for example:
I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF?
Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?
As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable
x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5).
Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.
Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).
You are asking about DecisionTrees. Because RandomForest is ensemble model, and by itself it don't know anything about data, it fully relies on decisons from base estimators (In this case DecisionTrees), and aggregates them.
So, how DecisionTree is treating continious features: Look at this official documentation page. DecisionTreeClassifier was fitted on continuous dataset (Fisher irises), if you will look at the picture of tree - it has threshold value in each node over some chosen feature at this node.

Resources