Related
I have a multiple lists with data of the form:(There is a simple example, in fact, the dimension of row-vectors are much larger)
list 1: [num1] [[1,0,0,1,0], [0,0,1,0,1], [0,1,0,1,0], ...]
list 2: [num2] [[0,0,0,1,0], [1,0,0,1,0], [0,0,1,0,0], ...]
...
list n: [numn] [[1,1,0,1,0], [1,0,0,1,1], [0,0,1,0,1], ...]
Every list marked with its own number [num] (numbers are not repeated).
The main question is: How to efficently find all num's of lists with identical row-vectors from them and such vectors?
In details:
For example, the row-vector [1,0,0,1,0] occurs in list 1 and list 2, so then I should return [1,0,0,1,0] : [num1], [num2]
First of all hash tables come to mind. I think it's best to use due to the large amount of data but I know hash tables quite superficially and I can’t structurize a clear algorithm in my head with this case. Can anyone advise what should I pay attention to and what modules should I consider? Perhaps there are other efficient approaches?
It is beyond the scope of a regular question to dive into hash tables and such. But suffice to say that sets in Python are backed by hash tables and checking for set membership is almost instantaneous and much more efficient than searching through lists.
If order doesn't matter within your list of vectors, you should just think of them as unordered collections (sets). Sets need to contain immutable things, so you cannot put a list into a set, but you can put in tuples. So, if you re-structure your data to be sets of tuples, you are in good shape.
You have many "cases" of things you might do then, below are a few examples.
data = { 1: {(1, 0, 0), (1, 1, 0)},
2: {(0, 0, 0), (1, 0, 0)},
3: {(1, 0, 0), (1, 0, 1), (1, 1, 0)}}
# find common vectors in 2 sets
def common_vecs(a, b):
return a.intersection(b)
# find all the common vectors in a group of sets
def all_common_vecs(grps):
return set.intersection(*grps)
# find which sets contain a specific vector
def find(vec, data):
result = set()
for idx, grp in data.items():
if vec in grp:
result.add(idx)
return result
print(common_vecs(data[1], data[3]))
print(all_common_vecs(data.values()))
print(find((1,0,1), data))
Output:
{(1, 0, 0), (1, 1, 0)}
{(1, 0, 0)}
{3}
This question already has an answer here:
Why is the fold action necessary in Spark?
(1 answer)
Closed 4 years ago.
I'm studying Spark using Learning Spark, Lightning-Fast Data Analysis book.
I have been to many sites and read many articles but I still did not understand the difference between reduce() and fold().
According to the book that I'm using:
"Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an empty list for concatenation)."
To help me better understand, I run the following code:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)
rdd.getNumPartitions()
Out[1]: 2
rdd.glom().collect()
Out[2]: [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
rdd.reduce(lambda x,y: x+y)
Out[3]: 55
rdd.fold(0, lambda x,y: x+y)
Out[4]: 55
Question:
1) Referencing: "but in addition takes a “zero value” to be used for the initial call on each partition." What does it mean initial call on each partition?
2) Referencing: "The zero value you provide should be the identity element for your operation; that is, applying it multiple times with your function should not change the value" If that's the case, what is the point of providing "the value" for the operation?
3) According to the example I provided above, both produced the sum of 55. What's the difference?
the difference is that fold lets you change the type of the result, whereas reduce doesn't and thus can use values from the data.
e.g.
rdd.fold("",lambda x,y: x+str(y))
'12345678910'
Your example doesn't change the type of the result and indeed in that example, you can use reduce instead of fold.
a "normal" fold used in a non-distributed environment uses the initial value once. However, as spark runs distributed it would run a fold that would start with the initial value in each partition and again when combining the results
Because in your example you've created the 10 numbers above in 2 partitions if we'd call the following :
rdd.fold("HERE",lambda x,y: x+str(y))
we'd get
'HEREHERE12345HERE678910'
I don't know my question is possible or not. I am using ortools to solve an optimization problem and I know in the part of conditions the argument should be defined in double type, like this:
constraints[i] = solver.Constraint(0.0 , 10,0)
But my problem is that, I don't want to use this type of argument in creating conditions. For example I want to have a list.
So I wrote this in my code:
constraints[i] = solver.Constraint([1,2,3,...])
And I got this error:
return _pywraplp.Solver_Constraint(self, *args)
NotImplementedError: Wrong number or type of arguments for overloaded
function 'Solver_Constraint'.
Possible C/C++ prototypes are:
operations_research::MPSolver::MakeRowConstraint(double,double)
operations_research::MPSolver::MakeRowConstraint()
operations_research::MPSolver::MakeRowConstraint(double,double,std::string
const &)
operations_research::MPSolver::MakeRowConstraint(std::string const &)
Is there any way to change the type of condition's argument?
My Assumptions
your constraint expression is "a sum of some lists", meaning something along the lines of what the NumPy library does: e.g., if you have two lists of values, [1, 2, 3] and [4, 5, 6], their sum would be element-wise, s.t. [1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9].
your "list constraint" is also element-wise; e.g., [x1, x2, x3] <= [1, 2, 3] means x1 <= 1, x2 <= 2 and x3 <= 3.
you're using the GLOP Linear Solver. (Everything I say below applies to the ILP/CP/CP-SAT solvers, but some of the particular method names/other details are different.)
My Answer
The thing is, ortools only lets you set scalar values (like numbers) as variables; you can't make a "list variable", so to speak.
Therefore, you'll have to make a list of scalar variables that effectively represents the same thing.
For example, let's say you wanted your "list variable" to be a list of values, each one subjected to a particular constraint which you have stored in a list. Let's say you have a list of upper bounds:
upper_bounds = [1, 2, 3, ..., n]
And you have several lists of solver variables like so:
vars1 = [
# variable bounds here are chosen arbitrarily; set them to your purposes
solver.NumVar(0, solver.infinity, 'x{0}'.format(i))
for i in range(n)
]
vars2 = [...] # you define any other variable lists in the same way
Then, you would make a list of constraint objects, one constraint for each upper bound in your list:
constraints = [
solver.Constraint(0, ubound)
for ubound in upper_bounds
]
And you insert the variables into your constraints however is dictated for your problem:
# Example expression: X1 - X2 + 0.5*X3 < UBOUND
for i in range(n):
constraints[i].SetCoefficient(vars1[i], 1)
constraints[i].SetCoefficient(vars2[i], -1)
constraints[i].SetCoefficient(vars3[i], 0.5)
Hope this helps! I recommend taking (another, if you already have) look at the examples for your particular solver. The one for GLOP can be found here.
Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.
for example:
I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF?
Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?
As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable
x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]
The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5).
Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.
Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).
You are asking about DecisionTrees. Because RandomForest is ensemble model, and by itself it don't know anything about data, it fully relies on decisons from base estimators (In this case DecisionTrees), and aggregates them.
So, how DecisionTree is treating continious features: Look at this official documentation page. DecisionTreeClassifier was fitted on continuous dataset (Fisher irises), if you will look at the picture of tree - it has threshold value in each node over some chosen feature at this node.
I'm very new to constraint programming and try to find some real situations to test it.
I found one i think may be solved with CP.
Here it is :
I have a group of kids that i have to assign to some activities.
These kids fill a form where they specify 3 choices in order of preference.
Activities have a max number of participant so, the idea is to find a solution where the choices are respected for the best without exceedind max.
So, in first approach, i defined vars for kids with [1,2,3] for domain (the link between the number of choice, activity and children being known somewhere else).
But then, i don't really know how to define relevant constraints so I have all the permutation (very long) and then, i have to give a note to each (adding the numbers of choices to get the min) and eliminate results with to big groups.
I think there must be a good way to do this using CP but i can't figure it out.
Does someone can help me ?
Thanks
I'm not sure that I understand everything in your description, for example "so I have all the permutation (very long)" and "i have to give a note to each (adding the numbers of choices to get the min)". That said, here is a simple encoding of what I think would be a model of your problem, or at least a starter.
It's written in MiniZinc and is shown below with a small example of 6 kids and 4 activities. The full model (including variants of some constraints) is here as well: http://hakank.org/minizinc/max_activity.mzn
Description of the variables:
"x" is an array of decision variables containing the selected activity for each kid. "scores" is the scores (1, 2, or 3 depending on which activity that was selected) for the selected activity, and "total_score" just sums the "scores" array.
include "globals.mzn";
int: num_kids;
array[1..num_kids, 1..3] of int: prefs;
int: num_activities;
array[1..num_activities] of int: activity_size;
% decision variables
array[1..num_kids] of var 1..num_activities: x; % the selected activity
array[1..num_kids] of var 1..num_activities: scores;
var int: total_score = sum(scores);
solve maximize total_score;
constraint
forall(k in 1..num_kids) (
% select one of the prefered activities
let {
var 1..3: p
} in
x[k] = prefs[k,p] /\
scores[k] = 4-p % score for the selected activity
)
/\ % ensure size of the activities
global_cardinality_low_up(x, [i | i in 1..num_activities], [0 | i in 1..num_activities], activity_size)
;
output [
"x : ", show(x), "\n",
"scores: ", show(scores), "\n",
"total_score: ", show(total_score), "\n",
];
%
% some small fake data
%
num_kids = 6;
num_activities = 4;
% Activity preferences for each kid
prefs = array2d(1..num_kids, 1..3,
[
1,2,3,
4,2,1,
2,1,4,
4,2,1,
3,2,4,
4,1,3
]);
% max size of activity
activity_size = [2,2,2,3];
The solution of this problem instance is:
x : [1, 4, 2, 4, 3, 4]
scores: [3, 3, 3, 3, 3, 3]
total_score: 18
This is a unique solution.
Using a slightly smaller activity_size ([2,2,2,2]) we get another optimal solution (total_score = 17), since there can be just 2 kids in activity #4 (kid #6 is here forced to take activity #1 instead)
x : [1, 4, 2, 4, 3, 1]
scores: [3, 3, 3, 3, 3, 2]
total_score: 17
There is two other possible selections for the second variant, namely
x : [1, 4, 2, 2, 3, 4]
scores: [3, 3, 3, 2, 3, 3]
total_score: 17
----------
x : [1, 2, 2, 4, 3, 4]
scores: [3, 2, 3, 3, 3, 3]
total_score: 17
Update: I also did a Picat model using the same principal approach: http://hakank.org/picat/max_activity.pi .
Update 2: The above model assumes that all kids get some of their preferred activities. When this assumption is not met one have then fix this somehow instead of just throwing a "UNSATISFIED" as an answer. One way is to select some other - not preferred - activity to kid which will yield a score of 0. This is done in this model: http://hakank.org/minizinc/max_activity2.mzn
The changes compared to the original model are small:
the domain of "scores" are 0..num_activities
we add a disjunction "/ scores[k] = 0" to the forall loop that selects the activity
Since this is a maximization problem a score of 0 will not be used unless it is necessary.
I also added a sanity check that there are enough activities for all kids:
constraint
assert(sum(activity_size) >= num_kids, "There is not activities enough for all kids.")
;