In the deep learning tutorials, all training data is stored in a shared array and only an index into that array is passed to the training function to slice out a minibatch.
I understand that this allows the data to be left in GPU memory, as opposed to passing small chunks of data as a parameter to the training function for each minibatch.
In some previous questions, this was given as an answer as to why the givens mechanism is used in the tutorials.
I don't yet see the connection between these two concepts, so I'm probably missing out on something essential.
As far as I understand, the givens mechanism swaps out a variable in the graph with a given symbolic expression (i.e., some given subgraph is inserted in place of that variable).
Then why not define the computational graph the way we need it in the first place?
Here is a minimal example. I define a shared variable X and an integer index, and I either create a graph that already contains the slicing operation, or I create one where the slicing operation is inserted post-hoc via givens.
By all appearances, the two resulting functions get_nogivens and get_tutorial are identical (see the debugprints at the end).
But then why do the tutorials use the givens pattern?
import numpy as np
import theano
import theano.tensor as T
X = theano.shared(np.arange(100),borrow=True,name='X')
index = T.scalar(dtype='int32',name='index')
X_slice = X[index:index+5]
get_tutorial = theano.function([index], X, givens={X: X[index:index+5]}, mode='DebugMode')
get_nogivens = theano.function([index], X_slice, mode='DebugMode')
> theano.printing.debugprint(get_tutorial)
DeepCopyOp [#A] '' 4
|Subtensor{int32:int32:} [#B] '' 3
|X [#C]
|ScalarFromTensor [#D] '' 0
| |index [#E]
|ScalarFromTensor [#F] '' 2
|Elemwise{add,no_inplace} [#G] '' 1
|TensorConstant{5} [#H]
|index [#E]
> theano.printing.debugprint(get_nogivens)
DeepCopyOp [#A] '' 4
|Subtensor{int32:int32:} [#B] '' 3
|X [#C]
|ScalarFromTensor [#D] '' 0
| |index [#E]
|ScalarFromTensor [#F] '' 2
|Elemwise{add,no_inplace} [#G] '' 1
|TensorConstant{5} [#H]
|index [#E]
They use givens here only to decouple actual data which is passed to the graph from the input data variable. You could explicitly replace input variable with X[index * batch_size: (index + 1) * batch_size] but that is just a little more messy.
Related
I want to solve the following optimization problem using Gekko in python 3.7 window version.
Original Problem
Here, x_s are continuous variables, D and Epsilon are deterministic and they are also parameters.
However, since minimization function exists in the objective function, I remove it using binary variables(z1, z2) and then the problem becomes MINLP as follows.
Modified problem
With Gekko,
(1) Can both original problem & modified problem be solved?
(2) How can I code summation in the objective function and also D & epsilon which are parameters in Gekko?
Thanks in advance.
Both problems should be feasible with Gekko but the original appears easier to solve. Here are a few suggestions for the original problem:
Use m.Maximize() for the objective
Use sum() for the inner summation and m.sum() for outer summation for the objective function. I switch to m.sum() when the summation would create an expression that is over 15,000 characters. Using sum() creates one long expression and m.sum() breaks the summation into pieces but takes longer to compile.
Use m.min3() for the min(Dt,xs) terms or slack variables s with x[i]+s[i]=D[i]. It appears that Dt (size 30) is an upper bound, but it has different dimensions that xs (size 100). Slack variables are much more efficient than using binary variables.
D = np.array(100)
x = m.Array(m.Var,100,lb=0,ub=2000000)
The modified problem has 6000 binary variables and 100 continuous variables. There are 2^6000 potential combinations of those variables so it may take a while to solve, even with the efficient branch and bound method of APOPT. Here are a few suggestions for the modified problem:
Use matrix multiplications when possible. Below is an example of matrix operations with Gekko.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
ni = 3; nj = 2; nk = 4
# solve AX=B
A = m.Array(m.Var,(ni,nj),lb=0)
X = m.Array(m.Var,(nj,nk),lb=0)
AX = np.dot(A,X)
B = m.Array(m.Var,(ni,nk),lb=0)
# equality constraints
m.Equations([AX[i,j]==B[i,j] for i in range(ni) \
for j in range(nk)])
m.Equation(5==m.sum([m.sum([A[i][j] for i in range(ni)]) \
for j in range(nj)]))
m.Equation(2==m.sum([m.sum([X[i][j] for i in range(nj)]) \
for j in range(nk)]))
# objective function
m.Minimize(m.sum([m.sum([B[i][j] for i in range(ni)]) \
for j in range(nk)]))
m.solve()
print(A)
print(X)
print(B)
Declare z1 and z2 variables as integer type with integer=True. Here is more information on using the integer type.
Solve locally with m=GEKKO(remote=False). The processing time will be large and the public server resets connections and deletes jobs every day. Switch to local mode to avoid a potential disruption.
I'm trying to investigate the behavior of the following Delayed Differential Equation using Python:
y''(t) = -y(t)/τ^2 - 2y'(t)/τ - Nd*f(y(t-T))/τ^2,
where f is a cut-off function which is essentially equal to the identity when the absolute value of its argument is between 1 and 10 and otherwise is equal to 0 (see figure 1), and Nd, τ and T are constants.
For this I'm using the package JiTCDDE. This provides a reasonable solution to the above equation. Nevertheless, when I try to add a noise on the right hand side of the equation, I obtain a solution which stabilize to a non-zero constant after a few oscillations. This is not a mathematical solution of the equation (the only possible constant solution being equal to zero). I don't understand why this problem arises and if it is possible to solve it.
I reproduce my code below. Here, for the sake of simplicity, I substituted the noise with an high-frequency cosine, which is introduced in the system of equation as the initial condition for a dummy variable (the cosine could have been introduced directly in the system, but for a general noise this doesn't seem possible). To simplify further the problem, I removed also the term involving the f function, as the problem arises also without it. Figure 2 shows the plot of the function given by the code.
from jitcdde import jitcdde, y, t
import numpy as np
from matplotlib import pyplot as plt
import math
from chspy import CubicHermiteSpline
# Definition of function f:
def functionf(x):
return x/4*(1+symengine.erf(x**2-Bmin**2))*(1-symengine.erf(x**2-Bmax**2))
#parameters:
τ = 42.9
T = 35.33
Nd = 8.32
# Definition of the initial conditions:
dt = .01 # Time step.
totT = 10000. # Total time.
Nmax = int(totT / dt) # Number of time steps.
Vt = np.linspace(0., totT, Nmax) # Vector of times.
# Definition of the "noise"
X = np.zeros(Nmax)
for i in range(Nmax):
X[i]=math.cos(Vt[i])
past=CubicHermiteSpline(n=3)
for time, datum in zip(Vt,X):
regular_past = [10.,0.]
past.append((
time-totT,
np.hstack((regular_past,datum)),
np.zeros(3)
))
noise= lambda t: y(2,t-totT)
# Integration of the DDE
g = [
y(1),
-y(0)/τ**2-2*y(1)/τ+0.008*noise(t)
]
g.append(0)
DDE = jitcdde(g)
DDE.add_past_points(past)
DDE.adjust_diff()
data = []
for time in np.arange(DDE.t, DDE.t+totT, 1):
data.append( DDE.integrate(time)[0] )
plt.plot(data)
plt.show()
Incidentally, I noticed that even without noise, the solution seems to be discontinuous at the point zero (y is set to be equal to zero for negative times), and I don't understand why.
As the comments unveiled, your problem eventually boiled down to this:
step_on_discontinuities assumes delays that are small with respect to the integration time and performs steps that are placed on those times where the delayed components points to the integration start (0 in your case). This way initial discontinuities are handled.
However, implementing an input with a delayed dummy variable introduces a large delay into the system, totT in your case.
The respective step for step_on_discontinuities would be at totT itself, i.e., after the desired integration time.
Thus when you reach for time in np.arange(DDE.t, DDE.t+totT, 1): in your code, DDE.t is totT.
Therefore you have made a big step before you actually start integrating and observing which may seem like a discontinuity and lead to weird results, in particular you do not see the effect of your input, because it has already “ended” at this point.
To avoid this, use adjust_diff or integrate_blindly instead of step_on_discontinuities.
I want to check if a data set is linearly separable or not. I am using the method mentioned at this link for the purpose.Here are the constraint equations that I want to implement using pulp:
-h^Ta + B <= -1
h^Tb - B <= -1
In the above equations 'a' represents data belonging to one class and 'b' represents data belonging to the other class.
The data stored in variable A has 11 columns. The last column contains value -1 or 1 , depending on whether row belongs to first equation or second equation. Similarly, the rest of columns contains all negative values or positive values, depending on whether row belongs to first equation or second equation. Below is the code that I am using:
try:
import os
#import random
import traceback
import datetime
#import numpy as np
import scipy.io as sio
import pulp
os.system('cls')
dicA = sio.loadmat('A1.mat')
A = dicA.get('A1')
var = pulp.LpVariable.dicts("var",range(11),cat =pulp.LpContinuous)
model = pulp.LpProblem("Data linearly seaparable", pulp.LpMinimize)
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
for i in range(len(A)):
expr = pulp.LpAffineExpression()
for j in range(len(A[i])):
expr += var[j]*A[i][j]
expr = expr <= -1
model+= expr
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
model.solve()
print(pulp.LpStatus[model.status])
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
except:
print('exception')
tb = traceback.format_exc()
print(tb)
finally:
print('reached finally')
And, I am getting the following output:
2017-08-31 07:28:30
2017-08-31 07:28:36
Infeasible
2017-08-31 07:28:42
reached finally
According to pulps documentation, the first equation that we add to the model should be the objective function, but there is no objective function in this case, so I am adding only constraints to the model. Is this right or is there a way to specify that there is no objective function.
According to pulps documentation, the first equation that we add to the model should be the objective function, but there is no objective function in this case
First thing you add to the "model" (problem) is not an equation but a formula, that acts as the objective function.
If you have no objective function, add an arbitrary one. Here's an example from the documentation:
# The arbitrary objective function is added
prob += 0, "Arbitrary Objective Function"
I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend
Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.