Related
Consider the below array t. When using min_frequency kwarg in the OneHotEncoder class, I cannot understand why the category snake is still present when transforming a new array. There are 2/40 events of this label. Should the shape of e be (4,3) instead?
sklearn.__version__ == '1.1.1'
t = np.array([['dog'] * 8 + ['cat'] * 20 + ['rabbit'] * 10 +
['snake'] * 2], dtype=object).T
enc = OneHotEncoder(min_frequency= 4/40,
sparse=False).fit(t)
print(enc.infrequent_categories_)
# [array(['snake'], dtype=object)]
e = enc.transform(np.array([['dog'], ['cat'], ['dog'], ['snake']]))
array([[0., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.]]) # snake is present?
Check out enc.get_feature_names_out():
array(['x0_cat', 'x0_dog', 'x0_rabbit', 'x0_infrequent_sklearn'],
dtype=object)
"snake" isn't considered its own category anymore, but lumped into the infrequent category. If you added some other rare categories, they'd be assigned to the same, and if you additionally set handle_unknown="infrequent_if_exist", you would also encode unseen categories to the same.
Suppose I have two tensors S and T defined as:
S = torch.rand((3,2,1))
T = torch.ones((3,2,1))
We can think of these as containing batches of tensors with shapes (2, 1). In this case, the batch size is 3.
I want to concatenate all possible pairings between batches. A single concatenation of batches produces a tensor of shape (4, 1). And there are 3*3 combinations so ultimately, the resulting tensor C must have a shape of (3, 3, 4, 1).
One solution is to do the following:
for i in range(S.shape[0]):
for j in range(T.shape[0]):
C[i,j,:,:] = torch.cat((S[i,:,:],T[j,:,:]))
But the for loop doesn't scale well to large batch sizes. Is there a PyTorch command to do this?
I don't know of any command out-of-the-box that does such operation. However, you can pull it off in a straightforward way using a single matrix multiplication.
The trick is to construct a tensor containing all pairs of batch elements by starting from already stacked S,T tensor. Then by multiplying it with a properly chosen mask tensor... In this method, keeping track of shapes and dimension sizes is essential.
The stack is given by (notice the reshape, we essentially flatten the batch elements from S and T into a single batch axis on ST):
>>> ST = torch.stack((S, T)).reshape(6, 2)
>>> ST
tensor([[0.7792, 0.0095],
[0.1893, 0.8159],
[0.0680, 0.7194],
[1.0000, 1.0000],
[1.0000, 1.0000],
[1.0000, 1.0000]]
# ST.shape = (6, 2)
You can retrieve all (S[i], T[j]) pairs using range and itertools.product:
>>> indices = torch.tensor(list(product(range(0, 3), range(3, 6))))
tensor([[0, 3],
[0, 4],
[0, 5],
[1, 3],
[1, 4],
[1, 5],
[2, 3],
[2, 4],
[2, 5]])
# indices.shape = (9, 2)
From there, we construct one-hot-encodings of the indices using torch.nn.functional.one_hot:
>>> mask = one_hot(indices).float()
tensor([[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 1., 0., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0.]],
[[0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 1.]]])
# mask.shape = (9, 2, 6)
Finally, we compute the matrix multiplication and reshape it to the final form:
>>> (mask#ST).reshape(3, 3, 4, 1)
tensor([[[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]],
[[0.7792],
[0.0095],
[1.0000],
[1.0000]]],
[[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]],
[[0.1893],
[0.8159],
[1.0000],
[1.0000]]],
[[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]],
[[0.0680],
[0.7194],
[1.0000],
[1.0000]]]])
I initially went with torch.einsum: torch.einsum('bf,pib->pif', ST, mask). But, later realized than that bf,pib->pif reduces nicely to a simple torch.Tensor.matmul operation if we switch the two operands: i.e. with pib,bf->pif (subscript b is reduced in the middle).
In numpy something called np.meshgrid is used.
https://stackoverflow.com/a/35608701/3259896
So in pytorch, it would be
torch.stack(
torch.meshgrid(x, y)
).T.reshape(-1,2)
Where x and y are your two lists. You can use any number. x, y , z, etc.
And then you reshape it to the number of lists you use.
So if you used three lists, use .reshape(-1,3), for four use .reshape(-1,4), etc.
So for 5 tensors, use
torch.stack(
torch.meshgrid(a, b, c, d, e)
).T.reshape(-1,5)
I'm trying to determine p and q values for an ARMA model. The time series is already stationary and I was looking to ACF and PACF plots, but I need to get those p and q values "on the go" (like performing a simulation).
I noticed that in statsmodels there are actually two functions for acf and pacf, but I'm not understanding how to use them properly.
This is how the code looks like
from statsmodels.tsa.stattools import acf, pacf
>>>acf(data,qstat=True)
(array([1. , 0.98707179, 0.9809318 , 0.9774078 , 0.97436479,
0.97102392, 0.96852746, 0.96620799, 0.9642253 , 0.96288455,
0.96128443, 0.96026672, 0.95912503, 0.95806287, 0.95739194,
0.95622575, 0.9545498 , 0.95381055, 0.95318588, 0.95203675,
0.95096276, 0.94996035, 0.94892427, 0.94740811, 0.94582933,
0.94420572, 0.9420396 , 0.9408416 , 0.93969163, 0.93789606,
0.93608273, 0.93413445, 0.93343312, 0.93233588, 0.93093149,
0.93033546, 0.92983324, 0.92910616, 0.92830326, 0.92799811,
0.92642784]),
array([ 2916.11296684, 5797.02377904, 8658.22999328, 11502.6002944 ,
14328.44503612, 17140.72034976, 19940.48013538, 22729.69637912,
25512.09429552, 28286.18290207, 31055.33003897, 33818.82409725,
36577.1270353 , 39332.49361223, 42082.0755955 , 44822.94911057,
47560.49941212, 50295.38504714, 53024.59880222, 55748.57526173,
58467.72758802, 61181.8659989 , 63888.25003765, 66586.53110019,
69276.46332225, 71954.97102175, 74627.57217707, 77294.54406888,
79952.23080669, 82600.54514273, 85238.73829645, 87873.86209917,
90503.68343426, 93126.47509834, 95746.79574474, 98365.17422285,
100980.34471949, 103591.88164688, 106202.58634768, 108805.3453693 ]),
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0.]))
>>>pacf(data)
array([ 1. , 0.98740203, 0.26463067, 0.18709112, 0.11351714,
0.0540612 , 0.06996315, 0.05159168, 0.05358487, 0.06867607,
0.03915513, 0.06099868, 0.04020074, 0.0390229 , 0.05198753,
0.01873783, -0.00169158, 0.04387457, 0.03770717, 0.01360295,
0.01740693, 0.01566421, 0.01409722, -0.00988412, -0.00860644,
-0.00905181, -0.0344616 , 0.0199406 , 0.01123293, -0.02002155,
-0.01415968, -0.0266674 , 0.03583483, 0.0065682 , -0.00483241,
0.0342638 , 0.02353691, 0.01704061, 0.01292073, 0.03163407,
-0.02838961])
How can I get p and q with this functions? The acf function returns only 1 array if qstat is set to False
Selecting the order of an ARMA(p,q) model using estimated ACFs/PACFs is usually not the best approach. This is simply because in case of an ARMA process both the ACF and PACF slowly decay (in absolute terms) for increasing lags. So you cannot really infer the lag order from it. Instead they are mostly used for pure AR/MA models in which you observe a clear cutoff in either of the two series (but even then it is more of a graphical approach).
If you want to determine p and q "on the fly" for an ARMA model it seems more reasonable to use information criteria (e.g. AIC, BIC, etc.). statsmodels provides the function arma_order_select_ic() for this very purpose. So what you want is something like this:
from statsmodels.tsa.stattools import arma_order_select_ic
arma_order_select_ic(data, max_ar=4, max_ma=4, ic='bic')
I want to create a tensor like
tensor([[[1,0,0],[0,1,0],[0,0,1]],[[2,0,0],[0,2,0],[0,0,2]]]])
That is, when a torch tensor B of size (1,n) is given, I want to create a torch tensor A of size (n,3,3) such that A[i] is an B[i] * (identity matrix of size 3x3).
Without using 'for sentence', how do I create this?
Use torch.einsum (Einstein's notation of sum and product)
A = torch.eye(3)
b = torch.tensor([1.0, 2.0, 3.0])
torch.einsum('ij,k->kij', A, b)
Will return:
tensor([[[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]],
[[2., 0., 0.],
[0., 2., 0.],
[0., 0., 2.]],
[[3., 0., 0.],
[0., 3., 0.],
[0., 0., 3.]]])
I'm pretty certain this is trivial, but I haven't yet managed to quite get my head around scan. I want to iteratively build a matrix of values, m, where
m[i,j] = f(m[k,l]) for k < i, j < l
so you could think of it as a dynamic programming problem. However, I can't even generate the list [1..100] by iterating over the list [1..100] and updating the shared value as I go.
import numpy as np
import theano as T
import theano.tensor as TT
def test():
arr = T.shared(np.zeros(100))
def grid(idx, arr):
return {arr: TT.set_subtensor(arr[idx], idx)}
T.scan(
grid,
sequences=TT.arange(100),
non_sequences=[arr])
return arr
run = T.function([], outputs=test())
run()
which returns
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0.])
There's a few things here that point towards some misunderstandings. scan really can be a hard bit of Theano to wrap your head around!
Here's some updated code that does what I think you're trying to do, but I wouldn't recommend using this code at all. The basic issue is that you seem to be using a shared variable inappropriately.
import numpy as np
import theano as T
import theano.tensor as TT
def test():
arr = T.shared(np.zeros(100))
def grid(idx, arr):
return {arr: TT.set_subtensor(arr[idx], idx)}
_, updates = T.scan(
grid,
sequences=TT.arange(100),
non_sequences=[arr])
return arr, updates
outputs, updates = test()
run = T.function([], outputs=outputs, updates=updates)
print run()
print outputs.get_value()
This code is changed from the original in two ways:
The updates from the scan have to be captured (originally discarded) and passed to the theano.function's updates parameters. Without this the shared variable won't be updated at all.
The contents of the shared variable need to be examined after the function is executed (see below).
This code prints two sets of values. The first is the output of the Theano function from when it's executed. The second is the contents of the shared variable after the Theano function has executed. The Theano function returns the shared variable so you might think that these two sets of values should be the same, but you'd be wrong! No shared variables are updated until after all of the function's output values have been computed. So it's only after the function has been executed and we look at the contents of the shared variable that we see the values we expected to see originally.
Here's an example of implementing a dynamic programming algorithm in Theano. The algorithm is a simplified version of dynamic time warping which has a lot of similarities to edit distance.
import numpy
import theano
import theano.tensor as tt
def inner_step(j, c_ijm1, i, c_im1, x, y):
insert_cost = tt.switch(tt.eq(j, 0), numpy.inf, c_ijm1)
delete_cost = tt.switch(tt.eq(i, 0), numpy.inf, c_im1[j])
match_cost = tt.switch(tt.eq(i, 0), numpy.inf, c_im1[j - 1])
in_top_left = tt.and_(tt.eq(i, 0), tt.eq(j, 0))
min_c = tt.min(tt.stack([insert_cost, delete_cost, match_cost]))
c_ij = tt.abs_(x[i] - y[j]) + tt.switch(in_top_left, 0., min_c)
return c_ij
def outer_step(i, c_im1, x, y):
outputs, _ = theano.scan(inner_step, sequences=[tt.arange(y.shape[0])],
outputs_info=[tt.constant(0, dtype=theano.config.floatX)],
non_sequences=[i, c_im1, x, y], strict=True)
return outputs
def main():
x = tt.vector()
y = tt.vector()
outputs, _ = theano.scan(outer_step, sequences=[tt.arange(x.shape[0])],
outputs_info=[tt.zeros_like(y)],
non_sequences=[x, y], strict=True)
f = theano.function([x, y], outputs=outputs)
a = numpy.array([1, 2, 4, 8], dtype=theano.config.floatX)
b = numpy.array([2, 3, 4, 7, 8, 9], dtype=theano.config.floatX)
print a
print b
print f(a, b)
main()
This is highly simplified and I wouldn't recommend using it for real. In general Theano is very bad at doing dynamic programming because theano.scan is so slow in comparison to native looping. If you need to propagate gradients through a dynamic program then you may not have any choice but if you don't need gradients you should probably avoid using Theano for dynamic programming.
If you want a much more thorough implementation of DTW which gets over some of the performance hits Theano imposes by computing many comparisons in parallel (i.e. batching) then take a look here: https://github.com/danielrenshaw/TheanoBatchDTW.