Get feature importances for dictionary of dataframes

Get feature importances for dictionary of dataframes - python-3.x

I'm currently working on a use case using RandomForestRegressor. To get training and test data separately based on one column, let's say Home, the dataframe was split into dictionary. Almost done with the modelling, but stuck in getting the feature importance for each of the key in dictionary (number of keys = 21). Please have a look at the codes below:
hp = pd.get_dummies(hp)
hp = {i: g for i, g in hp.set_index(["Home"]).groupby(level = [0])}
feature = {}; feature_train = {}; feature_test = {}
target = {}; target_train = {}; target_test = {}; target_pred = {}
importances = {}
for k, v in hp.items():
target[k] = np.array(v["HP"])
feature[k] = v.drop(["HP", "Corr"], axis = 1)
feature_list = list(feature[1].columns)
for k, v in zip(feature, target):
feature[k] = np.array(feature[v])
for k, v in zip(feature_train, target_train):
feature_train[k], feature_test[k], target_train[k], target_test[k] = train_test_split(
feature[v], target[v], test_size = 0.25, random_state = 42)
What I've tried after a help from Random Forest Feature Importance Chart using Python
for name, importance in zip(feature_list, list(rf.feature_importances_)):
print(name, "=", importance)
but this prints importance for only one of the dictionary (and I don't know which). What I want is to get it printed for all the keys in dictionary "importances". Thanks in advance!

If I understand you correctly, you want feature's importance for both train and test data.
That's not how it works, first it creates RandomForest from your training data, and after that operation it can calculate importance of each feature based on how many times it was used to split the space (and how 'good' were the splits, e.g. how low was, for example, the gini impurity, for many trees of course).
So you obtain feature's importance for training data, for test data the learned tree architecture is used in order to predict values.

Related

How do I build a probability matrix output layer in Keras

Suppose I need to build a network that takes two inputs:
A patient's information, represented as an array of features
Selected treatment, represented as one-hot encoded array
Now how do I build a network that outputs a 2D probability matrix A where A[i,j] represents the probability the patient will end up at state j under treatment i. Let's say there are n possible states, and under any treatment, the total probability of all n states sums up to 1.
I wanted to do this because I was motivated by a similar network, where the inputs are the same as above, but the output is a 1d array representing the expected lifetime after treatment i is delivered. And such network is built as follows:
def default_dense(feature_shape, n_treatment):
feature_input = keras.layers.Input(feature_shape)
treatment_input = keras.layers.Input((n_treatments,))
hidden_1 = keras.layers.Dense(16, activation = 'relu')(feature_input)
hidden_2 = keras.layers.Dense(16, activation = 'relu')(hidden_1)
output = keras.layers.Dense(n_treatments)(hidden_2)
output_on_action = keras.layers.multiply([output, treatment_input])
model = keras.models.Model([feature_input, treatment_input], output_on_action)
model.compile(optimizer=tf.optimizers.Adam(0.001),loss='mse')
return model
And the training is simply
model.fit(x = [features, encoded_treatments], y = encoded_treatments * lifetime[:, np.newaxis], verbose = 0)
This is super handy because when predicting, I can use np.ones() as the encoded_treatments, and the network gives expected lifetimes under all treatments, thus choosing the best one is one-step. Certainly I can create multiple networks, each for a treatment, but it would be much less efficient.
Now the questions is, can I do the same to probability output?

I have figured it out myself. The trick is to use RepeatVector() and Permute() layers to generate a matrix mask for treatments.
The output is an element-wise Multiply() of the mask and a Softmax() of same size.

LSTM getting caught up in loop

I recently implemented a name generating RNN "from scratch" which was doing ok but far from perfect. So I thought about trying my luck with pytorch's LSTM class to see if it makes a difference. Indeed it does and the outpus looks way better for the first 7 ~ 8 characters. But then the networks gets caught in a loop and outputs things like "laulaulaulau" or "rourourourou" (it is supposed the generate french names).
Is it a often occuring problem ? If so do you know a way to fix it ? I'm concern about the fact the network doesn't produce EOS tokens...
This is an issue which has already been asked here Why does my keras LSTM model get stuck in an infinite loop?
but not really answered hence my post.
here is the model :
class pytorchLSTM(nn.Module):
def __init__(self,input_size,hidden_size):
super(pytorchLSTM,self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size)
self.output_layer = nn.Linear(hidden_size,input_size)
self.tanh = nn.Tanh()
self.softmax = nn.LogSoftmax(dim = 2)
def forward(self, input, hidden)
out, hidden = self.lstm(input,hidden)
out = self.tanh(out)
out = self.output_layer(out)
out = self.softmax(out)
return out, hidden
The input and target are two sequences of one-hot encoded vectors respectively with a start of sequence and end of sequence vector at the start and the end. They represent the characters inside of a name taken from the name list (database).
I use a and token on each name from the database. here are the function I use
def inputTensor(line):
#tensor starts with <start of sequence> token.
tensor = torch.zeros(len(line)+1, 1, n_letters)
tensor[0][0][n_letters - 2] = 1
for li in range(len(line)):
letter = line[li]
tensor[li+1][0][all_letters.find(letter)] = 1
return tensor
# LongTensor of second letter to end (EOS) for target
def targetTensor(line):
letter_indexes = [all_letters.find(line[li]) for li in range(len(line))]
letter_indexes.append(n_letters - 1) # EOS
return torch.LongTensor(letter_indexes)
training loop :
def train_lstm(model):
start = time.time()
criterion = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
n_iters = 20000
print_every = 1000
plot_every = 500
all_losses = []
total_loss = 0
for iter in range(1,n_iters+1):
line = randomChoice(category_line)
input_line_tensor = inputTensor(line)
target_line_tensor = targetTensor(line).unsqueeze(-1)
optimizer.zero_grad()
loss = 0
output, hidden = model(input_line_tensor)
for i in range(input_line_tensor.size(0)):
l = criterion(output[i], target_line_tensor[i])
loss += l
loss.backward()
optimizer.step()
the sampling function :
def sample():
max_length = 20
input = torch.zeros(1,1,n_letters)
input[0][0][n_letters - 2] = 1
output_name = ""
hidden = (torch.zeros(2,1,lstm.hidden_size),torch.zeros(2,1,lstm.hidden_size))
for i in range(max_length):
output, hidden = lstm(input)
output = output[-1][:][:]
l = torch.multinomial(torch.exp(output[0]),num_samples = 1).item()
if l == n_letters - 1:
break
else:
letter = all_letters[l]
output_name += letter
input = inputTensor(letter)
return output_name
The typical sampled output looks something like that :
Laurayeerauerararauo
Leayealouododauodouo
Courouauurourourodau
Do you know how I can improve that ?

I found the explanation :
When using instances of the LSTM class as part of a RNN, the default input dimensions are (seq_length,batch_dim,input_size). To be able to interpret the output of the lstm as a probability (over the set of inputs) I needed to pass it to a Linear layer before the Softmax call, which is where the problem happens : Linear instances expects the input to be in the format (batch_dim,seq_length,input_size).
To fix this, one needs to pass batch_first = True as an argument to the LSTM upon creation, and then feed the RNN with an input of the form (batch_dim, seq_length, input_size).

Some tips to improve the network in the order of importance (and ease of implementing):
1. Training data
If you want your generated samples to look real, you have to give some real data to the network. Find a set of names, split those into letters and transform into indices. This step alone would give way more realistic names.
2. Separate start and end tokens.
I would go with <SON> (Start Of Name) and <EON> (End Of Name). In this configuration neural network can learn combinations of letters leading to <EON> and combinations of letters coming after <SON>. ATM it's trying to fit two different concepts into this one custom token.
3. Unsupervised Pretaining
You may want to give your letters some semantic meaning instead of one-hot encoded vectors, check word2vec for basic approach.
Basically, each letter would be represented by N-dimensional vector (say 50 dimensions) and would be closer in space if the letter occurs more often next to another letter (a closer to k than x).
Simple way to implement that would be taking some text dataset and trying to predict next letter at each timestep. Each letter would be represented by random vector at the beginning, through backpropagation letter representations would be updated to reflect their similarity.
Check pytorch embedding tutorial for more info.
4. Different architecture
You may want to check Andrej Karpathy's idea for generating baby names. It is simply described here.
Essentially, after training, you feed your model with random letters (say 10) and tell it to predict the next letter.
You remove last letter from random seed and put the predicted one in it's place. Iterate until <EON> is outputted.

zeroinflatedpoisson model in python

I want to use python3 to build a zeroinflatedpoisson model. I found in library statsmodel the function statsmodels.discrete.count_model.ZeroInflatePoisson.
I just wonder how to use it. It seems I should do:
ZIFP(Y_train,X_train).fit().
But when I wanted to do prediction using X_test.
It told me the length of X_test doesn't fit X_train.
Or is there another package to fit this model?
Here is the code I used:
X1 = [random.randint(0,1) for i in range(200)]
X2 = [random.randint(1,2) for i in range(200)]
y = np.random.poisson(lam = 2,size = 100).tolist()
for i in range(100):y.append(0)
df['x1'] = x1
df['x2'] = x2
df['y'] = y
df_x = df.iloc[:,:-1]
x_train,x_test,y_train,y_test = train_test_split(df_x,df['y'],test_size = 0.3)
clf = ZeroInflatedPoisson(endog = y_train,exog = x_train).fit()
clf.predict(x_test)
ValueError:operands could not be broadcat together with shapes (140,)(60,)
also tried:
clf.predict(x_test,exog = np.ones(len(x_test)))
ValueError: shapes(60,) and (1,) not aligned: 60 (dim 0) != 1 (dim 0)

This looks like a bug to me.
As far as I can see:
If there are no explanatory variables, exog_infl, specified for the inflation model, then a array of ones is used to model a constant inflation probability.
However, if exog_infl in predict is None, then it uses the model.exog_infl which is an array of ones with the length equal to the training sample.
As work around specifying a 1-D array of ones of correct length in predict should work.
Try:
clf.predict(test_x, exog_infl=np.ones(len(test_x))
I guess the same problem will occur if exposure was used in the model, but is not explicitly specified in predict.

I ran into the same problem, landing me on this thread. As noted by Josef, it seems like you need to provide exog_infl with a 1-D array of ones of correct length to work.
However, the code Josef provided misses the 1-D array-part, so the full line required to generate the required array is actually
clf.predict(test_x, exog_infl=np.ones((len(test_x),1))

How to get KMeans's inertia_ value after using Pipline

I want to combine StandardScaler() and KMeans() by using Pipeline and also check the kmeans's inertia_ because I want to check which number of cluster is best.
The code is as following:
ks = range(3, 5)
inertias = []
inertias_temp = 9999.0
for k in ks:
scaler = StandardScaler()
kmeans = KMeans(n_clusters=k, random_state=rng)
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(X_pca)
labels = pipeline.predict(X_pca)
np.round(kmeans.cluster_centers_, decimals=3)
inertias.append(kmeans.inertia_)
if (kmeans.inertia_ < inertias_temp):
n_clusters_min = k
kmeans_min = kmeans
inertias_temp = kmeans.inertia_
However, I think that maybe the value of kmeans.inertia_ is not correct because it should be got after pipeline.predict(). But I have no way to get this value after pipeline.predict(). Can anyone help me on this?

It is possible to observe the inertia distance of the cluster run from the make_pipeline instance. However, it is not necessary to perform .predict() to observe the distance of the number of centroids. To access the inertia value in your case, you may type as below:
pipeline.named_steps['kmeans'].inertia_
And then process it as you like!
Moreover, I had some free time, so I rewrote the code for you a little bit to make it more interesting:
scaler = StandardScaler()
cluster = KMeans(random_state=1337)
pipe = make_pipeline(scaler, cluster)
centroids = []
inertias = []
min_ks = []
inertia_temp = 9999.0
for k in range(3, 5):
pipe.set_params(cluster__n_clusters=k)
pipe.fit(X_pca)
centroid = pipe.named_steps['cluster'].cluster_centers_
inertia = pipe.named_steps['cluster'].inertia_
centroids.append(centroid)
inertias.append(inertia)
if inertia < inertia_temp:
min_ks.append(k)
Thank you for the question!

How to add a confusion matrix to Theano examples?

I want to make use of Theano's logistic regression classifier, but I would like to make an apples-to-apples comparison with previous studies I've done to see how deep learning stacks up. I recognize this is probably a fairly simple task if I was more proficient in Theano, but this is what I have so far. From the tutorials on the website, I have the following code:
def errors(self, y):
# check if y has same dimension of y_pred
if y.ndim != self.y_pred.ndim:
raise TypeError(
'y should have the same shape as self.y_pred',
('y', y.type, 'y_pred', self.y_pred.type)
)
# check if y is of the correct datatype
if y.dtype.startswith('int'):
# the T.neq operator returns a vector of 0s and 1s, where 1
# represents a mistake in prediction
return T.mean(T.neq(self.y_pred, y))
I'm pretty sure this is where I need to add the functionality, but I'm not certain how to go about it. What I need is either access to y_pred and y for each and every run (to update my confusion matrix in python) or to have the C++ code handle the confusion matrix and return it at some point along the way. I don't think I can do the former, and I'm unsure how to do the latter. I've done some messing around with an update function along the lines of:
def confuMat(self, y):
x=T.vector('x')
classes = T.scalar('n_classes')
onehot = T.eq(x.dimshuffle(0,'x'),T.arange(classes).dimshuffle('x',0))
oneHot = theano.function([x,classes],onehot)
yMat = T.matrix('y')
yPredMat = T.matrix('y_pred')
confMat = T.dot(yMat.T,yPredMat)
confusionMatrix = theano.function(inputs=[yMat,yPredMat],outputs=confMat)
def confusion_matrix(x,y,n_class):
return confusionMatrix(oneHot(x,n_class),oneHot(y,n_class))
t = np.asarray(confusion_matrix(y,self.y_pred,self.n_out))
print (t)
But I'm not completely clear on how to get this to interface with the function in question and give me a numpy array I can work with.
I'm quite new to Theano, so hopefully this is an easy fix for one of you. I'd like to use this classifer as my output layer in a number of configurations, so I could use the confusion matrix with other architectures.

I suggest using a brute force sort of a way. You need an output for a prediction first. Create a function for it.
prediction = theano.function(
inputs = [index],
outputs = MLPlayers.predicts,
givens={
x: test_set_x[index * batch_size: (index + 1) * batch_size]})
In your test loop, gather the predictions...
labels = labels + test_set_y.eval().tolist()
for mini_batch in xrange(n_test_batches):
wrong = wrong + int(test_model(mini_batch))
predictions = predictions + prediction(mini_batch).tolist()
Now create confusion matrix this way:
correct = 0
confusion = numpy.zeros((outs,outs), dtype = int)
for index in xrange(len(predictions)):
if labels[index] is predictions[index]:
correct = correct + 1
confusion[int(predictions[index]),int(labels[index])] = confusion[int(predictions[index]),int(labels[index])] + 1
You can find this kind of an implementation in this repository.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string