Add extra columns into test dataset in addition to Features and Labels - python-3.x

I am using tf.contrib.data.make_csv_dataset to convert my CSV data into a dataset that serves Features and Lables nicely for the specified columns from the CSV.
How can I specify extra column(s) from the CSV which I may want to be available during testing of the model, but not to be used for training and for model calculations? For instance, while evaluating testing accuracy, I want to know for which specific rows in the CSV dataset the predictions were wrong. Is there a way to serve an additional parameter that I could leverage to figure out what exactly the model got wrong?
Right now the code looks something like this (based on the Tensorflow example pages):
test_dataset = tf.contrib.data.make_csv_dataset(
CSV_file,
BATCH_TEST_SIZE,
column_names=column_names,
select_columns=column_select,
label_name=label_name,
num_epochs=1,
shuffle=False)\
.map(pack_features_vector)
And then during testing, the code does this:
for (x, y) in test_dataset:
logits = model(x)
prediction = tf.argmax(logits, axis=1, output_type=tf.int32)
print('Act\t{}\nPred\t{}\n\n'.format(y, prediction))
Since the generator function only serves x and y values, how can i say for which row specifically from the original CSV file the predictions may have been wrong?
How could I do something like
for (x, y, z) in test_dataset:
print(z[x])
where z would be that additional column, which I could then examine?

Thank you for clarifying your question. I believe the answer you are looking for in order to see which rows were predicted incorrectly is by using model.predict_classes() in keras. The following code should give you an array of what was guessed by your model:
predictionArr = model.predict_classes(testData).reshape(-1)
This will give you an array the length of your test data set that you can compare the ground truths between.
Hope this is helpful and answers your question!

Related

Multilabel classification with null values with scikit-learn

I'd like to do multilabel classification (with values 0 and 1 per label) for a bunch of text (but that's not really important, the vectorisation already works fine).
However, some of my label columns will also contain null/NaN (whatever representation for "there is no information").
Is this possible with scikit-learn?
My (working) code for a regular dataset: (rough edited snippet)
clf = OneVsRestClassifier(ComplementNB(alpha=.01))
clf.fit(x_train, train_labels)
preds = clf.predict(x_val)
f1_score = metrics.f1_score(val_labels, preds, average="micro")
So I'm offloading a lot of the work to scikit-learn. I can't just go about dropping all records that have a null somewhere in their labels, or I'd lose out on a bunch of information.
I guess what I want is "when learning a model for each label separately (which is what OneVsRest does, afaik), then only learn from the records that have actual values". So like a local dropna()? I'm not interested in imputing the missing values, that would make no sense. The information given is "we don't know if this label applies or not, so just ignore it for this record".

fastai tabular model trained but can not find categorical mapping

After training my dataset which has a number of categorical data using fastai's tabular model, I wish to read out the entity embedding and use it to map to my original data values.
I can see the embedding weights. The number of input don't seem to match anything, but maybe it is based on the unique categorical values in the train_ds.
To get that map, I would like to get the self.categories dictionary from the Categorify transform class. Is there anyway to get that from the data variable obtained by calling TabularList.from_df?
Or maybe someone can tell me a better way to get this map. I know the input df into the TabularList.from_df() is not it, because the number of rows are wrong. Most likely because df is splitted into train and valid subsets. But there is no easy way to obtain the train part of the TabularList to check just the train part.
It's strange I can't find any code example that shows this. Doesn't anyone else care to map the entity embedding value back to its original categorical value?
I found it.
It is in data.train_ds.inner_df.

I want to understand Python prediction model

I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

How to make custom gaussian noise layer that imposing different stddev to each column of dataset in Keras?

I want to make gaussian noise layer of Keras that is imposing noise with different stddev level to each column of dataset. However, since I am not know much about coding stuffs, there is a big problem that I cannot solve it by myself.
With source code of Keras gaussian noise layer,
I made a code like below :
def call(self, inputs, training=None):
def noised():
temp=inputs
for i in range(100):
temp[:,i]=temp[:,i]+K.random_normal(shape=
(len(inputs),1),mean=0.,stddev=self.stddev[i])
return temp
return K.in_train_phase(noised, inputs, training=training)
However, it shows an error like :
object of type 'Tensor' has no len()
I believe that the error comes from the different type of shape.
Because, the original code, which is like below :
def noised():
return inputs + K.random_normal(shape=K.shape(inputs),
mean=0.,
stddev=self.stddev)
is using symbolic type of shape(K.shape), and what I imposed is integer type of number(len()).
However, I have no idea the way to overcome the problem.
It would really be a great help for me if you give me some way to solve it.
Thank you so much for your assistance.
I know it's super late, but maybe it's still interesting for other people. I'm using Tensorflow 2.3.0 and I can just use the numpy slicing commands. So slice the tensor, apply the individual layers and merge them back together:
input = tf.keras.Input(shape=(None,3))
x1 = GaussianNoise(0.1)(input[:,:,0:1])
x2 = GaussianNoise(0.2)(input[:,:,1:2])
x3 = GaussianNoise(0.3)(input[:,:,2:3])
x = Concatenate()([x1,x2,x3])

Machine Learning liner Regression - Sklearn

I'm new to the Machine learning domain and in Learn Regression i have some doubt
1:While practicing the sklearn learn regression model prediction method getting the below error.
Code:
sklearn.linear_model.LinearRegression.predict(25)
Error:
"ValueError: Expected 2D array, got scalar array instead: array=25. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
Do i need to pass a 2-D array? Checked on sklearn documentation page any haven't found any thing for version update.
**Running my code on Kaggle
https://www.kaggle.com/aman9d/bikesharingdemand-upx/
2: Is index of dataset going to effect model's score (weights)?
First of all you should put your code as you use:
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# use the predict method
linreg.predict(25)
Because what you post in the question is not properly executable, predict method is not static for the class LinearRegression.
When you fit a model, the first step is recognize which kind of data will be the input, in your case will be similar to X, that means that if you pass something with different shape of X to the model it will raise an error.
In your example X seems to be a pd.DataFrame() instance with only 1 column, this should be replaceable with an array of 2 dimension representing the number of examples by the number of features, so if you try:
linreg.predict([[25]])
should work.
For example if you were trying a regression with more than 1 feature aka column, let's say temp and humidity, your input would look like this:
linreg.predict([[25, 56]])
I hope this will help you and always keep in mind which is the shape of your data.
Documentation: LinearRegression fit
X : array-like or sparse matrix, shape (n_samples, n_features)

Resources