Using a subset of classes in ImageNet - pytorch

I'm aware that subsets of ImageNet exist, however they don't fulfill my requirement. I want 50 classes at their native ImageNet resolutions.
To this end, I used torch.utils.data.dataset.Subset to select specific classes from ImageNet. However, it turns out, class labels/indices must be greater than 0 and less than num_classes.
Since ImageNet contains 1000 classes, the idx of my selected classes quickly goes over 50. How can I reassign the class indices and do so in a way that allows for evaluation later down the road as well?
Is there a way more elegant way to select a subset?

I am not sure I understood your conclusions about labels being greater than zero and less than num_classes. The torch.utils.data.Subset helper takes in a torch.utils.data.Dataset and a sequence of indices, they correspond to indices of data points from the Dataset you would like to keep in the subset. These indices have nothing to do with the classes they belong to.
Here's how I would approach this:
Load your dataset through torchvision.datasets (custom datasets would work the same way). Here I will demonstrate it with FashionMNIST since ImageNet's data is not made available directly through torchvision's API.
>>> ds = torchvision.datasets.FashionMNIST('.')
>>> len(ds)
60000
Define the classes you want to select for the subset dataset. And retrieve all indices from the main dataset which correspond to these classes:
>>> targets = [1, 3, 5, 9]
>>> indices = [i for i, label in enumerate(ds.targets) if label in targets]
You have your subset:
>>> ds_subset = Subset(ds, indices)
>>> len(ds_subset)
24000
At this point, you can use a dictionnary to remap your labels using targets:
>>> remap = {i:x for i, x in enumerate(targets)}
{0: 1, 1: 3, 2: 5, 3: 9}
For example:
>>> x, y = ds_subset[10]
>>> y, remap[y] # old_label, new_label
1, 3

Related

Multiplying two sub-matrices in NumPy without copying the matrices

I have two matrices A and B and I'm interested in multiplying a sub-matrix of A (defined by some of its rows) and a sub-matrix of B (defined by some of its columns) as shown in the example below:
import numpy as np
# create two matrices
A = np.ones((5, 3))
B = np.ones((3, 5))
# define sub-matrices
rows_idx = [0, 2, 3]
cols_idx = [1, 2, 4]
# print sub-matrices
print(A[rows_idx])
print(B[:, cols_idx])
I'm aware that this can be achieved directly through
A[rows_idx, :] # B[:, cols_idx]
However, the latter comes with the drawback of making copies of the rows of A and columns of B since rows_idx and cols_idx are lists, as mentioned here.
This is less efficient in terms of time and memory.
Also, I refrained from creating a Python function to perform the multiplication as Python loops are slow. Numpy array operations are much faster. Here are some timing comparisons.
Is there way to compute A[rows_idx, :] # B[:, cols_idx] without copying?

Expand the tensor by several dimensions

In PyTorch, given a tensor of size=[3], how to expand it by several dimensions to the size=[3,2,5,5] such that the added dimensions have the corresponding values from the original tensor. For example, making size=[3] vector=[1,2,3] such that the first tensor of size [2,5,5] has values 1, the second one has all values 2, and the third one all values 3.
In addition, how to expand the vector of size [3,2] to [3,2,5,5]?
One way to do it I can think is by means of creating a vector of the same size with ones-Like and then einsum but I think there should be an easier way.
You can first unsqueeze the appropriate number of singleton dimensions, then expand to a view at the target shape with torch.Tensor.expand:
>>> x = torch.rand(3)
>>> target = [3,2,5,5]
>>> x[:, None, None, None].expand(target)
A nice workaround is to use torch.Tensor.reshape or torch.Tensor.view to do perform multiple unsqueezing:
>>> x.view(-1, 1, 1, 1).expand(target)
This allows for a more general approach to handle any arbitrary target shape:
>>> x.view(len(x), *(1,)*(len(target)-1)).expand(target)
For an even more general implementation, where x can be multi-dimensional:
>>> x = torch.rand(3, 2)
# just to make sure the target shape is valid w.r.t to x
>>> assert list(x.shape) == list(target[:x.ndim])
>>> x.view(*x.shape, *(1,)*(len(target)-x.ndim)).expand(target)

What is meant by id's and labels in keras data generator?

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
The like above is the documentation regarding the custom keras data generator.
I have doubt in the "NOTATION" heading in the above link which says the following:-
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
1. Create a dictionary called partition where you gather:
a) in partition['train'] a list of training IDs
b) in partition['validation'] a list of validation IDs
2. Create a dictionary called labels where for each ID of the dataset, the associated label is given by labels[ID]
For example, let's say that our training set contains id-1, id-2 and id-3 with respective labels 0, 1 and 2, with a validation set containing id-4 with label 1. In that case, the Python variables partition and labels look like
>>> partition
{'train': ['id-1', 'id-2', 'id-3'], 'validation': ['id-4']}
and
>>> labels
{'id-1': 0, 'id-2': 1, 'id-3': 2, 'id-4': 1}
I'm really not able to understand what does labels and id's mean.
For example:- Say, I have a data frame, where there are 1000 columns. Each row corresponds to id's i.e., each ID meant to be just a "DATA POINT".
OR
Say, I have multiple data frame. Each data frame represents different id's?
It seems labels meant not to be the number of class-variable.
I would like to have a clear understanding regarding id's and labels WITH SOME EXAMPLES.
The mentioned article provides a good practice to better organize your data between training and validation. To do so, it's relevant to store line indexes from your dataframe (named IDs here) and corresponding target values (named label here) in an independent object so that in case of transformation on the input, you don't lose track of things.
Here is a basic example using a train/test split
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame([[0.1, 1, 'label_a'], [0.2, 2, 'label_a'], [0.3, 3, 'label_a'], [0.4, 4, 'label_b']], columns=['feature_a', 'feature_b', 'target'])
# df.index.tolist() results in [0, 1, 2, 3] (4 rows)
partitions = dict()
labels = dict()
X_train, X_test, y_train, y_test = train_test_split(df[['feature_a', 'feature_b']], df['target'], test_size=0.25, random_state=42)
partitions['train'] = X_train.index.tolist()
partitions['validation'] = X_test.index.tolist()
# partitions['train'] results in [3, 0, 2]
# partitions['validation'] results in [1]
labels = df['target'].to_dict()
# labels is {0: 'label_a', 1: 'label_a', 2: 'label_a', 3: 'label_b'}```

Removing the redundant feature from classification dataset ( make_classification )

In the make_classification method,
X,y = make_classification(n_samples=10, n_features=8, n_informative=7, n_redundant=1, n_repeated=0 , n_classes=2,random_state=6)
Docstring about n_redundant: The number of redundant features. These features are generated as
random linear combinations of the informative features.
Docstring about n_repeated: The number of duplicated features, drawn randomly from the informative
n_repeated features are picked easily as they are highly correlated with informative features.
The docstring for repeated and redundant features indicates that both are drawn from informative features.
My question is: how redundant features can be removed/highlighted, what are their characteristics.
Attached is the correlation heatmap among all the features, Which feature in the image is redundant.
Please help.
To check how many independent columns use np.linalg.matrix_rank(X)
To find indices of linearly independent rows of matrix X use sympy.Matrix(X).rref()
DEMO
Generate dataset and check number of independent columns (matrix rank):
from sklearn.datasets import make_classification
from sympy import Matrix
X, _ = make_classification(
n_samples=10, n_features=8, n_redundant=2,random_state=6
)
np.linalg.matrix_rank(X, tol=1e-3)
# 6
Find indices of linearly independent columns:
_, inds = Matrix(X).rref(iszerofunc=lambda x: abs(x)<1e-3)
inds
#(0, 1, 2, 3, 6, 7)
Remove dependent columns and check matrix rank (num of independent columns):
#linearly independent
X_independent = X[:,inds]
assert np.linalg.matrix_rank(X_independent, tol=1e-3) == X_independent.shape[1]

how to "normalize" vectors values when using Spark CountVectorizer?

CountVectorizer and CountVectorizerModel often creates a sparse feature vector that looks like this:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
this basically says the total size of the vocabulary is 10, the current document has 5 unique elements, and in the feature vector, these 5 unique elements take position at 0, 1, 4, 6 and 8. Also, one of the elements show up twice, therefore the 2.0 value.
Now, I would like to "normalize" the above feature vector and make it look like this,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
i.e., each value is divided by 6, the total number of all elements together. For example, 0.3333 = 2.0/6.
So is there a way to do this efficiently here?
Thanks!
You can use Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
with 1-norm
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+

Resources