how to "normalize" vectors values when using Spark CountVectorizer? - apache-spark

CountVectorizer and CountVectorizerModel often creates a sparse feature vector that looks like this:
(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])
this basically says the total size of the vocabulary is 10, the current document has 5 unique elements, and in the feature vector, these 5 unique elements take position at 0, 1, 4, 6 and 8. Also, one of the elements show up twice, therefore the 2.0 value.
Now, I would like to "normalize" the above feature vector and make it look like this,
(10,[0,1,4,6,8],[0.3333,0.1667,0.1667,0.1667,0.1667])
i.e., each value is divided by 6, the total number of all elements together. For example, 0.3333 = 2.0/6.
So is there a way to do this efficiently here?
Thanks!

You can use Normalizer
class pyspark.ml.feature.Normalizer(*args, **kwargs)
Normalize a vector to have unit norm using the given p-norm.
with 1-norm
from pyspark.ml.linalg import SparseVector
from pyspark.ml.feature import Normalizer
df = spark.createDataFrame([
(SparseVector(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0]), )
], ["features"])
Normalizer(inputCol="features", outputCol="features_norm", p=1).transform(df).show(1, False)
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |features |features_norm |
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+
# |(10,[0,1,4,6,8],[2.0,1.0,1.0,1.0,1.0])|(10,[0,1,4,6,8],[0.3333333333333333,0.16666666666666666,0.16666666666666666,0.16666666666666666,0.16666666666666666])|
# +--------------------------------------+---------------------------------------------------------------------------------------------------------------------+

Related

What is meant by id's and labels in keras data generator?

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
The like above is the documentation regarding the custom keras data generator.
I have doubt in the "NOTATION" heading in the above link which says the following:-
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
1. Create a dictionary called partition where you gather:
a) in partition['train'] a list of training IDs
b) in partition['validation'] a list of validation IDs
2. Create a dictionary called labels where for each ID of the dataset, the associated label is given by labels[ID]
For example, let's say that our training set contains id-1, id-2 and id-3 with respective labels 0, 1 and 2, with a validation set containing id-4 with label 1. In that case, the Python variables partition and labels look like
>>> partition
{'train': ['id-1', 'id-2', 'id-3'], 'validation': ['id-4']}
and
>>> labels
{'id-1': 0, 'id-2': 1, 'id-3': 2, 'id-4': 1}
I'm really not able to understand what does labels and id's mean.
For example:- Say, I have a data frame, where there are 1000 columns. Each row corresponds to id's i.e., each ID meant to be just a "DATA POINT".
OR
Say, I have multiple data frame. Each data frame represents different id's?
It seems labels meant not to be the number of class-variable.
I would like to have a clear understanding regarding id's and labels WITH SOME EXAMPLES.
The mentioned article provides a good practice to better organize your data between training and validation. To do so, it's relevant to store line indexes from your dataframe (named IDs here) and corresponding target values (named label here) in an independent object so that in case of transformation on the input, you don't lose track of things.
Here is a basic example using a train/test split
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame([[0.1, 1, 'label_a'], [0.2, 2, 'label_a'], [0.3, 3, 'label_a'], [0.4, 4, 'label_b']], columns=['feature_a', 'feature_b', 'target'])
# df.index.tolist() results in [0, 1, 2, 3] (4 rows)
partitions = dict()
labels = dict()
X_train, X_test, y_train, y_test = train_test_split(df[['feature_a', 'feature_b']], df['target'], test_size=0.25, random_state=42)
partitions['train'] = X_train.index.tolist()
partitions['validation'] = X_test.index.tolist()
# partitions['train'] results in [3, 0, 2]
# partitions['validation'] results in [1]
labels = df['target'].to_dict()
# labels is {0: 'label_a', 1: 'label_a', 2: 'label_a', 3: 'label_b'}```

Using a subset of classes in ImageNet

I'm aware that subsets of ImageNet exist, however they don't fulfill my requirement. I want 50 classes at their native ImageNet resolutions.
To this end, I used torch.utils.data.dataset.Subset to select specific classes from ImageNet. However, it turns out, class labels/indices must be greater than 0 and less than num_classes.
Since ImageNet contains 1000 classes, the idx of my selected classes quickly goes over 50. How can I reassign the class indices and do so in a way that allows for evaluation later down the road as well?
Is there a way more elegant way to select a subset?
I am not sure I understood your conclusions about labels being greater than zero and less than num_classes. The torch.utils.data.Subset helper takes in a torch.utils.data.Dataset and a sequence of indices, they correspond to indices of data points from the Dataset you would like to keep in the subset. These indices have nothing to do with the classes they belong to.
Here's how I would approach this:
Load your dataset through torchvision.datasets (custom datasets would work the same way). Here I will demonstrate it with FashionMNIST since ImageNet's data is not made available directly through torchvision's API.
>>> ds = torchvision.datasets.FashionMNIST('.')
>>> len(ds)
60000
Define the classes you want to select for the subset dataset. And retrieve all indices from the main dataset which correspond to these classes:
>>> targets = [1, 3, 5, 9]
>>> indices = [i for i, label in enumerate(ds.targets) if label in targets]
You have your subset:
>>> ds_subset = Subset(ds, indices)
>>> len(ds_subset)
24000
At this point, you can use a dictionnary to remap your labels using targets:
>>> remap = {i:x for i, x in enumerate(targets)}
{0: 1, 1: 3, 2: 5, 3: 9}
For example:
>>> x, y = ds_subset[10]
>>> y, remap[y] # old_label, new_label
1, 3

Removing the redundant feature from classification dataset ( make_classification )

In the make_classification method,
X,y = make_classification(n_samples=10, n_features=8, n_informative=7, n_redundant=1, n_repeated=0 , n_classes=2,random_state=6)
Docstring about n_redundant: The number of redundant features. These features are generated as
random linear combinations of the informative features.
Docstring about n_repeated: The number of duplicated features, drawn randomly from the informative
n_repeated features are picked easily as they are highly correlated with informative features.
The docstring for repeated and redundant features indicates that both are drawn from informative features.
My question is: how redundant features can be removed/highlighted, what are their characteristics.
Attached is the correlation heatmap among all the features, Which feature in the image is redundant.
Please help.
To check how many independent columns use np.linalg.matrix_rank(X)
To find indices of linearly independent rows of matrix X use sympy.Matrix(X).rref()
DEMO
Generate dataset and check number of independent columns (matrix rank):
from sklearn.datasets import make_classification
from sympy import Matrix
X, _ = make_classification(
n_samples=10, n_features=8, n_redundant=2,random_state=6
)
np.linalg.matrix_rank(X, tol=1e-3)
# 6
Find indices of linearly independent columns:
_, inds = Matrix(X).rref(iszerofunc=lambda x: abs(x)<1e-3)
inds
#(0, 1, 2, 3, 6, 7)
Remove dependent columns and check matrix rank (num of independent columns):
#linearly independent
X_independent = X[:,inds]
assert np.linalg.matrix_rank(X_independent, tol=1e-3) == X_independent.shape[1]

keras pre-processing of text using one_hot class

I came across this code while learning keras online.
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)
This returns the intergers like this...
[3, 1, 1, 2, 3]
I did not understand why and how does unique words return duplicate numbers. For e.g. 3 and 1 is repeated even if the words in the text are unique.
From the documentation of one_hot it is described how it is a wrapper of hashing_trick:
This is a wrapper to the hashing_trick function using hash as the hashing function; unicity of word to index mapping non-guaranteed.
From the documentation of hasing_trick:
Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects.
Since hashing is used there is a probability that different words will be hashed to the same index. The probability of a non-unique hash is proportional to the vocabulary size selected.
It is suggested by Jason Brownlee Jason Brownlee to use a vocabulary size 25% larger than the word size to increase the uniqueness of the hashes.
Following Jason Brownlee suggestion in you case results in:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.random import set_random_seed
import math
set_random_seed(1)
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
print(one_hot(text, math.ceil(length*1.25)))
which returns the integers
[3, 4, 5, 1, 6]

Behaviour of train_test_split() from Scikit-learn

I am curious how the train_test_split() method of Scikit-learn will behave in the following scenario:
An imaginary dataset:
id, count, size
1, 4, 8
2, 5, 9
3, 6, 0
say I would divide it into two separate sets like this (keeping 'id' in both):
id, count | id, size
1, 4 | 1, 8
2, 5 | 2, 9
3, 6 | 3, 0
And split them both with train_test_split() with the same random_state of 0. Would the order of both be the same with 'id' as reference? (since you are shuffling the same dataset but with different parts left out)
I am curious as to how this works because I have two models. The first one gets trained with the dataset and adds it's results to the dataset, part of which is then used to train the second model.
When doing this it's important that when testing the generalization of the second model, no data points are used which were also used to train the first model. This is because the data was 'seen before' and the model will know what to do with it, so then you are not testing the generalization to new data anymore.
It would be great if train_test_split() would shuffle it the same since then one would not need to keep track of what data was used to train the first algorithm to prevent contamination of the test results.
They should have the same resulting indices if you use the same random_state parameter in each call.
However--you could also just reverse your order of operations. Call test/train split on the parent dataset, then create two sub-sets from both the test and train sets that result.
Example:
print(df)
id count size
0 1 4 8
1 2 5 9
2 3 6 0
from sklearn.model_selection import train_test_split
dfa = df[['id', 'count']].copy()
dfb = df[['id', 'size']].copy()
rstate = 123
traina, testa = train_test_split(dfa, random_state=123)
trainb, testb = train_test_split(dfb, random_state=123)
assert traina.index.equals(trainb.index)
# True

Resources