MXNet - Dot Product of Sparse Matrices - python-3.x

I'm in the process of building a content recommendation model using MXNet. Despite being ~10K rows, out of memory issues are thrown with CPU and GPU contexts in MXNet. The current code is below.
```
import mxnet as mx
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
df = pd.read_csv("df_text.csv")
tf = TfidfVectorizer(analyzer = "word",
ngram_range = (1,3),
min_df = 2,
stop_words="english")
tfidf_matrix = tf.fit_transform(df["text_column"])
mx_tfidf = mx.nd.array(tfidf_matrix, ctx=mx.gpu())
# Out of memory error occurs here.
cosine_similarities = mx.ndarray.dot(mx_tfidf, mx_tfidf.T)
```
I'm aware that the dot product is a sparse matrix multiplied by a dense matrix, which may be part of the issue. This said, would the dot product have to be calculated across multiple GPU's, in order to prevent out of memory issues?

In MXNet (and AFAIK all other platforms) there is not magical "perform dot across GPUs" solution. One option is to use sparse matrices in MXNet (see this tutorial)
Another option is to implement your own multi-GPU dot product by slicing your input array into multiple matrices and performing parts of your dot product in each GPU.

Related

Combine numpy array with TfidfVectorizer as a joint feature matrix in SKLearn

I have a dataset input, which is a list of ~40000 letters (that are represented as strings).
With SKLearn, I first used a TfidfVectorizer to create a TF-IDF matrix representation1:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sklearn.pipeline
vectorizer = TfidfVectorizer(lowercase=False)
representation1 = vectorizer.fit_transform(input) # TFIDF representation
Now, I want to manually add one feature representation2 for every letter. This feature should tell the amount of different words compared to all words in a specific letter/string:
count_vectorizer = CountVectorizer()
sum_words = np.sum(count_vectorizer.fit_transform(input).toarray(), axis=-1)
sum_different_words = np.count_nonzero(count_vectorizer.fit_transform(input).toarray(), axis=-1)
representation2 = np.divide(sum_different_words, sum_words) # percentage of different words
The array representation2 is now an array of shape (39077,) (as expected). I now want to combine representation1 and representation2 into one feature vector representation.
I read about using FeatureUnion to combine two kinds of features in SKLearn, but I am not sure how to correctly use the Numpy array representation2as a feature here. I tried:
union = sklearn.pipeline.make_union([representation1, representation2])
But now I can't use e.g. union.get_feature_names_out(), since it throws: AttributeError: Transformer list (type list) does not provide get_feature_names_out.
What did I understand incorrectly here?

scikit-learn: most important feature due to SelectKBest() is not the same feature of top node in DecisionTreeClassifier() with unedited data?

I am applying the breast cancer dataset to a decision tree as simple as possible:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import graphviz
cancer = load_breast_cancer()
#print(cancer.feature_names)
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
tree = DecisionTreeClassifier(random_state=0, max_depth=2)
tree.fit(X_train, y_train)
print(f"\nscore train: {tree.score(X_train, y_train)}")
print(f"score test : {tree.score(X_test, y_test)}")
>>>
score train: 0.9413145539906104
score test : 0.9370629370629371
export_graphviz(tree, out_file=f"./src/dot/testing/breast_cancer.dot", class_names=['malignant', 'benign'], feature_names=cancer.feature_names, impurity=False, filled=True)
with open(f"./src/dot/testing/breast_cancer.dot") as f:
dot_graph = f.read()
graphviz.Source(dot_graph)
Which lead to this graph:
Playing with feature selection, I want to get only the most important feature. In my understanding it should be the feature in the root-leaf, no? Unfortunately it's not, it's "worst concave points". Here is what I did to get the most important feature:
select = SelectKBest(k=1)
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
print("X_train.shape : {}".format(X_train.shape))
print("X_train_selected.shape: {}\n".format(X_train_selected.shape))
>>>
X_train.shape : (426, 30)
X_train_selected.shape: (426, 1)
mask = select.get_support()
# plt.matshow(mask.reshape(1, -1), cmap='gray_r')
# plt.xlabel("Sample index")
print("most important features:")
for mask, feature in zip(mask, cancer.feature_names):
if mask: print(feature)
>>>
most important features:
worst concave points
I guess I am getting something wrong here. Could somebody clarify this? Any hint? Thanks
The most important feature does not necessarily mean that it will be the one used to make the first split. In fact, sklearn.tree.DecisionTreeClassifier uses entropy to decide which feature to use when making a split, so unless SelectKBest does this too, there is no need for both methods to reach the same conclusions in the same order. Even the same feature will reduce entropy differently in different stages of a tree classifier.
As a side note, trees do not always consider all features when making nodes. Take a look at max_features here. This means that, depending on your random-state and max_features hyper parameters, your tree may or may not have considered worst_concave_points when making the first split.

Recover feature names from pipeline [duplicate]

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

How to cluster "text document" with "spherical k-means" using Python?

I have finished implementing the traditional k-means text clustering. However, right now, I need to revise my program to "spherical k-means text clustering" but have not succeeded yet.
I've searched for solutions on sites but still cannot revise my program successfully.
The followings are the resources that should be helpful with my project but I still cannot figure out a way yet.
https://github.com/jasonlaska/spherecluster
https://github.com/khyatith/Clustering-newsgroup-dataset
Spherical k-means implementation in Python
This is my traditional K-means program:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.externals import joblib #store model
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tag_document) //tag_document is a list that contains many strings
true_k = 3 //assume that i want to have 3 clusters
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#store
joblib.dump(model,'save/cluster.pkl')
#restore
clu2 = joblib.load('save/cluster.pkl')
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
I expect to cluster text documents with "spherical k-means clustering".
First you need to check if your texts are similar when the cosine distance between two similar texts is small. After that, you can just normalize vectors and cluster with kmeans.
I did something like this:
k = 20
kmeans = KMeans(n_clusters=k,init='random', random_state=0)
normalizer = Normalizer(copy=False)
sphere_kmeans = make_pipeline(normalizer, kmeans)
sphere_kmeans = sphere_kmeans.fit_transform(word2vec-tfidf-vectors)

Sklearn.mixture.dpgmm not functioning correctly

I'm having trouble with sklearn.mixture.dpgmm. The main issue is that it is not returning correct covariances for synthetic data (2 separated 2D gaussians), where it really should have no issue. In particular, when I do dpgmm._get_covars(), the covariance matrices have diagonal elements that are always exactly 1.0 too large, regardless of the input data distributions. This seems like a bug, as gmm works perfectly (when limiting to known exact number of groups)
Another issue is that dpgmm.weights_ makes no sense, they sum to one but the values appear meaningless.
Does anyone have a solution to this or see something clearly wrong with my example?
Here is the exact script I'm running:
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
import pdb
from sklearn import mixture
# Generate 2D random sample, two gaussians each with 10000 points
rsamp1 = np.random.multivariate_normal(np.array([5.0,5.0]),np.array([[1.0,-0.2],[-0.2,1.0]]),10000)
rsamp2 = np.random.multivariate_normal(np.array([0.0,0.0]),np.array([[0.2,-0.0],[-0.0,3.0]]),10000)
X = np.concatenate((rsamp1,rsamp2),axis=0)
# Fit a mixture of Gaussians with EM using 2
gmm = mixture.GMM(n_components=2, covariance_type='full',n_iter=10000)
gmm.fit(X)
# Fit a Dirichlet process mixture of Gaussians using 10 components
dpgmm = mixture.DPGMM(n_components=10, covariance_type='full',min_covar=0.5,tol=0.00001,n_iter = 1000000)
dpgmm.fit(X)
print("Groups With data in them")
print(np.unique(dpgmm.predict(X)))
##print the input and output covars as example, should be very similar
correct_c0 = np.array([[1.0,-0.2],[-0.2,1.0]])
print "Input covar"
print correct_c0
covars = dpgmm._get_covars()
c0 = np.round(covars[0],decimals=1)
print "Output Covar"
print c0
print("Output Variances Too Big by 1.0")
According to the dpgmm docs this Class is Deprecated in version 0.18 and will be removed in version 0.20
You should use BayesianGaussianMixture Class instead, with parameter weight_concentration_prior_type set with option "dirichlet_process"
Hope it helps
instead of writing
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
you should write:
from sklearn.mixture import BayesianGaussianMixture
gmm = BayesianGaussianMixture(2, covariance_type='full', random_state=0)

Resources