Building vocabulary using document vector - doc2vec

I am not able to build vocabulary and getting an error:
TypeError: 'int' object is not iterable
Here is my code that is based on medium article:
https://towardsdatascience.com/implementing-multi-class-text-classification-with-doc2vec-df7c3812824d
I tried to provide pandas series, list to build_vocab function.
import pandas as pd
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.model_selection import train_test_split
import multiprocessing
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
df = pd.read_csv("https://raw.githubusercontent.com/RaRe-Technologies/movie-plots-by-genre/master/data/tagged_plots_movielens.csv")
tags_index = {
"sci-fi": 1,
"action": 2,
"comedy": 3,
"fantasy": 4,
"animation": 5,
"romance": 6,
}
df["tindex"] = df.tag.replace(tags_index)
df = df[["plot", "tindex"]]
mylist = list()
for i, q in df.iterrows():
mylist.append(
TaggedDocument(tokenize_text(str(q["plot"])), tags=q["tindex"])
)
df["tdoc"] = mylist
X = df[["tdoc"]]
y = df["tindex"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cores = multiprocessing.cpu_count()
model_doc2vec = Doc2Vec(
dm=1,
vector_size=300,
negative=5,
hs=0,
min_count=2,
sample=0,
workers=cores,
)
model_doc2vec.build_vocab([x for x in X_train["tdoc"]])
The documentation is very confusing for this method.

Doc2Vec needs an iterable sequence of TaggedDocument-like objects for its corpus (as is fed to build_vocab() or train()).
When showing an error, you should also show the full stack that accompanied it, so that it is clear what line-of-code, and surrounding call-frames, are involved.
But, it's unclear if what you've fed into the dataframe, then out via dataframe-bracket-access, then through the train_test_split(), is actually that.
So I'd suggest assigning things to descriptive interim variables, and verifying that they contain the right sorts of things at each step.
Is X_train["tdoc"][0] a proper TaggedDocument, with a words property that is a list-of-strings, and tags property a list-of-tags? (And, where each tag is probably a string, but could perhaps be a plain-int, counting upward from 0.)
Is mylist[0] a proper TaggedDocument?
Separately: many online examples of Doc2Vec use have egregious errors, and the Medium article you link is no exception. Its practice of calling train() multiple times in a loop is usually unneeded, and very error-prone, and in fact in that article results in severe learning-rate alpha mismanagement. (For example, deducting 0.002 from the starting-default alpha of 0.025 30 times results in a negative effective alpha, which is never justified and means the model is making itself worse with every example. This may be a factor contributing to the awful reported classifier accuracy.)
I would disregard that article entirely and seek better examples elsewhere.

Related

sklearn dcg_score not working as expected

This is my code:
from sklearn.metrics import dcg_score
true_relevance = np.asarray([[10]])
scores = np.asarray([[.1]])
dcg_score(true_relevance, scores)
The below code should produce 10 as the dcg_score. The formula from wikipedia gives 10/log2 = 10 But, instead I get ValueError: Only ('multilabel-indicator', 'continuous-multioutput', 'multiclass-multioutput') formats are supported. Got binary instead
Did anyone encounter this?
Since computing dcg on a single element is not meaningful, the sklearn library requires at least two y_true and y_score elements in the corresponding arrays.
You can check this by exploring the sklearn code (or through debugging): https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/utils/multiclass.py#L158
Like:
true_relevance = np.asarray([[10, 5]])
scores = np.asarray([[.1, .2]])
dcg_score(true_relevance, scores)

Scaling row-wise with MinMaxScaler from Sklearn

By default, scalers from Sklearn work column-wise. But i need my data to be scaled line-wise, so i did the following:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
# %% Generating sample data
x = np.array([[-1, 4, 2], [-0.5, 8, 9], [3, 2, 3]])
y = np.array([1, 2, 3])
#%% Train/Test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=2)
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train.T).T # scaling line-wise
x_test = scaler.transform(x_test) <-------- Error here
But i am getting the following error:
ValueError: X has 3 features, but MinMaxScaler is expecting 2 features as input.
I don't understand whats wrong here. Why it says it is expecting 2 features, when all my X (x, x_train and x_test) has 3 features? How can i fix this?
StandardScaler is stateful: when you fit it, it calculates and saves the columns' means and standard deviations; when transforming (train or test sets), it uses those saved statistics. Your transpose trick doesn't work with that: each row has saved statistics, and then your test set doesn't have the same rows, so transform cannot work correctly (throwing an error if different number of rows, or silently mis-scaling if the same number of rows).
What you want isn't stateful: test sets should be transformed completely independently of the training set. Indeed, every row should be transformed independently of each other. So you could just do this kind of transformation before splitting, or using fit_transform on the test set('s transpose).
For l2 normalization of rows, there's a builtin for this: Normalizer (docs). I don't think there's an analogue for min-max normalization, but I think you could write a FunctionTransformer to do it.
This is possible to do. I can think of a scenario where this would be useful. Normally, MinMaxScaler would scale each x, y, and z with respect to other observations of that feature. That's the "series" scaling. Now imagine that instead, you wanted to map each point constrained by x+y+z = 1. I think this is what OP is asking for. I have done this in the past, I will describe how I did it.
You need to treat your individual observations as a column multi-index and treat it like a higher-dimensional feature. Then, you need to build a pipeline within which the observations are transformed from column-wise to row wise, post which you do the min/max scaling. This gets you to x+y+z=1, but you still need to get back to the original shape of the data, for which you will need to track the index of each observation. Within the pipeline, you'll need to use something like a DataframeFunctionTransformer which I have seen on the interwebs, reproducing it below. This way you can use pandas functions to shape the data and merge back in with the indices.
class DataframeFunctionTransformer():
def __init__(self, func):
self.func = func
def transform(self, input_df, **transform_params):
return self.func(input_df)
def fit(self, X, y=None, **fit_params):
return self
Regarding the statefulness of MinMaxScaler, I think in a scenario such as this, the state of MinMaxScaler doesn't get used, it is purely acting as a transformer that maps these points to a different space meeting the constraint that x, y, and z are scaled such that they add up to 1.
#Murilo hope this gets you started with a solution. Must be an interesting problem.

Statistical tests: how do (perception; actual results; and next) interact?

What is the interaction between perception, outcome, and outlook?
I've brought them into categorical variables to [potentially] simplify things.
import pandas as pd
import numpy as np
high, size = 100, 20
df = pd.DataFrame({'perception': np.random.randint(0, high, size),
'age': np.random.randint(0, high, size),
'smokes_cat': pd.Categorical(np.tile(['lots', 'little', 'not'],
size//3+1)[:size]),
'outcome': np.random.randint(0, high, size),
'outlook_cat': pd.Categorical(np.tile(['positive', 'neutral',
'negative'],
size//3+1)[:size])
})
df.insert(2, 'age_cat', pd.Categorical(pd.cut(df.age, range(0, high+5, size//2),
right=False, labels=[
"{0} - {1}".format(i, i + 9)
for i in range(0, high, size//2)])))
def tierify(i):
if i <= 25:
return 'lowest'
elif i <= 50:
return 'low'
elif i <= 75:
return 'med'
return 'high'
df.insert(1, 'perception_cat', df['perception'].map(tierify))
df.insert(6, 'outcome_cat', df['outcome'].map(tierify))
np.random.shuffle(df['smokes_cat'])
Run online: http://ideone.com/fftuSv or https://repl.it/repls/MicroLeftSequences
This is faked data but should present the idea. The individual have a perceived view perception, then they are presented with actual outcome, and from that can decide their outlook.
Using Python (pandas, or anything open-source really), how do I show the probability—and p-value—of the interaction between these 3 dependent columns (possibly using the age, smokes_cat as potential confounders)?
You can use interaction plots for this particular purpose. This fits pretty well to your case. I would use such plot for your data. I've tried it for your dummy data generated in the question, and you can write your code like below. Think it as a pseudo-code though, you must tailor the code to your need.
In its simple form:
If the lines in the plot have an intersection or likely to have for other values, then you may assume that there is an interaction effect.
If the lines are parellel or not likely to have an intersection, then you assume there is no interaction effect.
Yet, for additional and deeper understanding, I placed some links that you can check out.
Code
... # The rest of the code in the question.
# Interaction plot
import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
p = interaction_plot(
x = df['perception'],
trace=df['outlook_cat'],
response= df['outcome']
)
plt.savefig('./my_interaction_plot.png') # or plt.show()
You can find the documentation of interaction_plot() here. Besides, I also suggest you run an ANOVA.
Further reading
You can check out these links:
(A paper) titled Interaction Effects in ANOVA.
(A case) in practice case.
(Another case) in practice case.
One option is a Multinomial logit model:
# Create one-hot encoded version of categorical variables
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
all_enc_df = pd.DataFrame({column: enc.fit_transform(df[column])
for column in ('perception_cat', 'age_cat',
'smokes_cat', 'outlook_cat')})
# Regression
from sklearn.linear_model import LogisticRegression
X, y = (all_enc_df[['age_cat', 'smokes_cat', 'outlook_cat']],
all_enc_df[['perception_cat']])
#clf = LogisticRegression(random_state=0, solver='lbfgs',
# multi_class='multinomial').fit(X, y)
import statsmodels.api as sm
mullogit = sm.MNLogit(y,X)
mulfit = mullogit.fit(method='bfgs', maxiter=100)
print(mulfit.summary())
https://repl.it/repls/MicroLeftSequences

k means cluster method score negative

guys. I am yet a beginner trying to learn ML so do forgive me for such a simple question. I had a dataset from UCI ML Repository. So, started applying all kinds of unsupervised algorithm in which i also applied K Means Cluster algorithm. When I printed out the accuracy score it was negative, not just once but many times. As far as I know scores aren't negative. So could you please help me as to why it's negative.
Any help is appreciated.
import pandas as pd
import numpy as np
a = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', names = ["a", "b", "c", "d","e","f","g","h","i"])
b = a
c = b.filter(a.columns[[8]], axis=1)
a.drop(a.columns[[8]], axis=1, inplace=True)
from sklearn.preprocessing import LabelEncoder
le1 = LabelEncoder()
le1.fit(a.a)
a.a = le1.transform(a.a)
from sklearn.preprocessing import OneHotEncoder
x = np.array(a)
y = np.array(c)
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit(x)
x = ohe.transform(x).toarray()
from sklearn.model_selection import train_test_split
xtr, xts, ytr, yts = train_test_split(x,y,test_size=0.2)
from sklearn import cluster
kmean = cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10)
kmean.fit(xtr,ytr)
print(kmean.score(xts,yts))
Thank you!!
The k-means score is an indication of how far the points are from the centroids.
In scikit learn, the score is better the closer to zero it is.
Bad scores will return a large negative number, whereas good scores return close to zero. Generally, you will want to take the absolute value of the output from the scores method for better visualization.
Clustering is not classification.
Note that the 'y' argument of fit is ignored. Kmeans will always predict 0,1,...,k-1. So it will never make a correct label on this data set, because it doesn't even know what a label is supposed to look like. It really doesn't work to transfer what you did in classification to clustering. You need to relearn this from scratch. Different workflow, different evaluation.
It was explained in a book called "Hands-on Machine Learning with Scikit Learn Keras and TensorFlow" by Geron Aurelien.
On page 243 of the book (Chapter 9), it says that "The score() method returns the negative inertia. Why negative? Because a predictor’s score() method must always respect Scikit-Learn’s “greater is better” rule: if a predictor is better than another, its score() method should return a greater score."
Hope this helped!

Informative Features Code not Working

I want to implement a most informative features function for binary NB in SciKit Learn. I am using Python3.
First off, I understand that the question of implementing some sort of 'informative features' function for SciKit's multinomial NB has been asked. However, I have tried the responses and have had no luck - so I think either SciKit updated, or I am doing something very wrong. I am using
tobigue's answer here for a function.
from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#Array contains a list of (headline, source) tupples where there are two sources.
#I want to classify each headline as belonging to a given source.
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]
#I want to classify each headline as belonging to a given source.
def scikit_naivebayes(data_array):
headlines = [element[0] for element in data_array]
sources = [element[1] for element in data_array]
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
cf1 = text_clf.fit(headlines, sources)
train(cf1,headlines,sources)
#Call most_informative_features function on CountVectorizer and classifier
show_most_informative_features(CountVectorizer, cf1)
def train(classifier, X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
classifier.fit(X_train, y_train)
print ("Accuracy: {}".format(classifier.score(X_test, y_test)))
#tobigue's code:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))
def main():
scikit_naivebayes(array)
main()
#ERROR:
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'
You need to fit the CountVectorizer before calling vectorizer.get_feature_names(). In your code, you only call the other function with the class CountVectorizer, which won't lead to anything.
You should try independtly from your pipeline to create a vectorizer with CountVectorizer, and then call fit on your text, and eventually use the function already provided, though you should further adapt it by yourself to your problem.
You should understand easily that the function you use needs an instanciated object, and not a class. Tell me if you don't.
Edit
coef_ is an attribute only accessible by an estimator, i.e a classifier (and not all). Pipeline is a sklearn object used to combined different steps in order to feed a classifier. Typically, a bag-of-words pipeline is constitued by a feature extractor and a classifier (here logistic regression):
pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])
So, in your case, you should either avoid using pipeline (what I recommend you to begin), or use get_params() method from the pipeline to access the classifier.
I suggest you to fit_transform the text, then feed the transformed result to a logistic regression or naive bayes classifier, and then call the function you have :
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)
First try that, and if it works you'll understand better how to then use a pipeline. Note that your Pipeline should not work as you combine to feature extractors, the last step should be an estimator. If you want to stack to features extractors, you need to look out for FeatureUnion

Resources