i face an issue to pass a function to compare between two column
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
after i apply the function
cosine_sim1('like football', 'football')
The results is:
I face a little issue to pass that function between two column in dataframe to calculate the score. here is a small sample of the data
d = pd.DataFrame({'A': ['my name is', 'i live in', 'i like football'], 'B': ['london is nice city', 'london city', 'football']})
i have tried to do like that. However there are some errors appears.
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1(d['A']), text2(d['B'])])
return ((tfidf * tfidf.T).A)[0,1]
d.apply(cosine_sim1, axis=1)
The error is:
TypeError: ("cosine_sim1() missing 1 required positional argument: 'text2'", 'occurred at index 0')

I believe it should be
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
d.apply(lambda x: cosine_sim1(x.A, x.B), axis=1)
You are applying function to DataFrame but you are not passing the parameters that you have defined.


Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

How do I make my algo work with KNN text classification?

Trying to make my classification accepting a text (string) and not just a number (numeric). Working with data, carrying a load of pulled articles, I want the classification algo to show which ones to proceed with and which ones to drop. Applying a number, things are working just fine, yet this is not very intuitive, although I know that the number represents a relationship to one of the two classes I am working with.
How do I change the logic in the algo to make it accept a text as search criteria and not just an anonymous number, picked from the 'Unique_id' column? Columns are, btw...'Title', 'Abstract', 'Relevant', 'Label', 'Unique_id'. The reason for concatenating df's at algo end is that I want to compare results. Finally. it should be noted that the col 'Label' consists of a list of keywords, so basically I want the algo to read from that col.
I did try, reading from data sources, changing the 'index_col='Unique_id' to 'index_col='Label', but that did not work out either.
An example of what I want:
print("\nPrint KNN1")
print(get_closest_neighs1('search word'), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2('search word'), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3('search word'), "\n")
This is the full code (view end of algo to see above example as it runs today, using a number to identify nearest neighbor):
import pandas as pd
print("\nPerforming Analysis using Text Classification")
data = pd.read_csv('File_1_coltest_demo.csv', sep=';', encoding="ISO-8859-1").dropna()
data['Unique_id'] = data.groupby(['Title', 'Abstract', 'Relevant']).ngroup()
data.to_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index=False)
data1 = pd.read_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index_col='Unique_id')
data2 = pd.DataFrame(data1, columns=['Abstract', 'Relevant'])
data2.to_csv('File_3_coltest_demo_KNN_reduced.csv', sep=';', encoding="ISO-8859-1", index=False)
print("\nData top 25 items")
print("\nData info")
print("\nData columns")
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 1), tokenizer=token.tokenize)
text_counts = cv.fit_transform(data2['Abstract'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_counts, data2['Abstract'], test_size=0.5, random_state=1)
print("\nTF IDF")
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(data2['Abstract'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_tf, data2['Abstract'], test_size=0.3, random_state=123)
from sklearn.neighbors import NearestNeighbors
import pandas as pd
nbrs = NearestNeighbors(n_neighbors=20, metric='euclidean').fit(text_tf)
def get_closest_neighs1(Abstract):
row = data2.index.get_loc(Abstract)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Abstract'])
result = pd.DataFrame({'distance1' : distances.flatten(), 'Abstract' : names_similar}) # 'Unique_id' : names_similar,
return result
def get_closest_neighs2(Unique_id):
row = data2.index.get_loc(Unique_id)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Unique_id'])
result1 = pd.DataFrame({'Distance' : distances.flatten() / 10, 'Unique_id' : names_similar}) # 'Unique_id' : names_similar,
return result1
def get_closest_neighs3(Relevant):
row = data2.index.get_loc(Relevant)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Relevant'])
result2 = pd.DataFrame({'distance2' : distances.flatten(), 'Relevant' : names_similar}) # 'Unique_id' : names_similar,
return result2
print("\nPrint KNN1")
print(get_closest_neighs1(114), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2(114), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3(114), "\n")
data3 = pd.DataFrame(get_closest_neighs1(114))
data4 = pd.DataFrame(get_closest_neighs2(114))
data5 = pd.DataFrame(get_closest_neighs3(114))
del data5['distance2']
data6 = pd.concat([data3, data4, data5], axis=1).reindex(data3.index)
del data6['distance1']
data6.to_csv('File_4_coltest_demo_KNN_results.csv', sep=';', encoding="ISO-8859-1", index=False)
If I understand you right you are trying to do this:
You have vectorised all your documents by their "Abstract" field. Therefore documents with abstracts with similar word distributions should be nearby in TFIDF space.
You want to find the nearest neighbours to a document which has the search keyword.
Therefore you'd need to search the original corpus for the first or all documents which have that keyword
then find the index of that/those document(s), and then find their neighbours.
if there are multiple documents with that keyword, you would need to sort the index list and merge the overall results somehow with some weightings.
If this is true, then the keyword search/lookup isn't really "inside" the model, it's just preselecting a document from the corpus. Once you have the document index, you can perform the KNN (repeatedly).
I'm not hugely familiar with Pandas, but I've done this kind of thing "manually" before e.g. by keeping the document titles in a separate array, with a map to the indexes.
I would imagine you would need to replace your data2.index.get_loc() calls with an iteration over the column values for "Label" and do a simple string search on each. Or does Pandas provide search functions within the corpus?

BERT binary Textclassification get different results every run

I do binary text classification with BERT from the Simpletransformer.
I work in Colab with GPU runtime type.
I have generated train and test set with the sklearn StratifiedKFold Method. I have two files with the dictionaries containing my folds.
I run my classification in the following while loop:
from sklearn.metrics import matthews_corrcoef, f1_score
import sklearn
counter = 0
resultatos = []
while counter != len(trainfolds):
model = ClassificationModel('bert', 'bert-base-multilingual-cased',args={'num_train_epochs': 4, 'learning_rate': 1e-5, 'fp16': False,
'max_seq_length': 160, 'train_batch_size': 24,'eval_batch_size': 24 ,
'warmup_ratio': 0.0,'weight_decay': 0.00,
'overwrite_output_dir': True})
print("start with fold_{}".format(counter))
trainfolds["{}_fold".format(counter)].to_csv("/content/data/train.tsv", sep="\t", index = False, header=False)
print("{}_fold Train als train.tsv exportiert". format(counter))
testfolds["{}_fold".format(counter)].to_csv("/content/data/dev.tsv", sep="\t", index = False, header=False)
print("{}_fold test als train.tsv exportiert". format(counter))
train_df = pd.read_csv("/content/data/train.tsv", delimiter='\t', header=None)
eval_df = df = pd.read_csv("/content/data/dev.tsv", delimiter='\t', header=None)
train_df = pd.DataFrame({
'text': train_df[3].replace(r'\n', ' ', regex=True),
eval_df = pd.DataFrame({
'text': eval_df[3].replace(r'\n', ' ', regex=True),
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1 = sklearn.metrics.f1_score)
counter += 1
And i get different Results Running this code for the same Folds:
Here for example the F1 Scores for two runs:
The F1 Score is: 0.6224511646974427
The F1 Score is: 0.618727463283623
How can they be that diffeerent for the same folds?
What i tried already is give a fixed Random seed right before my loop starts:
I came up with approach of having the Model initialized in the loop because, when its outside the loop, it somehow remembers what it has learned - that means after the 2nd fold I get f1 score of almost one - despite the fact that i delete the cache..
I figured it out myself, just set all seeds plus torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
like shown in this post and i get the same results for all runs!

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
if s == 'gr':
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
#1000 loops, best of 3: 1.17 ms per loop

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
for i in test_data:
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species']
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species']
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
for i in test_data:
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!
