How to remove key error from the program? - python-3.x

In the image you can see that i have ID still getting key error I am trying to do a recommendation algorithm so i got this error
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
KeyError: <built-in function id>
I am sharing link of article https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") #you can plug in your own list of products or movies or books here as csv file#
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
#ngram explanation begins#
#ngram (1,3) can be explained as follows#
#ngram(1,3) encompasses uni gram, bi gram and tri gram
#consider the sentence "The ball fell"
#ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
#ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID :
(Score,item_id))#
for idx, row in ds.iterrows(): #iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and 1 means perfect similarity#
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
#stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
#below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe, then we convert it to a list#
def item(id):
return ds.loc[ds['ID'] == id]['Book Title'].tolist()[0]
def recommend(id, num):
if (num == 0):
print("Unable to recommend any book as you have not chosen the number of book to be
recommended")
elif (num==1):
print("Recommending " + str(num) + " book similar to " + item(id))
else :
print("Recommending " + str(num) + " books similar to " + item(id))
print("----------------------------------------------------------")
recs = results[id][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
#the first argument in the below function to be passed is the id of the book, second argument is the number of books you want to be recommended#
recommend(5,2)
i have try and run successfully till results variable then getting error.

because python default id keyword is called when you call "def item(id):"
instead of id you have to declare another identifier....then i think this is the only reason for keyerror..

As the error suggests id is an build-in function in python-3. So if you change the name of the parameters id in def item(id) and def recommend(id, num) and all their references then the code should work.
After changing the id and correcting the indentation, an example could look like this:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
ds = pd.read_csv("test1.csv") # you can plug in your own list of products or movies or books here as csv file
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
# ngram explanation begins#
# ngram (1,3) can be explained as follows#
# ngram(1,3) encompasses uni gram, bi gram and tri gram
# consider the sentence "The ball fell"
# ngram (1,3) would be the, ball, fell, the ball, ball fell, the ball fell
# ngram explanation ends#
tfidf_matrix = tf.fit_transform(ds['Book Title'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
results = {} # dictionary created to store the result in a dictionary format (ID : (Score,item_id))
for idx, row in ds.iterrows(): # iterates through all the rows
# the below code 'similar_indice' stores similar ids based on cosine similarity. sorts them in ascending
# order. [:-5:-1] is then used so that the indices with most similarity are got. 0 means no similarity and
# 1 means perfect similarity
similar_indices = cosine_similarities[idx].argsort()[:-5:-1]
# stores 5 most similar books, you can change it as per your needs
similar_items = [(cosine_similarities[idx][i], ds['ID'][i]) for i in similar_indices]
results[row['ID']] = similar_items[1:]
# below code 'function item(id)' returns a row matching the id along with Book Title. Initially it is a dataframe,
# then we convert it to a list#
def item(ID):
return ds.loc[ds['ID'] == ID]['Book Title'].tolist()[0]
def recommend(ID, num):
if num == 0:
print("Unable to recommend any book as you have not chosen the number of book to be recommended")
elif num == 1:
print("Recommending " + str(num) + " book similar to " + item(ID))
else:
print("Recommending " + str(num) + " books similar to " + item(ID))
print("----------------------------------------------------------")
recs = results[ID][:num]
for rec in recs:
print("You may also like to read: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")
# the first argument in the below function to be passed is the id of the book, second argument is the number of books
# you want to be recommended
recommend(5, 2)

Related

How do I get accurate OCR results with PyTorch with a vertical line next to the text I want?

I want to pull the BOL number from the invoices in the picture below. I've pulled a lot of data successfully, but sometimes the vertical line in the table, to the left of the target data, gets misread in as a | or a \ or a 1. This would cause problems.
Currently, I'm failing the invoice upload when the confidence is low... but the confidence is too often low.
How can I make PyTorch "understand" that the vertical line is not a part of the target data, and thus improve confidence?
I'm using EasyOCR, which uses PyTorch.
Here's some relevant code:
import cv2
import easyocr
import numpy as np
from pdf2image import convert_from_path
def extract_bol_from_pdf(data: dict, result, testing=False):
"""
Try to get the BOL number from an invoice.
:param data: contains information needed for uploads, such as
BOL, invoice #, etc.
:param result: nested lists and tuples of x- and y- coordinates
of where the strings were found in the images.
:param testing: if True, pull BOL regardless of confidence
:return: the data dict
"""
low_confidence = None
bol_tuples = [tup for tup in result if "BOL" in tup[1]]
if len(bol_tuples) == 0:
logger.info("Couldnt find BOL string")
else:
bol_tuple = bol_tuples[0]
bol_num = find_match_by_vertical_alignment(bol_tuple, "BOL", result)
if len(bol_num) == 1:
bol_num_tuple = bol_num[0]
confidence = bol_num_tuple[2]
if confidence > 0.93 or testing:
logger.info(f"Got the BOL #: {bol_num_tuple[1]}")
data["bol"] = bol_num_tuple[1]
else:
logger.warning(
"Image recognition confidence is too low, cannot "
f"capture BOL. Confidence: {confidence} BOL Guess: {bol_num_tuple[1]}"
)
low_confidence = True
else:
logger.info("Couldnt find BOL number")
return data, low_confidence
pages = convert_from_path(attachment_path, dpi=300, grayscale=True, fmt="jpeg")
for i, page in enumerate(pages, start=1):
image_name = attachment_path.parent.joinpath(f"page_{str(i).zfill(3)}.jpg")
page.save(image_name, "JPEG")
pages = [
page
for page in sorted(list(attachment_path.parent.iterdir()))
if "page" in page.name
]
for page in pages:
reader = easyocr.Reader(["en"], gpu=USE_GPU)
result = reader.readtext(str(page))
invoice_data, low_confidence = extract_bol_from_pdf(
invoice_data, result, testing=testing
)
Removing vertical lines might be helpful, but questions like this refer to pytesseract, and not to Pytorch: pytesseract Erase table borders as delicately as possible

How to create a dataframe of a particular size containing both continuous and categorical values with a uniform random distribution

So, I'm trying to generate some fake random data of a given dimension size. Essentially, I want a dataframe in which the data has a uniform random distribution. The data consist of both continuous and categorical values. I've written the following code, but it doesn't work the way I want it to be.
import random
import pandas as pd
import time
from datetime import datetime
# declare global variables
adv_name = ['soft toys', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_loc = ['location_1', 'location_2', 'location_3',
'location_4', 'location_5']
adv_prod = ['baby product', 'kitchenware', 'electronics',
'mobile phones', 'laptops']
adv_size = [1, 2, 3, 4, 10]
adv_layout = ['static', 'dynamic'] # advertisment layout type on website
# adv_date, start_time, end_time = []
num = 10 # the given dimension
# define function to generate random advert locations
def rand_shuf_loc(str_lst, num):
lst = adv_loc
# using list comprehension
rand_shuf_str = [item for item in lst for i in range(num)]
return(rand_shuf_str)
# define function to generate random advert names
def rand_shuf_prod(loc_list, num):
rand_shuf_str = [item for item in loc_list for i in range(num)]
random.shuffle(rand_shuf_str)
return(rand_shuf_str)
# define function to generate random impression and click data
def rand_clic_impr(num):
rand_impr_lst = []
click_lst = []
for i in range(num):
rand_impr_lst.append(random.randint(0, 100))
click_lst.append(random.randint(0, 100))
return {'rand_impr_lst': rand_impr_lst, 'rand_click_lst': click_lst}
# define function to generate random product price and discount
def rand_prod_price_discount(num):
prod_price_lst = [] # advertised product price
prod_discnt_lst = [] # advertised product discount
for i in range(num):
prod_price_lst.append(random.randint(10, 100))
prod_discnt_lst.append(random.randint(10, 100))
return {'prod_price_lst': prod_price_lst, 'prod_discnt_lst': prod_discnt_lst}
def rand_prod_click_timestamp(stime, etime, num):
prod_clik_tmstmp = []
frmt = '%d-%m-%Y %H:%M:%S'
for i in range(num):
rtime = int(random.random()*86400)
hours = int(rtime/3600)
minutes = int((rtime - hours*3600)/60)
seconds = rtime - hours*3600 - minutes*60
time_string = '%02d:%02d:%02d' % (hours, minutes, seconds)
prod_clik_tmstmp.append(time_string)
time_stmp = [item for item in prod_clik_tmstmp for i in range(num)]
return {'prod_clik_tmstmp_lst':time_stmp}
def main():
print('generating data...')
# print('generating random geographic coordinates...')
# get the impressions and click data
impression = rand_clic_impr(num)
clicks = rand_clic_impr(num)
product_price = rand_prod_price_discount(num)
product_discount = rand_prod_price_discount(num)
prod_clik_tmstmp = rand_prod_click_timestamp("20-01-2018 13:30:00",
"23-01-2018 04:50:34",num)
lst_dict = {"ad_loc": rand_shuf_loc(adv_loc, num),
"prod": rand_shuf_prod(adv_prod, num),
"imprsn": impression['rand_impr_lst'],
"cliks": clicks['rand_click_lst'],
"prod_price": product_price['prod_price_lst'],
"prod_discnt": product_discount['prod_discnt_lst'],
"prod_clik_stmp": prod_clik_tmstmp['prod_clik_tmstmp_lst']}
fake_data = pd.DataFrame.from_dict(lst_dict, orient="index")
res = fake_data.apply(lambda x: x.fillna(0)
if x.dtype.kind in 'biufc'
# where 'biufc' means boolean, integer,
# unicode, float & complex data types
else x.fillna(random.randint(0, 100)
)
)
print(res.transpose())
res.to_csv("fake_data.csv", sep=",")
# invoke the main function
if __name__ == "__main__":
main()
Problem 1
when I execute the above code snippet, it prints fine but when written to csv format, its horizontally positioned; i.e., it looks like this... How do I position it vertically when writing to csv file? What I want is 7 columns (see lst_dict variable above) with n number of rows?
Problem 2
I dont understand why the random date is generated for the first 50 columns and remaining columns are filled with numerical values?
To answer your first question, replace
print(res.transpose())
with
res.transpose() print(res)
To answer your second question look at the length of the output of the method
rand_shuf_loc()
it as well as the other helper functions only produce a list of 50 items.
The creation of res using the method
fake_data.apply
replaces all nan with a random numeric, so it also applies a numeric to the columns without any predefined values.

How to simplify text comparison for big data-set where text meaning is same but not exact - deduplicate text data

I have text data set (different menu items like chocolate, cake, coke etc) of around 1.8 million records which belongs to 6 different categories (category A, B, C, D, E, F). one of the category has around 700k records. Most of the menu items are mixed up in multiple categories to which they doesn't belong to, for example: cake belongs to category 'A' but it is found in category 'B' & 'C' as well.
I want to identify those misclassified items and report to a personnel but the challenge is the item name is not always correct because it is totally human typed text. For example: Chocolate might be updated as hot chclt, sweet choklate, chocolat etc. There can also be items like chocolate cake ;)
so to handle this, I tried a simple method using cosine similarity to compare category-wise and identify those anomalies but it takes alot of time since I am comparing each items to 1.8 million records (Sample code is as shown below). Can anyone suggest a better way to deal with this problem?
#Function
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def cos_similarity(a,b):
X =a
Y =b
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
if float((sum(l1)*sum(l2))**0.5)>0:
cosine = c / float((sum(l1)*sum(l2))**0.5)
else:
cosine = 0
return cosine
#Base code
cos_sim_list = []
for i in category_B.index:
ln_cosdegree = 0
ln_degsem = []
for j in category_A.index:
ln_j = str(category_A['item_name'][j])
ln_i = str(category_B['item_name'][i])
degreeOfSimilarity = cos_similarity(ln_j,ln_i)
if degreeOfSimilarity>0.5:
cos_sim_list.append([ln_j,ln_i,degreeOfSimilarity])
Consider text is already cleaned
I used KNeighbor and cosine similarity to solve this case. Though I am running the code multiple times to compare category by category; still it is effective because of lesser number of categories. Please suggest me if any better solution is available
cat_A_clean = category_A['item_name'].unique()
print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(cat_A_clean)
print('Vecorizing completed...')
from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)
unique_B = set(category_B['item_name'].values)
def getNearestN(query):
queryTFIDF_ = vectorizer.transform(query)
distances, indices = nbrs.kneighbors(queryTFIDF_)
return distances, indices
import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_B)
t = time.time()-t1
print("COMPLETED IN:", t)
unique_B = list(unique_B)
print('finding matches...')
matches = []
for i,j in enumerate(indices):
temp = [round(distances[i][0],2), cat_A_clean['item_name'].values[j],unique_B[i]]
matches.append(temp)
print('Building data frame...')
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','ITEM_A','ITEM_B'])
print('Done')
def clean_string(text):
text = str(text)
text = text.lower()
return(text)
def cosine_sim_vectors(vec1,vec2):
vec1 = vec1.reshape(1,-1)
vec2 = vec2.reshape(1,-1)
return cosine_similarity(vec1,vec2)[0][0]
def cos_similarity(sentences):
cleaned = list(map(clean_string,sentences))
print(cleaned)
vectorizer = CountVectorizer().fit_transform(cleaned)
vectors = vectorizer.toarray()
print(vectors)
return(cosine_sim_vectors(vectors[0],vectors[1]))
cos_sim_list =[]
for ind in matches.index:
a = matches['Match confidence (lower is better)'][ind]
b = matches['ITEM_A'][ind]
c = matches['ITEM_B'][ind]
degreeOfSimilarity = cos_similarity([b,c])
cos_sim_list.append([a,b,c,degreeOfSimilarity])

How to add if statement that skips a method if there is no user input

I am creating a simple movie recommendation system that returns the most similar movies based on a user's input. However, I also created a method that returns the top rated movies from the dataset. I'm trying to figure out how to implement an if statement that skips the similar_movies method I created if the user doesn't input a movie title.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#Read the dataset
df = pd.read_csv("movie_dataset.csv")
#Select the best features that will result in better recommendations
features = ['keywords','cast','genres','director']
#Combine features into single string so that they can be
#represented by a single line on a graph (for cosine similarity)
def combine_features(row):
return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']
#Clean the data so there are no NaN values (Preprocessing)
for feature in features:
df[feature] = df[feature].fillna('') #fills all NaNs with a blank string
#Apply the combined_features method to the dataset
df["combined_features"] = df.apply(combine_features,axis=1)
#Using the strings, created by combined_features, create a count matrix (for the creation of the similarity matrix)
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_features"])
#Using the count matrix, create a similarity matrix using cosine simialrity
cosine_sim = cosine_similarity(count_matrix)
#These functions simply help get the title from the index and the index from the title
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
try:
return df[df.title == title]["index"].values[0]
except IndexError:
return print ("\nPlease select a different Movie\nHere are the Top Rated movies:\n")
#Asks for user imput to return movies that the user likes
MTitle = input("\nType in a movie title: ")
movie_user_likes = MTitle
#Takes the user input so the program can locate the row title is on
movie_index = get_index_from_title(MTitle)
#Produces the similarity scores that the program can search to find most similar movies
similar_movies = list(enumerate(cosine_sim[movie_index]))
#Sorts the similarity scores so the program can return the best scores first
sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]
#Returns the most similar movies based on the user's input
i=0
print("\nSimilar movies to "+movie_user_likes+" are:\n")
for element in sorted_similar_movies:
print(get_title_from_index(element[0]))
i=i+1
if i>9:
break
I want the program to default to here if the user doesn't enter a movie title.
#If the user does not enter a movie title the program can default to the top rated movies in the dataset
sort_by_average_vote = sorted(sorted_similar_movies,key=lambda x:df["vote_average"]
[x[0]],reverse=True)
#print(sort_by_average_vote)
i=0
print("\nTop Rated Movies:\n")
for element in sort_by_average_vote:
print(get_title_from_index(element[0]))
i=i+1
if i>5:
break
If no input is provided to input() (e.g. the enter key is pressed), then it will return an empty string, which in Python are falsy, so can be used in an expression:
movie_title = input("\nType in a movie title: ")
if movie_title:
print("User typed in movie: '{}'".format(movie_title))
else:
print("User did not type in a movie title")
To protect against the user entering a space or tab and nothing else, you can strip() the string you receive from input(), which removes leading and trailing whitespace.
movie_title = input("\nType in a movie title: ").strip()
if movie_title:
print("User typed in movie: '{}'".format(movie_title))
else:
print("User did not type in a movie title")

I want to make a dictionary of trigrams out of a text file, but something is wrong and I do not know what it is

I have written a program which is counting trigrams that occur 5 times or more in a text file. The trigrams should be printed out according to their frequency.
I cannot find the problem!
I get the following error message:
list index out of range
I have tried to make the range bigger but that did not work out
f = open("bsp_file.txt", encoding="utf-8")
text = f.read()
f.close()
words = []
for word in text.split():
word = word.strip(",.:;-?!-–—_ ")
if len(word) != 0:
words.append(word)
trigrams = {}
for i in range(len(words)):
word = words[i]
nextword = words[i + 1]
nextnextword = words[i + 2]
key = (word, nextword, nextnextword)
trigrams[key] = trigrams.get(key, 0) + 1
l = list(trigrams.items())
l.sort(key=lambda x: x[1])
l.reverse()
for key, count in l:
if count < 5:
break
word = key[0]
nextword = key[1]
nextnextword = key[2]
print(word, nextword, nextnextword, count)
The result should look like this:(simplified)
s = "this is a trigram which is an example............."
this is a
is a trigram
a trigram which
trigram which is
which is an
is an example
As the comments pointed out, you're iterating over your list words with i, and you try to access words[i+1], when i will reach the last cell of words, i+1 will be out of range.
I suggest you read this tutorial to generate n-grams with pure python: http://www.albertauyeung.com/post/generating-ngrams-python/
Answer
If you don't have much time to read it all here's the function I recommend adaptated from the link:
def get_ngrams_count(words, n):
# generates a list of Tuples representing all n-grams
ngrams_tuple = zip(*[words[i:] for i in range(n)])
# turn the list into a dictionary with the counts of all ngrams
ngrams_count = {}
for ngram in ngrams_tuple:
if ngram not in ngrams_count:
ngrams_count[ngram] = 0
ngrams_count[ngram] += 1
return ngrams_count
trigrams = get_ngrams_count(words, 3)
Please note that you can make this function a lot simpler by using a Counter (which subclasses dict, so it will be compatible with your code) :
from collections import Counter
def get_ngrams_count(words, n):
# turn the list into a dictionary with the counts of all ngrams
return Counter(zip(*[words[i:] for i in range(n)]))
trigrams = get_ngrams_count(words, 3)
Side Notes
You can use the bool argument reverse in .sort() to sort your list from most common to least common:
l = list(trigrams.items())
l.sort(key=lambda x: x[1], reverse=True)
this is a tad faster than sorting your list in ascending order and then reverse it with .reverse()
A more generic function for the printing of your sorted list (will work for any n-grams and not just tri-grams):
for ngram, count in l:
if count < 5:
break
# " ".join(ngram) will combine all elements of ngram in a string, separated with spaces
print(" ".join(ngram), count)

Resources