I'm pretty new to python and I'm currently working on an assignment to implement a movie recommendation system. I have a .csv file that contains various descriptions of a given movie's attribute. I ask the user for a movie title and then the system returns similar movies.
The dataset is named movie_dataset.csv from this folder on GitHub: https://github.com/codeheroku/Introduction-to-Machine-Learning/tree/master/Building%20a%20Movie%20Recommendation%20Engine
The problem I am encountering is that when I ask the user to enter a movie title, the program only works for certain titles.
The code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#helper functions#
def get_title_from_index(index):
return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
return df[df.title == title]["index"].values[0]
df = pd.read_csv("movie_dataset.csv")
#print (df.columns)
features = ['keywords','cast','genres','director']
for feature in features:
df[feature] = df[feature].fillna('')
def combine_features(row):
return row['keywords'] +" "+ row['cast'] +" "+ row['genres'] +" "+ row['director']
df["combine_features"] = df.apply(combine_features, axis=1)
#print (df["combine_features"].head())
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combine_features"])
#MTitle = input("Type in a movie title: ")
cosine_sim = cosine_similarity(count_matrix)
movie_user_likes = 'Avatar'#MTitle
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))
sorted_similar_movies = sorted(similar_movies, key= lambda x:x[1], reverse=True)
i = 0
for movie in sorted_similar_movies:
print (get_title_from_index(movie[0]))
i=i+1
if i>10:
break
When I enter "Batman" the program runs fine. But when I run "Harry Potter" I get:
IndexError Traceback (most recent call last)
<ipython-input-51-687ddb420709> in <module>
30 movie_user_likes = MTitle
31
---> 32 movie_index = get_index_from_title(movie_user_likes)
33
34 similar_movies = list(enumerate(cosine_sim[movie_index]))
<ipython-input-51-687ddb420709> in get_index_from_title(title)
8
9 def get_index_from_title(title):
---> 10 return df[df.title == title]["index"].values[0]
11
12 df = pd.read_csv("movie_dataset.csv")
IndexError: index 0 is out of bounds for axis 0 with size 0
There's simply no entry in the data base for the movie "Harry Potter"
You should add some testing for these cases such as:
def get_index_from_title(title):
try:
return df[df.title == title]["index"].values[0]
except IndexError:
return None
Then of course in the calling code you'll have to test if you got a None from the function and act accordingly.
Related
I am feeding a long list of inputs in a function that calls an API to retrieve data. My list is around 40.000 unique inputs. Currently, the function returns output every 1-2 seconds or so. Quick maths tells me that it would take over 10+ hrs before my function will be done. I therefore want to speed this process up, but have struggles finding a solution. I am quite a beginner, so threading/pooling is quite difficult for me. I hope someone is able to help me out here.
The function:
import quandl
import datetime
import numpy as np
quandl.ApiConfig.api_key = 'API key here'
def get_data(issue_date, stock_ticker):
# Prepare var
stock_ticker = "EOD/" + stock_ticker
# Volatility
date_1 = datetime.datetime.strptime(issue_date, "%d/%m/%Y")
pricing_date = date_1 + datetime.timedelta(days=-40) # -40 days of issue date
volatility_date = date_1 + datetime.timedelta(days=-240) # -240 days of issue date (-40,-240 range)
# Check if code exists : if not -> return empty array
try:
stock = quandl.get(stock_ticker, start_date=volatility_date, end_date=pricing_date) # get pricing data
except quandl.errors.quandl_error.NotFoundError:
return []
daily_close = stock['Adj_Close'].pct_change() # returns using adj.close
stock_vola = np.std(daily_close) * np.sqrt(252) # annualized volatility
# Average price
stock_pricing_date = date_1 + datetime.timedelta(days=-2) # -2 days of issue date
stock_pricing_date2 = date_1 + datetime.timedelta(days=-12) # -12 days of issue date
stock_price = quandl.get(stock_ticker, start_date=stock_pricing_date2, end_date=stock_pricing_date)
stock_price_average = np.mean(stock_price['Adj_Close']) # get average price
# Amihuds Liquidity measure
liquidity_pricing_date = date_1 + datetime.timedelta(days=-20)
liquidity_pricing_date2 = date_1 + datetime.timedelta(days=-120)
stock_data = quandl.get(stock_ticker, start_date=liquidity_pricing_date2, end_date=liquidity_pricing_date)
p = np.array(stock_data['Adj_Close'])
returns = np.array(stock_data['Adj_Close'].pct_change())
dollar_volume = np.array(stock_data['Adj_Volume'] * p)
illiq = (np.divide(returns, dollar_volume))
print(np.nanmean(illiq))
illiquidity_measure = np.nanmean(illiq, dtype=float) * (10 ** 6) # multiply by 10^6 for expositional purposes
return [stock_vola, stock_price_average, illiquidity_measure]
I then use a seperate script to select my csv file with the list with rows, each row containing the issue_date, stock_ticker
import function
import csv
import tkinter as tk
from tkinter import filedialog
# Open File Dialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
# Load Spreadsheet data
f = open(file_path)
csv_f = csv.reader(f)
next(csv_f)
result_data = []
# Iterate
for row in csv_f:
try:
return_data = function.get_data(row[1], row[0])
if len(return_data) != 0:
# print(return_data)
result_data_loc = [row[1], row[0]]
result_data_loc.extend(return_data)
result_data.append(result_data_loc)
except AttributeError:
print(row[0])
print('\n\n')
print(row[1])
continue
if result_data is not None:
with open('resuls.csv', mode='w', newline='') as result_file:
csv_writer = csv.writer(result_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for result in result_data:
# print(result)
csv_writer.writerow(result)
else:
print("No results found!")
It is quite messy, but like I mentioned before, I am definitely a beginner. Speeding this up would greatly help me.
I'm getting an IndexError using multiprocessing to process parts of a pandas DataFrame in parallel. vacancies is a pandas DataFrame containing several vacancies, of which one column is the raw text.
def addSkillRelevance(vacancies):
skills = pickle.load(open("skills.pkl", "rb"))
vacancies['skill'] = ''
vacancies['skillcount'] = 0
vacancies['all_skills_in_vacancy'] = ''
new_vacancies = pd.DataFrame(columns=vacancies.columns)
for vacancy_index, vacancy_row in vacancies.iterrows():
#Create a df for which each row is a found skill (with the other attributes of the vacancy)
per_vacancy_df = pd.DataFrame(columns=vacancies.columns)
all_skills_in_vacancy = []
skillcount = 0
for skill_index, skill_row in skills.iterrows():
#Making the search for the skill in the text body a bit smarter
spaceafter = ' ' + skill_row['txn_skill_name'] + ' '
newlineafter = ' ' + skill_row['txn_skill_name'] + '\n'
tabafter = ' ' + skill_row['txn_skill_name'] + '\t'
#Statement that returns true if we find a variation of the skill in the text body
if((spaceafter in vacancies.at[vacancy_index,'body']) or (newlineafter in vacancies.at[vacancy_index,'body']) or (tabafter in vacancies.at[vacancy_index,'body'])):
#Adding the skill to the list of skills found in the vacancy
all_skills_in_vacancy.append(skill_row['txn_skill_name'])
#Increasing the skillcount
skillcount += 1
#Adding the skill to the row
vacancies.at[vacancy_index,'skill'] = skill_row['txn_skill_name']
#Add a row to the vacancy df where 1 row, means 1 skill
per_vacancy_df = per_vacancy_df.append(vacancies.iloc[vacancy_index])
#Adding the list of all found skills in the vacancy to each (skill) row
per_vacancy_df['all_skills_in_vacancy'] = str(all_skills_in_vacancy)
per_vacancy_df['skillcount'] = skillcount
#Adds the individual vacancy df to a new vacancy df
new_vacancies = new_vacancies.append(per_vacancy_df)
return(new_vacancies)
def executeSkillScript(vacancies):
from multiprocessing import Pool
vacancies = vacancies.head(100298)
num_workers = 47
pool = Pool(num_workers)
vacancy_splits = np.array_split(vacancies, num_workers)
results_list = pool.map(addSkillRelevance,vacancy_splits)
new_vacancies = pd.concat(results_list, axis=0)
pool.close()
pool.join()
executeSkillScript(vacancies)
The function addSkillRelevance() takes in a pandas DataFrame and outputs a pandas DataFrame (with more columns). For some reason, after finishing all the multiprocessing, I get an IndexError on results_list = pool.map(addSkillRelevance,vacancy_splits). I'm quite stuck as I don't know how to handle the error. Does anyone have tips as to why the IndexError is occurring?
The error:
IndexError Traceback (most recent call last)
<ipython-input-11-7cb04a51c051> in <module>()
----> 1 executeSkillScript(vacancies)
<ipython-input-9-5195d46f223f> in executeSkillScript(vacancies)
14
15 vacancy_splits = np.array_split(vacancies, num_workers)
---> 16 results_list = pool.map(addSkillRelevance,vacancy_splits)
17 new_vacancies = pd.concat(results_list, axis=0)
18
~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/multiprocessing/pool.py in map(self, func, iterable, chunksize)
264 in a list that is returned.
265 '''
--> 266 return self._map_async(func, iterable, mapstar, chunksize).get()
267
268 def starmap(self, func, iterable, chunksize=None):
~/anaconda3/envs/amazonei_tensorflow_p36/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
IndexError: single positional indexer is out-of-bounds
As per the suggestion
The error is coming from this line:
per_vacancy_df = per_vacancy_df.append(vacancies.iloc[vacancy_index])
The error is occuring because vacancy_index is not in the index of the vacancies dataframe.
i'm trying to catch content on rows more than 1000.
So i have tried to append each content on list to be able to use it then.
coding:utf-8
import os
import json
from azure import *
from azure.storage import *
from azure.storage.table import TableService, Entity
import datetime
def Retrives_datas():
twenty_hours_before_now = datetime.datetime.now() - datetime.timedelta(days=1)
now = twenty_hours_before_now.isoformat()
filter = "Timestamp gt datetime'" + now + "'"
maker = None
i=0
table_service = TableService(account_name='MyAccount', sas_token='MySAS')
while True:
tasks = table_service.query_entities('MyTable', filter = filter, timeout=None, num_results=1000, marker=maker)
for task in tasks:
i += 1
print(i,tasks.items[i]['Status'])
if tasks.next_marker != {}:
maker = tasks.next_marker
else:
break
i get the below error :
999 Success
Traceback (most recent call last):
print(i,tasks.items[i]['Status']) IndexError: list index out of range
knowing that i when i replace
print(i,tasks.items[i]['Status'])
by
print(i)
I get more than 2770 rows.
The common error list index out of range because that you want to access the list[1000] even though it only has 1000 items(you set num_results=1000 ). You could only access list[999] because the index starts with 0.
Just move down i += 1 line.
for task in tasks:
print(i,tasks.items[i]['Status'])
i += 1
My sample data:
Output:
For summary, the length of the tasks.item takes the size of the next list that will be returned instead incrementation.
Solution: Added if i == <num_results>: i = 0
I have a error when trying to call calculate_similarity2 function which in in DocSim.py file from my notebook.
The error message is : 'DocSim' object has no attribute 'calculate_similarity2'
Here the content of my docsim File :
import numpy as np
class DocSim(object):
def __init__(self, w2v_model , stopwords=[]):
self.w2v_model = w2v_model
self.stopwords = stopwords
def vectorize(self, doc):
"""Identify the vector values for each word in the given document"""
doc = doc.lower()
words = [w for w in doc.split(" ") if w not in self.stopwords]
word_vecs = []
for word in words:
try:
vec = self.w2v_model[word]
word_vecs.append(vec)
except KeyError:
# Ignore, if the word doesn't exist in the vocabulary
pass
# Assuming that document vector is the mean of all the word vectors
# PS: There are other & better ways to do it.
vector = np.mean(word_vecs, axis=0)
return vector
def _cosine_sim(self, vecA, vecB):
"""Find the cosine similarity distance between two vectors."""
csim = np.dot(vecA, vecB) / (np.linalg.norm(vecA) * np.linalg.norm(vecB))
if np.isnan(np.sum(csim)):
return 0
return csim
def calculate_similarity(self, source_doc, target_docs=[], threshold=0):
"""Calculates & returns similarity scores between given source document & all
the target documents."""
if isinstance(target_docs, str):
target_docs = [target_docs]
source_vec = self.vectorize(source_doc)
results = []
for doc in target_docs:
target_vec = self.vectorize(doc)
sim_score = self._cosine_sim(source_vec, target_vec)
if sim_score > threshold:
results.append({
'score' : sim_score,
'sentence' : doc
})
# Sort results by score in desc order
results.sort(key=lambda k : k['score'] , reverse=True)
return results
def calculate_similarity2(self, source_doc=[], target_docs=[], threshold=0):
"""Calculates & returns similarity scores between given source document & all the target documents."""
if isinstance(source_doc, str):
target_docs = [source_doc]
if isinstance(target_docs, str):
target_docs = [target_docs]
#source_vec = self.vectorize(source_doc)
results = []
for doc in source_doc:
source_vec = self.vectorize(doc)
for doc1 in target_docs:
target_vec = self.vectorize(doc)
sim_score = self._cosine_sim(source_vec, target_vec)
if sim_score > threshold:
results.append({
'score' : sim_score,
'source sentence' : doc,
'target sentence' : doc1
})
# Sort results by score in desc order
results.sort(key=lambda k : k['score'] , reverse=True)
return results
here in instruction code when i try to call the fucntion :
To create DocSim Object
ds = DocSim(word2vec_model,stopwords=stopwords)
sim_scores = ds.calculate_similarity2(source_doc, target_docs)
the error message is :
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-bb0bd1e0e0ad> in <module>()
----> 1 sim_scores = ds.calculate_similarity2(source_doc, target_docs)
AttributeError: 'DocSim' object has no attribute 'calculate_similarity2'
i don't undersantand how to resolve this problem.
I can access to all function except calculate_similarity2
Can you help me please?
thanks
You have defined the calculate_similarity2 function inside the __init__ scope. Try getting it out of there
when i use large number of data show this Error:('no unique mode; found %d equally common values' % len(table) statistics.StatisticsError: no unique mode; found 2 equally common values). But use 100 number of data it's work.i can't understand what the reason it doesn't work any one help and how to solve this Error pls.
data link:https://github.com/YoeriNijs/TweetAnalyzer
code:
import warnings
warnings.filterwarnings("ignore")
import nltk, random, csv, sys
from nltk.probability import FreqDist, ELEProbDist
from nltk.classify.util import apply_features,accuracy
from nltk.corpus import names
from nltk.tokenize import word_tokenize
import nltk.classify.util
from nltk import NaiveBayesClassifier
from textblob import TextBlob
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify import ClassifierI
from statistics import mode
class VoteClassifier(ClassifierI):
def __init__(self, *classifiers):
self._classifiers = classifiers
def classify(self, features):
votes = []
for c in self._classifiers:
v = c.classify(features)
votes.append(v)
return mode(votes)
def get_words_in_tweets(tweets):
all_words = []
try:
for (words, sentiment) in tweets:
all_words.extend(words)
return all_words
except Exception as e:
print(e)
def get_word_features(wordlist):
wordlist = FreqDist(wordlist)
word_features = wordlist.keys()
#print (word_features)
return word_features
def selectTweets(row):
tweetWords = []
words = row[0].split()
for i in words:
i = i.lower()
i = i.strip('##\'"?,.!')
tweetWords.append(i)
row[0] = tweetWords
if counter <= 120:
trainTweets.append(row)
#print(trainTweets)
#print(('*')*30)
else:
testTweets.append(row)
#print(testTweets)
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
trainTweets = []
testTweets = []
#csvfile.csv
while True:
# Ask for filename
filename = str(input("> Please enter a filename (.csv): "))
#Check if filename ends with .csv
if filename.endswith(".csv"):
try:
#Open file
with open(filename, 'r',encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
#Print succes message
print ("> File opened successfully!")
counter = 0
for row in reader:
selectTweets(row)
counter += 1
print (counter,"> Wait a sec for the results...")
word_features = get_word_features(get_words_in_tweets(trainTweets))
training_set = apply_features(extract_features, trainTweets)
test_training_set=apply_features(extract_features, testTweets)
classifier = nltk.classify.NaiveBayesClassifier.train(training_set)
classifier.show_most_informative_features(5)
print (nltk.classify.util.accuracy(classifier,test_training_set))
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, test_training_set))
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, test_training_set))
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, test_training_set))*100)
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, test_training_set))*100)
SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, test_training_set))*100)
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, test_training_set))*100)
voted_classifier = VoteClassifier(classifier,
LinearSVC_classifier,
SGDClassifier_classifier,
MNB_classifier,
BNB_classifier,
LogisticRegression_classifier)
print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier,test_training_set ))*100)
while True:
tweet = str(input("Please enter the text of the tweet you want to analize: "))
print (classifier.classify(extract_features(tweet.split())))
while True:
print
repeat = str(input("> Do you want to check another tweet (y/n)? "))
if repeat == "n":
print ("Exit program")
sys.exit()
if repeat != "y":
print ("Something went wrong")
if repeat == "y":
break
#If file does not exist, display this"""
except IOError:
print ("File does not exist.")
#Else if file does not end with .csv, do this
else:
print ("Please open a file that ends with .csv")
Show this Error:
Traceback (most recent call last):
File "C:\Users\Nahid\Desktop\main folder\newcheck.py", line 163, in <module>
print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier,test_training_set ))*100)
File "C:\Users\Nahid\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\classify\util.py", line 87, in accuracy
results = classifier.classify_many([fs for (fs, l) in gold])
File "C:\Users\Nahid\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\classify\api.py", line 77, in classify_many
return [self.classify(fs) for fs in featuresets]
File "C:\Users\Nahid\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nltk\classify\api.py", line 77, in <listcomp>
return [self.classify(fs) for fs in featuresets]
File "C:\Users\Nahid\Desktop\main folder\newcheck.py", line 35, in classify
return mode(votes)
File "C:\Users\Nahid\AppData\Local\Programs\Python\Python36-32\lib\statistics.py", line 507, in mode
'no unique mode; found %d equally common values' % len(table)
statistics.StatisticsError: no unique mode; found 2 equally common values
The easiest way to solve this is to upgrade Python to 3.8 or higher.
In Python versions 3.7 and older there can to be only a single number that occurs the most times in the whole set. If a set contains two or more such numbers than mode becomes inconclusive and returns the exact error you got.
However, since version 3.8 the whole mathematical concept is changed. In cases in which there are two or more modes in a set, the smallest mode is selected as the result.
Example:
result = statistics.mode([1,1,2,2,3,3])
has three possible and equal solutions: 1, 2, or 3 as each number occurs two times in the set
in Python 3.7 this returns an error,
in Python 3.8 this returns 1 as the mode