Improve speed of wikipedia.search with Python - python-3.x

I have a dictionary of famous people's names sorted by their initials. I want to convert these names into their respective Wikipedia title page names. These are the same for the first three given in this example, but Alexander Bell gets correctly converted to Alexander Graham Bell after running this code.
The algorithm works, although took about an hour to do all the 'AA' names and I am hoping for it to do this all the way up to 'ZZ'.
Is there any optimisation I can do on this? For example I saw something about batch requests but am not sure if it applies to my algorithm.
Or is there a more efficient method that I could use to get this same information?
Thanks.
import wikipedia
PeopleDictionary = {'AA':['Amy Adams', 'Aaron Allston'], 'AB':['Alia Bhatt', 'Alexander Bell']}
for key, val in PeopleDictionary.items():
for val in range(len(PeopleDictionary[key])):
Name_URL_All = wikipedia.search(PeopleDictionary[key][val])
if Name_URL_All:
Name_URL = Name_URL_All[0]
PeopleDictionary[key][val] = Name_URL

Related

Python: How to excute a variable in a string in a for loop in a function?

I have an excel (output of survey) and I am trying to write codes to give marks based on the entries of a survey of different students.
Suppose the data at hand is given as follow
import warnings, pickle
import numpy as np, pandas as pd
warnings.filterwarnings('ignore')
A='pickle.loads(b\'\\x80\\x03cpandas.core.frame\\nDataFrame\\nq\\x00)\\x81q\\x01}q\\x02(X\\x05\\x00\\x00\\x00_dataq\\x03cpandas.core.internals.managers\\nBlockManager\\nq\\x04)\\x81q\\x05(]q\\x06(cpandas.core.indexes.base\\n_new_Index\\nq\\x07cpandas.core.indexes.base\\nIndex\\nq\\x08}q\\t(X\\x04\\x00\\x00\\x00dataq\\ncnumpy.core.multiarray\\n_reconstruct\\nq\\x0bcnumpy\\nndarray\\nq\\x0cK\\x00\\x85q\\rC\\x01bq\\x0e\\x87q\\x0fRq\\x10(K\\x01K\\x04\\x85q\\x11cnumpy\\ndtype\\nq\\x12X\\x02\\x00\\x00\\x00O8q\\x13K\\x00K\\x01\\x87q\\x14Rq\\x15(K\\x03X\\x01\\x00\\x00\\x00|q\\x16NNNJ\\xff\\xff\\xff\\xffJ\\xff\\xff\\xff\\xffK?tq\\x17b\\x89]q\\x18(X\\x04\\x00\\x00\\x00Nameq\\x19X \\x00\\x00\\x00Which companies have you chosen?q\\x1aX\\n\\x00\\x00\\x00Question 1q\\x1bX\\n\\x00\\x00\\x00Question 2q\\x1cetq\\x1dbX\\x04\\x00\\x00\\x00nameq\\x1eNu\\x86q\\x1fRq h\\x07cpandas.core.indexes.range\\nRangeIndex\\nq!}q"(h\\x1eNX\\x05\\x00\\x00\\x00startq#K\\x00X\\x04\\x00\\x00\\x00stopq$K\\x04X\\x04\\x00\\x00\\x00stepq%K\\x01u\\x86q&Rq\\\'e]q(h\\x0bh\\x0cK\\x00\\x85q)h\\x0e\\x87q*Rq+(K\\x01K\\x04K\\x04\\x86q,h\\x15\\x89]q-(X\\x03\\x00\\x00\\x00Ayaq.X\\x04\\x00\\x00\\x00Ramiq/X\\x07\\x00\\x00\\x00Geniousq0X\\x05\\x00\\x00\\x00Samirq1G\\x7f\\xf8\\x00\\x00\\x00\\x00\\x00\\x00X?\\x00\\x00\\x00 Mobil, ConsolidatedEdisonq2X?\\x00\\x00\\x00 DataGeneral, GeneralPublicUtilitiesq3h3X\\x11\\x00\\x00\\x00Uploaded the dataq4X\\xfe\\x00\\x00\\x00Uploaded the data,Specified the reason behind the selected stocks,You were successful in cleaning your data-sets,You have had a justification why you selected this particular time period of your chosen stock. This should have been answered in Question 2.q5X{\\x01\\x00\\x00Uploaded the data,Specified the reason behind the selected stocks,You selected both stocks from different industries,You were successful in cleaning your data-sets,You have had a justification why you selected this particular time period of your chosen stock. This should have been answered in Question 2.,You did extra work for this question and you deserve a round of applause.q6X\\xec\\x00\\x00\\x00Specified the reason behind the selected stocks,You were successful in cleaning your data-sets,You have had a justification why you selected this particular time period of your chosen stock. This should have been answered in Question 2.q7G\\x7f\\xf8\\x00\\x00\\x00\\x00\\x00\\x00X(\\x01\\x00\\x00You plot a graph over the time axis that demonstrate how your selected returns behaved in exciting periods,You produced a scatter plot that shows that you may change your x-axis from time to market return.,You also were able to compare your selected stock to other stocks within the same industryq8X\\xde\\x01\\x00\\x00You plot a graph over the time axis that demonstrate how your selected returns behaved in exciting periods,You produced a scatter plot that shows that you may change your x-axis from time to market return.,You have shown some references to what has been going in your chosen period and you gave a good justification.,You also were able to compare your selected stock to other stocks within the same industry,You did extra work for this question and you deserve a very high mark.q9h8etq:ba]q;h\\x07h\\x08}q<(h\\nh\\x0bh\\x0cK\\x00\\x85q=h\\x0e\\x87q>Rq?(K\\x01K\\x04\\x85q#h\\x15\\x89]qA(h\\x19h\\x1ah\\x1bh\\x1cetqBbh\\x1eNu\\x86qCRqDa}qEX\\x06\\x00\\x00\\x000.14.1qF}qG(X\\x04\\x00\\x00\\x00axesqHh\\x06X\\x06\\x00\\x00\\x00blocksqI]qJ}qK(X\\x06\\x00\\x00\\x00valuesqLh+X\\x08\\x00\\x00\\x00mgr_locsqMcbuiltins\\nslice\\nqNK\\x00K\\x04K\\x01\\x87qORqPuaustqQbX\\x04\\x00\\x00\\x00_typqRX\\t\\x00\\x00\\x00dataframeqSX\\t\\x00\\x00\\x00_metadataqT]qUub.\')'
df1=eval(A)
I wrote these function to help me understand the meaning of applying or getting results
# This function will give you the students's answer {Answer_Student_1_Q_1, Answer_Student_1_Q_2}
def Get_student_answers_per_question(i1, total_number_of_questions):
g=df1.index[df1['Name']!='Genious'][i1]
Gen1=df1.ix[g,:]
print(Gen1)
for i in range(total_number_of_questions):
if isinstance(Gen1[i+2], float):
foo='Answer_Student_'+str(i1+1)+'_Q_'+str(i+1)+ '= Gen1['+ str(i+2) +']'
else:
foo='Answer_Student_'+str(i1+1)+'_Q_'+str(i+1)+ '= Gen1['+ str(i+2) +'].split(",")'
exec(foo, locals(), globals())
pass
# Get student score for each answer in question
def Get_score_per_question(student, question, dictionary_mark, total_number_of_questions):
# Get_all_answers will get all the answers for student (student)
Get_all_answers=Get_student_answers_per_question(student-1,total_number_of_questions)
foo='Answer_Student_'+str(student)+'_Q_'+str(question)
print(foo)
v=[dictionary_mark.get(i) for i in exec(foo)]
return v
Now in the last function, Get_score_per_question, I was trying to code
v=[dictionary_mark.get(i) for i in exec(foo)]
where v is the score of the variable if available in the answer of the dictionary.
so depending on the entries in the string variable foo the results would be of same length with numbers
The example that I am trying to run is this
student=1
question=1
dictionary_mark={'Uploaded the data': 1,
'Specified the reason behind the selected stocks': 1,
'You selected both stocks from different industries': 1,
'You were successful in cleaning your data-sets': 2,
'You have had a justification why you selected this particular time period of your chosen stock. This should have been answered in Question 2.': 1,
'You did extra work for this question and you deserve a round of applause.': 1}
total_number_of_questions=2
Get_score_per_question(student, question, dictionary_mark, total_number_of_questions)
Where, as you can foresee, I will get the following error
TypeError: 'NoneType' object is not iterable
Can somebody help me in this regards, and is there any tutorial or page that someone could refer me to do a better coding in such surveys in python espcially when handling splits and so on.
Docs of exec() function : https://docs.python.org/3/library/functions.html#exec
Be aware that the return and yield statements may not be used outside of function definitions even within the context of code passed to the exec() function. The return value is None.
And you have used this piece of code : [dictionary_mark.get(i) for i in exec(foo)]
So obviously the None object doesn't implement __iter__ function and you get Type Error when you try to loop through

Pandas Column Names Not Lining Up When .dat File Read

I'm going through Wes McKinney's Python for Data Analysis 2nd Edition and in Chapter 2 he has several examples based of merging three .dat files about movie reviews.
I can get two of the three data files to work (users and reviews), but the third one (movie titles) I can not get to work and can't figure out what to do.
Here's the code:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('movies.dat', sep = '::', header = None, engine = 'python', names = mnames)
print(movies[:5])
And here is what the output/problem looks like. Seems the file is not lining up the separator correctly and I've tried recreating the file and comparing to the other two files which are working but they look exactly the same.
Here's a sample data taken from here:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller
11::American President, The (1995)::Comedy|Drama|Romance
12::Dracula: Dead and Loving It (1995)::Comedy|Horror
13::Balto (1995)::Animation|Children's
14::Nixon (1995)::Drama
I'd like to be able to read this file properly so I can join it to the other two example files and keep learning Pandas :)
try adding encoding='UTF-16' to pd.read_table()
(Sorry, not enough reputation to add a comment.)

Python 3: Saving API Results into CSV

I'm writing a script which requires a daily updated CSV source file which lists many movie details and have decided to use Python3 to create and update it even though I don't know too much about it.
I believe I've got the code down to pull the information via TheMovieDB.org's API that I need, but currently can only get it to echo the results and not save in a CSV. Below are a couple of questions I have, the code that I currently have, and an example of it's current output.
Questions:
1. What do I need to do add to get the resulting data into a CSV? I've tried many things but so far haven't gotten anything to work
2. What would I need to add so that rerunning the script would completely overwrite the CSV produced from the last run? (not append or error out)
3. Optional: Unless tedious or a pain, it would be nice to have a column for each of the values provided per title within the CSV.
Thanks!!
Current Code
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Result That's Echoed from the above Current Code
{"page":20,"total_results":360846,"total_pages":18043,"results":[{"vote_count":0,"id":521662,"video":false,"vote_average":0,"title":"森のかたみ","popularity":1.098018,"poster_path":"/qmj1gJ33lF7BhEOWAvK0mt6hRGH.jpg","original_language":"ja","original_title":"森のかたみ","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":518636,"video":false,"vote_average":0,"title":"Stadtkomödie:
Geschenkt","popularity":1.189812,"poster_path":null,"original_language":"de","original_title":"Stadtkomödie:
Geschenkt","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":520720,"video":false,"vote_average":0,"title":"Kim
Possible","popularity":1.188148,"poster_path":"/3QGHTLgNKRphu3bLvGpoTZ1Ce9U.jpg","original_language":"en","original_title":"Kim
Possible","genre_ids":[10751,28,12],"backdrop_path":null,"adult":false,"overview":"Live-action
film adaptation of the Disney Channel original series Kim
Possible.","release_date":"2019-01-01"},{"vote_count":0,"id":521660,"video":false,"vote_average":0,"title":"Speak
Low","popularity":1.098125,"poster_path":"/qYQQlizCTfD5km7GIrTWrBb4E9b.jpg","original_language":"ja","original_title":"小さな声で囁いて","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":497834,"video":false,"vote_average":0,"title":"Saturday Fiction","popularity":1.148142,"poster_path":null,"original_language":"zh","original_title":"兰心大剧院","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"An
actress working undercover for the Allies in 1941 Shanghai discovers
the Japanese plan to attack Pearl
Harbor.","release_date":"2019-01-01"},{"vote_count":0,"id":523461,"video":false,"vote_average":0,"title":"Wie
gut ist deine
Beziehung?","popularity":1.188171,"poster_path":null,"original_language":"de","original_title":"Wie
gut ist deine
Beziehung?","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":507118,"video":false,"vote_average":0,"title":"Schwartz &
Schwartz","popularity":1.345715,"poster_path":null,"original_language":"de","original_title":"Schwartz
&
Schwartz","genre_ids":[80],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":505916,"video":false,"vote_average":0,"title":"Kuru","popularity":1.107158,"poster_path":null,"original_language":"ja","original_title":"来る","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"After
the inexplicable message, at his workplace, of a mysterious death, a
man is introduced to a freelance writer and his
girlfriend.","release_date":"2019-01-01"},{"vote_count":0,"id":521028,"video":false,"vote_average":0,"title":"Tsokos:
Zersetzt","popularity":1.115739,"poster_path":null,"original_language":"de","original_title":"Tsokos:
Zersetzt","genre_ids":[53],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":516910,"video":false,"vote_average":0,"title":"Rufmord","popularity":1.658291,"poster_path":null,"original_language":"de","original_title":"Rufmord","genre_ids":[18],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":514224,"video":false,"vote_average":0,"title":"Shadows","popularity":1.289124,"poster_path":null,"original_language":"en","original_title":"Shadows","genre_ids":[16],"backdrop_path":null,"adult":false,"overview":"Plot
kept under
wraps.","release_date":"2019-01-01"},{"vote_count":0,"id":483202,"video":false,"vote_average":0,"title":"Eli","popularity":1.118757,"poster_path":null,"original_language":"en","original_title":"Eli","genre_ids":[27],"backdrop_path":null,"adult":false,"overview":"A
boy receiving treatment for his auto-immune disorder discovers that
the house he's living isn't as safe as he
thought.","release_date":"2019-01-01"},{"vote_count":0,"id":491287,"video":false,"vote_average":0,"title":"Untitled Lani Pixels
Project","popularity":1.951231,"poster_path":null,"original_language":"en","original_title":"Untitled
Lani Pixels
Project","genre_ids":[10751,16,12,35],"backdrop_path":null,"adult":false,"overview":"Evil
forces have invaded an isolated island and have targeted Patrick and
Susan's grandfather, Mr. Campbell. Guided by Jack, a charming Irish
rogue, the siblings end up on a dangerous journey filled with magic
and
mystery.","release_date":"2019-01-01"},{"vote_count":2,"id":49046,"video":false,"vote_average":0,"title":"All
Quiet on the Western
Front","popularity":6.197559,"poster_path":"/jZWVtbxyztDTSM0LXDcE6vdVTVC.jpg","original_language":"en","original_title":"All
Quiet on the Western
Front","genre_ids":[28,12,18,10752],"backdrop_path":null,"adult":false,"overview":"A
young German soldier's terrifying experiences and distress on the
western front during World War
I.","release_date":"2018-12-31"},{"vote_count":1,"id":299782,"video":false,"vote_average":0,"title":"The
Other Side of the
Wind","popularity":4.561363,"poster_path":"/vnfNbuyPqo5zJavqlgI3J50xJSi.jpg","original_language":"en","original_title":"The
Other Side of the
Wind","genre_ids":[35,18],"backdrop_path":null,"adult":false,"overview":"Orson
Welles' unfinished masterpiece, restored and assembled based on
Welles' own notes. During the last 15 years of his life, Welles, who
died in 1985, worked obsessively on the film, which chronicles a
temperamental film director—much like him—who is battling with the
Hollywood establishment to finish an iconoclastic
work.","release_date":"2018-12-31"},{"vote_count":0,"id":289600,"video":false,"vote_average":0,"title":"The
Sandman","popularity":3.329464,"poster_path":"/eju4vLNx9sSvscowmnKNLi3sFVe.jpg","original_language":"en","original_title":"The
Sandman","genre_ids":[27],"backdrop_path":"/zo67d5klQiFR3PCyvER39IMwZ73.jpg","adult":false,"overview":"THE
SANDMAN tells the story of Nathan, a young student in the city who
struggles to forget his childhood trauma at the hands of the serial
killer dubbed \"The Sandman.\" Nathan killed The Sandman years ago, on
Christmas Eve, after he witnessed the murder of his mother... until he
sees the beautiful woman who lives in the apartment across the way
dying at the hands of that same masked killer. This brutal murder
plunges Nathan into an odyssey into the night country of his past, his
dreams... and the buried secrets of The
Sandman.","release_date":"2018-12-31"},{"vote_count":0,"id":378177,"video":false,"vote_average":0,"title":"Luxembourg","popularity":1.179703,"poster_path":null,"original_language":"en","original_title":"Luxembourg","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"The
story of a group of people living in a permanent nuclear winter in the
ruins of the old civilisation destroyed by an atomic
war.","release_date":"2018-12-31"},{"vote_count":0,"id":347392,"video":false,"vote_average":0,"title":"Slice","popularity":3.248065,"poster_path":"/ySWPZihd5ynCc1aNLQUXmiw5H2V.jpg","original_language":"en","original_title":"Slice","genre_ids":[35],"backdrop_path":"/rtL9nzXtSvo1MW05kho9oeimCdb.jpg","adult":false,"overview":"When
a pizza delivery driver is murdered on the job, the city searches for
someone to blame: ghosts? drug dealers? a disgraced
werewolf?","release_date":"2018-12-31"},{"vote_count":0,"id":438674,"video":false,"vote_average":0,"title":"Dragged
Across
Concrete","popularity":3.659627,"poster_path":"/p4tpV4nGeocuOKhp0enuiQNDvhi.jpg","original_language":"en","original_title":"Dragged
Across
Concrete","genre_ids":[18,80,53,9648],"backdrop_path":null,"adult":false,"overview":"Two
policemen, one an old-timer (Gibson), the other his volatile younger
partner (Vaughn), find themselves suspended when a video of their
strong-arm tactics becomes the media's cause du jour. Low on cash and
with no other options, these two embittered soldiers descend into the
criminal underworld to gain their just due, but instead find far more
than they wanted awaiting them in the
shadows.","release_date":"2018-12-31"},{"vote_count":0,"id":437518,"video":false,"vote_average":0,"title":"Friend
of the
World","popularity":4.189267,"poster_path":"/hf3LucIg7t7DUvgGJ9DjQyHcI4J.jpg","original_language":"en","original_title":"Friend
of the
World","genre_ids":[35,18,27,878,53,10752],"backdrop_path":null,"adult":false,"overview":"After
a catastrophic war, an eccentric general guides a filmmaker through a
ravaged bunker.","release_date":"2018-12-31"}]}
import json
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
json_data = json.loads(data)
results=json_data["results"]
for item in results:
print (item('vote_count'))
#write code to get necessary objects to write in csv
This is a way how you can do it. Comment if you have any query.
That looks like a JSON object, so you can parse it into a python dictionary using:
import json
mydict = json.loads(data)
Probably the values you want are in mydict[results] which is another set of key:value pairs. Depending on how you want these you could use a CSV library or just iterate through them and the print the contents with a tab between them.
for item in vars["results"]:
for k in item:
print("{}\t{}".format(k,item.get(k)))

Looking for an efficient way to remove all occurrences of a pattern in a string

I have many text files (10,000,000) and I want to remove all tokens which are started with /m/ from them. I wrote the following code but I am looking for faster code because I tokenize string then join tokens again.
import nltk
def removeMId(str):
words=[]
tokens = nltk.word_tokenize(str)
i=0
for tkn in tokens:
if not tkn.startswith('/m/'):
words.append(tkn)
i+=1
return ' '.join(words)
All removal tokens follow this pattern=r''' /\m+(?:[-'/]\w+)*'''
This is an example of input:
the manuscript handling fee (checks must be in /m/09c7w0 dollars, drawn on a /m/09c7w0 bank, MICR encoded at the bottom of the check, and payable to ASPET). /m/01kqjn /m/021b7r and /m/01w6dw credit cards are also accepted.
Alright, I'll bite - the straight-forward approach is to do it with regex:
import re
TOKEN_REMOVER = re.compile(r"/m+[-'/]\w+\s?")
def removeMId(source):
return TOKEN_REMOVER.sub("", source)
It ought to be significantly faster than your example. That being said, Python is not designed to break speed/efficiency records so if all you need is one-time file modification you might want to look into dedicated compiled tools for the exact same task.

How do I get a list of the most common words in various languages?

Stack Overflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
How do I get such a list of the most common English words? Or most common words in other languages? Is this something I can just get off the Google website?
A word frequency list is what you want. You can also make your own, or customize one for use within a particular domain, and it is a nice way to become familiar with some good libraries. Start with some text such as discussed in this question, then try out some variants of this back-of-the-envelope script:
from nltk.stem.porter import PorterStemmer
import os
import string
from collections import defaultdict
ps = PorterStemmer()
word_count = defaultdict(int)
source_directory = '/some/dir/full/of/text'
for root, dirs, files in os.walk(source_directory):
for item in files:
current_text = os.path.join(root, item)
words = open(current_text, 'r').read().split()
for word in words:
entry = ps.stem_word(word.strip(string.punctuation).lower())
word_count[entry] += 1
results = [[word_count[i], i] for i in word_count]
print sorted(results)
This gives the following on a couple of books downloaded, re the most common words:
[2955, 'that'], [4201, 'in'], [4658, 'to'], [4689, 'a'], [6441, 'and'], [6705, 'of'], [14508, 'the']]
See what happens when you filter out the most common x y or z number from your queries, or leave them out of your text search index entirely. Also might get some interesting results if you include real world data -- for example "community" "wiki" is not likely a common word on a generic list, but on SO that obviously wouldn't be the case and you might want to exclude them.

Resources