Python 3: Saving API Results into CSV - python-3.x

I'm writing a script which requires a daily updated CSV source file which lists many movie details and have decided to use Python3 to create and update it even though I don't know too much about it.
I believe I've got the code down to pull the information via TheMovieDB.org's API that I need, but currently can only get it to echo the results and not save in a CSV. Below are a couple of questions I have, the code that I currently have, and an example of it's current output.
Questions:
1. What do I need to do add to get the resulting data into a CSV? I've tried many things but so far haven't gotten anything to work
2. What would I need to add so that rerunning the script would completely overwrite the CSV produced from the last run? (not append or error out)
3. Optional: Unless tedious or a pain, it would be nice to have a column for each of the values provided per title within the CSV.
Thanks!!
Current Code
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Result That's Echoed from the above Current Code
{"page":20,"total_results":360846,"total_pages":18043,"results":[{"vote_count":0,"id":521662,"video":false,"vote_average":0,"title":"森のかたみ","popularity":1.098018,"poster_path":"/qmj1gJ33lF7BhEOWAvK0mt6hRGH.jpg","original_language":"ja","original_title":"森のかたみ","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":518636,"video":false,"vote_average":0,"title":"Stadtkomödie:
Geschenkt","popularity":1.189812,"poster_path":null,"original_language":"de","original_title":"Stadtkomödie:
Geschenkt","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":520720,"video":false,"vote_average":0,"title":"Kim
Possible","popularity":1.188148,"poster_path":"/3QGHTLgNKRphu3bLvGpoTZ1Ce9U.jpg","original_language":"en","original_title":"Kim
Possible","genre_ids":[10751,28,12],"backdrop_path":null,"adult":false,"overview":"Live-action
film adaptation of the Disney Channel original series Kim
Possible.","release_date":"2019-01-01"},{"vote_count":0,"id":521660,"video":false,"vote_average":0,"title":"Speak
Low","popularity":1.098125,"poster_path":"/qYQQlizCTfD5km7GIrTWrBb4E9b.jpg","original_language":"ja","original_title":"小さな声で囁いて","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":497834,"video":false,"vote_average":0,"title":"Saturday Fiction","popularity":1.148142,"poster_path":null,"original_language":"zh","original_title":"兰心大剧院","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"An
actress working undercover for the Allies in 1941 Shanghai discovers
the Japanese plan to attack Pearl
Harbor.","release_date":"2019-01-01"},{"vote_count":0,"id":523461,"video":false,"vote_average":0,"title":"Wie
gut ist deine
Beziehung?","popularity":1.188171,"poster_path":null,"original_language":"de","original_title":"Wie
gut ist deine
Beziehung?","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":507118,"video":false,"vote_average":0,"title":"Schwartz &
Schwartz","popularity":1.345715,"poster_path":null,"original_language":"de","original_title":"Schwartz
&
Schwartz","genre_ids":[80],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":505916,"video":false,"vote_average":0,"title":"Kuru","popularity":1.107158,"poster_path":null,"original_language":"ja","original_title":"来る","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"After
the inexplicable message, at his workplace, of a mysterious death, a
man is introduced to a freelance writer and his
girlfriend.","release_date":"2019-01-01"},{"vote_count":0,"id":521028,"video":false,"vote_average":0,"title":"Tsokos:
Zersetzt","popularity":1.115739,"poster_path":null,"original_language":"de","original_title":"Tsokos:
Zersetzt","genre_ids":[53],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":516910,"video":false,"vote_average":0,"title":"Rufmord","popularity":1.658291,"poster_path":null,"original_language":"de","original_title":"Rufmord","genre_ids":[18],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":514224,"video":false,"vote_average":0,"title":"Shadows","popularity":1.289124,"poster_path":null,"original_language":"en","original_title":"Shadows","genre_ids":[16],"backdrop_path":null,"adult":false,"overview":"Plot
kept under
wraps.","release_date":"2019-01-01"},{"vote_count":0,"id":483202,"video":false,"vote_average":0,"title":"Eli","popularity":1.118757,"poster_path":null,"original_language":"en","original_title":"Eli","genre_ids":[27],"backdrop_path":null,"adult":false,"overview":"A
boy receiving treatment for his auto-immune disorder discovers that
the house he's living isn't as safe as he
thought.","release_date":"2019-01-01"},{"vote_count":0,"id":491287,"video":false,"vote_average":0,"title":"Untitled Lani Pixels
Project","popularity":1.951231,"poster_path":null,"original_language":"en","original_title":"Untitled
Lani Pixels
Project","genre_ids":[10751,16,12,35],"backdrop_path":null,"adult":false,"overview":"Evil
forces have invaded an isolated island and have targeted Patrick and
Susan's grandfather, Mr. Campbell. Guided by Jack, a charming Irish
rogue, the siblings end up on a dangerous journey filled with magic
and
mystery.","release_date":"2019-01-01"},{"vote_count":2,"id":49046,"video":false,"vote_average":0,"title":"All
Quiet on the Western
Front","popularity":6.197559,"poster_path":"/jZWVtbxyztDTSM0LXDcE6vdVTVC.jpg","original_language":"en","original_title":"All
Quiet on the Western
Front","genre_ids":[28,12,18,10752],"backdrop_path":null,"adult":false,"overview":"A
young German soldier's terrifying experiences and distress on the
western front during World War
I.","release_date":"2018-12-31"},{"vote_count":1,"id":299782,"video":false,"vote_average":0,"title":"The
Other Side of the
Wind","popularity":4.561363,"poster_path":"/vnfNbuyPqo5zJavqlgI3J50xJSi.jpg","original_language":"en","original_title":"The
Other Side of the
Wind","genre_ids":[35,18],"backdrop_path":null,"adult":false,"overview":"Orson
Welles' unfinished masterpiece, restored and assembled based on
Welles' own notes. During the last 15 years of his life, Welles, who
died in 1985, worked obsessively on the film, which chronicles a
temperamental film director—much like him—who is battling with the
Hollywood establishment to finish an iconoclastic
work.","release_date":"2018-12-31"},{"vote_count":0,"id":289600,"video":false,"vote_average":0,"title":"The
Sandman","popularity":3.329464,"poster_path":"/eju4vLNx9sSvscowmnKNLi3sFVe.jpg","original_language":"en","original_title":"The
Sandman","genre_ids":[27],"backdrop_path":"/zo67d5klQiFR3PCyvER39IMwZ73.jpg","adult":false,"overview":"THE
SANDMAN tells the story of Nathan, a young student in the city who
struggles to forget his childhood trauma at the hands of the serial
killer dubbed \"The Sandman.\" Nathan killed The Sandman years ago, on
Christmas Eve, after he witnessed the murder of his mother... until he
sees the beautiful woman who lives in the apartment across the way
dying at the hands of that same masked killer. This brutal murder
plunges Nathan into an odyssey into the night country of his past, his
dreams... and the buried secrets of The
Sandman.","release_date":"2018-12-31"},{"vote_count":0,"id":378177,"video":false,"vote_average":0,"title":"Luxembourg","popularity":1.179703,"poster_path":null,"original_language":"en","original_title":"Luxembourg","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"The
story of a group of people living in a permanent nuclear winter in the
ruins of the old civilisation destroyed by an atomic
war.","release_date":"2018-12-31"},{"vote_count":0,"id":347392,"video":false,"vote_average":0,"title":"Slice","popularity":3.248065,"poster_path":"/ySWPZihd5ynCc1aNLQUXmiw5H2V.jpg","original_language":"en","original_title":"Slice","genre_ids":[35],"backdrop_path":"/rtL9nzXtSvo1MW05kho9oeimCdb.jpg","adult":false,"overview":"When
a pizza delivery driver is murdered on the job, the city searches for
someone to blame: ghosts? drug dealers? a disgraced
werewolf?","release_date":"2018-12-31"},{"vote_count":0,"id":438674,"video":false,"vote_average":0,"title":"Dragged
Across
Concrete","popularity":3.659627,"poster_path":"/p4tpV4nGeocuOKhp0enuiQNDvhi.jpg","original_language":"en","original_title":"Dragged
Across
Concrete","genre_ids":[18,80,53,9648],"backdrop_path":null,"adult":false,"overview":"Two
policemen, one an old-timer (Gibson), the other his volatile younger
partner (Vaughn), find themselves suspended when a video of their
strong-arm tactics becomes the media's cause du jour. Low on cash and
with no other options, these two embittered soldiers descend into the
criminal underworld to gain their just due, but instead find far more
than they wanted awaiting them in the
shadows.","release_date":"2018-12-31"},{"vote_count":0,"id":437518,"video":false,"vote_average":0,"title":"Friend
of the
World","popularity":4.189267,"poster_path":"/hf3LucIg7t7DUvgGJ9DjQyHcI4J.jpg","original_language":"en","original_title":"Friend
of the
World","genre_ids":[35,18,27,878,53,10752],"backdrop_path":null,"adult":false,"overview":"After
a catastrophic war, an eccentric general guides a filmmaker through a
ravaged bunker.","release_date":"2018-12-31"}]}

import json
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
json_data = json.loads(data)
results=json_data["results"]
for item in results:
print (item('vote_count'))
#write code to get necessary objects to write in csv
This is a way how you can do it. Comment if you have any query.

That looks like a JSON object, so you can parse it into a python dictionary using:
import json
mydict = json.loads(data)
Probably the values you want are in mydict[results] which is another set of key:value pairs. Depending on how you want these you could use a CSV library or just iterate through them and the print the contents with a tab between them.
for item in vars["results"]:
for k in item:
print("{}\t{}".format(k,item.get(k)))

Related

Extacting data from text files

I have a file with almost 2000 tweets in english. It looks like this:
{"data":[{"no.":"1241583652212862978","created":"2020-03-22T04:33:04.000Z","tweet":"#OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?"},{"no.":"1241583655538941959","created":"2020-03-22T04:33:05.000Z","tweet":" I know it’s from a few days ago, but these books are in good shape}, .......]}
I want to extract only the tweet from the text file. How can I extract only the tweet part from the text file? Any suggestions will be helpful. Thanks in advance.
Your file is in json format. Check Python's json lib so you will be able to extract the tweets. https://docs.python.org/3/library/json.html
Assuming you use d to represent the object it's as simple as:
tweet = d["data"][0]["tweet"]
Also if it helps working example I did in the shell from your examples:
>>> d = {'data': [{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '#OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]}
>>> print(d["data"])
[{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '#OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]
>>> print(d["data"][0]["tweet"])
#OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?
>>>

Spark does not recognize new lines, &amp, etc. from String

I'm trying to process text data (Twitter tweets) with PySpark. Emojis and special characters are being red correctly but "\n", "&amp" appear to be escaped. Spark does not recognize them. Probably others too. One example tweet in my Spark DF would look like this:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
I would like Spark to read them correctly. The files are stored as parquet and I'm reading them like this:
tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)
Below are some sample input data, which I took from the original JSONL files (I stored the data as parquet later).
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination
education campaign for HPV?! Where is our FEDERAL #lungcancer
screening program?! (and\u2026"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of
#CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED"
Reading directly from JSONL files results in the same recognizing problems.
tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)
How can Spark recognize them correctly? Thank you in advance.
the below code might be helpful to solve your problem,
Input taken:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"
code to solve the problem:
from pyspark.sql.functions import *
df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\\\n', ''))
df1.select("cleandata").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? 😉 Take care & enjoy |
|"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &" |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Improve speed of wikipedia.search with Python

I have a dictionary of famous people's names sorted by their initials. I want to convert these names into their respective Wikipedia title page names. These are the same for the first three given in this example, but Alexander Bell gets correctly converted to Alexander Graham Bell after running this code.
The algorithm works, although took about an hour to do all the 'AA' names and I am hoping for it to do this all the way up to 'ZZ'.
Is there any optimisation I can do on this? For example I saw something about batch requests but am not sure if it applies to my algorithm.
Or is there a more efficient method that I could use to get this same information?
Thanks.
import wikipedia
PeopleDictionary = {'AA':['Amy Adams', 'Aaron Allston'], 'AB':['Alia Bhatt', 'Alexander Bell']}
for key, val in PeopleDictionary.items():
for val in range(len(PeopleDictionary[key])):
Name_URL_All = wikipedia.search(PeopleDictionary[key][val])
if Name_URL_All:
Name_URL = Name_URL_All[0]
PeopleDictionary[key][val] = Name_URL

Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'\xff...' characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

Using quantopian for data analysis

I want to know were Quantopian gets data from?
If I want to do an analysis on a stock market other than NYSE, will I get the data? If not, can I manually upload the data so that I can run my algorithms on it.
1.) Quantopian gets its data from several places, and provides most online although some are premium and require subscription.
2.) Yes, you can get standard stock market data, but if you have something like a Bloomberg, other subscription or something else you've built and want to pull it in, you can use fetcher.
The basic code is:
fetch_csv(url, pre_func=None, post_func=None, date_column='date',
date_format='%m/%d/%y', timezone='UTC', symbol=None, **kwargs)
Here is an example for something like Dropbox:
def initialize(context):
# fetch data from a CSV file somewhere on the web.
# Note that one of the columns must be named 'symbol' for
# the data to be matched to the stock symbol
fetch_csv('https://dl.dropboxusercontent.com/u/169032081/fetcher_sample_file.csv',
date_column = 'Settlement Date',
date_format = '%m/%d/%y')
context.stock = symbol('NFLX')
def handle_data(context, data):
record(Short_Interest = data.current(context.stock, 'Days To Cover'))
You can get data for non-NYSE stocks as well like Nasdaq securities. Screens are also available by fundamentals(market, exchange, market cap). These screens can limit stocks analyzed from the broad universe.
You can get stock data from Yahoo or other quant sites.

Resources