TheGuardian API - Script crashes - python-3.x

import json
import requests
from os import makedirs
from os.path import join, exists
from datetime import date, timedelta
ARTICLES_DIR = join('tempdata', 'articles')
makedirs(ARTICLES_DIR, exist_ok=True)
API_ENDPOINT = 'http://content.guardianapis.com/search'
my_params = {
'q': 'coronavirus,stock,covid',
'sectionID': 'business',
'from-date': "2019-01-01",
'to-date': "2020-09-30",
'order-by': "newest",
'show-fields': 'all',
'page-size': 300,
'api-key': '### my cryptic key ###'
}
# day iteration from here:
# http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
start_date = date(2019, 1, 1)
end_date = date(2020,9, 30)
dayrange = range((end_date - start_date).days + 1)
for daycount in dayrange:
dt = start_date + timedelta(days=daycount)
datestr = dt.strftime('%Y-%m-%d')
fname = join(ARTICLES_DIR, datestr + '.json')
if not exists(fname):
# then let's download it
print("Downloading", datestr)
all_results = []
my_params['from-date'] = datestr
my_params['to-date'] = datestr
current_page = 1
total_pages = 1
while current_page <= total_pages:
print("...page", current_page)
my_params['page'] = current_page
resp = requests.get(API_ENDPOINT, my_params)
data = resp.json()
all_results.extend(data['response']['results'])
# if there is more than one page
current_page += 1
total_pages = data['response']['pages']
with open(fname, 'w') as f:
print("Writing to", fname)
# re-serialize it for pretty indentation
f.write(json.dumps(all_results, indent=2))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-18-f04b4f0fe9ed> in <module>
49 resp = requests.get(API_ENDPOINT, my_params)
50 data = resp.json()
---> 51 all_results.extend(data['response']['results'])
52 # if there is more than one page
53 current_page += 1
KeyError: 'results'
Same error occurs for 'pages'
At first there was no issues and was able to run it. Download crashed after 2020-03-24. Since then can't get the code running again.
I'm referring to Line 51 and 54. At least at this point the codes crashes.
Not sure how to get rid of the issue. Any ideas?

Understanding the error message would be the first step - it compains about a missing key. Check if data['response']['results'] is present (hint: it is not) and check what exactly the structure of your data['response'] is.
Fortunately one can use the api parameter 'test' so we can help using that key:
my_params = {
'q': 'coronavirus,stock,covid',
'sectionID': 'business',
'from-date': "2019-01-01",
'to-date': "2020-09-30",
'order-by': "newest",
'show-fields': 'all',
'page-size': 300,
'api-key': 'test' # test key for that API
}
On running, I get the same exception, inspect data['response'] and get:
Lets see what parameters are given, shall we?
my_params = {
'q': 'coronavirus,stock,covid',
'sectionID': 'business',
'from-date': "2019-01-01",
'to-date': "2020-09-30",
'order-by': "newest",
'show-fields': 'all',
'page-size': 300, # TOO BIG
'api-key': 'test'
}
Fix that to 200 and you'll get
Downloading 2019-01-01
...page 1
Writing to tempdata\articles\2019-01-01.json
Downloading 2019-01-02
...page 1
Writing to tempdata\articles\2019-01-02.json
Downloading 2019-01-03
...page 1
Writing to tempdata\articles\2019-01-03.json
Downloading 2019-01-04
...page 1
Writing to tempdata\articles\2019-01-04.json
Downloading 2019-01-05
[snipp]

Related

How to make my pandas script export to csv

I have managed to get the output I want from this script:
But I am having trouble exporting it to a csv using:
v.to_csv(n + '.csv', index=False)
I get this error:
Traceback (most recent call last): Python/CouponRedemptions/start.py", line 22, in <module> print(v['invoice_line_normal_price']) File "~/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2902, in getitem indexer = self.columns.get_loc(key) File "~/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc raise KeyError(key) from err KeyError: 'invoice_line_normal_price'
I think it is the way the DF is structured, you cannot export it in its current state. I was wondering how I would go about making this work or any suggestions on where I cant start looking.
import pandas as pd
import re
r = pd.read_csv('cp.csv', low_memory=False)
r = r.filter(['shop_name','order_coupon_code','invoice_line_type','invoice_date','invoice_line_normal_price'])
r = r[r.order_coupon_code.notnull()]
r['invoice_line_normal_price'] = pd.to_numeric(r['invoice_line_normal_price'],errors = 'coerce')
n = input("Enter the coupon name: ")
nr = r[r.order_coupon_code.str.match(n,flags=re.IGNORECASE)]
nr = nr[nr.invoice_line_type.str.match('charge')]
nr = nr.sort_values('shop_name')
v = nr.groupby(['shop_name'])['invoice_line_normal_price'].value_counts().to_frame('counts')
print(v)
example of csv code
shop_name order_coupon_code invoice_line_type invoice_date invoice_line_normal_price moresome moreother hello
0 shop1 nv55 sell 01.01.2016 01:00:00.000 15.0 3 tt hi
1 shop2 nv44 quote 01.01.2016 02:00:00.000 22.0 4 rr hey
2 shop3 nv22 charge 01.01.2016 03:00:00.000 27.0 5 dd what
mport pandas as pd
# The low_memory option is not properly deprecated, but it should be, since it does not actually do anything differently
r = pd.read_csv('cp.csv')
print(r)
# r = r.loc[:,['shop_name', 'order_coupon_code', 'invoice_line_type', 'invoice_date', 'invoice_line_normal_price']]
r = r.filter(['shop_name','order_coupon_code','invoice_line_type','invoice_date','invoice_line_normal_price'])
r = r[r.order_coupon_code.notnull()]
r['invoice_line_normal_price'] = pd.to_numeric(r['invoice_line_normal_price'],errors = 'coerce')
# Enter the coupon name: nv22
n = input("Enter the coupon name: ")
nr = r[r.order_coupon_code.str.contains(n.lower())]
nr = nr[nr.invoice_line_type.str.match('charge')]
nr = nr.sort_values('shop_name')
v = nr.groupby(['shop_name'])['invoice_line_normal_price'].value_counts().to_frame('counts')
print(v)
v.to_csv(n + '.csv', index=False)
the output
shop_name invoice_line_normal_price
shop3 27.0 1
let say you need to add more to Single csv file
v.to_csv(n + '.csv',mode='a', index=False)
no header
v.to_csv(n + '.csv',mode='a', index=False,header=False)
just to make sure this error mean the name of column is not in your csv file check out the column name on your csv file
get_loc raise KeyError(key) from err KeyError: 'invoice_line_normal_price'

Download survey results from Qualtrics into Python

I am trying to directly get the data responses from Qualtrics directly into a pandas dataframe python. Is there a way of doing so?
import shutil
import os
import requests
import zipfile
import json
import io
# Setting user Parameters
# apiToken = "myKey"
# surveyId = "mySurveyID"
# fileFormat = "csv"
# dataCenter = "az1"
apiToken = "HfDjOn******"
surveyId = "SV_868******"
fileFormat = "csv"
dataCenter = 'uebs.eu'
# Setting static parameters
requestCheckProgress = 0
progressStatus = "in progress"
baseUrl = "https://{0}.qualtrics.com/API/v3/responseexports/".format(dataCenter)
headers = {
"content-type": "application/json",
"x-api-token": apiToken,
}
Then for # Step 1: Creating Data Export
downloadRequestUrl = baseUrl
then when i try to access the url from my chrom it gives me the following
{"meta":{"httpStatus":"404 - Not Found","error":{"errorMessage":"The requested resource does not exist."}}}
Which I believe the main reason why after running this code
# Step 1: Creating Data Export
downloadRequestUrl = baseUrl
downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId + '"}'
downloadRequestResponse = requests.request("POST", downloadRequestUrl, data=downloadRequestPayload, headers=headers)
progressId = downloadRequestResponse.json()["result"]["id"]
print(downloadRequestResponse.text)
It gives me this error
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-38-cd611e49879c> in <module>
3 downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId + '"}'
4 downloadRequestResponse = requests.request("POST", downloadRequestUrl, data=downloadRequestPayload, headers=headers)
----> 5 progressId = downloadRequestResponse.json()["result"]["id"]
6 print(downloadRequestResponse.text)
KeyError: 'result
I am somehow new to Qualtrics/python interface may someone share why I am having this difficulty is it because of the dataCenter?
Thank you

How to fix string concatenation error in Python 3

When I run the code below, it begins to loop through and returns the time and length of temp_data, but before reaching 100 in the loop throws the error. If I update the code to sleep for 5 seconds instead of 1, it will make it through all 100 iterations.
start_date = 1483228800*1000 #jan 1 2017
pair = 'ETHBTC'
timeframe = '1m'
final_data = []
for _ in range(0,100):
url = 'https://api.bitfinex.com/v2/candles/trade:'+timeframe+':t'+pair+'/hist?sort=1&limit=1000&start='+str(start_date)
r = requests.get(url)
temp_data = r.json()
final_data = final_data+temp_data
start_date = temp_data[len(temp_data)-1][0]+60*1000
print(time.ctime(), len(temp_data))
time.sleep(1)
print(len(final_data))
Error:
Traceback (most recent call last):
File
"/Users/michael/PycharmProjects/bot/venv/datasets/dataset.py", line
18, in <module>
start_date = temp_data[len(temp_data)-1][0]+60*1000
TypeError: can only concatenate str (not "int") to str
you should convert 60*1000 to str.
start_date = temp_data[len(temp_data)-1][0] + str(60*1000)

TypeError: 'str' object cannot be interpreted as an integer when multiplying 2 fields

I am using 2 columns in a file i.e Revenue and Margin to calculate profit for each row and then upload it to SearchAds. I am calculating profit in the function and it is throwing the error below:
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\pandas\core\indexes\base.py", line
4381, in get_value
return libindex.get_value_box(s, key)
File "pandas\_libs\index.pyx", line 52, in pandas._libs.index.get_value_box
File "pandas\_libs\index.pyx", line 48, in
pandas._libs.index.get_value_at
File "pandas\_libs\util.pxd", line 113, in pandas._libs.util.get_value_at
File "pandas\_libs\util.pxd", line 98, in
pandas._libs.util.validate_indexer
TypeError: 'str' object cannot be interpreted as an integer
KeyError: 'MarginData'
I tried calculating the profit right after the If clause it still gives me the same error. Below is the code.
for filename in os.listdir('//AwsSQl/Share/ftpdata/'):
file = '//AwsSQl/Share/ftpdata/' + filename
if filename.startswith('Revenue_'):
print(filename)
file_name = filename
logging.info("Uploading Conversions from " + filename)
columns = ['Timestamp', 'OrderID', 'Revenue', 'MarginPct']
data = pd.read_csv(file, delimiter='\t')
data['Revenue'] = data['Revenue'].map(lambda x: '{:.2f}'.format(x))
data['MarginPct'] = data['MarginPct'].map(lambda x: '{:.2f}'.format(x))
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
dir = '//AwsSQl/Share/ftpdata/'
print(data.head(data['Timestamp'].count()))
print(data['Timestamp'].count())
for index, row in data.iterrows():
dt = parse(row['Timestamp'])
millisecond = int(round(dt.timestamp() * 1000))
if row['Orders'] > 0:
profit_upload(service, row['GCLID'], str(row['OrderID']) + "Pro" + str(index), millisecond, row['Revenue'], row['MarginData'])
def profit_upload(service, gclid, orderId, mill, rev, mar):
"""Upload conversion data to Adobe - Revenue.
Args:
service: An authorized Doubleclicksearch service. See Set Up Your Application.
gclid, orderId, millisecond, revenue, row
"""
request = service.conversion().insert(
body=
{
'conversion': [{
'agencyId': agencyId,
'advertiserId': advertiserId,
'attributionModel': 'External Attribution Model',
'clickId': gclid,
'conversionId': orderId,
'conversionTimestamp': mill,
'segmentationType': 'FLOODLIGHT',
'segmentationName': 'Adobe - DSG - Profit',
'type': 'Transaction',
'revenueMicros': (round(float(rev), 2) * round(float(mar), 2) * 1000000), #10 million revenueMicros is equivalent to $10 of revenue
'countMillis': 0 * 1000,
'currencyCode': 'USD',
}]
}
)
it seems that you have an str type where it needs an int type.
... row['MarginData']) # <-- I expect this is where the problem starts
Either it can't find any index for the row['MarginData'] or it expects something else.

Python 3.x unsupported operand type in using encode decode

I am trying to build a generic crawler for my marketing project and keep track of where the information came from viz blogs, testimonials etc. I am using Python 3.5 and Spyder/pycharm as IDE and I keep getting the following error in using encode - decode. The input to my code is a list of company names and product features in an excel file. I also searched for possible solutions but the recommendations in the community are for typecasting, which I am not sure is the problem.
Kindly let me know if some more clarification is required from my side.
from __future__ import division, unicode_literals
import codecs
import re
import os
import xlrd
import requests
from urllib.request import urlopen
from time import sleep
from bs4 import BeautifulSoup
import openpyxl
from collections import Counter
page=0
b=0
n=0
w=0
p=0
o=0
workbook=xlrd.open_workbook("C:\Product.xlsx")
workbook1=xlrd.open_workbook("C:\linkslist.xlsx")
sheet_names = workbook.sheet_names()
sheet_names1 = workbook1.sheet_names()
wb= openpyxl.Workbook() #User Spreadsheet
ws = wb.active
ws.title = "User"
ws['A1'] = 'Feature'
ws['B1'] = 'Customer-Testimonials'
ws['C1'] = 'Case Study'
ws['D1'] = 'Blog'
ws['E1'] = 'Press'
ws['F1'] = 'Total posts'
ws1 = wb.create_sheet(title="Ml")
ws1['A1'] = 'Feature'
ws1['B1'] = 'Phrase'
ws1['C1'] = 'Address'
ws1['D1'] = 'Tag Count'
worksheet = workbook.sheet_by_name(sheet_names[0])
worksheet1 = workbook1.sheet_by_name(sheet_names[0])
for linknumber in range(0,25):
u = worksheet1.cell(linknumber,0).value
url='www.' + u.lower() + '.com'
print (url)
r=''
while r == '':
try:
print ("in loop")
r = requests.get("http://" +url)
except:
sleep(3)#if the code still gives that error then try increasing the sleep time to 5 maybe
print (r)
data = r.text
#print data
soup1 = BeautifulSoup(data, "html.parser")
#print soup1
num=3 #starting row number and keep the column same.
word = ''
word = worksheet.cell(num,3).value
while not word == 'end':
print (num)
#print word
tag_list=[]
phrase= []
counts=[]
address=[]
counts = Counter(tag_list)
for link in soup1.find_all('a'):
#print link
add = link.encode("ascii", "ignore")
print (add)
if not'Log In' in add:
#print link.get('href')
i=0
content = ''
for i in range(1,5):
if content=='':
try:
print (link.get('href'))
i+=1
req = urllib.request.Request(link.get('href'))
with urllib.request.urlopen(req) as response:
content = response.read()
except:
sleep(3)
#if the code still gives that error then try increasing the sleep time to 5 maybe
continue
soup = BeautifulSoup(content, "html.parser")
s=soup(text=re.compile(word))
if s:
print ("TRUE")
add = link.encode('ascii','ignore')
print (type(add))
if 'customer-testimonial' in add :
b+=1
elif 'case-study' in add :
n+=1
elif 'blog' in add :
w+=1
elif 'press' in add :
p+=1
else :
o+=1
#phrase_type=["Customer testimonials","news","ads","twitter","facebook","instagram"]
#print(os.path.join(root, name))
print (add)
for tag in s:
parent_html = tag.parent.name
print (parent_html)
tag_list.append(parent_html)
phrase.append(s)
address.append(add)
#print str(phrase)
counts = Counter(tag_list)
page +=1
else:
counts = Counter(tag_list)
no =num-1
print(counts)
print (word)
ws['A%d'%no] = word.encode('utf-8' , 'ignore')
ws1['A%d'%no] = word.encode('utf-8' , 'ignore')
print ("Number of pages is %d" %page)
print ("Number of Customer testimonials posts is %d" %b)
ws['B%d'%no] = b
print ("Number of Case Studies posts is %d" %n)
ws['C%d'%no] = n
print ("Number of blog posts is %d" %w)
ws['D%d'%no] = w
print ("Number of press posts is %d" %p)
ws['E%d'%no] = p
print ("Number of posts is %d" %page)
ws['F%d'%no] = page
ws1['B%d'%no] = phrase.encode('utf-8' , 'ignore')
ws1['C%d'%no] = address.encode('utf-8' , 'ignore')
ws1['D%d'%no] = counts.encode('utf-8' , 'ignore')
counts.clear()
num += 1
word = worksheet.cell(num,3).value
#print word
page=0
b=0
n=0
w=0
p=0
o=0
phrase=[]
address=[]
tag_list=[]
wb.save('%s.xlsx'%worksheet1.cell(linknumber,0).value)
I get the following output and error while running the code:
www.amobee.com
in loop
<Response [200]>
3
Traceback (most recent call last):
File "C:/project_web_parser.py", line 69, in <module>
add = link.encode("ascii", "ignore")
File "C:\ProgramData\Ana3\lib\site-packages\bs4\element.py", line 1094, in encode
u = self.decode(indent_level, encoding, formatter)
File "C:\ProgramData\Ana3\lib\site-packages\bs4\element.py", line 1159, in decode
indent_space = (' ' * (indent_level - 1))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
Process finished with exit code 1
Traceback shows error in line 69 where you try to encode link. To fix it, just change that line to:
add = link.encode("ascii", errors="ignore")
Why does it happen?
Your link variable is type of bs4.element.Tag
>>>type(link)
<class 'bs4.element.Tag'>
.encode() method for tags takes more arguments then .encode() method for strings.
In source code of bs4 in file \bs4\element.py on line 1089 you can find definition of it:
def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
indent_level=None, formatter="minimal",
errors="xmlcharrefreplace"):
First argument is encoding, second is indent_level (int or None) and errors handling is forth.
Error
unsupported operand type(s) for -: 'str' and 'int'
means that you tried to subtract 'ignore' - 1.

Resources