This question already has an answer here:
what is the uni code encoding error in the code below [duplicate]
(1 answer)
Closed 3 years ago.
The script below returns 'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'
and I cant find a good explanation for it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
results = {}
for page_num in range(0, 1000, 20):
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all(class_='table-condensed')
output = pd.read_html(str(table))[0]
results[page_num] = output
df = pd.concat([v for v in results.values()], axis = 0)
You are using the std library to open the url. This library forces the address to be encoded into ascii. Hence non ascii characters like ø will throw a Unicode Error.
Line 1116-1117 of http/client.py
# Non-ASCII characters should have been eliminated earlier
self._output(request.encode('ascii'))
As alternative to urllib.request, the 3rd party requests is great.
import requests
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text
Related
I am trying to do an assignment: write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Kylen.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: P[enter image description here][1]
#Code I used:
import re
import urllib
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL:')
count = int(input('Enter count:'))
position = int(input('Enter position:'))-1
html = urllib.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
#print href
for i in range(count):
link = href[position].get('href', None)
print(href[position].contents[0])
html = urllib.urlopen(link).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
But got an error: html = urllib.urlopen(url, context=ctx).read()
AttributeError: module 'urllib' has no attribute 'urlopen'
Can anyone provide solutions for this?
You imported urlopen already, but never used it. Instead you used urllib.urlopen which doesn't exist.
Instead of using urllib.urlopen just use urlopen
Example:
from urllib.request import urlopen
# before: html = urllib.urlopen(url, context=ctx).read()
html = urlopen(url, context=ctx).read()
I have a problem with my code created in python.
I would like the URL API telegram to open with a change so that the downloaded item from the site is sent to chat.
# Import libraries
import requests
import urllib.request
import time
import sys
from bs4 import BeautifulSoup
stdoutOrigin=sys.stdout
sys.stdout = open("log.txt", "w")
# Set the URL you want to webscrape from
url = 'https://31asdasdasdasdasd.com/'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")
zapisane = ''
row = soup.find('strong')
print(">> Ilosc opinii ktora przeszla:")
send = print(row.get_text()) # Print row as text
import urllib.request
u = urllib.request.urlopen("https://api.telegram.org/botid:ts/sendMessage?chat_id=-3channel1&text=")
You likely want to use a string format with a variable in your last line of code shown here. Here's a helpful resource for string formatting: https://www.geeksforgeeks.org/python-format-function/
Here is my code :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
links = pd.read_csv('C:\\Users\\acer\\Desktop\\links.csv',encoding = 'utf-8',dtype=str)
for i in range(1,10):
link = links.iloc[i,0]
for count in range(1,5):
r = requests.get(link + str(count))
soup = BeautifulSoup(r.text,'lxml')
##comp links
for links in soup.find_all('th',{"id":"c_name"}):
link = links.find('a')
li = link['href'][3:]
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li)
I am getting the below error :
TypeError: unsupported operand type(s) for +: 'Tag' and 'str'
In the last line of your source code. You made a concatenation of an instance of "Tag" class (li) with the string that has an URL.
Try to extract the information that you want from the tag before the concatenation. For example, if you want to get the text in the link use this.
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li.text)
I am having trouble in scraping a list of hits.
For each year there is a hit list in a certain webpage with a certain url. The url contains the year so I'd like to make a single csv file for each year with the hit list.
Unfortunately I cannot make it sequentially and I get the following error:
ValueError: unknown url type: 'h'
Here is the code I am trying to use. I apologize if there are simple mistakes but I'm a newbie in pyhon and I couldn't find any sequence in the forum to adapt to this case.
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1947,2016))
for year in years:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')
my_url = my_urls[0]
for my_url in my_urls:
uClient = uReq(my_url)
html_input = uClient.read()
uClient.close()
page_soup = BeautifulSoup(html_input, "html.parser")
container = page_soup.findAll("li")
filename = "singoli" + str(year) + ".csv"
f = open(singoli + str(year), "w")
headers = "lista"
f.write(headers)
lista = container.text
print("lista: " + lista)
f.write(lista + "\n")
f.close()
You think you are defining a tuple with ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm') but you just defined a simple string.
So you are looping in a string, so looping letter by letter, not url by url.
When you want to define a tuple with one single element you have to explicit it with a ending ,, example: ("foo",).
Fix:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm', )
Reference:
A special problem is the construction of tuples containing 0 or 1
items: the syntax has some extra quirks to accommodate these. Empty
tuples are constructed by an empty pair of parentheses; a tuple with
one item is constructed by following a value with a comma (it is not
sufficient to enclose a single value in parentheses). Ugly, but
effective.
Try this. Hope it will solve the issue:
import csv
import urllib.request
from bs4 import BeautifulSoup
outfile = open("hitparade.csv","w",newline='',encoding='utf8')
writer = csv.writer(outfile)
for year in range(1947,2016):
my_urls = urllib.request.urlopen('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm').read()
soup = BeautifulSoup(my_urls, "lxml")
[scr.extract() for scr in soup('script')]
for container in soup.select(".li1,.liy,li"):
writer.writerow([container.text.strip()])
print("lista: " + container.text.strip())
outfile.close()
Im trying to make a proxy scraper , this is my code:
import bs4
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import lxml
from contextlib import redirect_stdout
meh=[]
pathf = '/home/user/tests.txt'
url = Request('https://www.path.to/table', headers={'User-Agent': 'Mozilla/5.0'})
page_html = urlopen(url).read()
page_soup = soup(page_html, features="xml")
final = page_soup.tbody
meh.append(final)
with open(pathf, 'w') as f:
with redirect_stdout(f):
print(meh[0].text.strip())
Now i want the text to show in a more readable way, because its like this:
12.183.20.3615893USUnited StatesSocks5AnonymousYes11 seconds ago220.133.97.7445657TWTaiwanSocks5AnonymousYes11 seconds ago
How can i turn this text into a more readable file? something like:
12.183.20.36 15893 US United States Socks5 Anonymous Yes 11 seconds ago (new line) ...
Here is the actual output without the '.text.strip()' format after a jsbeautifier trip if it helps
https://ghostbin.com/paste/g56qe
You can extract all td elements as list instead of extracting complete table body:
final_list = page_soup.findAll('td')
and then get list of text nodes:
list_of_text_nodes = [td.text.strip() for td in final_list]
Output:
[u'182.235.38.81', u'40748', u'TW', u'Taiwan', u'Socks5', u'Anonymous'...]
or get all text nodes as single string:
complete_text = " ".join([i.text.strip() for i in final_list])
Output:
'182.235.38.81 40748 TW Taiwan Socks5 Anonymous Yes 14 seconds ago ...'