automation in scraping multiple pages with known url scheme

automation in scraping multiple pages with known url scheme - python-3.x

I am having trouble in scraping a list of hits.
For each year there is a hit list in a certain webpage with a certain url. The url contains the year so I'd like to make a single csv file for each year with the hit list.
Unfortunately I cannot make it sequentially and I get the following error:
ValueError: unknown url type: 'h'
Here is the code I am trying to use. I apologize if there are simple mistakes but I'm a newbie in pyhon and I couldn't find any sequence in the forum to adapt to this case.
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1947,2016))
for year in years:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')
my_url = my_urls[0]
for my_url in my_urls:
uClient = uReq(my_url)
html_input = uClient.read()
uClient.close()
page_soup = BeautifulSoup(html_input, "html.parser")
container = page_soup.findAll("li")
filename = "singoli" + str(year) + ".csv"
f = open(singoli + str(year), "w")
headers = "lista"
f.write(headers)
lista = container.text
print("lista: " + lista)
f.write(lista + "\n")
f.close()

You think you are defining a tuple with ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm') but you just defined a simple string.
So you are looping in a string, so looping letter by letter, not url by url.
When you want to define a tuple with one single element you have to explicit it with a ending ,, example: ("foo",).
Fix:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm', )
Reference:
A special problem is the construction of tuples containing 0 or 1
items: the syntax has some extra quirks to accommodate these. Empty
tuples are constructed by an empty pair of parentheses; a tuple with
one item is constructed by following a value with a comma (it is not
sufficient to enclose a single value in parentheses). Ugly, but
effective.

Try this. Hope it will solve the issue:
import csv
import urllib.request
from bs4 import BeautifulSoup
outfile = open("hitparade.csv","w",newline='',encoding='utf8')
writer = csv.writer(outfile)
for year in range(1947,2016):
my_urls = urllib.request.urlopen('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm').read()
soup = BeautifulSoup(my_urls, "lxml")
[scr.extract() for scr in soup('script')]
for container in soup.select(".li1,.liy,li"):
writer.writerow([container.text.strip()])
print("lista: " + container.text.strip())
outfile.close()

Related

Writing multiple files as output when webscraping - python bs4

to preface - I am quite new to python and my HTML skills are kindergarten level.
So I am trying to save the quotes from this website which has many links in it for each member of the US Election candidates.
I have managed to get the actual code to extract the quotes (with the help of soem stackoverflow users), but am lost on how to write these quotes in to separate text files for each candidate.
For example, the first page, with all of Justin Amash's quotes should be written to a file: JustinAmash.txt.
The second page, with all of Michael Bennet's quotes should be written to MichaelBennet.txt (or something in that form). and so on.. Is there a way to do this?
For reference, to scrape the pages, the following code works:
import bs4
from urllib.request import Request,urlopen as uReq, HTTPError
#Import HTTPError in order to avoid the links with no content/resource of interest
from bs4 import BeautifulSoup as soup_
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'
def make_soup(url):
# set up known browser user agent for the request to bypass HTMLError
req=Request(url,headers={'User-Agent': 'Mozilla/5.0'})
#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
#html is jumbled at the moment, so call html using soup function
soup = soup_(page_html, "lxml")
return soup
# Test: print title of page
#soup.title
soup = make_soup(my_url)
tags = soup.findAll("a" , href=re.compile("javascript:pop\("))
#print(tags)
# open a text file and write it if it doesn't exist
file1 = open("Quotefile.txt","w")
# get list of all URLS
for links in tags:
link = links.get('href')
if "java" in link:
print("http://archive.ontheissues.org" + link[18:len(link)-3])
main_url = "http://archive.ontheissues.org" + link[18:len(link)-3]
try:
sub_soup = make_soup(main_url)
content_collexn = sub_soup.body.contents #Splitting up the page into contents for iterative access
#text_data = [] #This list can be used to store data related to every person
for item in content_collexn:
#Accept an item if it belongs to the following classes
if(type(item) == str):
print(item.get_text())
elif(item.name == "h3"):
#Note that over here, every h3 tagged title has a string following it
print(item.get_text())
#Hence, grab that string too
print(item.next_sibling)
elif(item.name in ["p", "ul", "ol"]):
print(item.get_text())
except HTTPError: #Takes care of missing pages and related HTTP exception
print("[INFO] Resource not found. Skipping to next link.")
#print(text_data)

You can store that text data into the list you had started with text_data. Join all those items and then write to file:
So something like:
import bs4
from urllib.request import Request,urlopen as uReq, HTTPError
#Import HTTPError in order to avoid the links with no content/resource of interest
from bs4 import BeautifulSoup as soup_
import re
#define url of interest
my_url = 'http://archive.ontheissues.org/Free_Trade.htm'
def make_soup(url):
# set up known browser user agent for the request to bypass HTMLError
req=Request(url,headers={'User-Agent': 'Mozilla/5.0'})
#opening up connection, grabbing the page
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
#html is jumbled at the moment, so call html using soup function
soup = soup_(page_html, "lxml")
return soup
# Test: print title of page
#soup.title
soup = make_soup(my_url)
tags = soup.findAll("a" , href=re.compile("javascript:pop\("))
#print(tags)
# open a text file and write it if it doesn't exist
#file1 = open("Quotefile.txt","w")
# get list of all URLS
candidates = []
for links in tags:
link = links.get('href')
if "java" in link:
#print("http://archive.ontheissues.org" + link[18:len(link)-3])
main_url = "http://archive.ontheissues.org" + link[18:len(link)-3]
candidate = link.split('/')[-1].split('_Free_Trade')[0]
if candidate in candidates:
continue
else:
candidates.append(candidate)
try:
sub_soup = make_soup(main_url)
content_collexn = sub_soup.body.contents #Splitting up the page into contents for iterative access
text_data = [] #This list can be used to store data related to every person
for item in content_collexn:
#Accept an item if it belongs to the following classes
if(type(item) == str):
#print(item.get_text())
text_data.append(item.get_text())
elif(item.name == "h3"):
#Note that over here, every h3 tagged title has a string following it
#print(item.get_text())
text_data.append(item.get_text())
#Hence, grab that string too
#print(item.next_sibling)
text_data.append(item.next_sibling)
elif(item.name in ["p", "ul", "ol"]):
#print(item.get_text())
text_data.append(item.get_text())
except HTTPError: #Takes care of missing pages and related HTTP exception
print("[INFO] Resource not found. Skipping to next link.")
candidates.remove(candidate)
continue
text_data = '\n'.join(text_data)
with open("C:/%s.txt" %(candidate), "w") as text_file:
text_file.write(text_data)
print('Aquired: %s' %(candidate))

Parse through website with Beautiful Soup to find matching Data

I am trying Python + BeautifulSoup to loop through a website in order to find a matching string contained in a tag.
When the matching substring is found stop the iteration and print the span, can't find a way to make this work.
this is what I could manage to work out so far
import urllib.request
from bs4 import BeautifulSoup as b
num = 1
base_url = "https://v-tac.it/led-products-results-page/?q="
request = '500'
separator = '&start='
page_num = "1"
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
for i in range(100) :
for post in soup.findAll("div",{"class" : "spacer"}):
h = post.findAll("span")[0].text
if "request" in h:
break
print(h)
num += 1
page_num = str(num)
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
print("We are at page " + page_num)
But it doesn't return anything, it only cycles through the pages.
Thanks in advance for any help

If it is in the text then with bs4 4.7.1 you should be able to use :contains
soup.select_one('.spacer span:contains("request")').text if soup.select_one('.spacer span:contains("request")') is not None else 'Not found'
I'm not sure why when you have for i in range(100) , you don't use i instead of num later; then you wouldn't need +=

How can i add a tag and a string in requests.get() to crawl data using python beautifulsoup

Here is my code :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
links = pd.read_csv('C:\\Users\\acer\\Desktop\\links.csv',encoding = 'utf-8',dtype=str)
for i in range(1,10):
link = links.iloc[i,0]
for count in range(1,5):
r = requests.get(link + str(count))
soup = BeautifulSoup(r.text,'lxml')
##comp links
for links in soup.find_all('th',{"id":"c_name"}):
link = links.find('a')
li = link['href'][3:]
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li)
I am getting the below error :
TypeError: unsupported operand type(s) for +: 'Tag' and 'str'

In the last line of your source code. You made a concatenation of an instance of "Tag" class (li) with the string that has an URL.
Try to extract the information that you want from the tag before the concatenation. For example, if you want to get the text in the link use this.
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li.text)

Pagination Webscraping Python3- BS4 - While loop

I finished my scraper for one page and extracted the href for the next page.
I can't get the scraper in a loop for each subsequent page. I tried a While True loop, but this kills my results from the first page.
This code works perfectly for the first page:
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
def NextPage():
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
print(np)
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage()
No need for a baseurl + number method I suppose, because I can extract the correct url from each page using findAll. I'm fairly new so the mistake must be pretty dumb.
Thanks for helping out!

Try the below script to get the required fields traversing different pages and write them accordingly to a csv file. I tried to clean up your repetitive coding and applied slightly cleaner approach in place of that. Give it a go:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='
with open("output.csv","w",newline="",encoding="utf-8") as infile:
writer = csv.writer(infile)
writer.writerow(['Artist','Venue','City'])
pagenum = -1 #make sure to get the content of the first page as well which is "0" in the link
while True:
pagenum+=1
res = urlopen(link.format(pagenum)).read()
soup = BeautifulSoup(res, "html.parser")
container = soup.find_all("section",class_="concert_rows_info")
if len(container)<=1:break ##as soon as there is no content the scraper should break out of the loop
for items in container:
artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
venue = items.find(class_="td_3").get_text(strip=True)
city = items.find(class_="td_4").get_text(strip=True)
writer.writerow([artist,city,venue])
print(f'{artist}\n{venue}\n{city}\n')

your mistakes
you have to fetch the url that you found in the end of your file you are just calling NextPage() but what is it doing is just printing out the url
that was your mistake :)
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
filename = "db.csv"
#at the beginning of the document you create the file in 'w'-write mode
#but later you should open it in "A"-append mode because 'W'-write will rewrite the file
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
f.close()
#create a function url_fetcher that everytime will go and fetch the html
def url_fetcher(url):
myurl = (url)
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
get_artist(DayContainer, page_soup)
#here you have to call the url otherwize it wont work
def NextPage(page_soup):
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
url_fetcher(np)
#in get artist you have some repeatings but you can tweak alittle bit and it will work
def get_artist(DayContainer, page_soup):
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
print(soldout)
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
with open(filename, "a") as f:
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage(page_soup)
url_fetcher('https://www.podiuminfo.nl/concertagenda/')
recap
for easier understanding i've made a big a loop but it works :)
you need to make some ajustments of the so there are not repetitive names and dates in db.csv

How to avoid UnicodeEncodeError '\xf8' when scraping table with beautifulsoup [duplicate]

This question already has an answer here:
what is the uni code encoding error in the code below [duplicate]
(1 answer)
Closed 3 years ago.
The script below returns 'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'
and I cant find a good explanation for it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
results = {}
for page_num in range(0, 1000, 20):
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all(class_='table-condensed')
output = pd.read_html(str(table))[0]
results[page_num] = output
df = pd.concat([v for v in results.values()], axis = 0)

You are using the std library to open the url. This library forces the address to be encoded into ascii. Hence non ascii characters like ø will throw a Unicode Error.
Line 1116-1117 of http/client.py
# Non-ASCII characters should have been eliminated earlier
self._output(request.encode('ascii'))
As alternative to urllib.request, the 3rd party requests is great.
import requests
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

automation in scraping multiple pages with known url scheme - python-3.x

Related

Writing multiple files as output when webscraping - python bs4

Parse through website with Beautiful Soup to find matching Data

How can i add a tag and a string in requests.get() to crawl data using python beautifulsoup

Pagination Webscraping Python3- BS4 - While loop

How to avoid UnicodeEncodeError '\xf8' when scraping table with beautifulsoup [duplicate]

Categories

Resources