Crawler in Python, urlopen not working

Crawler in Python, urlopen not working - python-3.x

I am playing around trying to extract some info from a webpage and I have the following code:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl="https://mtgsingles.gr/search?q="
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages=5
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = "https://mtgsingles.gr" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The next URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
Url= URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
The code runs fine and no errors occur but the results are not as expected. I am trying to extract some information from the page as well as the url of the next page. Ultimately I would like the program to run 5 times and crawl 5 pages. But this code crawls the initial page given (InitUrl="xyz.com") all 5 times and does not proceed in the next page url that is extracted. I tried debugging it by entering some print statements to see where the problem lies and I think that the problem lies at these statements:
UClient = uReq(Url)
page_html = UClient.read()
UClient.close()
For some reason urlopen does not work repeatedly in the for loop. Why does this happen? Is it wrong to use urlopen in a for statement?

This site get data by Ajax request. So you must send post requests to get data.
Tip: Select Url correctly for example: https://mtgsingles.gr/search?ajax=products-listing&q=

Related

Parse through website with Beautiful Soup to find matching Data

I am trying Python + BeautifulSoup to loop through a website in order to find a matching string contained in a tag.
When the matching substring is found stop the iteration and print the span, can't find a way to make this work.
this is what I could manage to work out so far
import urllib.request
from bs4 import BeautifulSoup as b
num = 1
base_url = "https://v-tac.it/led-products-results-page/?q="
request = '500'
separator = '&start='
page_num = "1"
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
for i in range(100) :
for post in soup.findAll("div",{"class" : "spacer"}):
h = post.findAll("span")[0].text
if "request" in h:
break
print(h)
num += 1
page_num = str(num)
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
print("We are at page " + page_num)
But it doesn't return anything, it only cycles through the pages.
Thanks in advance for any help

If it is in the text then with bs4 4.7.1 you should be able to use :contains
soup.select_one('.spacer span:contains("request")').text if soup.select_one('.spacer span:contains("request")') is not None else 'Not found'
I'm not sure why when you have for i in range(100) , you don't use i instead of num later; then you wouldn't need +=

How to scrape all the test match details in cricinfo

I am trying to scrape all the test match details but it is showing HTTP Error 504: Gateway Timeout I am getting the details of test matches but it is not showing this is my code i have used bs4 to scrape the test match details from cricinfo
I need to scrape the details of 2000 test matches this is my code
import urllib.request as req
BASE_URL = 'http://www.espncricinfo.com'
if not os.path.exists('./espncricinfo-fc'):
os.mkdir('./espncricinfo-fc')
for i in range(0, 2000):
soupy = BeautifulSoup(urllib2.urlopen('http://search.espncricinfo.com/ci/content/match/search.html?search=test;all=1;page=' + str(i)).read())
time.sleep(1)
for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
try:
new_host = new_host['href']
except:
continue
odiurl =BASE_URL + urljoin(BASE_URL,new_host)
new_host = unicodedata.normalize('NFKD', new_host).encode('ascii','ignore')
print(new_host)
html = req.urlopen(odiurl).read()
if html:
with open('espncricinfo-fc/{0!s}'.format(str.split(new_host, "/")[4]), "wb") as f:
f.write(html)
print(html)
else:
print("no html")

this is usually happen when doing multiple request too fast, it can be the server is down or your connection blocked by server firewall, try increase your sleep() or add random sleep.
import random
.....
for i in range(0, 2000):
soupy = BeautifulSoup(....)
time.sleep(random.randint(2,6))

not sure why, seems to be working for me.
I made a few changes in the loop through the links. I'm not sure how you're wanting the output to look in terms of writing it to your file, so I left that part alone. But like I said, seems to be working ok on my end.
import bs4
import requests
import os
import time
import urllib.request as req
BASE_URL = 'http://www.espncricinfo.com'
if not os.path.exists('C:/espncricinfo-fc'):
os.mkdir('C:/espncricinfo-fc')
for i in range(0, 2000):
i=0
url = 'http://search.espncricinfo.com/ci/content/match/search.html?search=test;all=1;page=%s' %i
html = requests.get(url)
print ('Checking page %s of 2000' %(i+1))
soupy = bs4.BeautifulSoup(html.text, 'html.parser')
time.sleep(1)
for new_host in soupy.findAll('a', {'class' : 'srchPlyrNmTxt'}):
try:
new_host = new_host['href']
except:
continue
odiurl = BASE_URL + new_host
new_host = odiurl
print(new_host)
html = req.urlopen(odiurl).read()
if html:
with open('C:/espncricinfo-fc/{0!s}'.format('_'.join(str.split(new_host, "/")[4:])), "wb") as f:
f.write(html)
#print(html)
else:
print("no html")

Pagination Webscraping Python3- BS4 - While loop

I finished my scraper for one page and extracted the href for the next page.
I can't get the scraper in a loop for each subsequent page. I tried a While True loop, but this kills my results from the first page.
This code works perfectly for the first page:
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
def NextPage():
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
print(np)
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage()
No need for a baseurl + number method I suppose, because I can extract the correct url from each page using findAll. I'm fairly new so the mistake must be pretty dumb.
Thanks for helping out!

Try the below script to get the required fields traversing different pages and write them accordingly to a csv file. I tried to clean up your repetitive coding and applied slightly cleaner approach in place of that. Give it a go:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='
with open("output.csv","w",newline="",encoding="utf-8") as infile:
writer = csv.writer(infile)
writer.writerow(['Artist','Venue','City'])
pagenum = -1 #make sure to get the content of the first page as well which is "0" in the link
while True:
pagenum+=1
res = urlopen(link.format(pagenum)).read()
soup = BeautifulSoup(res, "html.parser")
container = soup.find_all("section",class_="concert_rows_info")
if len(container)<=1:break ##as soon as there is no content the scraper should break out of the loop
for items in container:
artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
venue = items.find(class_="td_3").get_text(strip=True)
city = items.find(class_="td_4").get_text(strip=True)
writer.writerow([artist,city,venue])
print(f'{artist}\n{venue}\n{city}\n')

your mistakes
you have to fetch the url that you found in the end of your file you are just calling NextPage() but what is it doing is just printing out the url
that was your mistake :)
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
filename = "db.csv"
#at the beginning of the document you create the file in 'w'-write mode
#but later you should open it in "A"-append mode because 'W'-write will rewrite the file
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
f.close()
#create a function url_fetcher that everytime will go and fetch the html
def url_fetcher(url):
myurl = (url)
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
get_artist(DayContainer, page_soup)
#here you have to call the url otherwize it wont work
def NextPage(page_soup):
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
url_fetcher(np)
#in get artist you have some repeatings but you can tweak alittle bit and it will work
def get_artist(DayContainer, page_soup):
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
print(soldout)
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
with open(filename, "a") as f:
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage(page_soup)
url_fetcher('https://www.podiuminfo.nl/concertagenda/')
recap
for easier understanding i've made a big a loop but it works :)
you need to make some ajustments of the so there are not repetitive names and dates in db.csv

Not able to parse webpage contents using beautiful soup

I have been using Beautiful Soup for parsing webpages for some data extraction. It has worked perfectly well for me so far, for other webpages. But however I'm trying to count the number of < a> tags in this page,
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)
I'm getting the output as 0. This makes me think that the 2 lines after r=requests.get(url) is probably not working(there's obviously no chance that there's zero < a> tags in the page), and i'm not sure about what alternative solution I can use here. Does anybody have any solution or faced a similar kind of problem before?
Thanks, in advance.

You need to pass some of the information along with the request to the server.
Following code should work...You can play along with other parameter as well
from bs4 import BeautifulSoup
import requests
catsection = "cricket"
url_base = "http://www.dnaindia.com/"
i = 89
url = url_base + catsection + "?page=" + str(i)
print(url)
headers = {
'User-agent': 'Mozilla/5.0'
}
#This is the page I'm trying to parse and also the one in the hyperlink
#I get the correct url i'm looking for at this stage
r = requests.get(url, headers=headers)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
j=0
for num in soup.find_all('a'):
j=j+1
print(j)

Put any url in the parser and check the number of "a" tags available on that page:
from bs4 import BeautifulSoup
import requests
url_base = "http://www.dnaindia.com/cricket?page=1"
res = requests.get(url_base, headers={'User-agent': 'Existed'})
soup = BeautifulSoup(res.text, 'html.parser')
a_tag = soup.select('a')
print(len(a_tag))

How to get certain text from a url links

So im trying to get all the statistics in the statistics box page on the url page for each team. An example of what the page looks like is on the hyperlink I put below. Im trying to have if so it prints out;
month : win %
month : win %
All time: win%
But I am not to sure how to write that code, since the last piece of code I wrote in the main was giving me an error.
http://www.gosugamers.net/counterstrike/teams/16448-nasty-gravy-runners
import time
import requests
from bs4 import BeautifulSoup
def get_all(url, base): # Well called it will print all the team links
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page, 'html.parser')
for team_links in soup.select('div.details h3 a'):
members = int(team_links.find_next('th', text='Members:').find_next_sibling('td').text.strip().split()[0])
if members < 5:
continue
yield base + team_links['href']
next_page = soup.find('div', {'class': 'pages'}).find('span', text='Next')
while next_page:
# Gives the server a break
time.sleep(0.2)
r = requests.get(BASE_URL + next_page.find_previous('a')['href'])
page = r.text
soup = BeautifulSoup(page)
for team_links in soup.select('div.details h3 a'):
yield BASE_URL + team_links['href']
next_page = soup.find('div', {'class': 'pages'}).find('span', text='Next')
if __name__ == '__main__':
BASE_URL = 'http://www.gosugamers.net'
URL = 'http://www.gosugamers.net/counterstrike/teams'
for links in get_all(URL, BASE_URL): # When run it will generate all the links for all the teams
r = requests.get(links)
page = r.content
soup = BeautifulSoup(page)
for statistics in soup.select('div.statistics tr'):
win_rate = int(statistics.find('th', text='Winrate:').find_next_sibling('td'))
print(win_rate)

Not sure exactly what you want but this will get all the team stats:
from bs4 import BeautifulSoup, Tag
import requests
soup = BeautifulSoup(requests.get("http://www.gosugamers.net/counterstrike/teams/16448-nasty-gravy-runners").content)
table = soup.select_one("table.stats-table")
head1 = [th.text.strip() for th in table.select("tr.header th") if th.text]
head2 = [th.text.strip() for th in table.select_one("tr + tr") if isinstance(th, Tag)]
scores = [th.text.strip() for th in table.select_one("tr + tr + tr") if isinstance(th, Tag)]
print(head1, head2, scores)
Output:
([u'Jun', u'May', u'All time'], [u'Winrate:', u'0%', u'0%', u'0%'], [u'Matches played:', u'0 / 0 / 0', u'0 / 0 / 0', u'0 / 0 / 0'])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Crawler in Python, urlopen not working - python-3.x

This site get data by Ajax request. So you must send post requests to get data. Tip: Select Url correctly for example: https://mtgsingles.gr/search?ajax=products-listing&q=

Related

Parse through website with Beautiful Soup to find matching Data

How to scrape all the test match details in cricinfo

Pagination Webscraping Python3- BS4 - While loop

Not able to parse webpage contents using beautiful soup

How to get certain text from a url links

Categories

Resources