Pagination Webscraping Python3- BS4 - While loop

Pagination Webscraping Python3- BS4 - While loop - python-3.x

I finished my scraper for one page and extracted the href for the next page.
I can't get the scraper in a loop for each subsequent page. I tried a While True loop, but this kills my results from the first page.
This code works perfectly for the first page:
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
myurl = ('https://www.podiuminfo.nl/concertagenda/')
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
filename = "db.csv"
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
def NextPage():
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
print(np)
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage()
No need for a baseurl + number method I suppose, because I can extract the correct url from each page using findAll. I'm fairly new so the mistake must be pretty dumb.
Thanks for helping out!

Try the below script to get the required fields traversing different pages and write them accordingly to a csv file. I tried to clean up your repetitive coding and applied slightly cleaner approach in place of that. Give it a go:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
link = 'https://www.podiuminfo.nl/concertagenda/?page={}&input_plaats=&input_datum=2018-06-30&input_podium=&input_genre=&input_provincie=&sort=&input_zoek='
with open("output.csv","w",newline="",encoding="utf-8") as infile:
writer = csv.writer(infile)
writer.writerow(['Artist','Venue','City'])
pagenum = -1 #make sure to get the content of the first page as well which is "0" in the link
while True:
pagenum+=1
res = urlopen(link.format(pagenum)).read()
soup = BeautifulSoup(res, "html.parser")
container = soup.find_all("section",class_="concert_rows_info")
if len(container)<=1:break ##as soon as there is no content the scraper should break out of the loop
for items in container:
artist = items.find(class_="td_2")("a")[0].get_text(strip=True)
venue = items.find(class_="td_3").get_text(strip=True)
city = items.find(class_="td_4").get_text(strip=True)
writer.writerow([artist,city,venue])
print(f'{artist}\n{venue}\n{city}\n')

your mistakes
you have to fetch the url that you found in the end of your file you are just calling NextPage() but what is it doing is just printing out the url
that was your mistake :)
import bs4
from urllib.request import urlopen as ireq
from bs4 import BeautifulSoup as soup
filename = "db.csv"
#at the beginning of the document you create the file in 'w'-write mode
#but later you should open it in "A"-append mode because 'W'-write will rewrite the file
f = open(filename, "w")
headers = "Artist, Venue, City, Date\n"
f.write(headers)
f.close()
#create a function url_fetcher that everytime will go and fetch the html
def url_fetcher(url):
myurl = (url)
uClient = ireq(myurl)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
DayContainer = page_soup.findAll("section",{"class":"overflow"})
print("Days on page: " + str(len(DayContainer)) + "\n")
get_artist(DayContainer, page_soup)
#here you have to call the url otherwize it wont work
def NextPage(page_soup):
np = page_soup.findAll("section", {"class":"next_news"})
np = np[0].find('a').attrs['href']
url_fetcher(np)
#in get artist you have some repeatings but you can tweak alittle bit and it will work
def get_artist(DayContainer, page_soup):
for days in DayContainer:
shows = days.findAll("span", {"class":"concert_uitverkocht"})
for soldout in shows:
print(soldout)
if shows:
soldoutPlu = shows[0].parent.parent.parent
artist = soldoutPlu.findAll("div", {"class":"td_2"})
artist = artist[0].text.strip()
venue = soldoutPlu.findAll("div", {"class":"td_3"})
venue = venue[0].text
city = soldoutPlu.findAll("div", {"class":"td_4"})
city = city[0].text
date = shows[0].parent.parent.parent.parent.parent
date = date.findAll("section", {"class":"concert_agenda_date"})
date = date[0].text
date = date.strip().replace("\n", " ")
print("Datum gevonden!")
print("Artiest: " + artist)
print("Locatie: " + venue)
print("Stad: " + city)
print("Datum: " + date+ "\n")
with open(filename, "a") as f:
f.write(artist + "," + date + "," + city + "," + venue + "\n")
else:
pass
NextPage(page_soup)
url_fetcher('https://www.podiuminfo.nl/concertagenda/')
recap
for easier understanding i've made a big a loop but it works :)
you need to make some ajustments of the so there are not repetitive names and dates in db.csv

Related

How to write to a csv from from python

I am web scraping with python from pacsun.com and I am trying to put it into a csv file but when I open the file only the headers print and not the product_name, price, or the new_arrival.
So my question is how do I get these values to print out under the headers in a csv file?
from bs4 import BeautifulSoup as soup
import csv
my_url = ('https://www.pacsun.com/mens/')
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
product_data = page_soup.findAll('div',{'class':'product-data'})
#print(len(product_data))
#print(product_data[0])
product = product_data[0]
filename = 'pacsun.csv'
f = open(filename,'w')
headers = 'product_name, price, new_arrival\n'
f.write(headers)
for product in product_data:
#name = product.div.a["title"]
product_name = print('product: ' + product.div.a["title"])
#the code above gets the title of the product
price = product.findAll('div',{'class':'product-price group'})
#the code above gets the price of the product
new_arrival = product.findAll('div',{'class':'new'})
#the code above gets
print(price[0].text)
print(new_arrival[0].text)
thewriter = csv.DictWriter(filename, headers)
thewriter.writerow({'product_name':product_name, 'price':price, 'new_arrival':new_arrival})
#f.write(product_name.replace(",", "|") + "," + price + ","+ new_arrival + "\n")
f.close()

You have a problem with the data. So, I fix it and it works fine. You only need to change w to a to be f = open(filename,'a') and to put f.write in the loop
from bs4 import BeautifulSoup as soup
import csv
from urllib.request import urlopen as uReq
my_url = ('https://www.pacsun.com/mens/')
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
product_data = page_soup.findAll('div',{'class':'product-data'})
#print(len(product_data))
#print(product_data[0])
product = product_data[0]
filename = 'pacsun.csv'
f = open(filename,"a")
headers = 'product_name, price, new_arrival\n'
f.write(headers)
for product in product_data:
#name = product.div.a["title"]
product_name = print('product: ' + product.div.a["title"])
#the code above gets the title of the product
price = product.findAll('div',{'class':'product-price group'})
#the code above gets the price of the product
new_arrival = product.findAll('div',{'class':'new'})
price_ = ''
new_arrival_ = ''
product_name_ = ''
# product_name_ = ' '.join([str(elem) for elem in product.div.a["title"]])
for price_text in price:
price_ = price_text.text
for new_arrival_text in new_arrival:
new_arrival_ = new_arrival_text.text
f.write(product.div.a["title"]+","+price_+ "," + new_arrival_ + "\n")
f.close()

Parse through website with Beautiful Soup to find matching Data

I am trying Python + BeautifulSoup to loop through a website in order to find a matching string contained in a tag.
When the matching substring is found stop the iteration and print the span, can't find a way to make this work.
this is what I could manage to work out so far
import urllib.request
from bs4 import BeautifulSoup as b
num = 1
base_url = "https://v-tac.it/led-products-results-page/?q="
request = '500'
separator = '&start='
page_num = "1"
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
for i in range(100) :
for post in soup.findAll("div",{"class" : "spacer"}):
h = post.findAll("span")[0].text
if "request" in h:
break
print(h)
num += 1
page_num = str(num)
url = base_url + request + separator + page_num
html = urllib.request.urlopen(url).read()
soup = b(html, "html.parser")
print("We are at page " + page_num)
But it doesn't return anything, it only cycles through the pages.
Thanks in advance for any help

If it is in the text then with bs4 4.7.1 you should be able to use :contains
soup.select_one('.spacer span:contains("request")').text if soup.select_one('.spacer span:contains("request")') is not None else 'Not found'
I'm not sure why when you have for i in range(100) , you don't use i instead of num later; then you wouldn't need +=

By Beautiful Soup i scrape twitter data. I am able to get data but can't save in csv file

I scraped Twitter for user name, Tweets, replies, retweets but can't save in a CSV file.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "5_twitterBBC.csv"
f = open(file, "w")
Headers = "tweet_user, tweet_text, replies, retweets\n"
f.write(Headers)
for page in range(0,5):
url = "https://twitter.com/BBCWorld".format(page)
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
tweets = soup.find_all("div", {"class":"js-stream-item"})
for tweet in tweets:
try:
if tweet.find('p',{"class":'tweet-text'}):
tweet_user = tweet.find('span',{"class":'username'}).text.strip()
tweet_text = tweet.find('p',{"class":'tweet-text'}).text.encode('utf8').strip()
replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
print(tweet_user, tweet_text, replies, retweets)
f.write("{}".format(tweet_user).replace(",","|")+ ",{}".format(tweet_text)+ ",{}".format( replies).replace(",", " ")+ ",{}".format(retweets) + "\n")
except: AttributeError
f.close()
I get data but can't save in CSV file. Someone explain me how to save data in CSV file.

As you can see, you've only made a small error in finding the tweets here tweets = soup.find_all("div", {"class":"js-stream-item"}), you forgot to pass on the argument key name which should be like this tweets = soup.find_all("div", attrs={"class":"js-stream-item"})
This is a working solution but it only fetches the first 20 tweets
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "5_twitterBBC.csv"
f = open(file, "w")
Headers = "tweet_user, tweet_text, replies, retweets\n"
f.write(Headers)
url = "https://twitter.com/BBCWorld"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# Gets the tweet
tweets = soup.find_all("li", attrs={"class":"js-stream-item"})
# Writes tweet fetched in file
for tweet in tweets:
try:
if tweet.find('p',{"class":'tweet-text'}):
tweet_user = tweet.find('span',{"class":'username'}).text.strip()
tweet_text = tweet.find('p',{"class":'tweet-text'}).text.encode('utf8').strip()
replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
# String interpolation technique
f.write(f'{tweet_user},/^{tweet_text}$/,{replies},{retweets}\n')
except: AttributeError
f.close()

filename = "output.csv"
f = open(filename, "w",encoding="utf-8")
headers = " tweet_user, tweet_text, replies, retweets \n"
f.write(headers)
***your code***
***loop****
f.write(''.join(tweet_user + [","] + tweet_text + [","] + replies + [","] + retweets + [","] + ["\n"]) )
f.close()

'list' object has no attribute 'timeout' and only prints first item in the table

I am trying to pull a table from a list of URL's. When I only input one URL it only prints out the first items in the table and when I add more URL's to the list I get the error message 'list' object has no attribute 'timeout'. What is the best way to get the rest of the items and adding more URL's?
Below is the code I am running.
import time, random, csv, bs4, requests, io
import pandas as pd
timeDelay = random.randrange(5, 20)
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_urls = [
"https://www.lonza.com/products-services/bio-research/electrophoresis-of-nucleic-acids-and-proteins/nucleic-acid-electrophoresis/precast-gels-for-dna-and-rna-analysis/truband-gel-anchors.aspx",
"https://www.lonza.com/products-services/bio-research/transfection/nucleofector-kits-for-primary-cells/nucleofector-kits-for-primary-epithelial-cells/nucleofector-kits-for-human-mammary-epithelial-cells-hmec.aspx",
"https://www.lonza.com/products-services/bio-research/transfection/nucleofector-kits-for-primary-cells/nucleofector-kits-for-primary-neural-cells/nucleofector-kits-for-mammalian-glial-cells.aspx",
]
uClient = uReq(my_urls)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll('tbody')
product_name_list =[]
cat_no_list = []
size_list = []
price_list =[]
for container in containers:
if (len(container) > 0):
#try:
title_container = container.findAll('td')
Product_name = title_container[0].text.strip()
product_name_list.append(Product_name)
CatNo_container = container.findAll('td')
CatNo = CatNo_container[1].text.strip()
cat_no_list.append(CatNo)
#Size_container = container.findAll('div',{'class':'col-xs-2 noPadding'})
#Size = Size_container[0].text.strip()
#size_list.append(Size)
Price_container = container.findAll('td')
Price = Price_container[4].text.strip()
price_list.append(Price)
print('Product_name: '+ Product_name)
print('CatNo: ' + CatNo)
print('Size: ' + 'N/A')
print('Price: ' + Price)
print(" ")
time.sleep(timeDelay)

You are passing a list here, uClient = uReq(my_urls) as my_urls where a string is required.
You need to pass the individual element of the list i.e. the strings.
Here is the edited code that works for multiple urls.
UPDATED CODE (to get all items):
import time, random, csv, bs4, requests, io
import pandas as pd
timeDelay = random.randrange(5, 20)
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_urls = [
"https://www.lonza.com/products-services/bio-research/electrophoresis-of-nucleic-acids-and-proteins/nucleic-acid-electrophoresis/precast-gels-for-dna-and-rna-analysis/truband-gel-anchors.aspx",
"https://www.lonza.com/products-services/bio-research/transfection/nucleofector-kits-for-primary-cells/nucleofector-kits-for-primary-epithelial-cells/nucleofector-kits-for-human-mammary-epithelial-cells-hmec.aspx",
"https://www.lonza.com/products-services/bio-research/transfection/nucleofector-kits-for-primary-cells/nucleofector-kits-for-primary-neural-cells/nucleofector-kits-for-mammalian-glial-cells.aspx",
]
for url in my_urls:
print("URL using: ", url)
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll('tbody')
product_name_list =[]
cat_no_list = []
size_list = []
price_list =[]
for container in containers:
if (len(container) > 0):
#try:
items = container.findAll('tr')
for item in items:
item = item.text.split('\n')
Product_name = item[1]
product_name_list.append(Product_name)
CatNo = item[2]
cat_no_list.append(CatNo)
#Size_container = container.findAll('div',{'class':'col-xs-2 noPadding'})
#Size = Size_container[0].text.strip()
#size_list.append(Size)
Price = item[6]
price_list.append(Price)
print('Product_name: '+ Product_name)
print('CatNo: ' + CatNo)
print('Size: ' + 'N/A')
print('Price: ' + Price)
print(" ")
time.sleep(timeDelay)

Crawler in Python, urlopen not working

I am playing around trying to extract some info from a webpage and I have the following code:
import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup
InitUrl="https://mtgsingles.gr/search?q="
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages=5
for i in range(0, NumOfPages):
if i == 0:
Url = InitUrl
else:
Url = URL_Next
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
try:
URL_Next = "https://mtgsingles.gr" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
print("The next URL is: " + URL_Next + "\n")
except IndexError:
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The next URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
Url= URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
The code runs fine and no errors occur but the results are not as expected. I am trying to extract some information from the page as well as the url of the next page. Ultimately I would like the program to run 5 times and crawl 5 pages. But this code crawls the initial page given (InitUrl="xyz.com") all 5 times and does not proceed in the next page url that is extracted. I tried debugging it by entering some print statements to see where the problem lies and I think that the problem lies at these statements:
UClient = uReq(Url)
page_html = UClient.read()
UClient.close()
For some reason urlopen does not work repeatedly in the for loop. Why does this happen? Is it wrong to use urlopen in a for statement?

This site get data by Ajax request. So you must send post requests to get data.
Tip: Select Url correctly for example: https://mtgsingles.gr/search?ajax=products-listing&q=

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pagination Webscraping Python3- BS4 - While loop - python-3.x

Related

How to write to a csv from from python

Parse through website with Beautiful Soup to find matching Data

By Beautiful Soup i scrape twitter data. I am able to get data but can't save in csv file

'list' object has no attribute 'timeout' and only prints first item in the table

Crawler in Python, urlopen not working

Categories

Resources