Python requests pull not always retrieving data - python-3.x

To practice programming, I am trying to help a friend review a subreddit's data via webscraping and requests and bs4. (I prefer requests for this task since I am moving this script over to my Rasberry Pi and don't think its little heart could even get Chrome installed.
I am running into an issue where my requests only outputs the results sometimes, meaning it will pull the name and url of the post, say 1 out of 5 times when run. When the request returns no data, it doesn't return an error, the program just stops.
from time import sleep
import requests
import os
import re
i = 1
selections = ""
r = requests.get("https://www.reddit.com/r/hardwareswap/search?q=Vive&restrict_sr=1", timeout = None)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs = {'data-click-id':'body'})
textitems = []
for result in results:
textitems.append(result.text.strip())
for result in textitems:
print(result)
links = soup.find_all('a', attrs = {'data-click-id':'body'})
for link in links:
print(link.attrs['href'])
Any thoughts as to why this happens? My initial thoughts were it was either due to a reddit policy or an invalid URL.
Thanks!

Related

BeautifulSoup WebScraping Issue: Cannot find specific classes for this specific Website (Python 3.7)

I am a bit new to webscraping, I have created webscrapers with the methods below before, however with this specific website I am running into an issue where the parser cannot locate the specific class ('mainTitle___mbpq1') this is the class which refers to the text of announcement. Whenever I run the code it returns None. This also the case for the majority of other classes. I want to capture this info without using selenium, since this slows the process down from what I understand. I think the issue is that it is a json file, and so script tags are being used (I may be completely wrong, just a guess), but I do not know much about this area, so any help would be much appreciated.
The code below I have attempted using, with no success.
from bs4 import BeautifulSoup
import re
import requests
# Method 1
url_4 = "https://www.kucoin.com/news/categories/listing"
res = requests.get(url_4)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
texts = soup.body
text = soup.body.div.find('div',{'class':'mainTitle___mbpq1'})
print(text)
from bs4 import BeautifulSoup
import urllib3
import re
# Method2
http = urllib3.PoolManager()
comm = re.compile("<!--|-->")
def make_soup(url):
page = http.request('GET', url)
soupdata = BeautifulSoup(page.data,features="lxml")
return soupdata
soup = make_soup(url_4)
Annouce_Info = soup.find('div',{'class':'mainTitle___mbpq1'})
print(Annouce_Info)
linkKuCoin Listing
The data is loaded from external source via Javascript. To print all article titles, you can use this example:
import json
import requests
url = "https://www.kucoin.com/_api/cms/articles"
params = {"page": 1, "pageSize": 10, "category": "listing", "lang": ""}
data = requests.get(url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data["items"]:
print(item["title"])
Prints:
PhoenixDAO (PHNX) Gets Listed on KuCoin!
LABS Group (LABS) Gets Listed on KuCoin! World Premiere!
Polkadex (PDEX) Gets Listed on KuCoin! World Premiere!
Announcement of Polkadex (PDEX) Token Sale on KuCoin Spotlight
KuCoin Futures Has Launched USDT Margined NEO, ONT, XMR, SNX Contracts
Introducing the Polkadex (PDEX) Token Sale on KuCoin Spotlight
Huobi Token (HT) Gets Listed on KuCoin!
KuCoin Futures Has Launched USDT Margined XEM, BAT, XTZ, QTUM Contracts
RedFOX Labs (RFOX) Gets Listed on KuCoin!
Boson Protocol (BOSON) Gets Listed on KuCoin! World Premiere!
If you trying to scrape information about new listings at crypto exchanges, you can be interested in this API:
https://rapidapi.com/Diver44/api/new-cryptocurrencies-listings/
import requests
url = "https://new-cryptocurrencies-listings.p.rapidapi.com/new_listings"
headers = {
'x-rapidapi-host': "new-cryptocurrencies-listings.p.rapidapi.com",
'x-rapidapi-key': "your-key"
}
response = requests.request("GET", url, headers=headers)
print(response.text)
It includes an endpoint with New Listings from the biggest exchanges and a very useful endpoint with information about exchanges where you can buy specific coins and prices for this coin at that exchanges

Python/Selenium - how to loop through hrefs in <li>?

Web URL: https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times
I want to parse the HTML as below:
I want to get all hrefs within the < li > elements and the highlighted text. I tried the code
elementList = driver.find_element_by_class_name('block-wysiwyg').find_elements_by_tag_name("li")
for i in range(len(elementList)):
driver.find_element_by_class_name('blcokwysiwyg').find_elements_by_tag_name("li").get_attribute("href")
But the block returned none.
Can anyone please help me with the above code?
I suppose it will fetch you the required content.
import requests
from bs4 import BeautifulSoup
link = 'https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times'
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".block-wysiwyg li"):
item_text = item.get_text(strip=True)
item_link = item.select_one("a[href]").get("href")
print(item_text,item_link)
Try is this way:
coronas = driver.find_element_by_xpath("//div[#class='block-wysiwyg']/ul/li")
hr = coronas.find_element_by_xpath('./a')
print(coronas.text)
print(hr.get_attribute('href'))
Output:
The coronavirus is touching the lives of all Americans, but race, age, and income play a big role in the exact ways the virus — and the stalled economy — are affecting people. Here's what that means.
https://www.ipsos.com/en-us/america-under-coronavirus

Web Scraping with BeautifulSoup code review

from bs4 import BeautifulSoup
import requests
import pandas as pd
records=[]
keep_looking = True
url = 'https://www.tapology.com/fightcenter'
while keep_looking:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
data = soup.find_all('section',attrs={'class':'fcListing'})
for d in data:
event = d.find('a').text
date = d.find('span',attrs={'class':'datetime'}).text[1:-4]
location = d.find('span',attrs={'class':'venue-location'}).text
mainEvent = first.find('span',attrs={'class':'bout'}).text
url_tag = soup.find('div',attrs={'class':'fightcenterEvents'})
if not url_tag:
keep_looking = False
else:
url = "https://www.tapology.com" + url_tag.find('a')['href']
I am wondering if there are any errors in my code? It is running, but it is taking a very long time to finish and I am afraid it might be stuck in an infinity loop. Please any feedback would be helpful. Please do not rewrite all of this and post, as I would like to keep this format, as I am learning and want to improve.
Although this is not the right site to seek help for review related task, I considered giving a solution as it sounds that you may fall in an infinite loop according to your statement above.
Try this to get information from that site. It will run until there is a next page link to traverse. When there is no more new page link to follow, the script will automatically stop.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = 'https://www.tapology.com/fightcenter'
while True:
re = requests.get(url)
soup = BeautifulSoup(re.text,'html.parser')
for data in soup.find_all('section',attrs={'class':'fcListing'}):
event = data.select_one('.name a').get_text(strip=True)
date = data.find('span',attrs={'class':'datetime'}).get_text(strip=True)[:-1]
location = data.find('span',attrs={'class':'venue-location'}).get_text(strip=True)
try:
mainEvent = data.find('span',attrs={'class':'bout'}).get_text(strip=True)
except AttributeError: mainEvent = ""
print(f'{event} {date} {location} {mainEvent}')
urltag = soup.select_one('.pagination a[rel="next"]')
if not urltag: break #as soon as it finds that there is no next page link, it will break out of the loop
url = urljoin(url,urltag.get("href")) #applied urljoin to save you from using hardcoded prefix
For future reference: feel free to post any question in this site to get your code reviewed.

How can i get the links under a specific class

So 2 days ago i was trying to parse the data between two same classes and Keyur helped me a lot then after he left other problems behind.. :D
Page link
Now I want to get the links under a specific class, here is my code, and here are the errors.
from bs4 import BeautifulSoup
import urllib.request
import datetime
headers = {} # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0' # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers) # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read() # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')
a = 0
b = 1
# Getting the match pages' links.
for x in soup.find('span', text=datetime.date.today()).parent:
print(x.find('a'))
Error:
Actually there isn't any error but it outputs like:
None
None
None
-1
None
None
-1
Then i researched and saw that if there isn't any data to give, find function gives you nothing which is none.
Then i tried to use find_all
Code:
print(x.find_all('a'))
Output:
AttributeError: 'NavigableString' object has no attribute 'find_all'
This is the class name:
<div class="standard-headline">2018-05-01</div>
I don't want to post all the code to here so here is the link hltv.org/matches/ so you can check the classes more easily.
I'm not quite sure I could understand what links OP really wants to grab. However, I took a guess. The links are within compound classes a-reset block upcoming-match standard-box and if you can spot the right class then one individual calss will suffice to fetch you the data like selectors do. Give it a shot.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.parse import urljoin
import datetime
url = 'https://www.hltv.org/matches'
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res, 'lxml')
for links in soup.find(class_="standard-headline",text=(datetime.date.today())).find_parent().find_all(class_="upcoming-match")[:-2]:
print(urljoin(url,links.get('href')))
Output:
https://www.hltv.org/matches/2322508/yeah-vs-sharks-ggbet-ascenso
https://www.hltv.org/matches/2322633/team-australia-vs-team-uk-showmatch-csgo
https://www.hltv.org/matches/2322638/sydney-saints-vs-control-fe-lil-suzi-winner-esl-womens-sydney-open-finals
https://www.hltv.org/matches/2322426/faze-vs-astralis-iem-sydney-2018
https://www.hltv.org/matches/2322601/max-vs-fierce-tiger-starseries-i-league-season-5-asian-qualifier
and so on ------

Beautiful Soup saving sessions to checkout products

I am trying to write a bot to add items to my cart then purchase them for me because I need to make very regular purchases and it becomes tedious to purchase them myself.
from bs4 import BeautifulSoup
import requests
import numpy as np
page = requests.get("http://www.onlinestore.com/shop")
soup = BeautifulSoup(page.content, 'html.parser')
try:
for i in soup.find_all('a'):
if "shop" in i['href']:
shop_page = requests.get("http://www.onlinestore.com" + i['href'])
item_page = BeautifulSoup(shop_page.content, 'html.parser')
for h in item_page.find_all('form', class_="add"):
print(h['action'])
try:
shop_page = requests.get("http://www.online.com" + h['action'])
except:
print("None left")
for h in item_page.find_all('h1', class_="protect"):
print(h.getText())
except:
print("either ended or error occured")
checkout_page = requests.get("http://www.onlinestore.com/checkout")
checkout = BeautifulSoup(checkout_page.content, 'html.parser')
for j in checkout.find_all('strong', id_="total"):
print(j)
I was having some trouble checking out the products because the items don't carry over. Is there a way that I can implement cookies so that it keeps track my items I have added to cart?
Thanks
Even though requests is not a browser, it still can persist headers and cookies across multiple requests, but if you use a Session:
with requests.Session() as session:
page = session.get("http://www.onlinestore.com/shop")
# use "session" instead of "requests" later on
Note that you would also get a performance boost for free because of the persistent connection:
if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
Depending on how the cart and checkout on this specific online store are implemented this may or may not work. But, at least, this is a step forward.

Resources