Beautiful Soup saving sessions to checkout products - python-3.x

I am trying to write a bot to add items to my cart then purchase them for me because I need to make very regular purchases and it becomes tedious to purchase them myself.
from bs4 import BeautifulSoup
import requests
import numpy as np
page = requests.get("http://www.onlinestore.com/shop")
soup = BeautifulSoup(page.content, 'html.parser')
try:
for i in soup.find_all('a'):
if "shop" in i['href']:
shop_page = requests.get("http://www.onlinestore.com" + i['href'])
item_page = BeautifulSoup(shop_page.content, 'html.parser')
for h in item_page.find_all('form', class_="add"):
print(h['action'])
try:
shop_page = requests.get("http://www.online.com" + h['action'])
except:
print("None left")
for h in item_page.find_all('h1', class_="protect"):
print(h.getText())
except:
print("either ended or error occured")
checkout_page = requests.get("http://www.onlinestore.com/checkout")
checkout = BeautifulSoup(checkout_page.content, 'html.parser')
for j in checkout.find_all('strong', id_="total"):
print(j)
I was having some trouble checking out the products because the items don't carry over. Is there a way that I can implement cookies so that it keeps track my items I have added to cart?
Thanks

Even though requests is not a browser, it still can persist headers and cookies across multiple requests, but if you use a Session:
with requests.Session() as session:
page = session.get("http://www.onlinestore.com/shop")
# use "session" instead of "requests" later on
Note that you would also get a performance boost for free because of the persistent connection:
if you're making several requests to the same host, the underlying TCP
connection will be reused, which can result in a significant
performance increase
Depending on how the cart and checkout on this specific online store are implemented this may or may not work. But, at least, this is a step forward.

Related

BeautifulSoup WebScraping Issue: Cannot find specific classes for this specific Website (Python 3.7)

I am a bit new to webscraping, I have created webscrapers with the methods below before, however with this specific website I am running into an issue where the parser cannot locate the specific class ('mainTitle___mbpq1') this is the class which refers to the text of announcement. Whenever I run the code it returns None. This also the case for the majority of other classes. I want to capture this info without using selenium, since this slows the process down from what I understand. I think the issue is that it is a json file, and so script tags are being used (I may be completely wrong, just a guess), but I do not know much about this area, so any help would be much appreciated.
The code below I have attempted using, with no success.
from bs4 import BeautifulSoup
import re
import requests
# Method 1
url_4 = "https://www.kucoin.com/news/categories/listing"
res = requests.get(url_4)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
texts = soup.body
text = soup.body.div.find('div',{'class':'mainTitle___mbpq1'})
print(text)
from bs4 import BeautifulSoup
import urllib3
import re
# Method2
http = urllib3.PoolManager()
comm = re.compile("<!--|-->")
def make_soup(url):
page = http.request('GET', url)
soupdata = BeautifulSoup(page.data,features="lxml")
return soupdata
soup = make_soup(url_4)
Annouce_Info = soup.find('div',{'class':'mainTitle___mbpq1'})
print(Annouce_Info)
linkKuCoin Listing
The data is loaded from external source via Javascript. To print all article titles, you can use this example:
import json
import requests
url = "https://www.kucoin.com/_api/cms/articles"
params = {"page": 1, "pageSize": 10, "category": "listing", "lang": ""}
data = requests.get(url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data["items"]:
print(item["title"])
Prints:
PhoenixDAO (PHNX) Gets Listed on KuCoin!
LABS Group (LABS) Gets Listed on KuCoin! World Premiere!
Polkadex (PDEX) Gets Listed on KuCoin! World Premiere!
Announcement of Polkadex (PDEX) Token Sale on KuCoin Spotlight
KuCoin Futures Has Launched USDT Margined NEO, ONT, XMR, SNX Contracts
Introducing the Polkadex (PDEX) Token Sale on KuCoin Spotlight
Huobi Token (HT) Gets Listed on KuCoin!
KuCoin Futures Has Launched USDT Margined XEM, BAT, XTZ, QTUM Contracts
RedFOX Labs (RFOX) Gets Listed on KuCoin!
Boson Protocol (BOSON) Gets Listed on KuCoin! World Premiere!
If you trying to scrape information about new listings at crypto exchanges, you can be interested in this API:
https://rapidapi.com/Diver44/api/new-cryptocurrencies-listings/
import requests
url = "https://new-cryptocurrencies-listings.p.rapidapi.com/new_listings"
headers = {
'x-rapidapi-host': "new-cryptocurrencies-listings.p.rapidapi.com",
'x-rapidapi-key': "your-key"
}
response = requests.request("GET", url, headers=headers)
print(response.text)
It includes an endpoint with New Listings from the biggest exchanges and a very useful endpoint with information about exchanges where you can buy specific coins and prices for this coin at that exchanges

Python requests pull not always retrieving data

To practice programming, I am trying to help a friend review a subreddit's data via webscraping and requests and bs4. (I prefer requests for this task since I am moving this script over to my Rasberry Pi and don't think its little heart could even get Chrome installed.
I am running into an issue where my requests only outputs the results sometimes, meaning it will pull the name and url of the post, say 1 out of 5 times when run. When the request returns no data, it doesn't return an error, the program just stops.
from time import sleep
import requests
import os
import re
i = 1
selections = ""
r = requests.get("https://www.reddit.com/r/hardwareswap/search?q=Vive&restrict_sr=1", timeout = None)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('a', attrs = {'data-click-id':'body'})
textitems = []
for result in results:
textitems.append(result.text.strip())
for result in textitems:
print(result)
links = soup.find_all('a', attrs = {'data-click-id':'body'})
for link in links:
print(link.attrs['href'])
Any thoughts as to why this happens? My initial thoughts were it was either due to a reddit policy or an invalid URL.
Thanks!

how to read links from a list with beautifulsoup?

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3
links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]
raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")
raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
soups.append(BeautifulSoup(page,'html.parser'))
You can then access eg. the soup object for the first link with soups[0].
Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.
You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping
import requests
from bs4 import BeautifulSoup as bs
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
with requests.Session as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
#do something

Selenium not updating website (Python)

In a project I am doing, I am telling Selenium to go and scrape the data on the next page, which has the exact same URL.
My code:
driver = webdriver.Chrome()
driver.get("https://etherscan.io/token/0x168296bb09e24a88805cb9c33356536b980d3fc5#balances")
iframe1 = driver.find_element_by_id('tokeholdersiframe')
driver.switch_to.frame(iframe1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
token_holders = soup.find_all('tr')
driver.find_element_by_link_text('Next').click()
time.sleep(10)
token_holders2 = soup.find_all('tr') #I get the data from previous page (exact same as token_holder) rather than the new data.
However, Selenium doesn't update and I still get the same data from the previous page.
I tried using an implicit wait after the click:
driver.implicitly_wait(30)
but it doesn't work. I also tried resetting the soup to the driver.page_source, as well as making the driver refind the iframe using driver.find_element_by_id("id"), but neither work.
From the question I assume selenium is not waiting for the next page to load. One method of ensuring this happens (while not the most elegant) is to use known elements on the current page that you know will change and wait for that change to happen after clicking. You can use implicit wait see https://selenium-python.readthedocs.io/waits.html for details on how you can do this.
Alternatively, you can add an explicit wait after your click. ie
from time import sleep
...
driver.click(..);
sleep(0.5) # Wait for half a second
# Scrape the page
After you create a soup it won't dynamically update to reflect the driver.page_source. You need to create a new instance of BeautifulSoup and pass the updated page source.
token_holders = soup.find_all('tr')
driver.find_element_by_link_text('Next').click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
token_holders2 = soup.find_all('tr')
>>> token_holders[1:]
[<tr><td>1</td><td><span>0xd35a2d8c651f3eba4f0a044db961b5b0ccf68a2d</span></td><td>310847219.011683</td><td>31.0847%</td></tr>,
<tr><td>2</td><td><span>0xe17c20292b2f1b0ff887dc32a73c259fae25f03b</span></td><td>200000001</td><td>20.0000%</td></tr>,
...
]
>>> token_holders2[1:]
[<tr><td>51</td><td><span>0x5473621d6d5f68561c4d3c6a8e23f705c8db7661</span></td><td>687442.69121294</td><td>0.0687%</td></tr>,
<tr><td>52</td><td><span>0xbc14ca2a57ea383a94281cc158f34870be345eb6</span></td><td>619772.39698</td><td>0.0620%</td></tr>,
...
]

Optimizing a webcrawl

The following crawl, though very short, is painfully slow. I mean, "Pop in a full-length feature film," slow.
def bestActressDOB():
# create empty bday list
bdays = []
# for every base url
for actress in getBestActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress"):
# use actress list to create unique actress url
URL = "http://en.wikipedia.org"+actress
# connect to html
html = urlopen(URL)
# create soup object
bsObj = BeautifulSoup(html, "lxml")
# get text from <span class='bday">
try:
bday = bsObj.find("span", {"class":"bday"}).get_text()
except AttributeError:
print(URL)
bdays.append(bday)
print(bday)
return bdays
It grabs the name of every actress nominated for an Academy Award from a table on one Wikipedia page, then converts that to a list, uses those names to create URLs to visit each actresses' wiki, where it grabs her date of birth. The data will be used to calculate the age at which each actress was nominated for, or won, the Academy Award for Best Actress. Beyond Big O, is there a way to speed this up in real time. I have little experience with this sort of thing, so I am unsure of how normal this is. Thoughts?
Edit: Requested sub-routine
def getBestActresses(URL):
bestActressNomineeLinks = []
html = urlopen(URL)
try:
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", {"class":"wikitable sortable"})
except AttributeError:
print("Error creating/navigating soup object")
table_row = table.find_all("tr")
for row in table_row:
first_data_cell = row.find_all("td")[0:1]
for datum in first_data_cell:
actress_name = datum.find("a")
links = actress_name.attrs['href']
bestActressNomineeLinks.append(links)
#print(bestActressNomineeLinks)
return bestActressNomineeLinks
I would reccomend trying on a faster computer or even running on a service like google cloud platform, microsoft azure, or amazon web services. There is no code that will make it go faster.

Resources