Beautiful Soup Questions - python-3.x

I am trying to scrap the following website, however, I have encountered some problems. As you can see, I am not really familiar with regex and I hope you can give me some pointers to how to solve this problem.
Basically, I hope I can download all the transaction data into the database. However, I need to retrieve it first.
Thank you
Below is my quote:
from bs4 import BeautifulSoup
import requests
import re
url = 'https://bochk.etnet.com.hk/content/bochkweb/eng/quote_transaction_daily_history.php?services=STK&code=6881\
&ip=boc.etnet.com.hk&host=BochkUMSContent&sessionId=44c99b61679e019666f0570db51ad932'
pattern = re.compile('var json_result = (.*?);')
def turnover_detail(url):
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,"html.parser")
data = soup.find_all("script")
for json in data:
print(json)
turnover_detail(url)

Related

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

Beautifulsoup response does not match with view source code output

While comparing response from code and chrome source code. I observe that response returned from beautifulsoup does not match with page source code. I want to fetch class="rc"and I can see the class with "rc" on page source code, but could not find it in the response printed. I checked with "lxml" and "html.parser" too.
I am beginner in python so my question might sound basic. Also, I already checked few articles related to my problem(BeautifulSoup returning different html than view source) but could not find solution.
Below is my code:
import sys, requests
import re
import docx
import webbrowser
from bs4 import BeautifulSoup
query = sys.argv
url = "https://google.com/search?q=" + "+".join(query[1:])
print(url)
res = requests.get(url)
# print(res[:1000])
if res.status_code == 200:
soup = BeautifulSoup(res.text, "html5lib")
print(type(soup))
all_select = soup.select("div", {"class": "rc"})
print("All Select ", all_select)
I had the same problem, try using another parser such as "lxml" instead of "html5lib".

find() in Beautifulsoup returns None

I'm very new to programming in general and I'm trying to write my own little torrent leecher. I'm using Beautifulsoup In order to extract the title and the magnet link of a torrent file. However find() element keeps returning none no matter what I do. The page is correct. I've also tested with find_next_sibling and read all the similar questions but to no avail. Since there are no errors I have no idea what my mistake is.
Any help would be much appreciated. Below is my code:
import urllib3
from bs4 import BeautifulSoup
print("Please enter the movie name: \n")
search_string = input("")
search_string.rstrip()
search_string.lstrip()
open_page = ('https://www.yify-torrent.org/search/' + search_string + '/s-1/all/all/') # get link - creates a search string with input value
print(open_page)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
manager = urllib3.PoolManager(10)
page_content = manager.urlopen('GET',open_page)
soup = BeautifulSoup(page_content,'html.parser')
magnet = soup.find('a', attrs={'class': 'movielink'}, href=True)
print(magnet)
Check out the following script which does exactly what you wanna achieve. I used requests library instead of urllib3. The main mistake you made is that you looked for the magnet link in the wrong place. You need to go one layer deep to dig out that link. Try using quote instead of string manipulation to fit your search query within the url.
Give this a shot:
import requests
from urllib.parse import urljoin
from urllib.parse import quote
from bs4 import BeautifulSoup
keyword = 'The Last Of The Mohicans'
url = 'https://www.yify-torrent.org/search/'
base = f"{url}{quote(keyword)}{'/p-1/all/all/'}"
res = requests.get(base)
soup = BeautifulSoup(res.text,'html.parser')
tlink = urljoin(url,soup.select_one(".img-item .movielink").get("href"))
req = requests.get(tlink)
sauce = BeautifulSoup(req.text,"html.parser")
title = sauce.select_one("h1[itemprop='name']").text
magnet = sauce.select_one("a#dm").get("href")
print(f"{title}\n{magnet}")

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Asynchronous scraping with Python: grequests and Beautifulsoup4

I am trying to scrape this site . I managed to do it by using urllib and beautifulsoup. But urllib is too slow. I want to have asynchronous requests because the urls are thousands. I found that a nice package is grequests.
example:
import grequests
from bs4 import BeautifulSoup
pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
pages.append(page)
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
page = page + "/offset_{}".format(i*10)
rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)
The problem is that I don't know how to continue and use beautifulsoup. So as to get the html code of every page.
It would be nice to hear your ideas. Thank you!
Refer to the script below, also check the link of the source. It will help.
reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('a', attrs={"class":'product__list-name'})
print(results[0].text)
prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
print(prices[0].text)
discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
print(discount[0].text)
Source: https://blog.datahut.co/asynchronous-web-scraping-using-python/

Resources