bs4 how to use find after using find_all? - python-3.x

I would like to search a bbs and find the content of one specific user. Here is the code I have:
import requests
from bs4 import BeautifulSoup
main_html = requests.get('http://bbs.tianya.cn/post-free-2328897-1.shtml')
main_soup = BeautifulSoup(main_html.text, 'lxml')
all_a = main_soup.find_all('div', _host='羡鱼则结网')
Now I have all the code from this user and I would like to search again for the content. However, I can't use find because find_all returns a list.
What should I do to search twice by BeautifulSoup?

Related

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

I'm having trouble returning an HTML link as I pull links from a google search query in Python

I'm attempting to pull website links from a google search but I'm having trouble returning any value, I think the issue is with the attributes I'm using to call the web link but I'm not sure why as I was able to use the same attributes in webdriver to accomplish the result.
Here's the code:
import requests
import sys
import webbrowser
import bs4
from parsel import Selector
import xlsxwriter
from openpyxl import load_workbook
import pandas as pd
print('Searching...')
res = requests.get('https://google.com/search?q="retail software" AND "mission"')
soup = bs4.BeautifulSoup(res.content, 'html.parser')
for x in soup.find_all('div', class_='yuRUbf'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
print(link)
This is the ouput:
Searching...
That's it. Thanks in advance for the help!
The class value is dynamic, you should use the following selector to retrieve the href value
"a[href*='/url']"
This will get any a tag that contains the pattern /url.
So, just change your for loop to
for anchor_tags in soup.select("a[href*='/url']"):
print(anchor_tags.attrs.get('href'))
Example of href printed
/url?q=https://www.devontec.com/mission-vision-and-values/&sa=U&ved=2ahUKEwjShcWU-aPtAhWEH7cAHYfQAlEQFjAAegQICBAB&usg=AOvVaw14ziJ-ipXIkPWH3oMXCig1
To get the link, you only need to split the string.
You can do it like this
for anchor_tags in soup.select("a[href*='/url']"):
link = anchor_tags.attrs.get('href').split('q=')[-1] # split by 'q=' and gets the last position
print(link)

results of soup.find are none despite the content exisiting

I'm trying to track the price for a product on amazon using python in jupyter notebook. I've imported bs4 and requests for this task.
When I inspect HTML in the product page I can see <span id="productTitle" class="a-size-large">
However when I try to search for it using soup.find(id = "productTitle") The results come out as None
I've tried using soup.find other id and classes but the results are still None
title = soup.find(id="productTitle")
This is my code to find the id
If I fix this I hope to be able to get the name of my product whose price I will be tracking
That info is stored in various places in return html. Have you check your response to see you are not blocked or getting an unexpected response?
I found it with that id using and strip
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#productTitle').text.strip())
Also,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#imgTagWrapperId img[alt]')['alt'])

Why 'amp;' is include in link in many parts of links('a') that I'm trying to scrape using BeautifulSoup in phyton? whats the better way to remove it?

I am using findAll('a') or the variations of it to extract a particular tag or class but I'm getting 'amp;' in between the link in many parts.
Example:
The two links the actual and error('amp;') one
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=111)3&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14311&CUST_PREV_CMD=null
"selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=false&TIMEZONE_OFFSET=&CMD=VIEW_ARTICLE&ARTICLE_ID=14271&CUST_PREV_CMD=BROWSE_TOPIC"
I can get rid of it using regex, but is there a better way to do it?
The website I'm having a problem with is cybonline
I don't see that problem at all with lxml. Can you try running the following?
import requests
from bs4 import BeautifulSoup as bs
base_url = 'https://help.cybonline.co.uk/system/'
r = requests.get('https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956')
soup = bs(r.content, 'lxml')
links = [base_url + item['href'] for item in soup.select('.articleAnchor')]
print(links)
If not, you can use replace
base_url + item['href'].replace('amp;', '')
If you want remove that & value you can simply use replace while fetching the value.
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.find_all('a' ,class_='articleAnchor'):
link=a['href'].replace('&' , '')
print(link)
OR
import requests
from bs4 import BeautifulSoup
html=requests.get("https://help.cybonline.co.uk/system/selfservice.controller?CONFIGURATION=1113&PARTITION_ID=1&secureFlag=true&TIMEZONE_OFFSET=&CMD=BROWSE_TOPIC&TOPIC_ID=55956").text
soup=BeautifulSoup(html,'html.parser')
for a in soup.select('a.articleAnchor'):
link=a['href'].replace('&' , '')
print(link)

How to Extract just the text from a website using Jupyter?

I am trying to get the text of an article from a link but while importing the text I am getting all other links, advertisement links , and image names which I don't need it for my analysis.
import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-
120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml").get_text()
raw
I am getting this result (copied just some few lines, I get actual text of an article as well but exists in other lines):
window.performance && window.performance.mark &&
window.performance.mark(\'PageStart\');Best Bites: Weeknight meals
cauliflower vegetable fried rice!function(s,f,p){var
a=[],e={_version:"3.6.0",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_q:[],on:function(e,t){var
n=this;setTimeout(function(){t(n[e])},0)},addTest:function(e,t,n){a.push({name:e,fn:t,options:n})},addAsyncTest:function(e){a.push({name:null,fn:e})}},l=function(){};l.prototype=e,l=new
l;var c=[];function v(e,t){return typeof e===t}var t="Moz O ms
Webkit",u=e._config
I just want to know if there is any way for me to extract just the text of an article, ignoring all these values.
When BS4 parses a site it creates its own DOM internally as an object.
To access different parts of the DOM we have to use the correct accessors or tags like below
import re
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.yahoo.com/news/best-bites-weeknight-meals-cauliflower-120000419.html" #this is the link
html = request.urlopen(url).read().decode('utf8')
parsedHTML = BeautifulSoup(html, "html.parser")
readableText = parsedHTML.article.get_text() # <- we got the text from inside the <article> tag
print(readableText)
You were close but you didn't specify which tag you wanted to get_text() from.
Also find() and find_all() are very usefull for finding tags on a page.

Resources