I can't select using id in bs4(BeautiFullSoup) because the id is a number.
import bs4
soup = bs4.BeautiFullSoup("<td id='1'>This is text</td>", 'lxml')
td = soup.select('#1')
Which is showing this error
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed id selector at position 0
line 1:
#1
Try this. use bs4.BeautifulSoup instead of bs4.BeautiFullSoup
td = soup.find(attrs={'id': '1'})
You can also select parent element of td then use for loop to get desire output.
its not only because of that, beautifulsoup is also written wrong, and it generally looks like some key parts are missing, if you arent very experienced with that code yet, id suggest to use pycharm, since those errors are easy fixed.
Related
Web URL: https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times
I want to parse the HTML as below:
I want to get all hrefs within the < li > elements and the highlighted text. I tried the code
elementList = driver.find_element_by_class_name('block-wysiwyg').find_elements_by_tag_name("li")
for i in range(len(elementList)):
driver.find_element_by_class_name('blcokwysiwyg').find_elements_by_tag_name("li").get_attribute("href")
But the block returned none.
Can anyone please help me with the above code?
I suppose it will fetch you the required content.
import requests
from bs4 import BeautifulSoup
link = 'https://www.ipsos.com/en-us/knowledge/society/covid19-research-in-uncertain-times'
r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
for item in soup.select(".block-wysiwyg li"):
item_text = item.get_text(strip=True)
item_link = item.select_one("a[href]").get("href")
print(item_text,item_link)
Try is this way:
coronas = driver.find_element_by_xpath("//div[#class='block-wysiwyg']/ul/li")
hr = coronas.find_element_by_xpath('./a')
print(coronas.text)
print(hr.get_attribute('href'))
Output:
The coronavirus is touching the lives of all Americans, but race, age, and income play a big role in the exact ways the virus — and the stalled economy — are affecting people. Here's what that means.
https://www.ipsos.com/en-us/america-under-coronavirus
I have been trying to delete the first instance of an element using BeautifulSoup and I am sure I am missing something. I did not use find all since I need to target the first instance which is always a header(div) and has the class HubHeader. The class is used in other places in combination with a div tag. Unfortunately I can't change the setup of the base html.
I did also try select one outside of a loop and it still did not work.
def delete_header(filename):
html_docs = open(filename,'r')
soup = BeautifulSoup( html_docs, "html.parser")
print (soup.select_one(".HubHeader")) #testing
for div in soup.select_one(".HubHeader"):
div.decompose()
print (soup.select_one(".HubHeader")) #testing
html_docs.close()
delete_header("my_file")
The most recent error is this:
AttributeError: 'NavigableString' object has no attribute 'decompose'
I am using select_one() and decompose().
Short answer, replace,
for div in soup.select_one(".HubHeader"):
div.decompose()
With one line:
soup.select_one(".HubHeader").decompose()
Longer answer, you code iterates over a bs4.element.Tag object. The function .select_one() returns an object while .select() returns a list if you were using .select() your code would work but take out all occurrences of the element with the selected class.
I have built a bot which pulls matches' info from HLTV. Problem is, before 10 am there isn't being any live match. When my bot tries to pull the page's links it gives error.
I tried to ignore it like:
if links is None:
pass
Returns me as:
'Nonetype' object has no attribute find_all('a')
I tried try and except but when i use try and except it takes all the code again and again. I mean think like loop in loop. :D Which is annoying. Is there any way to solve it?
My code is here but you wont take that error because it passed 10 am :D
from bs4 import BeautifulSoup
import requests, datetime
matchlinks_lm = []
r = requests.get('https://hltv.org/matches')
sauce = r.content
soup = BeautifulSoup(sauce, 'lxml')
for links in soup.find('div', class_='live-matches').find_all('a'):
matchlinks_lm.append('https://hltv.org' + links.get('href'))
What can i do?
You're chaining the calls to .find() and find_all() in
for links in soup.find('div', class_='live-matches').find_all('a'):
which makes it impossible to catch an error or NoneType in the first .find().
You could do something like this instead:
div = soup.find('div', class_='live-matches')
if div is not None:
for link in div.find_all("a"):
if link is None:
continue
matchlinks_lm.append('https://hltv.org' + links.get('href'))
I want to get text from span tag but i have such problems.
I wrote this,
import bs4 as bs
import urllib.request
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all('li', class_='wind'))
and it returned like that [<li class="wind"><strong>28 km/h</strong></li>]
but I want to get just "28 km/h"
then I tried that
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all("span" , { "class" : "wind" }))
but it did not work either. Please help me with it.
You need to use .find() and not .find_all() to get a single element and call .get_text() to get the text of the desired element:
print(soup.find('li', class_='wind').get_text())
Or, you can also use .select_one() and locate the same element using a CSS selector:
print(soup.select_one('li.wind').get_text())
As a side note, look up the "AccuWeather API" - that might be a faster, easier and a more appropriate way to get to the desired data.
I have wrote code that gets this returned:
<div id="IncidentDetailContainer"><p>The Fire Service received a call reporting a car on fire at the above location. One fire appliance from Ashburton attended.</p><p>Fire crews confirmed one car well alight and severley damaged by fire. The vehicle was extinguished by fire crews using two breathing apparatus wearers and one hose reel jet. The cause of the fire is still under investigation by the Fire Service and Police.</p><p> </p><p> </p></div>
I want to search through it and find the "Ashburton" part, but so far no matter what I use I get none returned or [].
My question is this: is this a normal string that can be searched (and I'm doing something wrong) or is it because I have got it from a webpage source code that I can't search through it the normal way?
It should be simple, I know, but still I get none!
from bs4 import BeautifulSoup
from urllib import request
import sys, traceback
webpage = request.urlopen("http://www.dsfire.gov.uk/News/Newsdesk/IncidentsPast7days.cfm?siteCategoryId=3&T1ID=26&T2ID=35")
soup = BeautifulSoup(webpage)
incidents = soup.find(id="CollapsiblePanel1")
Links = []
for line in incidents.find_all('a'):
Links.append("http://www.dsfire.gov.uk/News/Newsdesk/"+line.get('href'))
n = 0
e = len(Links)
if e == n:
print("No Incidents Found Please Try Later")
sys.exit(0)
while n < e:
webpage = request.urlopen(Links[n])
soup = BeautifulSoup(webpage)
station = soup.find(id="IncidentDetailContainer")
#search string
print(soup.body.findAll(text='Ashburton'))
n=n+1
Just FYI, if the webpage doesn't have any incidents today it wont search for anything (obviously) that's why I included the returned string, so if you run it and get nothing, that's why.
Provide a pattern in the text=. From the html you have provided, If you want to find the tag that has "Ashburton" in it, you can use something like this,
soup.find_all('p', text=re.compile(r'Ashburton'))
You can get only the text by this,
soup.find_all(text=re.compile(r'Ashburton'))