Delete first instance of an element using Beautiful Soup - python-3.x

I have been trying to delete the first instance of an element using BeautifulSoup and I am sure I am missing something. I did not use find all since I need to target the first instance which is always a header(div) and has the class HubHeader. The class is used in other places in combination with a div tag. Unfortunately I can't change the setup of the base html.
I did also try select one outside of a loop and it still did not work.
def delete_header(filename):
html_docs = open(filename,'r')
soup = BeautifulSoup( html_docs, "html.parser")
print (soup.select_one(".HubHeader")) #testing
for div in soup.select_one(".HubHeader"):
div.decompose()
print (soup.select_one(".HubHeader")) #testing
html_docs.close()
delete_header("my_file")
The most recent error is this:
AttributeError: 'NavigableString' object has no attribute 'decompose'
I am using select_one() and decompose().

Short answer, replace,
for div in soup.select_one(".HubHeader"):
div.decompose()
With one line:
soup.select_one(".HubHeader").decompose()
Longer answer, you code iterates over a bs4.element.Tag object. The function .select_one() returns an object while .select() returns a list if you were using .select() your code would work but take out all occurrences of the element with the selected class.

Related

How to select an id in bs4 which is a number?

I can't select using id in bs4(BeautiFullSoup) because the id is a number.
import bs4
soup = bs4.BeautiFullSoup("<td id='1'>This is text</td>", 'lxml')
td = soup.select('#1')
Which is showing this error
raise SelectorSyntaxError(msg, self.pattern, index)
soupsieve.util.SelectorSyntaxError: Malformed id selector at position 0
line 1:
#1
Try this. use bs4.BeautifulSoup instead of bs4.BeautiFullSoup
td = soup.find(attrs={'id': '1'})
You can also select parent element of td then use for loop to get desire output.
its not only because of that, beautifulsoup is also written wrong, and it generally looks like some key parts are missing, if you arent very experienced with that code yet, id suggest to use pycharm, since those errors are easy fixed.

Striping text in scrappy

I'm trying to run spyder to extract real estate advertisements informaiton.
My code:
import scrapy
from ..items import RealestateItem
class AddSpider (scrapy.Spider):
name = 'Add'
start_urls = ['https://www.exampleurl.com/2-bedroom-apartment-downtown-4154251/']
def parse(self, response):
items = RealestateItem()
whole_page = response.css('body')
for item in whole_page:
Title = response.css(".obj-header-text::text").extract()
items['Title'] = Title
yield items
After running in console:
scrapy crawl Add -o Data.csv
In .csv file I get
['\n 2-bedroom-apartment ']
Tried adding strip method to function:
Title = response.css(".obj-header-text::text").extract().strip()
But scrapy returns:
Title = response.css(".obj-header-text::text").extract().strip()
AttributeError: 'list' object has no attribute 'strip'
Is there are some easy way to make scrapy return into .csv file just:
2-bedroom-apartment
AttributeError: 'list' object has no attribute 'strip'
You get this error because .extract() returns a list, and .strip() is a method of string.
If that selector always returns ONE item, you could replace it with .get() [or extract_first()] instead of .extract(), this will return a string of the first item, instead of a list. Read more here.
If you need it to return a list, you can loop through the list, calling strip in each item like:
title = response.css(".obj-header-text::text").extract()
title = [item.strip() for item in title]
You can also use an XPath selector, instead of a CSS selector, that way you can use normalize-space to strip whitespace.
title = response.xpath('normalize-space(.//*[#class="obj-header-text"]/text())').extract()
This XPath may need some adjustment, as you didn't post the source I couldn't check it

In BeautifulSoup / Python, how do I extract a single element from a result set?

I'm using Python 3.7 and BeautifulSoup 4. I'm having a problem getting text from an element. I'm trying this
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
bs = BeautifulSoup(html, features="lxml")
...
author_elts = bs.find_all("a", class_="author")
author_elt = author_elts.first
But on the "author_elt = author_elts.first" line, I'm getting the error
AttributeError: ResultSet object has no attribute 'first'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()
What's the right way to extract the element from the ResultSet?
find_all return a list, why did you not use author_elts[0] to get the first element ?

How can I change my code to get URL link from the HTML code?

I try to use beautifulsoup4 to scrape the URL of the HTML code in python, but I got the error like this: AttributeError: 'NoneType' object has no attribute 'get'
HTML code:
<a class="top NQHJEb dfhHve" href="https://globalnews.ca/news/5137005/donald-trump-robert-mueller-report/" ping="/url?sa=t&source=web&rct=j&url=https://globalnews.ca/news/5137005/donald-trump-robert-mueller-report/&ved=0ahUKEwiS9pn-4rzhAhWOyIMKHSOPD6QQvIgBCDcwAg"><img class="th BbeB2d" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQ_Nf-kVlqsQz8NeNgQ9a9YRiA7Fl4DJ6Jod0sxNXapOK_iJebx20dgROk5YBl8IqFQX6S-eeY2" alt="Story image for trump from Globalnews.ca" onload="typeof google==='object'&&google.aft&&google.aft(this)" data-iml="1554598687532" data-atf="3"></a>
My python code:
URL_results = soup.find_all('a', class_= 'top NQHJEb dfhHve').get('href')
You are applying the method to a list. Instead you want to apply to each element
URL_results = [a.attrs.get('href') for a in soup.find_all('a', class_= 'top NQHJEb dfhHve')]
I prefer
URL_results = [item['href'] for item in soup.select('a.top.NQHJEb.dfhHve')]
And you may be able to remove some of the classes from the current compound class selector e.g.
URL_results = [item['href'] for item in soup.select('a.dfhHve')]
You will need to play around and see.

Unable to figure out where to add wait statement in python selenium

I am searching elements in my list(one by one) by inputing into searchbar of a website and get apple products name that appeared in search result and printed. However I am getting following exception
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
I know its because of changing of element very fast so I need to add wait like
wait(driver, 10).until(EC.visibility_of_element_located((By.ID, "submitbutton")))
or explictly
Q1. But I don't understand where should I add it? Here is my code. Please help!
Q2. I want to go to all the next pages using but that's not working.
driver.find_element_by_xpath( '//div[#class="no-hover"]/a' ).click()
Earlier exception was raised on submitton button and now at if statement.
That's not what implicit wait is for. Since the page change regularly you can't be sure when the current object inside the variable is still valid.
My suggestion is to run the above code in loop using try except. Something like the following:
for element in mylist:
ok = False
while True:
try:
do_something_useful_while_the_page_can_change(element)
except StaleElementReferenceException:
# retry
continue
else:
# go to next element
break
Where:
def do_something_useful_while_the_page_can_change(element):
searchElement = driver.find_element_by_id("searchbar")
searchElement.send_keys(element)
driver.find_element_by_id("searchbutton").click()
items_count = 0
items = driver.find_elements_by_class_name( 'searchresult' )
for i, item in enumerate( items ):
if 'apple' in item.text:
print ('item.text')
items_count += len( items )
I think what you had was doing too much and can be simplified. You basically need to loop through a list of search terms, myList. Inside that loop you send the search term to the searchbox and click search. Still inside that loop you want to grab all the elements off the page that consist of search results, class='search-result-product-url' but also the text of the element contains 'apple'. The XPath locator I provided should do both so that the collection that is returned all are ones you want to print... so print each. End loop... back to next search term.
for element in mylist:
driver.find_element_by_id("search-input").send_keys(element)
driver.find_element_by_id("button-search").click()
# may need a wait here?
for item in driver.find_elements_by_xpath( "//a[#class='search-result-product-url'][contains(., 'apple')]" ):
print item.text

Resources