Striping text in scrappy - python-3.x

I'm trying to run spyder to extract real estate advertisements informaiton.
My code:
import scrapy
from ..items import RealestateItem
class AddSpider (scrapy.Spider):
name = 'Add'
start_urls = ['https://www.exampleurl.com/2-bedroom-apartment-downtown-4154251/']
def parse(self, response):
items = RealestateItem()
whole_page = response.css('body')
for item in whole_page:
Title = response.css(".obj-header-text::text").extract()
items['Title'] = Title
yield items
After running in console:
scrapy crawl Add -o Data.csv
In .csv file I get
['\n 2-bedroom-apartment ']
Tried adding strip method to function:
Title = response.css(".obj-header-text::text").extract().strip()
But scrapy returns:
Title = response.css(".obj-header-text::text").extract().strip()
AttributeError: 'list' object has no attribute 'strip'
Is there are some easy way to make scrapy return into .csv file just:
2-bedroom-apartment

AttributeError: 'list' object has no attribute 'strip'
You get this error because .extract() returns a list, and .strip() is a method of string.
If that selector always returns ONE item, you could replace it with .get() [or extract_first()] instead of .extract(), this will return a string of the first item, instead of a list. Read more here.
If you need it to return a list, you can loop through the list, calling strip in each item like:
title = response.css(".obj-header-text::text").extract()
title = [item.strip() for item in title]
You can also use an XPath selector, instead of a CSS selector, that way you can use normalize-space to strip whitespace.
title = response.xpath('normalize-space(.//*[#class="obj-header-text"]/text())').extract()
This XPath may need some adjustment, as you didn't post the source I couldn't check it

Related

scrapy RuntimeError: To use XPath or CSS selectors, ItemLoader be instantiated with a selector

I have an ProductItemlLoader which is just a simple ItemlLoader that loads into a simple ProductItem with an offer_type field
I run this code:
il = ProductItemLoader(response=response)
il.add_css('offer_type', '.incentive-type-label')
and recieve:
RuntimeError: To use XPath or CSS selectors, ItemLoader be instantiated with a selector
What am I doing wrong??
The more concise way of declaring ItemLoader is the following :
item = ItemLoader(item=ProductItem(), selector=response)
If you cycle through a loop of a broader selector :
sel = response.xpath('//xpath/selection') # return a list of Selectors
for one_product in sel:
item = ItemLoader(item=ProductItem(), response=response, selector=one_product)
# item populating
# yielding the item
So it turns out that I had this code running on scrapy 1.4 and I now moved to scrapy 2.3 .
In the old version it worked fine but now, in order to make use of selectors I had to add some lines and remove old ones.
So instead of this:
il = ProductItemLoader(response=response)
I now needed to do this:
from scrapy.selector import Selector
selector = Selector(response=response, type='html')
il = AudiDealItemLoader(selector=selector)
reference from the docs

In BeautifulSoup / Python, how do I extract a single element from a result set?

I'm using Python 3.7 and BeautifulSoup 4. I'm having a problem getting text from an element. I'm trying this
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req, timeout=settings.SOCKET_TIMEOUT_IN_SECONDS).read()
bs = BeautifulSoup(html, features="lxml")
...
author_elts = bs.find_all("a", class_="author")
author_elt = author_elts.first
But on the "author_elt = author_elts.first" line, I'm getting the error
AttributeError: ResultSet object has no attribute 'first'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()
What's the right way to extract the element from the ResultSet?
find_all return a list, why did you not use author_elts[0] to get the first element ?

Clear old text and input new text in Python Selenium

I am using Selenium to clear old text in a text area before input new text in a web browser. This is my code:
MY_UDP_SESSION = 32768
elem = driver.find_element_by_id("udp-session-quota").clear()
time.sleep(1)
elem.send_keys(MY_UDP_SESSION)
but I see this error:
'NoneType' object has no attribute 'send_keys'
clear()
clear() clears the text if it's a text entry element and is defined as:
def clear(self):
"""Clears the text if it's a text entry element."""
self._execute(Command.CLEAR_ELEMENT)
As clear() doesn't returns anything hence as per your line of code:
elem = driver.find_element_by_id("udp-session-quota").clear()
elem is assigned as null i.e. NoneType. Moving forward when you try to invoke send_keys() on elem:
elem.send_keys(MY_UDP_SESSION)
You see the error.
Solution
Once the WebElement is returned then you can invoke clear() and send_keys() as follows:
MY_UDP_SESSION = "32768"
elem = driver.find_element_by_id("udp-session-quota")
elem.clear()
elem.send_keys(MY_UDP_SESSION)

not getting the double for-loops python to work

I'm trying to scrape some content, but I cannot get a double for-loop to work. I tried look up other examples/solutions but I am getting no luck on my own. Using python3x and BS4.
Context:
In the html content there is a container containing 11x ("div",{"class":"days"})
Within this class, there can be 1-8x ("div",{"class":"item"})
Of this item, i want to have the 'name' and 'description' fields
page_soup = soup(page_html, "html.parser")
days = page_soup.findAll("div",{"class":"days"})
for item in days.findAll("div",{"class":"item"}):
name = item.h3.a.text
description = item.h4.a.text
print(name, description)
This give me the error AttributeError: ResultSet object has no attribute 'findAll'. When I add days = days[0] it provides me the correct details of the first 'days'. But now I want it to loop through all 11 'days', how do I loop through these 'days'?

Delete first instance of an element using Beautiful Soup

I have been trying to delete the first instance of an element using BeautifulSoup and I am sure I am missing something. I did not use find all since I need to target the first instance which is always a header(div) and has the class HubHeader. The class is used in other places in combination with a div tag. Unfortunately I can't change the setup of the base html.
I did also try select one outside of a loop and it still did not work.
def delete_header(filename):
html_docs = open(filename,'r')
soup = BeautifulSoup( html_docs, "html.parser")
print (soup.select_one(".HubHeader")) #testing
for div in soup.select_one(".HubHeader"):
div.decompose()
print (soup.select_one(".HubHeader")) #testing
html_docs.close()
delete_header("my_file")
The most recent error is this:
AttributeError: 'NavigableString' object has no attribute 'decompose'
I am using select_one() and decompose().
Short answer, replace,
for div in soup.select_one(".HubHeader"):
div.decompose()
With one line:
soup.select_one(".HubHeader").decompose()
Longer answer, you code iterates over a bs4.element.Tag object. The function .select_one() returns an object while .select() returns a list if you were using .select() your code would work but take out all occurrences of the element with the selected class.

Resources