Python-Beautiful Soup- Parsing XML-repeated string - python-3.x

I am trying to get some information from a page. Here is how I set it up:
url =str('https://kith.com/collections/all/products.atom')
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, "xml")
I am trying to get the 'size' for each product on this page. The price is tag as <title> which is located the <variant> tag inside the <entry> tag.
items = soup.find_all('entry')
for item in items:
variants=item.find_all('variant')
for variant in variants:
size = variant.title.text
print('size: '+str(size))
For some reasons, instead of size: 3 it prints out size:33.
When ran it the third time it prints out size:333 and so on. Why is it repeating itself and how can I fix it?

Related

I am doing some web scraping and during that I've run into an error whih states that, "'NoneType' object is not subscriptable"

I am using bs4 for web scraping. This is the html code that I am scraping.
items is a list of these multiple div tags i.e <div class="list_item odd" itemscope=""...>
from which the tag that I really want from each in items element is:
<p class="cert-runtime-genre">
<img title="R" alt="Certificate R" class="absmiddle certimage" src="https://m...>
<time datetime="PT119M">119 min</time>
-
<span>Drama</span>
<span class="ghost">|</span>
<span>War</span>
</p>
The main class of this list is saved in items. From that I want to scrape the img tag and then access the title attribute so that I can save all the certifications of the movies in a database i.e R or PG etc. But when I apply loop to the items it gives an error that the items is not subscriptable. I tried list comprehensions, simple for loop, called items elements through a predefined integer array nothing works and still gives the same error. (items is not Null and is subscriptable i.e. is a list). But when I call it with a direct integers it works fine i.e items[0] or items[1] etc, and gives the correct result for each corresponding element in the items list. The error line is below:
cert = [item.find(class_ = "absmiddle certimage")["title"] for item in items] or
cert = [item.find("img",{"class": "absmiddle certimage"})["title"] for item in items]
and this is what works fine: cert = items[0].find(class_ = "absmiddle certimage")["title"]
Any suggestion will be appreciated.

Scrape info from a span title

My html looks like this:
<h3>Current Guide Price <span title="92"> 92
</span></h3>
The info I am trying to get is the 92.
here is another html page where i need to get the same data:
<h3>Current Guide Price <span title="4,161"> 4,161
</span></h3>
I would need to get the 4,161 from this page.
here is the link to the page for reference:
http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=1613
What I have tried:
/h3/span[#title="92"]#title
/h3/span[#title="92"]/text()
/div[#class="stats"]/h3/span[#title="4,161"]#title
since the info I need is in the actual span tag, it is hard to grab the data in a dynamic way that I can use for many different pages.
from lxml import html
import requests
baseUrl = 'http://services.runescape.com/m=itemdb_oldschool/viewitem?obj=2355'
page = requests.get(baseUrl)
tree = html.fromstring(page.content)
price = tree.xpath('//h3/span')
price2 = tree.xpath('//h3/span/#title')
for p in price:
print(p.text.strip())
for p2 in price2:
print(p2)
The output is 92 in both cases.

web scraping: Iterate over pages of a site without can edit url with Python and Requests

I'm extracting the data from this reseller site for cars, but I can not find a way to iterate over the pages. I usually iterate by altering some index present in the url, but in the url of that site there is no index of any page
Here is an example code of how I usually do when I can iterate the pages by editing the url:
import requests as req
url = "https://www.seminovosunidas.com.br/veiculos/page:{}?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-"
indice_pagina = 1
dados = {}
r = req.get(url.format(indice_pagina))
print(r.text)
I think you are new to scraping. There are links in each div you can find it at this path and iterate for more pages
#resultadoPesquisa > div:nth-child(1) > a
and get the herf attribute that has the link like
/Paginas/detalhes-do-carro.aspx?o=fmKOUbLvWxA%3d
which you can append to url to request for the product
so This would be like this
complete_url = 'https://seminovos.localiza.com' + '/Paginas/detalhes-do-carro.aspx?o=fmKOUbLvWxA%3d'
comment if you have any question

Web Scraping Location Data with BeautifulSoup

I am trying to scrape a webpage for address data (the highlighted street address shown in this image:1) using the find() function of the BeautifulSoup library. Most online tutorials only provide examples where data can be easily pinpointed to a certain class; however, for this particular site, the street address is a element within a larger class="dataCol col02 inlineEditWrite" and I'm not sure how to get at it with the find() function.
What would be the arguments to find() to get the street address in this example? Any help would be greatly appreciated.
Image: 1
This should get you started, it will find the div element with the class "dataCol col02 inlineEditWrite" then search for td elements within it and print the first td elements text:
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
for tag in divTag:
tdTags = tag.find_all("td")
print (tdTags[0].text)
the above example assumes you want to print the first td element from all the div elements with the class "dataCol col02 inlineEditWrite" otherwise
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
tdTags = divTag[0].find_all("td")
print (tdTags[0].text)

Selenium scrapes only one result and ignores other related reults

I am new to selenium. Searching a web site, I get 10 results for each page. Those results are shown as lists (li tags) on the page and each list contains the same attributes. When my conditions are met, I go to another related web page and get desired content. However, when my code keeps looping for the lists, it fails to find the same attributes for the others. Here is my code:
p_url = "https://www.linkedin.com/vsearch/f?keywords=BARCO%2BNV%2Bkortrijk&pt=people&page_num=5"
driver.get(p_url)
time.sleep(5)
results = driver.find_element_by_id("results-container")
employees = results.find_elements_by_tag_name('li')
#emp_list = []
#for i in range(len(employees)):
# emp_list.append(employees[i])
for emp in employees:
try:
main_emp = emp.find_element_by_css_selector("a.title.main-headline")
name = emp.find_element_by_css_selector("a.title.main-headline").text
href = main_emp.get_attribute("href")
if name != "LinkedIn Member":
location = emp.find_element_by_class_name("demographic").text
href = main_emp.get_attribute("href")
print(href)
print(location)
driver.get(href)
exp = driver.find_element_by_id("background-experience")
amkk = exp.find_elements_by_class_name("editable-item")
for amk in amkk:
him = amk.find_element_by_tag_name("header").text
him2 = amk.find_element_by_class_name("experience-date-locale").text
if '\n' in him:
a = him.split('\n')
print(a[0])
print(a[1])
print(him2)
except Exception as exc:
print(exc)
continue
In this code the line main_emp = emp.find_element_by_css_selector("a.title.main-headline") stop working after it works for the first time. As a result I got an error of Message: stale element reference: element is not attached to the page document
From stackoverflow questions I saw that some say the content is removed from DOM structure and from another post someone suggested to fill a list with the results. Here what I have tried emp_list = []
for i in range(len(employees)):
emp_list.append(employees[i]) , however, it also did not work out.
How can I overcome this?
The selector you are using is wrong. You are getting the results using the results-container id. This works fine, but the collecting the elements form this is not working. It is returning more elements than just the employees (I'm not quite sure why).
If you change you selectors to this single selector you will get just the employees and no other unwanted elements.
employees = results.find_elements_by_css_selector("ol[id='results']>li")
Edit
Since you are opening the employees and losing the list of elements you might want to try opening the employee in a new tab, perform your actions here and close the tab afterwards.
Example:
for emp in employees:
try:
main_emp = emp.find_element_by_css_selector("a.title.main-headline")
# Do stuff you need...
# Open employee in new tab (make sure Keys is imported)
main_emp.send_keys(Keys.CONTROL + 't')
# Focus on new tab
driver.switch_to_window(d.window_handles[1])
# Do stuff inside the employee page
# Close the tab you opened
driver.close()
# Switch back to the first tab
driver.switch_to_window(d.window_handles[0])
Note: For OSX you should use main_emp.send_keys(Keys.COMMAND + 't')

Resources